Here is an updated version of huge pages support implementation in
tmpfs/shmem.
All known issues has been fixed. I'll continue with validation.
I will also send follow up patch with documentation update.
Hugh, I would be glad to hear your opinion on this patchset.
The main difference with Hugh's approach[1] is that I continue with
compound pages, where Hugh invents new way couple pages: team pages.
I believe THP refcounting rework made team pages unnecessary: compound
page are flexible enough to serve needs of page cache.
Many ideas and some patches were stolen from Hugh's patchset. Having this
patchset around was very helpful.
I will continue with code validation. I would expect mlock require some
more attention.
Please, review and test the code.
Git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v4
Rebased onto linux-next:
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v4-next
== Changelog ==
v4:
- first four patch were applied to -mm tree;
- drop pages beyond i_size on split_huge_pages;
- few small random bugfixes;
v3:
- huge= mountoption now can have values always, within_size, advice and
never;
- sysctl handle is replaced with sysfs knob;
- MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation via
page fault;
- mlock() handling had been fixed;
- bunch of smaller bugfixes and cleanups.
== Design overview ==
Huge pages are allocated by shmem when it's allowed (by mount option) and
there's no entries for the range in radix-tree. Huge page is represented by
HPAGE_PMD_NR entries in radix-tree.
MM core maps a page with PMD if ->fault() returns huge page and the VMA is
suitable for huge pages (size, alignment). There's no need into two
requests to file system: filesystem returns huge page if it can,
graceful fallback to small pages otherwise.
As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we can
re-fault the page with PTEs later.
Basic scheme for split_huge_page() is the same as for anon-THP.
Few differences:
- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.
- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;
- Lockless access is prevented by setting the head->_count to 0 during
split, so get_page_unless_zero() would fail;
- After split, some pages can be beyond i_size. We drop them from
radix-tree.
- We don't setup migration entries. Just unmap pages. It helps
handling cases when i_size is in the middle of the page: no need
handle unmap pages beyond i_size manually.
COW mapping handled on PTE-level. It's not clear how beneficial would be
allocation of huge pages on COW faults. And it would require some code to
make them work.
I think at some point we can consider teaching khugepaged to collapse
pages in COW mappings, but allocating huge on fault is probably overkill.
As with anon THP, we mlock file huge page only if it mapped with PMD.
PTE-mapped THPs are never mlocked. This way we can avoid all sorts of
scenarios when we can leak mlocked page.
As with anon THP, we split huge page on swap out.
Truncate and punch hole that only cover part of THP range is implemented
by zero out this part of THP.
This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
As we don't really create hole in this case, lseek(SEEK_HOLE) may have
inconsistent results depending what pages happened to be allocated.
I don't think this will be a problem.
== Patchset overview ==
[01-04/25]
Rework fault path and rmap to handle file pmd. Unlike DAX with
vm_ops->pmd_fault, we don't need to ask filesystem twice -- first
for huge page and then for small. If ->fault happened to return
huge page and VMA is suitable for mapping it as huge, we would
do so.
[05/25]
Add support for huge file pages in rmap;
[06-16/25]
Various preparation of THP core for file pages.
[17-21/25]
Various preparation of MM core for file pages.
[22-24/25]
And finally, bring huge pages into tmpfs/shmem.
[25/25]
Wire up madvise() existing hints for file THP.
We can implement fadvise() later.
[1] http://lkml.kernel.org/g/[email protected]
Hugh Dickins (1):
shmem: get_unmapped_area align huge page
Kirill A. Shutemov (24):
mm: do not pass mm_struct into handle_mm_fault
mm: introduce fault_env
mm: postpone page table allocation until we have page to map
rmap: support file thp
mm: introduce do_set_pmd()
mm, rmap: account file thp pages
thp, vmstats: add counters for huge file pages
thp: support file pages in zap_huge_pmd()
thp: handle file pages in split_huge_pmd()
thp: handle file COW faults
thp: handle file pages in mremap()
thp: skip file huge pmd on copy_huge_pmd()
thp: prepare change_huge_pmd() for file thp
thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
thp: file pages support for split_huge_page()
thp, mlock: do not mlock PTE-mapped file huge pages
vmscan: split file huge pages before paging them out
page-flags: relax policy for PG_mappedtodisk and PG_reclaim
radix-tree: implement radix_tree_maybe_preload_order()
filemap: prepare find and delete operations for huge pages
truncate: handle file thp
shmem: prepare huge= mount option and sysfs knob
shmem: add huge pages support
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
Documentation/filesystems/Locking | 10 +-
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/avr32/mm/fault.c | 2 +-
arch/cris/mm/fault.c | 2 +-
arch/frv/mm/fault.c | 2 +-
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m32r/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/metag/mm/fault.c | 2 +-
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/mn10300/mm/fault.c | 2 +-
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/mm/copro_fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/score/mm/fault.c | 2 +-
arch/sh/mm/fault.c | 2 +-
arch/sparc/mm/fault_32.c | 4 +-
arch/sparc/mm/fault_64.c | 2 +-
arch/tile/mm/fault.c | 2 +-
arch/um/kernel/trap.c | 2 +-
arch/unicore32/mm/fault.c | 2 +-
arch/x86/mm/fault.c | 2 +-
arch/xtensa/mm/fault.c | 2 +-
drivers/base/node.c | 10 +-
drivers/char/mem.c | 24 ++
drivers/iommu/amd_iommu_v2.c | 2 +-
drivers/iommu/intel-svm.c | 2 +-
fs/proc/meminfo.c | 5 +-
fs/userfaultfd.c | 22 +-
include/linux/huge_mm.h | 26 +-
include/linux/mm.h | 51 ++-
include/linux/mmzone.h | 3 +-
include/linux/page-flags.h | 19 +-
include/linux/radix-tree.h | 1 +
include/linux/rmap.h | 2 +-
include/linux/shmem_fs.h | 5 +-
include/linux/userfaultfd_k.h | 8 +-
include/linux/vm_event_item.h | 7 +
ipc/shm.c | 6 +-
lib/radix-tree.c | 68 ++-
mm/filemap.c | 220 +++++++---
mm/gup.c | 7 +-
mm/huge_memory.c | 564 ++++++++++++++-----------
mm/internal.h | 4 +-
mm/ksm.c | 3 +-
mm/memory.c | 852 +++++++++++++++++++++-----------------
mm/mempolicy.c | 4 +-
mm/migrate.c | 5 +-
mm/mmap.c | 26 +-
mm/mremap.c | 22 +-
mm/nommu.c | 3 +-
mm/page-writeback.c | 1 +
mm/page_alloc.c | 2 +
mm/rmap.c | 76 +++-
mm/shmem.c | 621 +++++++++++++++++++++++----
mm/swap.c | 2 +
mm/truncate.c | 22 +-
mm/util.c | 6 +
mm/vmscan.c | 15 +-
mm/vmstat.c | 3 +
68 files changed, 1866 insertions(+), 925 deletions(-)
--
2.7.0
We always have vma->vm_mm around.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/avr32/mm/fault.c | 2 +-
arch/cris/mm/fault.c | 2 +-
arch/frv/mm/fault.c | 2 +-
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m32r/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/metag/mm/fault.c | 2 +-
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/mn10300/mm/fault.c | 2 +-
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/mm/copro_fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/score/mm/fault.c | 2 +-
arch/sh/mm/fault.c | 2 +-
arch/sparc/mm/fault_32.c | 4 ++--
arch/sparc/mm/fault_64.c | 2 +-
arch/tile/mm/fault.c | 2 +-
arch/um/kernel/trap.c | 2 +-
arch/unicore32/mm/fault.c | 2 +-
arch/x86/mm/fault.c | 2 +-
arch/xtensa/mm/fault.c | 2 +-
drivers/iommu/amd_iommu_v2.c | 2 +-
drivers/iommu/intel-svm.c | 2 +-
include/linux/mm.h | 9 ++++-----
mm/gup.c | 5 ++---
mm/ksm.c | 3 +--
mm/memory.c | 13 +++++++------
36 files changed, 47 insertions(+), 49 deletions(-)
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 4a905bd667e2..83e9eee57a55 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -147,7 +147,7 @@ retry:
/* If for any reason at all we couldn't handle the fault,
make sure we exit gracefully rather than endlessly redo
the fault. */
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index af63f4a13e60..e94e5aa33985 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -137,7 +137,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
if (unlikely(fatal_signal_pending(current))) {
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index ad5841856007..3a2e678b8d30 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -243,7 +243,7 @@ good_area:
goto out;
}
- return handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+ return handle_mm_fault(vma, addr & PAGE_MASK, flags);
check_stack:
/* Don't allow expansion below FIRST_USER_ADDRESS */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6c8d9bb5dc51..a338148d61ef 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -183,7 +183,7 @@ good_area:
goto out;
}
- return handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
+ return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);
check_stack:
if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index c03533937a9f..a4b7edac8f10 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -134,7 +134,7 @@ good_area:
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 3066d40a6db1..112ef26c7f2e 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -168,7 +168,7 @@ retry:
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 61d99767fe16..614a46c413d2 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -164,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, ear0, flags);
+ fault = handle_mm_fault(vma, ear0, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 8704c9320032..bd7c251e2bce 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -101,7 +101,7 @@ good_area:
break;
}
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 70b40d1205a6..fa6ad95e992e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -159,7 +159,7 @@ retry:
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 8f9875b7933d..a3785d3644c2 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -196,7 +196,7 @@ good_area:
*/
addr = (address & PAGE_MASK);
set_thread_fault_code(error_code);
- fault = handle_mm_fault(mm, vma, addr, flags);
+ fault = handle_mm_fault(vma, addr, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 6a94cdd0c830..bd66a0b20c6b 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -136,7 +136,7 @@ good_area:
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
pr_debug("handle_mm_fault returns %d\n", fault);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index f57edca63609..372783a67dda 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -133,7 +133,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 177dfc003643..abb678ccde6f 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -216,7 +216,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 4b88fa031891..9560ad731120 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -153,7 +153,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 4a1d181ed32f..f23781d6bbb3 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -254,7 +254,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index b51878b0c6b8..affc4eb3f89e 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -131,7 +131,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 230ac20ae794..e94cd225e816 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -163,7 +163,7 @@ good_area:
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index a762864ec92e..87436cd7ea1a 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -243,7 +243,7 @@ good_area:
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 6527882ce05e..bb0354222b11 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -75,7 +75,7 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
}
ret = 0;
- *flt = handle_mm_fault(mm, vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
+ *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
if (unlikely(*flt & VM_FAULT_ERROR)) {
if (*flt & VM_FAULT_OOM) {
ret = -ENOMEM;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a67c6d781c52..a4db22f65021 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -429,7 +429,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
if (fault & VM_FAULT_SIGSEGV)
goto bad_area;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index ec1a30d0d11a..9579c92e35b9 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -455,7 +455,7 @@ retry:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
/* No reason to continue if interrupted by SIGKILL. */
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
fault = VM_FAULT_SIGNAL;
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 37a6c2e0e969..995b71e4db4b 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -111,7 +111,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 79d8276377d1..9bf876780cef 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -487,7 +487,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
if (mm_fault_error(regs, error_code, address, fault))
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index c399e7b3b035..3d2d6686d8e9 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -241,7 +241,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
@@ -411,7 +411,7 @@ good_area:
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
}
- switch (handle_mm_fault(mm, vma, address, flags)) {
+ switch (handle_mm_fault(vma, address, flags)) {
case VM_FAULT_SIGBUS:
case VM_FAULT_OOM:
goto do_sigbus;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cb841a33da59..6c43b924a7a2 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -436,7 +436,7 @@ good_area:
goto bad_area;
}
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
goto exit_exception;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 13eac59bf16a..4be04f649396 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -435,7 +435,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 98783dd0fa2e..ad8f206ab5e8 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -73,7 +73,7 @@ good_area:
do {
int fault;
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
goto out_nosemaphore;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 2ec3d3adcefc..6c7f70bcaae3 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -194,7 +194,7 @@ good_area:
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the fault.
*/
- fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+ fault = handle_mm_fault(vma, addr & PAGE_MASK, flags);
return fault;
check_stack:
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..d01fd75adfcb 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1235,7 +1235,7 @@ good_area:
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;
/*
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 7f4a1fdb1502..2725e08ef353 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -110,7 +110,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 7caf2fa237f2..d2f978245ca6 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -539,7 +539,7 @@ static void do_fault(struct work_struct *work)
goto out;
}
- ret = handle_mm_fault(mm, vma, address, write);
+ ret = handle_mm_fault(vma, address, write);
if (ret & VM_FAULT_ERROR) {
/* failed to service fault */
up_read(&mm->mmap_sem);
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 50464833d0b8..48c0f18e7a84 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -559,7 +559,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (access_error(vma, req))
goto invalid;
- ret = handle_mm_fault(svm->mm, vma, address,
+ ret = handle_mm_fault(vma, address,
req->wr_req ? FAULT_FLAG_WRITE : 0);
if (ret & VM_FAULT_ERROR)
goto invalid;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 19e0cf56e8f3..1e19c653e7c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1193,15 +1193,14 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
int invalidate_inode_page(struct page *page);
#ifdef CONFIG_MMU
-extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags);
+extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags);
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
#else
-static inline int handle_mm_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- unsigned int flags)
+static inline int handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
/* should never happen if there's no MMU */
BUG();
diff --git a/mm/gup.c b/mm/gup.c
index 7bf19ffa2199..60f422a0af8b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -349,7 +349,6 @@ unmap:
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned long address, unsigned int *flags, int *nonblocking)
{
- struct mm_struct *mm = vma->vm_mm;
unsigned int fault_flags = 0;
int ret;
@@ -372,7 +371,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_TRIED;
}
- ret = handle_mm_fault(mm, vma, address, fault_flags);
+ ret = handle_mm_fault(vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return -ENOMEM;
@@ -659,7 +658,7 @@ retry:
if (!(vm_flags & vma->vm_flags))
return -EFAULT;
- ret = handle_mm_fault(mm, vma, address, fault_flags);
+ ret = handle_mm_fault(vma, address, fault_flags);
major |= ret & VM_FAULT_MAJOR;
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
diff --git a/mm/ksm.c b/mm/ksm.c
index 823d78b2a055..1bd92327bf01 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -440,8 +440,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
if (IS_ERR_OR_NULL(page))
break;
if (PageKsm(page))
- ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ ret = handle_mm_fault(vma, addr, FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index c544458d60a6..0462ed431a4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3341,9 +3341,10 @@ unlock:
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
{
+ struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -3425,15 +3426,15 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
{
int ret;
__set_current_state(TASK_RUNNING);
count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGFAULT);
/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
@@ -3445,7 +3446,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
- ret = __handle_mm_fault(mm, vma, address, flags);
+ ret = __handle_mm_fault(vma, address, flags);
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
--
2.7.0
The idea (and most of code) is borrowed again: from Hugh's patchset on
huge tmpfs[1].
Instead of allocation pte page table upfront, we postpone this until we
have page to map in hands. This approach opens possibility to map the
page as huge if filesystem supports this.
Comparing to Hugh's patch I've pushed page table allocation a bit
further: into do_set_pte(). This way we can postpone allocation even in
faultaround case without moving do_fault_around() after __do_fault().
do_set_pte() got renamed to alloc_set_pte() as it can allocate page
table if required.
[1] http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 10 +-
mm/filemap.c | 17 +--
mm/memory.c | 297 +++++++++++++++++++++++++++++++----------------------
3 files changed, 197 insertions(+), 127 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1eb4f324fdb5..e023bb6419cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -300,6 +300,13 @@ struct fault_env {
* Protects pte page table if 'pte'
* is not NULL, otherwise pmd.
*/
+ pgtable_t prealloc_pte; /* Pre-allocated pte page table.
+ * vm_ops->map_pages() calls
+ * alloc_set_pte() from atomic context.
+ * do_fault_around() pre-allocates
+ * page table to avoid allocation from
+ * atomic context.
+ */
};
/*
@@ -580,7 +587,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}
-void do_set_pte(struct fault_env *fe, struct page *page);
+int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+ struct page *page);
#endif
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 3ec791e13f74..e7b55ec0221f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2170,11 +2170,6 @@ void filemap_map_pages(struct fault_env *fe,
start_pgoff) {
if (iter.index > end_pgoff)
break;
- fe->pte += iter.index - last_pgoff;
- fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
- last_pgoff = iter.index;
- if (!pte_none(*fe->pte))
- goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2211,8 +2206,18 @@ repeat:
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
- do_set_pte(fe, page);
+
+ fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ if (fe->pte)
+ fe->pte += iter.index - last_pgoff;
+ last_pgoff = iter.index;
+ alloc_set_pte(fe, NULL, page);
unlock_page(page);
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*fe->pmd))
+ break;
+ /* Failed to setup page table? */
+ VM_BUG_ON(!fe->pte);
goto next;
unlock:
unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 850071fcbd52..9c896a52565c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2657,8 +2657,6 @@ static int do_anonymous_page(struct fault_env *fe)
struct page *page;
pte_t entry;
- pte_unmap(fe->pte);
-
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
@@ -2667,6 +2665,23 @@ static int do_anonymous_page(struct fault_env *fe)
if (check_stack_guard_page(vma, fe->address) < 0)
return VM_FAULT_SIGSEGV;
+ /*
+ * Use pte_alloc() instead of pte_alloc_map(). We can't run
+ * pte_offset_map() on pmds where a huge pmd might be created
+ * from a different thread.
+ *
+ * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+ * parallel threads are excluded by other means.
+ *
+ * Here we only have down_read(mmap_sem).
+ */
+ if (pte_alloc(vma->vm_mm, fe->pmd, fe->address))
+ return VM_FAULT_OOM;
+
+ /* See the comment in pte_alloc_one_map() */
+ if (unlikely(pmd_trans_unstable(fe->pmd)))
+ return 0;
+
/* Use the zero-page for reads */
if (!(fe->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
@@ -2782,23 +2797,76 @@ static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
return ret;
}
+static int pte_alloc_one_map(struct fault_env *fe)
+{
+ struct vm_area_struct *vma = fe->vma;
+
+ if (!pmd_none(*fe->pmd))
+ goto map_pte;
+ if (fe->prealloc_pte) {
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd))) {
+ spin_unlock(fe->ptl);
+ goto map_pte;
+ }
+
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
+ spin_unlock(fe->ptl);
+ fe->prealloc_pte = 0;
+ } else if (unlikely(pte_alloc(vma->vm_mm, fe->pmd, fe->address))) {
+ return VM_FAULT_OOM;
+ }
+map_pte:
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+ * didn't become pmd_trans_huge under us and then back to pmd_none, as
+ * a result of MADV_DONTNEED running immediately after a huge pmd fault
+ * in a different thread of this mm, in turn leading to a misleading
+ * pmd_trans_huge() retval. All we have to ensure is that it is a
+ * regular pmd that we can walk with pte_offset_map() and we can do that
+ * through an atomic read in C, which is what pmd_trans_unstable()
+ * provides.
+ */
+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+ return VM_FAULT_NOPAGE;
+
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ return 0;
+}
+
/**
- * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
+ * alloc_set_pte - setup new PTE entry for given page and add reverse page
+ * mapping. If needed, the fucntion allocates page table or use pre-allocated.
*
* @fe: fault environment
+ * @memcg: memcg to charge page (only for private mappings)
* @page: page to map
*
- * Caller must hold page table lock relevant for @fe->pte.
+ * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
*
* Target users are page handler itself and implementations of
* vm_ops->map_pages.
*/
-void do_set_pte(struct fault_env *fe, struct page *page)
+int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+ struct page *page)
{
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
+ if (!fe->pte) {
+ int ret = pte_alloc_one_map(fe);
+ if (ret)
+ return ret;
+ }
+
+ /* Re-check under ptl */
+ if (unlikely(!pte_none(*fe->pte)))
+ return VM_FAULT_NOPAGE;
+
flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (write)
@@ -2807,6 +2875,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, fe->address, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
+ lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page);
@@ -2815,6 +2885,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, fe->address, fe->pte);
+
+ return 0;
}
static unsigned long fault_around_bytes __read_mostly =
@@ -2881,19 +2953,17 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
+static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
{
- unsigned long address = fe->address, start_addr, nr_pages, mask;
- pte_t *pte = fe->pte;
+ unsigned long address = fe->address, nr_pages, mask;
pgoff_t end_pgoff;
- int off;
+ int off, ret = 0;
nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
- start_addr = max(fe->address & mask, fe->vma->vm_start);
- off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- fe->pte -= off;
+ fe->address = max(address & mask, fe->vma->vm_start);
+ off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
start_pgoff -= off;
/*
@@ -2901,30 +2971,45 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
* or fault_around_pages() from start_pgoff, depending what is nearest.
*/
end_pgoff = start_pgoff -
- ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ ((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
start_pgoff + nr_pages - 1);
- /* Check if it makes any sense to call ->map_pages */
- fe->address = start_addr;
- while (!pte_none(*fe->pte)) {
- if (++start_pgoff > end_pgoff)
- goto out;
- fe->address += PAGE_SIZE;
- if (fe->address >= fe->vma->vm_end)
- goto out;
- fe->pte++;
+ if (pmd_none(*fe->pmd)) {
+ fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
+ smp_wmb(); /* See comment in __pte_alloc() */
}
fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+
+ /* preallocated pagetable is unused: free it */
+ if (fe->prealloc_pte) {
+ pte_free(fe->vma->vm_mm, fe->prealloc_pte);
+ fe->prealloc_pte = 0;
+ }
+ /* Huge page is mapped? Page fault is solved */
+ if (pmd_trans_huge(*fe->pmd)) {
+ ret = VM_FAULT_NOPAGE;
+ goto out;
+ }
+
+ /* ->map_pages() haven't done anything useful. Cold page cache? */
+ if (!fe->pte)
+ goto out;
+
+ /* check if the page fault is solved */
+ fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
+ if (!pte_none(*fe->pte))
+ ret = VM_FAULT_NOPAGE;
+ pte_unmap_unlock(fe->pte, fe->ptl);
out:
- /* restore fault_env */
- fe->pte = pte;
fe->address = address;
+ fe->pte = NULL;
+ return ret;
}
-static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
@@ -2936,33 +3021,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- do_fault_around(fe, pgoff);
- if (!pte_same(*fe->pte, orig_pte))
- goto unlock_out;
- pte_unmap_unlock(fe->pte, fe->ptl);
+ ret = do_fault_around(fe, pgoff);
+ if (ret)
+ return ret;
}
ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, NULL, fault_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
- unlock_page(fault_page);
- page_cache_release(fault_page);
- return ret;
- }
- do_set_pte(fe, fault_page);
unlock_page(fault_page);
-unlock_out:
- pte_unmap_unlock(fe->pte, fe->ptl);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ page_cache_release(fault_page);
return ret;
}
-static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
@@ -2990,26 +3067,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
copy_user_highpage(new_page, fault_page, fe->address, vma);
__SetPageUptodate(new_page);
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, memcg, new_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
- if (fault_page) {
- unlock_page(fault_page);
- page_cache_release(fault_page);
- } else {
- /*
- * The fault handler has no page to lock, so it holds
- * i_mmap_lock for read to protect against truncate.
- */
- i_mmap_unlock_read(vma->vm_file->f_mapping);
- }
- goto uncharge_out;
- }
- do_set_pte(fe, new_page);
- mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
- pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
page_cache_release(fault_page);
@@ -3020,6 +3080,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
*/
i_mmap_unlock_read(vma->vm_file->f_mapping);
}
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ goto uncharge_out;
return ret;
uncharge_out:
mem_cgroup_cancel_charge(new_page, memcg, false);
@@ -3027,7 +3089,7 @@ uncharge_out:
return ret;
}
-static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
@@ -3053,16 +3115,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
}
}
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, NULL, fault_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+ VM_FAULT_RETRY))) {
unlock_page(fault_page);
page_cache_release(fault_page);
return ret;
}
- do_set_pte(fe, fault_page);
- pte_unmap_unlock(fe->pte, fe->ptl);
if (set_page_dirty(fault_page))
dirtied = 1;
@@ -3094,20 +3155,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int do_fault(struct fault_env *fe, pte_t orig_pte)
+static int do_fault(struct fault_env *fe)
{
struct vm_area_struct *vma = fe->vma;
pgoff_t pgoff = linear_page_index(vma, fe->address);
- pte_unmap(fe->pte);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
if (!(fe->flags & FAULT_FLAG_WRITE))
- return do_read_fault(fe, pgoff, orig_pte);
+ return do_read_fault(fe, pgoff);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(fe, pgoff, orig_pte);
- return do_shared_fault(fe, pgoff, orig_pte);
+ return do_cow_fault(fe, pgoff);
+ return do_shared_fault(fe, pgoff);
}
static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3247,37 +3307,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with pte unmapped and unlocked.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
+ * concurrent faults).
*
- * The mmap_sem may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * The mmap_sem may have been released depending on flags and our return value.
+ * See filemap_fault() and __lock_page_or_retry().
*/
static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;
- /*
- * some architectures can have larger ptes than wordsize,
- * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
- * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
- * The code below just needs a consistent view for the ifs and
- * we later double check anyway with the ptl lock held. So here
- * a barrier will do.
- */
- entry = *fe->pte;
- barrier();
- if (!pte_present(entry)) {
+ if (unlikely(pmd_none(*fe->pmd))) {
+ /*
+ * Leave __pte_alloc() until later: because vm_ops->fault may
+ * want to allocate huge page, and if we expose page table
+ * for an instant, it will be difficult to retract from
+ * concurrent faults and from rmap lookups.
+ */
+ } else {
+ /* See comment in pte_alloc_one_map() */
+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+ return 0;
+ /*
+ * A regular pmd is established and it can't morph into a huge
+ * pmd from under us anymore at this point because we hold the
+ * mmap_sem read mode and khugepaged takes it in write mode.
+ * So now it's safe to run pte_offset_map().
+ */
+ fe->pte = pte_offset_map(fe->pmd, fe->address);
+
+ entry = *fe->pte;
+
+ /*
+ * some architectures can have larger ptes than wordsize,
+ * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
+ * CONFIG_32BIT=y, so READ_ONCE or ACCESS_ONCE cannot guarantee
+ * atomic accesses. The code below just needs a consistent
+ * view for the ifs and we later double check anyway with the
+ * ptl lock held. So here a barrier will do.
+ */
+ barrier();
if (pte_none(entry)) {
- if (vma_is_anonymous(fe->vma))
- return do_anonymous_page(fe);
- else
- return do_fault(fe, entry);
+ pte_unmap(fe->pte);
+ fe->pte = NULL;
}
- return do_swap_page(fe, entry);
}
+ if (!fe->pte) {
+ if (vma_is_anonymous(fe->vma))
+ return do_anonymous_page(fe);
+ else
+ return do_fault(fe);
+ }
+
+ if (!pte_present(entry))
+ return do_swap_page(fe, entry);
+
if (pte_protnone(entry))
return do_numa_page(fe, entry);
@@ -3359,34 +3444,6 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
}
- /*
- * Use pte_alloc() instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pte_alloc(fe.vma->vm_mm, fe.pmd, fe.address)))
- return VM_FAULT_OOM;
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
- * didn't become pmd_trans_huge under us and then back to pmd_none, as
- * a result of MADV_DONTNEED running immediately after a huge pmd fault
- * in a different thread of this mm, in turn leading to a misleading
- * pmd_trans_huge() retval. All we have to ensure is that it is a
- * regular pmd that we can walk with pte_offset_map() and we can do that
- * through an atomic read in C, which is what pmd_trans_unstable()
- * provides.
- */
- if (unlikely(pmd_trans_unstable(fe.pmd) || pmd_devmap(*fe.pmd)))
- return 0;
- /*
- * A regular pmd is established and it can't morph into a huge pmd
- * from under us anymore at this point because we hold the mmap_sem
- * read mode and khugepaged takes it in write mode. So now it's
- * safe to run pte_offset_map().
- */
- fe.pte = pte_offset_map(fe.pmd, fe.address);
-
return handle_pte_fault(&fe);
}
--
2.7.0
We need to mirror logic in move_ptes() wrt need_rmap_locks to get proper
serialization file THP.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/mremap.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/mm/mremap.c b/mm/mremap.c
index 3fa0a467df66..88fa7ab1a8ce 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -192,17 +192,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
break;
if (pmd_trans_huge(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE) {
+ struct address_space *mapping = NULL;
+ struct anon_vma *anon_vma = NULL;
bool moved;
- VM_BUG_ON_VMA(vma->vm_file || !vma->anon_vma,
- vma);
/* See comment in move_ptes() */
- if (need_rmap_locks)
- anon_vma_lock_write(vma->anon_vma);
+ if (need_rmap_locks) {
+ if (vma->vm_file) {
+ mapping = vma->vm_file->f_mapping;
+ i_mmap_lock_write(mapping);
+ }
+ if (vma->anon_vma) {
+ anon_vma = vma->anon_vma;
+ anon_vma_lock_write(anon_vma);
+ }
+ }
moved = move_huge_pmd(vma, new_vma, old_addr,
new_addr, old_end,
old_pmd, new_pmd);
- if (need_rmap_locks)
- anon_vma_unlock_write(vma->anon_vma);
+ if (anon_vma)
+ anon_vma_unlock_write(anon_vma);
+ if (mapping)
+ i_mmap_unlock_write(mapping);
if (moved) {
need_flush = true;
continue;
--
2.7.0
change_huge_pmd() has assert which is not relvant for file page.
For shared mapping it's perfectly fine to have page table entry
writable, without explicit mkwrite.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0ef06d47bd24..93f5b50490df 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1786,7 +1786,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
entry = pmd_mkwrite(entry);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(!preserve_write && pmd_write(entry));
+ BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
+ pmd_write(entry));
}
spin_unlock(ptl);
}
--
2.7.0
vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
During split we munlock the huge page which requires rmap walk.
rmap wants to take the lock on its own.
Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/mmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 86c069e7d86a..94979671b42c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -677,6 +677,8 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ vma_adjust_trans_huge(vma, start, end, adjust_next);
+
if (file) {
mapping = file->f_mapping;
root = &mapping->i_mmap;
@@ -697,8 +699,6 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
- vma_adjust_trans_huge(vma, start, end, adjust_next);
-
anon_vma = vma->anon_vma;
if (!anon_vma && adjust_next)
anon_vma = next->anon_vma;
--
2.7.0
Let's add FileHugeMapped field into meminfo. It indicates how many time
we map file THP
NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/base/node.c | 10 ++++++----
fs/proc/meminfo.c | 5 +++--
include/linux/mmzone.h | 3 ++-
mm/huge_memory.c | 2 +-
mm/rmap.c | 12 ++++++------
mm/vmstat.c | 1 +
6 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..9cc4e9dad47e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -113,6 +113,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d FileHugeMapped: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -131,10 +132,11 @@ static ssize_t node_read_meminfo(struct device *dev,
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
- , nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR));
+ nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+ nid, K(node_page_state(nid, NR_ANON_THPS) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_FILE_THP_MAPPED) *
+ HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
#endif
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..50666e987fbd 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -105,6 +105,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "FileHugeMapped: %8lu kB\n"
#endif
#ifdef CONFIG_CMA
"CmaTotal: %8lu kB\n"
@@ -162,8 +163,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- , K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR)
+ , K(global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_FILE_THP_MAPPED) * HPAGE_PMD_NR)
#endif
#ifdef CONFIG_CMA
, K(totalcma_pages)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c60df9257cc7..85fd4aac53a1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,7 +158,8 @@ enum zone_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
- NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_ANON_THPS,
+ NR_FILE_THP_MAPPED,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2e9e6f4afe40..44468fb7cdbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2982,7 +2982,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/* No need in mapcount reference anymore */
for (i = 0; i < HPAGE_PMD_NR; i++)
diff --git a/mm/rmap.c b/mm/rmap.c
index 66808955a5e0..359ec5cff9b0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1227,10 +1227,8 @@ void do_page_add_anon_rmap(struct page *page,
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (compound) {
- __inc_zone_page_state(page,
- NR_ANON_TRANSPARENT_HUGEPAGES);
- }
+ if (compound)
+ __inc_zone_page_state(page, NR_ANON_THPS);
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
}
if (unlikely(PageKsm(page)))
@@ -1268,7 +1266,7 @@ void page_add_new_anon_rmap(struct page *page,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
atomic_set(compound_mapcount_ptr(page), 0);
- __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __inc_zone_page_state(page, NR_ANON_THPS);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1298,6 +1296,7 @@ void page_add_file_rmap(struct page *page, bool compound)
}
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
+ __inc_zone_page_state(page, NR_FILE_THP_MAPPED);
} else {
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
@@ -1330,6 +1329,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
goto out;
+ __dec_zone_page_state(page, NR_FILE_THP_MAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))
goto out;
@@ -1363,7 +1363,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74f8c918ac4b..943b37f17007 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -762,6 +762,7 @@ const char * const vmstat_text[] = {
"workingset_activate",
"workingset_nodereclaim",
"nr_anon_transparent_hugepages",
+ "nr_file_transparent_hugepages_mapped",
"nr_free_cma",
/* enum writeback_stat_item counters */
--
2.7.0
Let's wire up existing madvise() hugepage hints for file mappings.
MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
VMA. It only has effect if the filesystem is mounted with huge=advise or
huge=within_size.
MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
the VMA. It doesn't prevent a huge page from being allocated by other
means, i.e. page fault into different mapping or write(2) into file.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 13 ++-----------
mm/shmem.c | 20 +++++++++++++++++---
2 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a9d642ff7aa2..8d5b552b7963 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1844,11 +1844,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
if (mm_has_pgste(vma->vm_mm))
return 0;
#endif
- /*
- * Be somewhat over-protective like KSM for now!
- */
- if (*vm_flags & VM_NO_THP)
- return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
@@ -1856,15 +1851,11 @@ int hugepage_madvise(struct vm_area_struct *vma,
* register it here without waiting a page fault that
* may not happen any time soon.
*/
- if (unlikely(khugepaged_enter_vma_merge(vma, *vm_flags)))
+ if ((*vm_flags & VM_NO_THP) &&
+ khugepaged_enter_vma_merge(vma, *vm_flags))
return -ENOMEM;
break;
case MADV_NOHUGEPAGE:
- /*
- * Be somewhat over-protective like KSM for now!
- */
- if (*vm_flags & VM_NO_THP)
- return -EINVAL;
*vm_flags &= ~VM_HUGEPAGE;
*vm_flags |= VM_NOHUGEPAGE;
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index c31216806721..ee8a15c24123 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -101,6 +101,8 @@ struct shmem_falloc {
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */
+ SGP_HUGE, /* like SGP_CACHE, huge pages preferred */
SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
@@ -1371,6 +1373,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ enum sgp_type sgp_huge = sgp;
pgoff_t hindex = index;
int error;
int once = 0;
@@ -1378,6 +1381,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
return -EFBIG;
+ if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
+ sgp = SGP_CACHE;
repeat:
swap.val = 0;
page = find_lock_entry(mapping, index);
@@ -1491,7 +1496,7 @@ repeat:
/* shmem_symlink() */
if (mapping->a_ops != &shmem_aops)
goto alloc_nohuge;
- if (shmem_huge == SHMEM_HUGE_DENY)
+ if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
goto alloc_nohuge;
if (shmem_huge == SHMEM_HUGE_FORCE)
goto alloc_huge;
@@ -1507,7 +1512,9 @@ repeat:
goto alloc_huge;
/* fallthrough */
case SHMEM_HUGE_ADVISE:
- /* TODO: wire up fadvise()/madvise() */
+ if (sgp_huge == SGP_HUGE)
+ goto alloc_huge;
+ /* TODO: implement fadvise() hints */
goto alloc_nohuge;
}
@@ -1639,6 +1646,7 @@ unlock:
static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct inode *inode = file_inode(vma->vm_file);
+ enum sgp_type sgp;
int error;
int ret = VM_FAULT_LOCKED;
@@ -1700,7 +1708,13 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
spin_unlock(&inode->i_lock);
}
- error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
+ sgp = SGP_CACHE;
+ if (vma->vm_flags & VM_HUGEPAGE)
+ sgp = SGP_HUGE;
+ else if (vma->vm_flags & VM_NOHUGEPAGE)
+ sgp = SGP_NOHUGE;
+
+ error = shmem_getpage(inode, vmf->pgoff, &vmf->page, sgp, &ret);
if (error)
return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
--
2.7.0
Splitting THP PMD is simple: just unmap it as in DAX case.
Unlike DAX, we also remove the page from rmap and drop reference.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c22144e3fe11..51c54d5a945b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2930,10 +2930,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
count_vm_event(THP_SPLIT_PMD);
- if (vma_is_dax(vma)) {
- pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ if (!vma_is_anonymous(vma)) {
+ _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
if (is_huge_zero_pmd(_pmd))
put_huge_zero_page();
+ if (vma_is_dax(vma))
+ return;
+ page = pmd_page(_pmd);
+ page_remove_rmap(page, true);
+ put_page(page);
+ add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
return;
} else if (is_huge_zero_pmd(*pmd)) {
return __split_huge_zero_page_pmd(vma, haddr, pmd);
--
2.7.0
With postponed page table allocation we have chance to setup huge pages.
do_set_pte() calls do_set_pmd() if following criteria met:
- page is compound;
- pmd entry in pmd_none();
- vma has suitable size and alignment;
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 8 ------
mm/memory.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++-
mm/migrate.c | 3 +--
4 files changed, 74 insertions(+), 11 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 24918897f073..193fccdc275d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -147,6 +147,8 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
struct page *get_huge_zero_page(void);
+#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b111d5c0312..2e9e6f4afe40 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -780,14 +780,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}
-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
- pmd_t entry;
- entry = mk_pmd(page, prot);
- entry = pmd_mkhuge(entry);
- return entry;
-}
-
static inline struct list_head *page_deferred_list(struct page *page)
{
/*
diff --git a/mm/memory.c b/mm/memory.c
index a6c1c4955560..0109db96fdff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2837,6 +2837,66 @@ map_pte:
return 0;
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+ unsigned long haddr)
+{
+ if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+ return false;
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return false;
+ return true;
+}
+
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+ struct vm_area_struct *vma = fe->vma;
+ bool write = fe->flags & FAULT_FLAG_WRITE;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
+ pmd_t entry;
+ int i, ret;
+
+ if (!transhuge_vma_suitable(vma, haddr))
+ return VM_FAULT_FALLBACK;
+
+ ret = VM_FAULT_FALLBACK;
+ page = compound_head(page);
+
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd)))
+ goto out;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ flush_icache_page(vma, page + i);
+
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ if (write)
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+
+ add_mm_counter(vma->vm_mm, MM_FILEPAGES, HPAGE_PMD_NR);
+ page_add_file_rmap(page, true);
+
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+
+ update_mmu_cache_pmd(vma, haddr, fe->pmd);
+
+ /* fault is handled */
+ ret = 0;
+out:
+ spin_unlock(fe->ptl);
+ return ret;
+}
+#else
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+ BUILD_BUG();
+ return 0;
+}
+#endif
+
/**
* alloc_set_pte - setup new PTE entry for given page and add reverse page
* mapping. If needed, the fucntion allocates page table or use pre-allocated.
@@ -2856,9 +2916,19 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
+ int ret;
+
+ if (pmd_none(*fe->pmd) && PageTransCompound(page)) {
+ /* THP on COW? */
+ VM_BUG_ON_PAGE(memcg, page);
+
+ ret = do_set_pmd(fe, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
if (!fe->pte) {
- int ret = pte_alloc_one_map(fe);
+ ret = pte_alloc_one_map(fe);
if (ret)
return ret;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index d20276fffce7..5c9cd90334ea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1820,8 +1820,7 @@ fail_putback:
}
orig_entry = *pmd;
- entry = mk_pmd(new_page, vma->vm_page_prot);
- entry = pmd_mkhuge(entry);
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
/*
--
2.7.0
copy_page_range() has a check for "Don't copy ptes where a page fault
will fill them correctly." It works on VMA level. We still copy all page
table entries from private mappings, even if they map page cache.
We can simplify copy_huge_pmd() a bit by skipping file PMDs.
We don't map file private pages with PMDs, so they only can map page
cache. It's safe to skip them as they can be re-faulted later.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 34 ++++++++++++++++------------------
1 file changed, 16 insertions(+), 18 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51c54d5a945b..0ef06d47bd24 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1082,14 +1082,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
struct page *src_page;
pmd_t pmd;
pgtable_t pgtable = NULL;
- int ret;
+ int ret = -ENOMEM;
- if (!vma_is_dax(vma)) {
- ret = -ENOMEM;
- pgtable = pte_alloc_one(dst_mm, addr);
- if (unlikely(!pgtable))
- goto out;
- }
+ /* Skip if can be re-fill on fault */
+ if (!vma_is_anonymous(vma))
+ return 0;
+
+ pgtable = pte_alloc_one(dst_mm, addr);
+ if (unlikely(!pgtable))
+ goto out;
dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1097,7 +1098,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
- if (unlikely(!pmd_trans_huge(pmd) && !pmd_devmap(pmd))) {
+ if (unlikely(!pmd_trans_huge(pmd))) {
pte_free(dst_mm, pgtable);
goto out_unlock;
}
@@ -1120,16 +1121,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
- if (!vma_is_dax(vma)) {
- /* thp accounting separate from pmd_devmap accounting */
- src_page = pmd_page(pmd);
- VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
- get_page(src_page);
- page_dup_rmap(src_page, true);
- add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
- atomic_long_inc(&dst_mm->nr_ptes);
- pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
- }
+ src_page = pmd_page(pmd);
+ VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+ get_page(src_page);
+ page_dup_rmap(src_page, true);
+ add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ atomic_long_inc(&dst_mm->nr_ptes);
+ pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
--
2.7.0
THP_FILE_ALLOC: how many times huge page was allocated and put page
cache.
THP_FILE_MAPPED: how many times file huge page was mapped.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 7 +++++++
mm/memory.c | 1 +
mm/vmstat.c | 2 ++
3 files changed, 10 insertions(+)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ec084321fe09..42604173f122 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -70,6 +70,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_FILE_ALLOC,
+ THP_FILE_MAPPED,
THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED,
THP_DEFERRED_SPLIT_PAGE,
@@ -100,4 +102,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NR_VM_EVENT_ITEMS
};
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_FILE_MAPPED ({ BUILD_BUG(); 0; })
+#endif
+
#endif /* VM_EVENT_ITEM_H_INCLUDED */
diff --git a/mm/memory.c b/mm/memory.c
index 0109db96fdff..f7a73b6cabac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2885,6 +2885,7 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)
/* fault is handled */
ret = 0;
+ count_vm_event(THP_FILE_MAPPED);
out:
spin_unlock(fe->ptl);
return ret;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 943b37f17007..78e94198b35f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -847,6 +847,8 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_file_alloc",
+ "thp_file_mapped",
"thp_split_page",
"thp_split_page_failed",
"thp_deferred_split_page",
--
2.7.0
Here's basic implementation of huge pages support for shmem/tmpfs.
It's all pretty streight-forward:
- shmem_getpage() allcoates huge page if it can and try to inserd into
radix tree with shmem_add_to_page_cache();
- shmem_add_to_page_cache() puts the page onto radix-tree if there's
space for it;
- shmem_undo_range() removes huge pages, if it fully within range.
Partial truncate of huge pages zero out this part of THP.
This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
behaviour. As we don't really create hole in this case,
lseek(SEEK_HOLE) may have inconsistent results depending what
pages happened to be allocated.
- no need to change shmem_fault: core-mm will map an compound page as
huge if VMA is suitable;
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 +
mm/memory.c | 5 +-
mm/mempolicy.c | 2 +-
mm/page-writeback.c | 1 +
mm/shmem.c | 355 +++++++++++++++++++++++++++++++++++++-----------
mm/swap.c | 2 +
6 files changed, 282 insertions(+), 85 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d40eab8c9d00..2696a61d1bdc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,6 +160,8 @@ struct page *get_huge_zero_page(void);
#define transparent_hugepage_enabled(__vma) 0
+static inline void prep_transhuge_page(struct page *page) {}
+
#define transparent_hugepage_flags 0UL
static inline int
split_huge_page_to_list(struct page *page, struct list_head *list)
diff --git a/mm/memory.c b/mm/memory.c
index c0a2510b9e15..939499dbdff5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1090,7 +1090,7 @@ again:
* unmap shared but keep private pages.
*/
if (details->check_mapping &&
- details->check_mapping != page->mapping)
+ details->check_mapping != page_rmapping(page))
continue;
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -1181,7 +1181,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
- VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
+ VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
+ !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
split_huge_pmd(vma, pmd, addr);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 76a0e6e56e05..5d1d9dd9a379 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -534,7 +534,7 @@ retry:
nid = page_to_nid(page);
if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
continue;
- if (PageTransCompound(page) && PageAnon(page)) {
+ if (PageTransCompound(page)) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11ff8f758631..2c8d5386665d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2554,6 +2554,7 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);
+ page = compound_head(page);
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index e7c2a69ff410..c31216806721 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -173,10 +173,13 @@ static inline int shmem_reacct_size(unsigned long flags,
* shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
* so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
*/
-static inline int shmem_acct_block(unsigned long flags)
+static inline int shmem_acct_block(unsigned long flags, long pages)
{
- return (flags & VM_NORESERVE) ?
- security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0;
+ if (!(flags & VM_NORESERVE))
+ return 0;
+
+ return security_vm_enough_memory_mm(current->mm,
+ pages * VM_ACCT(PAGE_CACHE_SIZE));
}
static inline void shmem_unacct_blocks(unsigned long flags, long pages)
@@ -376,30 +379,55 @@ static int shmem_add_to_page_cache(struct page *page,
struct address_space *mapping,
pgoff_t index, void *expected)
{
- int error;
+ int error, nr = hpage_nr_pages(page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ VM_BUG_ON_PAGE(index != round_down(index, nr), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ VM_BUG_ON(expected && PageTransHuge(page));
- page_cache_get(page);
+ atomic_add(nr, &page->_count);
page->mapping = mapping;
page->index = index;
spin_lock_irq(&mapping->tree_lock);
- if (!expected)
+ if (PageTransHuge(page)) {
+ void __rcu **results;
+ pgoff_t idx;
+ int i;
+
+ error = 0;
+ if (radix_tree_gang_lookup_slot(&mapping->page_tree,
+ &results, &idx, index, 1) &&
+ idx < index + HPAGE_PMD_NR) {
+ error = -EEXIST;
+ }
+
+ if (!error) {
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ error = radix_tree_insert(&mapping->page_tree,
+ index + i, page + i);
+ VM_BUG_ON(error);
+ }
+ count_vm_event(THP_FILE_ALLOC);
+ }
+ } else if (!expected) {
error = radix_tree_insert(&mapping->page_tree, index, page);
- else
+ } else {
error = shmem_radix_tree_replace(mapping, index, expected,
page);
+ }
+
if (!error) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- __inc_zone_page_state(page, NR_SHMEM);
+ mapping->nrpages += nr;
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, nr);
spin_unlock_irq(&mapping->tree_lock);
} else {
page->mapping = NULL;
spin_unlock_irq(&mapping->tree_lock);
- page_cache_release(page);
+ atomic_sub(nr, &page->_count);
}
return error;
}
@@ -412,6 +440,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
struct address_space *mapping = page->mapping;
int error;
+ VM_BUG_ON_PAGE(PageCompound(page), page);
+
spin_lock_irq(&mapping->tree_lock);
error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
page->mapping = NULL;
@@ -596,10 +626,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
continue;
}
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page);
+
if (!trylock_page(page))
continue;
+
+ if (PageTransTail(page)) {
+ /* Middle of THP: zero out the page */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ } else if (PageTransHuge(page)) {
+ if (index == round_down(end, HPAGE_PMD_NR)) {
+ /*
+ * Range ends in the middle of THP:
+ * zero out the page
+ */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ }
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ }
+
if (!unfalloc || !PageUptodate(page)) {
- if (page->mapping == mapping) {
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (page_mapping(page) == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
truncate_inode_page(mapping, page);
}
@@ -675,8 +728,36 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
lock_page(page);
+
+ if (PageTransTail(page)) {
+ /* Middle of THP: zero out the page */
+ clear_highpage(page);
+ unlock_page(page);
+ /*
+ * Partial thp truncate due 'start' in middle
+ * of THP: don't need to look on these pages
+ * again on !pvec.nr restart.
+ */
+ if (index != round_down(end, HPAGE_PMD_NR))
+ start++;
+ continue;
+ } else if (PageTransHuge(page)) {
+ if (index == round_down(end, HPAGE_PMD_NR)) {
+ /*
+ * Range ends in the middle of THP:
+ * zero out the page
+ */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ }
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ }
+
if (!unfalloc || !PageUptodate(page)) {
- if (page->mapping == mapping) {
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (page_mapping(page) == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
truncate_inode_page(mapping, page);
} else {
@@ -935,6 +1016,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
swp_entry_t swap;
pgoff_t index;
+ VM_BUG_ON_PAGE(PageCompound(page), page);
BUG_ON(!PageLocked(page));
mapping = page->mapping;
index = page->index;
@@ -1034,8 +1116,8 @@ redirty:
return 0;
}
-#ifdef CONFIG_NUMA
#ifdef CONFIG_TMPFS
+#ifdef CONFIG_NUMA
static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
{
char buffer[64];
@@ -1059,68 +1141,129 @@ static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
}
return mpol;
}
+
+#else
+
+static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
+{
+}
+#endif /* CONFIG_NUMA */
#endif /* CONFIG_TMPFS */
+static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
+ struct shmem_inode_info *info, pgoff_t index)
+{
+ /* Create a pseudo vma that just contains the policy */
+ vma->vm_start = 0;
+ /* Bias interleave by inode number to distribute better across nodes */
+ vma->vm_pgoff = index + info->vfs_inode.i_ino;
+ vma->vm_ops = NULL;
+
+#ifdef CONFIG_NUMA
+ vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+#endif /* CONFIG_NUMA */
+}
+
+static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_NUMA
+ /* Drop reference taken by mpol_shared_policy_lookup() */
+ mpol_cond_put(vma->vm_policy);
+#endif
+}
+
static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
struct vm_area_struct pvma;
struct page *page;
- /* Create a pseudo vma that just contains the policy */
- pvma.vm_start = 0;
- /* Bias interleave by inode number to distribute better across nodes */
- pvma.vm_pgoff = index + info->vfs_inode.i_ino;
- pvma.vm_ops = NULL;
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
-
+ shmem_pseudo_vma_init(&pvma, info, index);
page = swapin_readahead(swap, gfp, &pvma, 0);
-
- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(pvma.vm_policy);
+ shmem_pseudo_vma_destroy(&pvma);
return page;
}
-static struct page *shmem_alloc_page(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_hugepage(gfp_t gfp,
+ struct shmem_inode_info *info, pgoff_t index)
{
struct vm_area_struct pvma;
+ struct inode *inode = &info->vfs_inode;
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t idx, hindex = round_down(index, HPAGE_PMD_NR);
+ void __rcu **results;
struct page *page;
- /* Create a pseudo vma that just contains the policy */
- pvma.vm_start = 0;
- /* Bias interleave by inode number to distribute better across nodes */
- pvma.vm_pgoff = index + info->vfs_inode.i_ino;
- pvma.vm_ops = NULL;
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
-
- page = alloc_page_vma(gfp, &pvma, 0);
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return NULL;
- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(pvma.vm_policy);
+ rcu_read_lock();
+ if (radix_tree_gang_lookup_slot(&mapping->page_tree, &results, &idx,
+ hindex, 1) && idx < hindex + HPAGE_PMD_NR) {
+ rcu_read_unlock();
+ return NULL;
+ }
+ rcu_read_unlock();
+ shmem_pseudo_vma_init(&pvma, info, hindex);
+ page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
+ HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+ shmem_pseudo_vma_destroy(&pvma);
+ if (page)
+ prep_transhuge_page(page);
return page;
}
-#else /* !CONFIG_NUMA */
-#ifdef CONFIG_TMPFS
-static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
-{
-}
-#endif /* CONFIG_TMPFS */
-static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+static struct page *shmem_alloc_page(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
- return swapin_readahead(swap, gfp, NULL, 0);
+ struct vm_area_struct pvma;
+ struct page *page;
+
+ shmem_pseudo_vma_init(&pvma, info, index);
+ page = alloc_page_vma(gfp, &pvma, 0);
+ shmem_pseudo_vma_destroy(&pvma);
+
+ return page;
}
-static inline struct page *shmem_alloc_page(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_and_acct_page(gfp_t gfp,
+ struct shmem_inode_info *info, struct shmem_sb_info *sbinfo,
+ pgoff_t index, bool huge)
{
- return alloc_page(gfp);
+ struct page *page;
+ int nr;
+ int err = -ENOSPC;
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ huge = false;
+ nr = huge ? HPAGE_PMD_NR : 1;
+
+ if (shmem_acct_block(info->flags, nr))
+ goto failed;
+ if (sbinfo->max_blocks) {
+ if (percpu_counter_compare(&sbinfo->used_blocks,
+ sbinfo->max_blocks + nr) > 0)
+ goto unacct;
+ percpu_counter_add(&sbinfo->used_blocks, nr);
+ }
+
+ if (huge)
+ page = shmem_alloc_hugepage(gfp, info, index);
+ else
+ page = shmem_alloc_page(gfp, info, index);
+ if (page)
+ return page;
+
+ err = -ENOMEM;
+ if (sbinfo->max_blocks)
+ percpu_counter_add(&sbinfo->used_blocks, -nr);
+unacct:
+ shmem_unacct_blocks(info->flags, nr);
+failed:
+ return ERR_PTR(err);
}
-#endif /* CONFIG_NUMA */
#if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
@@ -1228,6 +1371,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ pgoff_t hindex = index;
int error;
int once = 0;
int alloced = 0;
@@ -1344,50 +1488,75 @@ repeat:
swap_free(swap);
} else {
- if (shmem_acct_block(info->flags)) {
- error = -ENOSPC;
- goto failed;
- }
- if (sbinfo->max_blocks) {
- if (percpu_counter_compare(&sbinfo->used_blocks,
- sbinfo->max_blocks) >= 0) {
- error = -ENOSPC;
- goto unacct;
- }
- percpu_counter_inc(&sbinfo->used_blocks);
+ /* shmem_symlink() */
+ if (mapping->a_ops != &shmem_aops)
+ goto alloc_nohuge;
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ goto alloc_nohuge;
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ goto alloc_huge;
+ switch (sbinfo->huge) {
+ loff_t i_size;
+ pgoff_t off;
+ case SHMEM_HUGE_NEVER:
+ goto alloc_nohuge;
+ case SHMEM_HUGE_WITHIN_SIZE:
+ off = round_up(index, HPAGE_PMD_NR);
+ i_size = round_up(i_size_read(inode), PAGE_CACHE_SIZE);
+ if (i_size >> PAGE_CACHE_SHIFT >= off)
+ goto alloc_huge;
+ /* fallthrough */
+ case SHMEM_HUGE_ADVISE:
+ /* TODO: wire up fadvise()/madvise() */
+ goto alloc_nohuge;
}
- page = shmem_alloc_page(gfp, info, index);
- if (!page) {
- error = -ENOMEM;
- goto decused;
+alloc_huge:
+ page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+ index, true);
+ if (IS_ERR(page)) {
+alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+ index, false);
+ }
+ if (IS_ERR(page)) {
+ error = PTR_ERR(page);
+ page = NULL;
+ goto failed;
}
+ if (PageTransHuge(page))
+ hindex = round_down(index, HPAGE_PMD_NR);
+ else
+ hindex = index;
+
__SetPageSwapBacked(page);
__SetPageLocked(page);
if (sgp == SGP_WRITE)
__SetPageReferenced(page);
error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
- false);
+ PageTransHuge(page));
if (error)
- goto decused;
- error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
+ goto unacct;
+ error = radix_tree_maybe_preload_order(gfp & GFP_RECLAIM_MASK,
+ compound_order(page));
if (!error) {
- error = shmem_add_to_page_cache(page, mapping, index,
+ error = shmem_add_to_page_cache(page, mapping, hindex,
NULL);
radix_tree_preload_end();
}
if (error) {
- mem_cgroup_cancel_charge(page, memcg, false);
- goto decused;
+ mem_cgroup_cancel_charge(page, memcg,
+ PageTransHuge(page));
+ goto unacct;
}
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false,
+ PageTransHuge(page));
lru_cache_add_anon(page);
spin_lock(&info->lock);
- info->alloced++;
- inode->i_blocks += BLOCKS_PER_PAGE;
+ info->alloced += 1 << compound_order(page);
+ inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
shmem_recalc_inode(inode);
spin_unlock(&info->lock);
alloced = true;
@@ -1403,10 +1572,15 @@ clear:
* but SGP_FALLOC on a page fallocated earlier must initialize
* it now, lest undo on failure cancel our earlier guarantee.
*/
- if (sgp != SGP_WRITE) {
- clear_highpage(page);
- flush_dcache_page(page);
- SetPageUptodate(page);
+ if (sgp != SGP_WRITE && !PageUptodate(page)) {
+ struct page *head = compound_head(page);
+ int i;
+
+ for (i = 0; i < (1 << compound_order(head)); i++) {
+ clear_highpage(head + i);
+ flush_dcache_page(head + i);
+ }
+ SetPageUptodate(head);
}
if (sgp == SGP_DIRTY)
set_page_dirty(page);
@@ -1425,17 +1599,23 @@ clear:
error = -EINVAL;
goto unlock;
}
- *pagep = page;
+ *pagep = page + index - hindex;
return 0;
/*
* Error recovery.
*/
-decused:
- if (sbinfo->max_blocks)
- percpu_counter_add(&sbinfo->used_blocks, -1);
unacct:
- shmem_unacct_blocks(info->flags, 1);
+ if (sbinfo->max_blocks)
+ percpu_counter_add(&sbinfo->used_blocks,
+ 1 << compound_order(page));
+ shmem_unacct_blocks(info->flags, 1 << compound_order(page));
+
+ if (PageTransHuge(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto alloc_nohuge;
+ }
failed:
if (swap.val && !shmem_confirm_swap(mapping, index, swap))
error = -EEXIST;
@@ -1776,12 +1956,23 @@ shmem_write_end(struct file *file, struct address_space *mapping,
i_size_write(inode, pos + copied);
if (!PageUptodate(page)) {
+ struct page *head = compound_head(page);
+ if (PageTransCompound(page)) {
+ int i;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ if (head + i == page)
+ continue;
+ clear_highpage(head + i);
+ flush_dcache_page(head + i);
+ }
+ }
if (copied < PAGE_CACHE_SIZE) {
unsigned from = pos & (PAGE_CACHE_SIZE - 1);
zero_user_segments(page, 0, from,
from + copied, PAGE_CACHE_SIZE);
}
- SetPageUptodate(page);
+ SetPageUptodate(head);
}
set_page_dirty(page);
unlock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e97714a..5ee5118f45d4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -291,6 +291,7 @@ static bool need_activate_page_drain(int cpu)
void activate_page(struct page *page)
{
+ page = compound_head(page);
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
@@ -315,6 +316,7 @@ void activate_page(struct page *page)
{
struct zone *zone = page_zone(page);
+ page = compound_head(page);
spin_lock_irq(&zone->lru_lock);
__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
spin_unlock_irq(&zone->lru_lock);
--
2.7.0
These flags are in use for file THP.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/page-flags.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 517707ae8cd1..9d518876dd6a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -292,11 +292,11 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
*/
TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
-PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_COMPOUND)
+PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
/* PG_readahead is only used for reads; PG_reclaim is only for writes */
-PAGEFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
- TESTCLEARFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
+ TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
--
2.7.0
This is preparation of vmscan for file huge pages. We cannot write out
huge pages, so we need to split them on the way out.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/vmscan.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c72032dbe8db..9fa9e15594e9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -473,12 +473,14 @@ void drop_slab(void)
static inline int is_page_cache_freeable(struct page *page)
{
+ int radix_tree_pins = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
+
/*
* A freeable page cache page is referenced only by the caller
* that isolated the page, the page cache radix tree and
* optional buffer heads at page->private.
*/
- return page_count(page) - page_has_private(page) == 2;
+ return page_count(page) - page_has_private(page) == 1 + radix_tree_pins;
}
static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
@@ -548,8 +550,6 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* swap_backing_dev_info is bust: it doesn't reflect the
* congestion state of the swapdevs. Easy to fix, if needed.
*/
- if (!is_page_cache_freeable(page))
- return PAGE_KEEP;
if (!mapping) {
/*
* Some data journaling orphaned pages can have
@@ -1112,6 +1112,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
+
+ if (!is_page_cache_freeable(page))
+ goto keep_locked;
+
+ if (unlikely(PageTransHuge(page))) {
+ if (split_huge_page_to_list(page, page_list))
+ goto keep_locked;
+ }
+
switch (pageout(page, mapping, sc)) {
case PAGE_KEEP:
goto keep_locked;
--
2.7.0
For shmem/tmpfs we only need to tweak truncate_inode_page() and
invalidate_mapping_pages().
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/truncate.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/mm/truncate.c b/mm/truncate.c
index 7598b552ae03..40d3730a8e62 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -157,10 +157,14 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)
int truncate_inode_page(struct address_space *mapping, struct page *page)
{
+ loff_t holelen;
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ holelen = PageTransHuge(page) ? HPAGE_PMD_SIZE : PAGE_CACHE_SIZE;
if (page_mapped(page)) {
unmap_mapping_range(mapping,
(loff_t)page->index << PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
+ holelen, 0);
}
return truncate_complete_page(mapping, page);
}
@@ -489,7 +493,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (!trylock_page(page))
continue;
- WARN_ON(page->index != index);
+
+ WARN_ON(page_to_pgoff(page) != index);
+
+ /* Middle of THP: skip */
+ if (PageTransTail(page)) {
+ unlock_page(page);
+ continue;
+ } else if (PageTransHuge(page)) {
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ /* 'end' is in the middle of THP */
+ if (index == round_down(end, HPAGE_PMD_NR))
+ continue;
+ }
+
ret = invalidate_inode_page(page);
unlock_page(page);
/*
--
2.7.0
This patch adds new mount option "huge=". It can have following values:
- "always":
Attempt to allocate huge pages every time we need a new page;
- "never":
Do not allocate huge pages;
- "within_size":
Only allocate huge page if it will be fully within i_size.
Also respect fadvise()/madvise() hints;
- "advise:
Only allocate huge pages if requested with fadvise()/madvise();
Default is "never" for now.
"mount -o remount,huge= /mountpoint" works fine after mount: remounting
huge=never will not attempt to break up huge pages at all, just stop
more from being allocated.
No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
which is the appropriate option to protect those who don't want
the new bloat, and with which we shall share some pmd code.
Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
explicit).
Allow enabling THP only if the machine has_transparent_hugepage().
But what about Shmem with no user-visible mount? SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
DRM objects, Ashmem. Though unlikely to suit all usages, provide
sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled to
experiment with huge on those.
And allow shmem_enabled two further values:
- "deny":
For use in emergencies, to force the huge option off from
all mounts;
- "force":
Force the huge option on for all - very useful for testing;
Based on patch by Hugh Dickins.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 +
include/linux/shmem_fs.h | 3 +-
mm/huge_memory.c | 3 +
mm/shmem.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 159 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 193fccdc275d..d40eab8c9d00 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -43,6 +43,8 @@ enum transparent_hugepage_flag {
#endif
};
+extern struct kobj_attribute shmem_enabled_attr;
+
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index a43f41cb3c43..03490d1554ba 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -31,9 +31,10 @@ struct shmem_sb_info {
unsigned long max_inodes; /* How many inodes are allowed */
unsigned long free_inodes; /* How many are left for allocation */
spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
+ umode_t mode; /* Mount mode for root directory */
+ unsigned char huge; /* Whether to try for hugepages */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
- umode_t mode; /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a2680b2112e4..a9d642ff7aa2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -441,6 +441,9 @@ static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
&use_zero_page_attr.attr,
+#ifdef CONFIG_SHMEM
+ &shmem_enabled_attr.attr,
+#endif
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
#endif
diff --git a/mm/shmem.c b/mm/shmem.c
index 09d7201dba3c..41faccc3347d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -289,6 +289,87 @@ static bool shmem_confirm_swap(struct address_space *mapping,
}
/*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
+ *
+ * SHMEM_HUGE_NEVER:
+ * disables huge pages for the mount;
+ * SHMEM_HUGE_ALWAYS:
+ * enables huge pages for the mount;
+ * SHMEM_HUGE_WITHIN_SIZE:
+ * only allocate huge pages if the page will be fully within i_size,
+ * also respect fadvise()/madvise() hints;
+ * SHMEM_HUGE_ADVISE:
+ * only allocate huge pages if requested with fadvise()/madvise();
+ */
+
+#define SHMEM_HUGE_NEVER 0
+#define SHMEM_HUGE_ALWAYS 1
+#define SHMEM_HUGE_WITHIN_SIZE 2
+#define SHMEM_HUGE_ADVISE 3
+
+/*
+ * Special values.
+ * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
+ *
+ * SHMEM_HUGE_DENY:
+ * disables huge on shm_mnt and all mounts, for emergency use;
+ * SHMEM_HUGE_FORCE:
+ * enables huge on shm_mnt and all mounts, w/o needing option, for testing;
+ *
+ */
+#define SHMEM_HUGE_DENY (-1)
+#define SHMEM_HUGE_FORCE (-2)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* ifdef here to avoid bloating shmem.o when not necessary */
+
+int shmem_huge __read_mostly;
+
+static int shmem_parse_huge(const char *str)
+{
+ if (!strcmp(str, "never"))
+ return SHMEM_HUGE_NEVER;
+ if (!strcmp(str, "always"))
+ return SHMEM_HUGE_ALWAYS;
+ if (!strcmp(str, "within_size"))
+ return SHMEM_HUGE_WITHIN_SIZE;
+ if (!strcmp(str, "advise"))
+ return SHMEM_HUGE_ADVISE;
+ if (!strcmp(str, "deny"))
+ return SHMEM_HUGE_DENY;
+ if (!strcmp(str, "force"))
+ return SHMEM_HUGE_FORCE;
+ return -EINVAL;
+}
+
+static const char *shmem_format_huge(int huge)
+{
+ switch (huge) {
+ case SHMEM_HUGE_NEVER:
+ return "never";
+ case SHMEM_HUGE_ALWAYS:
+ return "always";
+ case SHMEM_HUGE_WITHIN_SIZE:
+ return "within_size";
+ case SHMEM_HUGE_ADVISE:
+ return "advise";
+ case SHMEM_HUGE_DENY:
+ return "deny";
+ case SHMEM_HUGE_FORCE:
+ return "force";
+ default:
+ VM_BUG_ON(1);
+ return "bad_val";
+ }
+}
+
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define shmem_huge SHMEM_HUGE_DENY
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
* Like add_to_page_cache_locked, but error if expected item has gone.
*/
static int shmem_add_to_page_cache(struct page *page,
@@ -2914,11 +2995,24 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
sbinfo->gid = make_kgid(current_user_ns(), gid);
if (!gid_valid(sbinfo->gid))
goto bad_val;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ } else if (!strcmp(this_char, "huge")) {
+ int huge;
+ huge = shmem_parse_huge(value);
+ if (huge < 0)
+ goto bad_val;
+ if (!has_transparent_hugepage() &&
+ huge != SHMEM_HUGE_NEVER)
+ goto bad_val;
+ sbinfo->huge = huge;
+#endif
+#ifdef CONFIG_NUMA
} else if (!strcmp(this_char,"mpol")) {
mpol_put(mpol);
mpol = NULL;
if (mpol_parse_str(value, &mpol))
goto bad_val;
+#endif
} else {
pr_err("tmpfs: Bad mount option %s\n", this_char);
goto error;
@@ -2964,6 +3058,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
goto out;
error = 0;
+ sbinfo->huge = config.huge;
sbinfo->max_blocks = config.max_blocks;
sbinfo->max_inodes = config.max_inodes;
sbinfo->free_inodes = config.max_inodes - inodes;
@@ -2997,6 +3092,11 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u",
from_kgid_munged(&init_user_ns, sbinfo->gid));
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
+ if (sbinfo->huge)
+ seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
+#endif
shmem_show_mpol(seq, sbinfo->mpol);
return 0;
}
@@ -3345,6 +3445,58 @@ out3:
return error;
}
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS)
+static ssize_t shmem_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int values[] = {
+ SHMEM_HUGE_ALWAYS,
+ SHMEM_HUGE_WITHIN_SIZE,
+ SHMEM_HUGE_ADVISE,
+ SHMEM_HUGE_NEVER,
+ SHMEM_HUGE_DENY,
+ SHMEM_HUGE_FORCE,
+ };
+ int i, count;
+
+ for (i = 0, count = 0; i < ARRAY_SIZE(values); i++) {
+ const char *fmt = shmem_huge == values[i] ? "[%s] " : "%s ";
+
+ count += sprintf(buf + count, fmt,
+ shmem_format_huge(values[i]));
+ }
+ buf[count - 1] = '\n';
+ return count;
+}
+
+static ssize_t shmem_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ char tmp[16];
+ int huge;
+
+ if (count + 1 > sizeof(tmp))
+ return -EINVAL;
+ memcpy(tmp, buf, count);
+ tmp[count] = '\0';
+ if (count && tmp[count - 1] == '\n')
+ tmp[count - 1] = '\0';
+
+ huge = shmem_parse_huge(tmp);
+ if (huge == -EINVAL)
+ return -EINVAL;
+ if (!has_transparent_hugepage() &&
+ huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
+ return -EINVAL;
+
+ shmem_huge = huge;
+ return count;
+}
+
+struct kobj_attribute shmem_enabled_attr =
+ __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
+
#else /* !CONFIG_SHMEM */
/*
--
2.7.0
Basic scheme is the same as for anon THP.
Main differences:
- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.
- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;
- Lockless access is prevented by setting the head->_count to 0 during
split;
- After split, some pages can be beyond i_size. We drop them from
radix-tree.
- We don't setup migration entries. Just unmap pages. It helps
handling cases when i_size is in the middle of the page: no need
handle unmap pages beyond i_size manually.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/gup.c | 2 +
mm/huge_memory.c | 162 +++++++++++++++++++++++++++++++++++++++----------------
mm/mempolicy.c | 2 +
3 files changed, 121 insertions(+), 45 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 60f422a0af8b..76148816c0cd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -285,6 +285,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
ret = split_huge_page(page);
unlock_page(page);
put_page(page);
+ if (pmd_none(*pmd))
+ return no_page_table(vma, flags);
}
return ret ? ERR_PTR(ret) :
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 93f5b50490df..ba5fdf654f27 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -29,6 +29,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/page_idle.h>
#include <linux/swapops.h>
+#include <linux/shmem_fs.h>
#include <linux/debugfs.h>
#include <asm/tlb.h>
@@ -3132,12 +3133,15 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
static void freeze_page(struct page *page)
{
- enum ttu_flags ttu_flags = TTU_MIGRATION | TTU_IGNORE_MLOCK |
- TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED;
+ enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
+ TTU_RMAP_LOCKED;
int i, ret;
VM_BUG_ON_PAGE(!PageHead(page), page);
+ if (PageAnon(page))
+ ttu_flags |= TTU_MIGRATION;
+
/* We only need TTU_SPLIT_HUGE_PMD once */
ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
for (i = 1; !ret && i < HPAGE_PMD_NR; i++) {
@@ -3147,7 +3151,7 @@ static void freeze_page(struct page *page)
ret = try_to_unmap(page + i, ttu_flags);
}
- VM_BUG_ON(ret);
+ VM_BUG_ON_PAGE(ret, page + i - 1);
}
static void unfreeze_page(struct page *page)
@@ -3169,15 +3173,20 @@ static void __split_huge_page_tail(struct page *head, int tail,
/*
* tail_page->_count is zero and not changing from under us. But
* get_page_unless_zero() may be running from under us on the
- * tail_page. If we used atomic_set() below instead of atomic_inc(), we
- * would then run atomic_set() concurrently with
+ * tail_page. If we used atomic_set() below instead of atomic_inc() or
+ * atomic_add(), we would then run atomic_set() concurrently with
* get_page_unless_zero(), and atomic_set() is implemented in C not
* using locked ops. spin_unlock on x86 sometime uses locked ops
* because of PPro errata 66, 92, so unless somebody can guarantee
* atomic_set() here would be safe on all archs (and not only on x86),
- * it's safer to use atomic_inc().
+ * it's safer to use atomic_inc()/atomic_add().
*/
- page_ref_inc(page_tail);
+ if (PageAnon(head)) {
+ page_ref_inc(page_tail);
+ } else {
+ /* Additional pin to radix tree */
+ page_ref_add(page_tail, 2);
+ }
page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
page_tail->flags |= (head->flags &
@@ -3213,25 +3222,46 @@ static void __split_huge_page_tail(struct page *head, int tail,
lru_add_page_tail(head, page_tail, lruvec, list);
}
-static void __split_huge_page(struct page *page, struct list_head *list)
+static void __split_huge_page(struct page *page, struct list_head *list,
+ unsigned long flags)
{
struct page *head = compound_head(page);
struct zone *zone = page_zone(head);
struct lruvec *lruvec;
+ pgoff_t end = -1;
int i;
- /* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(head, zone);
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);
- for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
+ if (!PageAnon(page)) {
+ end = DIV_ROUND_UP(i_size_read(head->mapping->host),
+ PAGE_CACHE_SIZE);
+ }
+
+ for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
+ /* Some pages can be beyond i_size: drop them from page cache */
+ if (head[i].index >= end) {
+ __ClearPageDirty(head + i);
+ __delete_from_page_cache(head + i, NULL);
+ put_page(head + i);
+ }
+ }
ClearPageCompound(head);
- spin_unlock_irq(&zone->lru_lock);
+ /* See comment in __split_huge_page_tail() */
+ if (PageAnon(head)) {
+ atomic_inc(&head->_count);
+ } else {
+ /* Additional pin to radix tree */
+ atomic_add(2, &head->_count);
+ spin_unlock(&head->mapping->tree_lock);
+ }
+
+ spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
unfreeze_page(head);
@@ -3298,36 +3328,54 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
- struct anon_vma *anon_vma;
- int count, mapcount, ret;
+ struct anon_vma *anon_vma = NULL;
+ struct address_space *mapping = NULL;
+ int count, mapcount, extra_pins, ret;
bool mlocked;
unsigned long flags;
VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
- VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
VM_BUG_ON_PAGE(!PageCompound(page), page);
- /*
- * The caller does not necessarily hold an mmap_sem that would prevent
- * the anon_vma disappearing so we first we take a reference to it
- * and then lock the anon_vma for write. This is similar to
- * page_lock_anon_vma_read except the write lock is taken to serialise
- * against parallel split or collapse operations.
- */
- anon_vma = page_get_anon_vma(head);
- if (!anon_vma) {
- ret = -EBUSY;
- goto out;
+ if (PageAnon(head)) {
+ /*
+ * The caller does not necessarily hold an mmap_sem that would
+ * prevent the anon_vma disappearing so we first we take a
+ * reference to it and then lock the anon_vma for write. This
+ * is similar to page_lock_anon_vma_read except the write lock
+ * is taken to serialise against parallel split or collapse
+ * operations.
+ */
+ anon_vma = page_get_anon_vma(head);
+ if (!anon_vma) {
+ ret = -EBUSY;
+ goto out;
+ }
+ extra_pins = 0;
+ mapping = NULL;
+ anon_vma_lock_write(anon_vma);
+ } else {
+ mapping = head->mapping;
+
+ /* Truncated ? */
+ if (!mapping) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ /* Addidional pins from radix tree */
+ extra_pins = HPAGE_PMD_NR;
+ anon_vma = NULL;
+ i_mmap_lock_read(mapping);
}
- anon_vma_lock_write(anon_vma);
/*
* Racy check if we can split the page, before freeze_page() will
* split PMDs
*/
- if (total_mapcount(head) != page_count(head) - 1) {
+ if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
ret = -EBUSY;
goto out_unlock;
}
@@ -3340,35 +3388,60 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (mlocked)
lru_add_drain();
+ /* prevent PageLRU to go away from under us, and freeze lru stats */
+ spin_lock_irqsave(&page_zone(head)->lru_lock, flags);
+
+ if (mapping) {
+ void **pslot;
+
+ spin_lock(&mapping->tree_lock);
+ pslot = radix_tree_lookup_slot(&mapping->page_tree,
+ page_index(head));
+ /*
+ * Check if the head page is present in radix tree.
+ * We assume all tail are present too, if head is there.
+ */
+ if (radix_tree_deref_slot_protected(pslot,
+ &mapping->tree_lock) != head)
+ goto fail;
+ }
+
/* Prevent deferred_split_scan() touching ->_count */
- spin_lock_irqsave(&pgdata->split_queue_lock, flags);
+ spin_lock(&pgdata->split_queue_lock);
count = page_count(head);
mapcount = total_mapcount(head);
- if (!mapcount && count == 1) {
+ if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
if (!list_empty(page_deferred_list(head))) {
pgdata->split_queue_len--;
list_del(page_deferred_list(head));
}
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- __split_huge_page(page, list);
+ spin_unlock(&pgdata->split_queue_lock);
+ __split_huge_page(page, list, flags);
ret = 0;
- } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- pr_alert("total_mapcount: %u, page_count(): %u\n",
- mapcount, count);
- if (PageTail(page))
- dump_page(head, NULL);
- dump_page(page, "total_mapcount(head) > 0");
- BUG();
} else {
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
+ if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
+ pr_alert("total_mapcount: %u, page_count(): %u\n",
+ mapcount, count);
+ if (PageTail(page))
+ dump_page(head, NULL);
+ dump_page(page, "total_mapcount(head) > 0");
+ BUG();
+ }
+ spin_unlock(&pgdata->split_queue_lock);
+fail: if (mapping)
+ spin_unlock(&mapping->tree_lock);
+ spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
unfreeze_page(head);
ret = -EBUSY;
}
out_unlock:
- anon_vma_unlock_write(anon_vma);
- put_anon_vma(anon_vma);
+ if (anon_vma) {
+ anon_vma_unlock_write(anon_vma);
+ put_anon_vma(anon_vma);
+ }
+ if (mapping)
+ i_mmap_unlock_read(mapping);
out:
count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
return ret;
@@ -3491,8 +3564,7 @@ static int split_huge_pages_set(void *data, u64 val)
if (zone != page_zone(page))
goto next;
- if (!PageHead(page) || !PageAnon(page) ||
- PageHuge(page))
+ if (!PageHead(page) || PageHuge(page) || !PageLRU(page))
goto next;
total++;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b25de27b83d0..76a0e6e56e05 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -515,6 +515,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
}
}
+ if (pmd_trans_unstable(pmd))
+ return 0;
retry:
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
--
2.7.0
The new helper is similar to radix_tree_maybe_preload(), but tries to
preload number of nodes required to insert (1 << order) continuous
naturally-aligned elements.
This is required to push huge pages into pagecache.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/radix-tree.h | 1 +
lib/radix-tree.c | 68 ++++++++++++++++++++++++++++++++++++++++------
2 files changed, 61 insertions(+), 8 deletions(-)
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 32623d26b62a..20b626160430 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -288,6 +288,7 @@ unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
unsigned long first_index, unsigned int max_items);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 224b369f5a5e..84d417665ddc 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -42,6 +42,9 @@
*/
static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly;
+/* Number of nodes in fully populated tree of given height */
+static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
+
/*
* Radix tree node cache.
*/
@@ -261,7 +264,7 @@ radix_tree_node_free(struct radix_tree_node *node)
* To make use of this facility, the radix tree must be initialised without
* __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
*/
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload(gfp_t gfp_mask, int nr)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
@@ -269,14 +272,14 @@ static int __radix_tree_preload(gfp_t gfp_mask)
preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
- while (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+ while (rtp->nr < nr) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
- if (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+ if (rtp->nr < nr) {
node->private_data = rtp->nodes;
rtp->nodes = node;
rtp->nr++;
@@ -302,7 +305,7 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
}
EXPORT_SYMBOL(radix_tree_preload);
@@ -314,7 +317,7 @@ EXPORT_SYMBOL(radix_tree_preload);
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
if (gfpflags_allow_blocking(gfp_mask))
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
return 0;
@@ -322,6 +325,51 @@ int radix_tree_maybe_preload(gfp_t gfp_mask)
EXPORT_SYMBOL(radix_tree_maybe_preload);
/*
+ * The same as function above, but preload number of nodes required to insert
+ * (1 << order) continuous naturally-aligned elements.
+ */
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
+{
+ unsigned long nr_subtrees;
+ int nr_nodes, subtree_height;
+
+ /* Preloading doesn't help anything with this gfp mask, skip it */
+ if (!gfpflags_allow_blocking(gfp_mask)) {
+ preempt_disable();
+ return 0;
+ }
+
+ /*
+ * Calculate number and height of fully populated subtrees it takes to
+ * store (1 << order) elements.
+ */
+ nr_subtrees = 1 << order;
+ for (subtree_height = 0; nr_subtrees > RADIX_TREE_MAP_SIZE;
+ subtree_height++)
+ nr_subtrees >>= RADIX_TREE_MAP_SHIFT;
+
+ /*
+ * The worst case is zero height tree with a single item at index 0 and
+ * then inserting items starting at ULONG_MAX - (1 << order).
+ *
+ * This requires RADIX_TREE_MAX_PATH nodes to build branch from root to
+ * 0-index item.
+ */
+ nr_nodes = RADIX_TREE_MAX_PATH;
+
+ /* Plus branch to fully populated subtrees. */
+ nr_nodes += RADIX_TREE_MAX_PATH - subtree_height;
+
+ /* Root node is shared. */
+ nr_nodes--;
+
+ /* Plus nodes required to build subtrees. */
+ nr_nodes += nr_subtrees * height_to_maxnodes[subtree_height];
+
+ return __radix_tree_preload(gfp_mask, nr_nodes);
+}
+
+/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
*/
@@ -1472,12 +1520,16 @@ static __init unsigned long __maxindex(unsigned int height)
return ~0UL >> shift;
}
-static __init void radix_tree_init_maxindex(void)
+static __init void radix_tree_init_arrays(void)
{
- unsigned int i;
+ unsigned int i, j;
for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
height_to_maxindex[i] = __maxindex(i);
+ for (i = 0; i < ARRAY_SIZE(height_to_maxnodes); i++) {
+ for (j = i; j > 0; j--)
+ height_to_maxnodes[i] += height_to_maxindex[j - 1] + 1;
+ }
}
static int radix_tree_callback(struct notifier_block *nfb,
@@ -1507,6 +1559,6 @@ void __init radix_tree_init(void)
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
- radix_tree_init_maxindex();
+ radix_tree_init_arrays();
hotcpu_notifier(radix_tree_callback, 0);
}
--
2.7.0
From: Hugh Dickins <[email protected]>
Provide a shmem_get_unmapped_area method in file_operations, called
at mmap time to decide the mapping address. It could be conditional
on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by
making it unconditional.
shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout). Lots of conditions, and in most cases
it just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back
to ask for a larger arena, within which to align the mapping suitably.
There have to be some direct calls to shmem_get_unmapped_area(),
not via the file_operations: because of the way shmem_zero_setup()
is called to create a shmem object late in the mmap sequence, when
MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero. Though
this only matters when /proc/sys/vm/shmem_huge has been set.
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/char/mem.c | 24 ++++++++++++
include/linux/shmem_fs.h | 2 +
ipc/shm.c | 6 ++-
mm/mmap.c | 16 +++++++-
mm/shmem.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 142 insertions(+), 4 deletions(-)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 6b1721f978c2..a4c3ce0c9ece 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -22,6 +22,7 @@
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/backing-dev.h>
+#include <linux/shmem_fs.h>
#include <linux/splice.h>
#include <linux/pfn.h>
#include <linux/export.h>
@@ -661,6 +662,28 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma)
return 0;
}
+static unsigned long get_unmapped_area_zero(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+#ifdef CONFIG_MMU
+ if (flags & MAP_SHARED) {
+ /*
+ * mmap_zero() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge;
+ * and pass NULL for file as in mmap.c's get_unmapped_area(),
+ * so as not to confuse shmem with our handle on "/dev/zero".
+ */
+ return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+ }
+
+ /* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+#else
+ return -ENOSYS;
+#endif
+}
+
static ssize_t write_full(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -768,6 +791,7 @@ static const struct file_operations zero_fops = {
.read_iter = read_iter_zero,
.write_iter = write_iter_zero,
.mmap = mmap_zero,
+ .get_unmapped_area = get_unmapped_area_zero,
#ifndef CONFIG_MMU
.mmap_capabilities = zero_mmap_capabilities,
#endif
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 03490d1554ba..25ac75b424d3 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -53,6 +53,8 @@ extern struct file *shmem_file_setup(const char *name,
extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
unsigned long flags);
extern int shmem_zero_setup(struct vm_area_struct *);
+extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
extern bool shmem_mapping(struct address_space *mapping);
extern void shmem_unlock_mapping(struct address_space *mapping);
diff --git a/ipc/shm.c b/ipc/shm.c
index 3174634ca4e5..b797a6e49d78 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -476,13 +476,15 @@ static const struct file_operations shm_file_operations = {
.mmap = shm_mmap,
.fsync = shm_fsync,
.release = shm_release,
-#ifndef CONFIG_MMU
.get_unmapped_area = shm_get_unmapped_area,
-#endif
.llseek = noop_llseek,
.fallocate = shm_fallocate,
};
+/*
+ * shm_file_operations_huge is now identical to shm_file_operations,
+ * but we keep it distinct for the sake of is_file_shm_hugepages().
+ */
static const struct file_operations shm_file_operations_huge = {
.mmap = shm_mmap,
.fsync = shm_fsync,
diff --git a/mm/mmap.c b/mm/mmap.c
index 1786d0b0244f..f6c825d828b5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -25,6 +25,7 @@
#include <linux/personality.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
#include <linux/profile.h>
#include <linux/export.h>
#include <linux/mount.h>
@@ -1892,8 +1893,19 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
return -ENOMEM;
get_area = current->mm->get_unmapped_area;
- if (file && file->f_op->get_unmapped_area)
- get_area = file->f_op->get_unmapped_area;
+ if (file) {
+ if (file->f_op->get_unmapped_area)
+ get_area = file->f_op->get_unmapped_area;
+ } else if (flags & MAP_SHARED) {
+ /*
+ * mmap_region() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge.
+ * do_mmap_pgoff() will clear pgoff, so match alignment.
+ */
+ pgoff = 0;
+ get_area = shmem_get_unmapped_area;
+ }
+
addr = get_area(file, addr, len, pgoff, flags);
if (IS_ERR_VALUE(addr))
return addr;
diff --git a/mm/shmem.c b/mm/shmem.c
index 41faccc3347d..e7c2a69ff410 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,94 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return ret;
}
+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long uaddr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ unsigned long (*get_area)(struct file *,
+ unsigned long, unsigned long, unsigned long, unsigned long);
+ unsigned long addr;
+ unsigned long offset;
+ unsigned long inflated_len;
+ unsigned long inflated_addr;
+ unsigned long inflated_offset;
+
+ if (len > TASK_SIZE)
+ return -ENOMEM;
+
+ get_area = current->mm->get_unmapped_area;
+ addr = get_area(file, uaddr, len, pgoff, flags);
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return addr;
+ if (IS_ERR_VALUE(addr))
+ return addr;
+ if (addr & ~PAGE_MASK)
+ return addr;
+ if (addr > TASK_SIZE - len)
+ return addr;
+
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return addr;
+ if (len < HPAGE_PMD_SIZE)
+ return addr;
+ if (flags & MAP_FIXED)
+ return addr;
+ /*
+ * Our priority is to support MAP_SHARED mapped hugely;
+ * and support MAP_PRIVATE mapped hugely too, until it is COWed.
+ * But if caller specified an address hint, respect that as before.
+ */
+ if (uaddr)
+ return addr;
+
+ if (shmem_huge != SHMEM_HUGE_FORCE) {
+ struct super_block *sb;
+
+ if (file) {
+ VM_BUG_ON(file->f_op != &shmem_file_operations);
+ sb = file_inode(file)->i_sb;
+ } else {
+ /*
+ * Called directly from mm/mmap.c, or drivers/char/mem.c
+ * for "/dev/zero", to create a shared anonymous object.
+ */
+ if (IS_ERR(shm_mnt))
+ return addr;
+ sb = shm_mnt->mnt_sb;
+ }
+ if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER)
+ return addr;
+ }
+
+ offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
+ if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+ return addr;
+ if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+ return addr;
+
+ inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+ if (inflated_len > TASK_SIZE)
+ return addr;
+ if (inflated_len < len)
+ return addr;
+
+ inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+ if (IS_ERR_VALUE(inflated_addr))
+ return addr;
+ if (inflated_addr & ~PAGE_MASK)
+ return addr;
+
+ inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+ inflated_addr += offset - inflated_offset;
+ if (inflated_offset > offset)
+ inflated_addr += HPAGE_PMD_SIZE;
+
+ if (inflated_addr > TASK_SIZE - len)
+ return addr;
+ return inflated_addr;
+}
+
#ifdef CONFIG_NUMA
static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
{
@@ -3313,6 +3401,7 @@ static const struct address_space_operations shmem_aops = {
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
+ .get_unmapped_area = shmem_get_unmapped_area,
#ifdef CONFIG_TMPFS
.llseek = shmem_file_llseek,
.read_iter = shmem_file_read_iter,
@@ -3539,6 +3628,15 @@ void shmem_unlock_mapping(struct address_space *mapping)
{
}
+#ifdef CONFIG_MMU
+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+#endif
+
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
{
truncate_inode_pages_range(inode->i_mapping, lstart, lend);
--
2.7.0
File COW for THP is handled on pte level: just split the pmd.
It's not clear how benefitial would be allocation of huge pages on COW
faults. And it would require some code to make them work.
I think at some point we can consider teaching khugepaged to collapse
pages in COW mappings, but allocating huge on fault is probably
overkill.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index f7a73b6cabac..c0a2510b9e15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3366,6 +3366,11 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
if (fe->vma->vm_ops->pmd_fault)
return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
fe->flags);
+
+ /* COW handled on pte level: split pmd */
+ VM_BUG_ON_VMA(fe->vma->vm_flags & VM_SHARED, fe->vma);
+ split_huge_pmd(fe->vma, fe->pmd, fe->address);
+
return VM_FAULT_FALLBACK;
}
--
2.7.0
As with anon THP, we only mlock file huge pages if we can prove that the
page is not mapped with PTE. This way we can avoid mlock leak into
non-mlocked vma on split.
We rely on PageDoubleMap() under lock_page() to check if the the page
may be PTE mapped. PG_double_map is set by page_add_file_rmap() when the
page mapped with PTEs.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/page-flags.h | 13 ++++++++++++-
mm/huge_memory.c | 27 ++++++++++++++++++++-------
mm/mmap.c | 6 ++++++
mm/page_alloc.c | 2 ++
mm/rmap.c | 16 ++++++++++++++--
5 files changed, 54 insertions(+), 10 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4ed4f1b0c77..517707ae8cd1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -544,6 +544,17 @@ static inline int PageDoubleMap(struct page *page)
return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
}
+static inline void SetPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ set_bit(PG_double_map, &page[1].flags);
+}
+
+static inline void ClearPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ clear_bit(PG_double_map, &page[1].flags);
+}
static inline int TestSetPageDoubleMap(struct page *page)
{
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -560,7 +571,7 @@ static inline int TestClearPageDoubleMap(struct page *page)
TESTPAGEFLAG_FALSE(TransHuge)
TESTPAGEFLAG_FALSE(TransCompound)
TESTPAGEFLAG_FALSE(TransTail)
-TESTPAGEFLAG_FALSE(DoubleMap)
+PAGEFLAG_FALSE(DoubleMap)
TESTSETFLAG_FALSE(DoubleMap)
TESTCLEARFLAG_FALSE(DoubleMap)
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ba5fdf654f27..a2680b2112e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1427,6 +1427,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
* We don't mlock() pte-mapped THPs. This way we can avoid
* leaking mlocked pages into non-VM_LOCKED VMAs.
*
+ * For anon THP:
+ *
* In most cases the pmd is the only mapping of the page as we
* break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
* writable private mappings in populate_vma_page_range().
@@ -1434,15 +1436,26 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
* The only scenario when we have the page shared here is if we
* mlocking read-only mapping shared over fork(). We skip
* mlocking such pages.
+ *
+ * For file THP:
+ *
+ * We can expect PageDoubleMap() to be stable under page lock:
+ * for file pages we set it in page_add_file_rmap(), which
+ * requires page to be locked.
*/
- if (compound_mapcount(page) == 1 && !PageDoubleMap(page) &&
- page->mapping && trylock_page(page)) {
- lru_add_drain();
- if (page->mapping)
- mlock_vma_page(page);
- unlock_page(page);
- }
+
+ if (PageAnon(page) && compound_mapcount(page) != 1)
+ goto skip_mlock;
+ if (PageDoubleMap(page) || !page->mapping)
+ goto skip_mlock;
+ if (!trylock_page(page))
+ goto skip_mlock;
+ lru_add_drain();
+ if (page->mapping && !PageDoubleMap(page))
+ mlock_vma_page(page);
+ unlock_page(page);
}
+skip_mlock:
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page), page);
if (flags & FOLL_GET)
diff --git a/mm/mmap.c b/mm/mmap.c
index 94979671b42c..1786d0b0244f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2576,6 +2576,12 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
/* drop PG_Mlocked flag for over-mapped range */
for (tmp = vma; tmp->vm_start >= start + size;
tmp = tmp->vm_next) {
+ /*
+ * Split pmd and munlock page on the border
+ * of the range.
+ */
+ vma_adjust_trans_huge(tmp, start, start + size, 0);
+
munlock_vma_pages_range(tmp,
max(tmp->vm_start, start),
min(tmp->vm_end, start + size));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d6f042673e01..bce8b320fcce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1009,6 +1009,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
if (PageAnon(page))
page->mapping = NULL;
+ if (compound)
+ ClearPageDoubleMap(page);
bad += free_pages_check(page);
for (i = 1; i < (1 << order); i++) {
if (compound)
diff --git a/mm/rmap.c b/mm/rmap.c
index 359ec5cff9b0..f48258a78c8a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1298,6 +1298,12 @@ void page_add_file_rmap(struct page *page, bool compound)
goto out;
__inc_zone_page_state(page, NR_FILE_THP_MAPPED);
} else {
+ if (PageTransCompound(page)) {
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ SetPageDoubleMap(compound_head(page));
+ if (PageMlocked(page))
+ clear_page_mlock(compound_head(page));
+ }
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
}
@@ -1472,8 +1478,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
- /* Holding pte lock, we do *not* need mmap_sem here */
- mlock_vma_page(page);
+ /* PTE-mapped THP are never mlocked */
+ if (!PageTransCompound(page)) {
+ /*
+ * Holding pte lock, we do *not* need
+ * mmap_sem here
+ */
+ mlock_vma_page(page);
+ }
ret = SWAP_MLOCK;
goto out_unmap;
}
--
2.7.0
For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
page. That's suboptimal and it will be changed to use Matthew's
multi-order entries later.
'add' operation is not changed, because we don't need it to implement
hugetmpfs: shmem uses its own implementation.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 187 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 134 insertions(+), 53 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index e7b55ec0221f..bf5611d2fc81 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,43 +110,18 @@
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/
-static void page_cache_tree_delete(struct address_space *mapping,
- struct page *page, void *shadow)
+static void __page_cache_tree_delete(struct address_space *mapping,
+ struct radix_tree_node *node, void **slot, unsigned long index,
+ void *shadow)
{
- struct radix_tree_node *node;
- unsigned long index;
- unsigned int offset;
unsigned int tag;
- void **slot;
-
- VM_BUG_ON(!PageLocked(page));
- __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
-
- if (shadow) {
- mapping->nrexceptional++;
- /*
- * Make sure the nrexceptional update is committed before
- * the nrpages update so that final truncate racing
- * with reclaim does not see both counters 0 at the
- * same time and miss a shadow entry.
- */
- smp_wmb();
- }
- mapping->nrpages--;
-
- if (!node) {
- /* Clear direct pointer tags in root node */
- mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
- radix_tree_replace_slot(slot, shadow);
- return;
- }
+ VM_BUG_ON(node == NULL);
+ VM_BUG_ON(*slot == NULL);
/* Clear tree tags for the removed page */
- index = page->index;
- offset = index & RADIX_TREE_MAP_MASK;
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
- if (test_bit(offset, node->tags[tag]))
+ if (test_bit(index & RADIX_TREE_MAP_MASK, node->tags[tag]))
radix_tree_tag_clear(&mapping->page_tree, index, tag);
}
@@ -173,6 +148,54 @@ static void page_cache_tree_delete(struct address_space *mapping,
}
}
+static void page_cache_tree_delete(struct address_space *mapping,
+ struct page *page, void *shadow)
+{
+ struct radix_tree_node *node;
+ unsigned long index;
+ void **slot;
+ int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+ if (shadow) {
+ mapping->nrexceptional += nr;
+ /*
+ * Make sure the nrexceptional update is committed before
+ * the nrpages update so that final truncate racing
+ * with reclaim does not see both counters 0 at the
+ * same time and miss a shadow entry.
+ */
+ smp_wmb();
+ }
+ mapping->nrpages -= nr;
+
+ if (!node) {
+ /* Clear direct pointer tags in root node */
+ mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+ VM_BUG_ON(nr != 1);
+ radix_tree_replace_slot(slot, shadow);
+ return;
+ }
+
+ index = page->index;
+ VM_BUG_ON_PAGE(index & (nr - 1), page);
+ for (i = 0; i < nr; i++) {
+ /* Cross node border */
+ if (i && ((index + i) & RADIX_TREE_MAP_MASK) == 0) {
+ __radix_tree_lookup(&mapping->page_tree,
+ page->index + i, &node, &slot);
+ }
+
+ __page_cache_tree_delete(mapping, node,
+ slot + (i & RADIX_TREE_MAP_MASK), index + i,
+ shadow);
+ }
+}
+
/*
* Delete a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
@@ -181,6 +204,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
void __delete_from_page_cache(struct page *page, void *shadow)
{
struct address_space *mapping = page->mapping;
+ int nr = hpage_nr_pages(page);
trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -193,6 +217,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
else
cleancache_invalidate_page(mapping, page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(page_mapped(page), page);
if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(page_mapped(page))) {
int mapcount;
@@ -224,9 +249,9 @@ void __delete_from_page_cache(struct page *page, void *shadow)
/* hugetlb pages do not participate in page cache accounting. */
if (!PageHuge(page))
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
/*
* At this point page must be either written or cleaned by truncate.
@@ -250,9 +275,8 @@ void __delete_from_page_cache(struct page *page, void *shadow)
*/
void delete_from_page_cache(struct page *page)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page_mapping(page);
unsigned long flags;
-
void (*freepage)(struct page *);
BUG_ON(!PageLocked(page));
@@ -265,7 +289,13 @@ void delete_from_page_cache(struct page *page)
if (freepage)
freepage(page);
- page_cache_release(page);
+
+ if (PageTransHuge(page) && !PageHuge(page)) {
+ atomic_sub(HPAGE_PMD_NR, &page->_count);
+ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ } else {
+ page_cache_release(page);
+ }
}
EXPORT_SYMBOL(delete_from_page_cache);
@@ -1058,7 +1088,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
{
void **pagep;
- struct page *page;
+ struct page *head, *page;
rcu_read_lock();
repeat:
@@ -1078,9 +1108,17 @@ repeat:
*/
goto out;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
/*
* Has the page moved?
* This is part of the lockless pagecache protocol. See
@@ -1123,12 +1161,12 @@ repeat:
if (page && !radix_tree_exception(page)) {
lock_page(page);
/* Has the page been truncated? */
- if (unlikely(page->mapping != mapping)) {
+ if (unlikely(page_mapping(page) != mapping)) {
unlock_page(page);
page_cache_release(page);
goto repeat;
}
- VM_BUG_ON_PAGE(page->index != offset, page);
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
}
return page;
}
@@ -1261,7 +1299,7 @@ unsigned find_get_entries(struct address_space *mapping,
rcu_read_lock();
restart:
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1276,8 +1314,16 @@ repeat:
*/
goto export;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
goto repeat;
+ }
/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1323,7 +1369,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
rcu_read_lock();
restart:
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1347,9 +1393,16 @@ repeat:
continue;
}
- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
/* Has the page moved? */
if (unlikely(page != *slot)) {
page_cache_release(page);
@@ -1390,7 +1443,7 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
rcu_read_lock();
restart:
radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
/* The hole, there no reason to continue */
@@ -1414,8 +1467,14 @@ repeat:
break;
}
- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
+ goto repeat;
+ }
/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1428,7 +1487,7 @@ repeat:
* otherwise we can get both false positives and false
* negatives, which is just confusing to the caller.
*/
- if (page->mapping == NULL || page->index != iter.index) {
+ if (page->mapping == NULL || page_to_pgoff(page) != iter.index) {
page_cache_release(page);
break;
}
@@ -1467,7 +1526,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
restart:
radix_tree_for_each_tagged(slot, &mapping->page_tree,
&iter, *index, tag) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1496,8 +1555,15 @@ repeat:
continue;
}
- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
goto repeat;
+ }
/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1546,7 +1612,7 @@ unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
restart:
radix_tree_for_each_tagged(slot, &mapping->page_tree,
&iter, start, tag) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1568,9 +1634,17 @@ repeat:
*/
goto export;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
/* Has the page moved? */
if (unlikely(page != *slot)) {
page_cache_release(page);
@@ -2163,7 +2237,7 @@ void filemap_map_pages(struct fault_env *fe,
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
loff_t size;
- struct page *page;
+ struct page *head, *page;
rcu_read_lock();
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
@@ -2181,8 +2255,15 @@ repeat:
goto next;
}
- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
goto repeat;
+ }
/* Has the page moved? */
if (unlikely(page != *slot)) {
--
2.7.0
split_huge_pmd() for file mappings (and DAX too) is implemented by just
clearing pmd entry as we can re-fill this area from page cache on pte
level later.
This means we don't need deposit page tables when file THP is mapped.
Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
file THP PMD.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 44468fb7cdbf..c22144e3fe11 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1684,10 +1684,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page = pmd_page(orig_pmd);
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON_PAGE(!PageHead(page), page);
- pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
- atomic_long_dec(&tlb->mm->nr_ptes);
+ if (PageAnon(page)) {
+ pgtable_t pgtable;
+ pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+ pte_free(tlb->mm, pgtable);
+ atomic_long_dec(&tlb->mm->nr_ptes);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ } else {
+ add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+ }
spin_unlock(ptl);
tlb_remove_page(tlb, page);
}
--
2.7.0
The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:
Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.
The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.
This patch is preparation for the next one.
[1] http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
Documentation/filesystems/Locking | 10 +-
fs/userfaultfd.c | 22 +-
include/linux/huge_mm.h | 20 +-
include/linux/mm.h | 34 ++-
include/linux/userfaultfd_k.h | 8 +-
mm/filemap.c | 28 +-
mm/huge_memory.c | 280 +++++++++----------
mm/internal.h | 4 +-
mm/memory.c | 569 ++++++++++++++++++--------------------
mm/nommu.c | 3 +-
10 files changed, 468 insertions(+), 510 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 06d443450f21..0e499a7944a5 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -546,13 +546,13 @@ subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.
->map_pages() is called when VM asks to map easy accessible pages.
-Filesystem should find and map pages associated with offsets from "pgoff"
-till "max_pgoff". ->map_pages() is called with page table locked and must
+Filesystem should find and map pages associated with offsets from "start_pgoff"
+till "end_pgoff". ->map_pages() is called with page table locked and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
-page table entry. Pointer to entry associated with offset "pgoff" is
-passed in "pte" field in vm_fault structure. Pointers to entries for other
-offsets should be calculated relative to "pte".
+page table entry. Pointer to entry associated with the page is passed in
+"pte" field in fault_env structure. Pointers to entries for other offsets
+should be calculated relative to "pte".
->page_mkwrite() is called when a previously read-only pte is
about to become writeable. The filesystem again must ensure that there are
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 50311703135b..0a08143dbc87 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -257,10 +257,9 @@ out:
* fatal_signal_pending()s, and the mmap_sem must be released before
* returning it.
*/
-int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, unsigned long reason)
+int handle_userfault(struct fault_env *fe, unsigned long reason)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct mm_struct *mm = fe->vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
int ret;
@@ -269,7 +268,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
ret = VM_FAULT_SIGBUS;
- ctx = vma->vm_userfaultfd_ctx.ctx;
+ ctx = fe->vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
goto out;
@@ -296,17 +295,17 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* without first stopping userland access to the memory. For
* VM_UFFD_MISSING userfaults this is enough for now.
*/
- if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+ if (unlikely(!(fe->flags & FAULT_FLAG_ALLOW_RETRY))) {
/*
* Validate the invariant that nowait must allow retry
* to be sure not to return SIGBUS erroneously on
* nowait invocations.
*/
- BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+ BUG_ON(fe->flags & FAULT_FLAG_RETRY_NOWAIT);
#ifdef CONFIG_DEBUG_VM
if (printk_ratelimit()) {
printk(KERN_WARNING
- "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+ "FAULT_FLAG_ALLOW_RETRY missing %x\n", fe->flags);
dump_stack();
}
#endif
@@ -318,7 +317,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* and wait.
*/
ret = VM_FAULT_RETRY;
- if (flags & FAULT_FLAG_RETRY_NOWAIT)
+ if (fe->flags & FAULT_FLAG_RETRY_NOWAIT)
goto out;
/* take the reference before dropping the mmap_sem */
@@ -326,10 +325,11 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
- uwq.msg = userfault_msg(address, flags, reason);
+ uwq.msg = userfault_msg(fe->address, fe->flags, reason);
uwq.ctx = ctx;
- return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+ return_to_userland =
+ (fe->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
spin_lock(&ctx->fault_pending_wqh.lock);
@@ -347,7 +347,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
TASK_KILLABLE);
spin_unlock(&ctx->fault_pending_wqh.lock);
- must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
+ must_wait = userfaultfd_must_wait(ctx, fe->address, fe->flags, reason);
up_read(&mm->mmap_sem);
if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5307dfb3f8ec..24918897f073 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,20 +1,12 @@
#ifndef _LINUX_HUGE_MM_H
#define _LINUX_HUGE_MM_H
-extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags);
+extern int do_huge_pmd_anonymous_page(struct fault_env *fe);
extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma);
-extern void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd, int dirty);
-extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd);
+extern void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd);
+extern int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd);
extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmd,
@@ -139,8 +131,7 @@ static inline int hpage_nr_pages(struct page *page)
return 1;
}
-extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd);
extern struct page *huge_zero_page;
@@ -200,8 +191,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
return NULL;
}
-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+static inline int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd)
{
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1e19c653e7c0..1eb4f324fdb5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -279,10 +279,27 @@ struct vm_fault {
* is set (which is also implied by
* VM_FAULT_ERROR).
*/
- /* for ->map_pages() only */
- pgoff_t max_pgoff; /* map pages for offset from pgoff till
- * max_pgoff inclusive */
- pte_t *pte; /* pte entry associated with ->pgoff */
+};
+
+/*
+ * Page fault context: passes though page fault handler instead of endless list
+ * of function arguments.
+ */
+struct fault_env {
+ struct vm_area_struct *vma; /* Target VMA */
+ unsigned long address; /* Faulting virtual address */
+ unsigned int flags; /* FAULT_FLAG_xxx flags */
+ pmd_t *pmd; /* Pointer to pmd entry matching
+ * the 'address'
+ */
+ pte_t *pte; /* Pointer to pte entry matching
+ * the 'address'. NULL if the page
+ * table hasn't been allocated.
+ */
+ spinlock_t *ptl; /* Page table lock.
+ * Protects pte page table if 'pte'
+ * is not NULL, otherwise pmd.
+ */
};
/*
@@ -297,7 +314,8 @@ struct vm_operations_struct {
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
pmd_t *, unsigned int flags);
- void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ void (*map_pages)(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff);
/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -562,8 +580,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}
-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon);
+void do_set_pte(struct fault_env *fe, struct page *page);
#endif
/*
@@ -2040,7 +2057,8 @@ extern void truncate_inode_pages_final(struct address_space *);
/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
-extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff);
extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
/* mm/page-writeback.c */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480ad41b7..dd66a952e8cd 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -27,8 +27,7 @@
#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
-extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, unsigned long reason);
+extern int handle_userfault(struct fault_env *fe, unsigned long reason);
extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len);
@@ -56,10 +55,7 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
#else /* CONFIG_USERFAULTFD */
/* mm helpers */
-static inline int handle_userfault(struct vm_area_struct *vma,
- unsigned long address,
- unsigned int flags,
- unsigned long reason)
+static inline int handle_userfault(struct fault_env *fe, unsigned long reason)
{
return VM_FAULT_SIGBUS;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index d2d745acf1b5..3ec791e13f74 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2154,22 +2154,27 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);
-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
{
struct radix_tree_iter iter;
void **slot;
- struct file *file = vma->vm_file;
+ struct file *file = fe->vma->vm_file;
struct address_space *mapping = file->f_mapping;
+ pgoff_t last_pgoff = start_pgoff;
loff_t size;
struct page *page;
- unsigned long address = (unsigned long) vmf->virtual_address;
- unsigned long addr;
- pte_t *pte;
rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
- if (iter.index > vmf->max_pgoff)
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
+ start_pgoff) {
+ if (iter.index > end_pgoff)
break;
+ fe->pte += iter.index - last_pgoff;
+ fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ last_pgoff = iter.index;
+ if (!pte_none(*fe->pte))
+ goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2204,14 +2209,9 @@ repeat:
if (page->index >= size >> PAGE_CACHE_SHIFT)
goto unlock;
- pte = vmf->pte + page->index - vmf->pgoff;
- if (!pte_none(*pte))
- goto unlock;
-
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
- addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
- do_set_pte(vma, addr, page, pte, false, false);
+ do_set_pte(fe, page);
unlock_page(page);
goto next;
unlock:
@@ -2219,7 +2219,7 @@ unlock:
skip:
page_cache_release(page);
next:
- if (iter.index == vmf->max_pgoff)
+ if (iter.index == end_pgoff)
break;
}
rcu_read_unlock();
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3a03310c2de5..1f58c6026860 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -809,26 +809,23 @@ void prep_transhuge_page(struct page *page)
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
}
-static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- struct page *page, gfp_t gfp,
- unsigned int flags)
+static int __do_huge_pmd_anonymous_page(struct fault_env *fe, struct page *page,
+ gfp_t gfp)
{
+ struct vm_area_struct *vma = fe->vma;
struct mem_cgroup *memcg;
pgtable_t pgtable;
- spinlock_t *ptl;
- unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
VM_BUG_ON_PAGE(!PageCompound(page), page);
- if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- pgtable = pte_alloc_one(mm, haddr);
+ pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable)) {
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
@@ -843,12 +840,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
*/
__SetPageUptodate(page);
- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_none(*pmd))) {
- spin_unlock(ptl);
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd))) {
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
} else {
pmd_t entry;
@@ -856,12 +853,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
if (userfaultfd_missing(vma)) {
int ret;
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
- pte_free(mm, pgtable);
- ret = handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ pte_free(vma->vm_mm, pgtable);
+ ret = handle_userfault(fe, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
return ret;
}
@@ -871,11 +867,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false, true);
lru_cache_add_active_or_unevictable(page, vma);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- set_pmd_at(mm, haddr, pmd, entry);
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
- atomic_long_inc(&mm->nr_ptes);
- spin_unlock(ptl);
+ pgtable_trans_huge_deposit(vma->vm_mm, fe->pmd, pgtable);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ spin_unlock(fe->ptl);
count_vm_event(THP_FAULT_ALLOC);
}
@@ -925,13 +921,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
return true;
}
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags)
+int do_huge_pmd_anonymous_page(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
gfp_t gfp;
struct page *page;
- unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
return VM_FAULT_FALLBACK;
@@ -939,42 +934,40 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm) &&
+ if (!(fe->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm) &&
transparent_hugepage_use_zero_page()) {
- spinlock_t *ptl;
pgtable_t pgtable;
struct page *zero_page;
bool set;
int ret;
- pgtable = pte_alloc_one(mm, haddr);
+ pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
zero_page = get_huge_zero_page();
if (unlikely(!zero_page)) {
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- ptl = pmd_lock(mm, pmd);
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
ret = 0;
set = false;
- if (pmd_none(*pmd)) {
+ if (pmd_none(*fe->pmd)) {
if (userfaultfd_missing(vma)) {
- spin_unlock(ptl);
- ret = handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ spin_unlock(fe->ptl);
+ ret = handle_userfault(fe, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
} else {
- set_huge_zero_page(pgtable, mm, vma,
- haddr, pmd,
- zero_page);
- spin_unlock(ptl);
+ set_huge_zero_page(pgtable, vma->vm_mm, vma,
+ haddr, fe->pmd, zero_page);
+ spin_unlock(fe->ptl);
set = true;
}
} else
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
if (!set) {
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
put_huge_zero_page();
}
return ret;
@@ -986,8 +979,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
}
prep_transhuge_page(page);
- return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
- flags);
+ return __do_huge_pmd_anonymous_page(fe, page, gfp);
}
static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1159,38 +1151,31 @@ out:
return ret;
}
-void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmd, pmd_t orig_pmd,
- int dirty)
+void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd)
{
- spinlock_t *ptl;
pmd_t entry;
unsigned long haddr;
- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ fe->ptl = pmd_lock(fe->vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto unlock;
entry = pmd_mkyoung(orig_pmd);
- haddr = address & HPAGE_PMD_MASK;
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
- update_mmu_cache_pmd(vma, address, pmd);
+ haddr = fe->address & HPAGE_PMD_MASK;
+ if (pmdp_set_access_flags(fe->vma, haddr, fe->pmd, entry,
+ fe->flags & FAULT_FLAG_WRITE))
+ update_mmu_cache_pmd(fe->vma, fe->address, fe->pmd);
unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
}
-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmd, pmd_t orig_pmd,
- struct page *page,
- unsigned long haddr)
+static int do_huge_pmd_wp_page_fallback(struct fault_env *fe, pmd_t orig_pmd,
+ struct page *page)
{
+ struct vm_area_struct *vma = fe->vma;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
struct mem_cgroup *memcg;
- spinlock_t *ptl;
pgtable_t pgtable;
pmd_t _pmd;
int ret = 0, i;
@@ -1207,11 +1192,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
for (i = 0; i < HPAGE_PMD_NR; i++) {
pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE |
- __GFP_OTHER_NODE,
- vma, address, page_to_nid(page));
+ __GFP_OTHER_NODE, vma,
+ fe->address, page_to_nid(page));
if (unlikely(!pages[i] ||
- mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
- &memcg, false))) {
+ mem_cgroup_try_charge(pages[i], vma->vm_mm,
+ GFP_KERNEL, &memcg, false))) {
if (pages[i])
put_page(pages[i]);
while (--i >= 0) {
@@ -1237,41 +1222,41 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_free_pages;
VM_BUG_ON_PAGE(!PageHead(page), page);
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
/* leave pmd empty until pte is filled */
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
+ pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, fe->pmd);
+ pmd_populate(vma->vm_mm, &_pmd, pgtable);
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
+ pte_t entry;
entry = mk_pte(pages[i], vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- page_add_new_anon_rmap(pages[i], vma, haddr, false);
+ page_add_new_anon_rmap(pages[i], fe->vma, haddr, false);
mem_cgroup_commit_charge(pages[i], memcg, false, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
- pte = pte_offset_map(&_pmd, haddr);
- VM_BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
+ fe->pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*fe->pte));
+ set_pte_at(vma->vm_mm, haddr, fe->pte, entry);
+ pte_unmap(fe->pte);
}
kfree(pages);
smp_wmb(); /* make pte visible before pmd */
- pmd_populate(mm, pmd, pgtable);
+ pmd_populate(vma->vm_mm, fe->pmd, pgtable);
page_remove_rmap(page, true);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1280,8 +1265,8 @@ out:
return ret;
out_free_pages:
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ spin_unlock(fe->ptl);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1292,25 +1277,23 @@ out_free_pages:
goto out;
}
-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
{
- spinlock_t *ptl;
- int ret = 0;
+ struct vm_area_struct *vma = fe->vma;
struct page *page = NULL, *new_page;
struct mem_cgroup *memcg;
- unsigned long haddr;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
gfp_t huge_gfp; /* for allocation and charge */
+ int ret = 0;
- ptl = pmd_lockptr(mm, pmd);
+ fe->ptl = pmd_lockptr(vma->vm_mm, fe->pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
- haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
- spin_lock(ptl);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ spin_lock(fe->ptl);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_unlock;
page = pmd_page(orig_pmd);
@@ -1329,13 +1312,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
- update_mmu_cache_pmd(vma, address, pmd);
+ if (pmdp_set_access_flags(vma, haddr, fe->pmd, entry, 1))
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
ret |= VM_FAULT_WRITE;
goto out_unlock;
}
get_page(page);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow()) {
@@ -1348,13 +1331,12 @@ alloc:
prep_transhuge_page(new_page);
} else {
if (!page) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
ret |= VM_FAULT_FALLBACK;
} else {
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
+ ret = do_huge_pmd_wp_page_fallback(fe, orig_pmd, page);
if (ret & VM_FAULT_OOM) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
ret |= VM_FAULT_FALLBACK;
}
put_page(page);
@@ -1363,14 +1345,12 @@ alloc:
goto out;
}
- if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg,
- true))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
+ huge_gfp, &memcg, true))) {
put_page(new_page);
- if (page) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
+ if (page)
put_page(page);
- } else
- split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -1386,13 +1366,13 @@ alloc:
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
- spin_lock(ptl);
+ spin_lock(fe->ptl);
if (page)
put_page(page);
- if (unlikely(!pmd_same(*pmd, orig_pmd))) {
- spin_unlock(ptl);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd))) {
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(new_page, memcg, true);
put_page(new_page);
goto out_mn;
@@ -1400,14 +1380,14 @@ alloc:
pmd_t entry;
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
- set_pmd_at(mm, haddr, pmd, entry);
- update_mmu_cache_pmd(vma, address, pmd);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
if (!page) {
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
put_huge_zero_page();
} else {
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -1416,13 +1396,13 @@ alloc:
}
ret |= VM_FAULT_WRITE;
}
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
out:
return ret;
out_unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
return ret;
}
@@ -1482,13 +1462,12 @@ out:
}
/* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t pmd)
{
- spinlock_t *ptl;
+ struct vm_area_struct *vma = fe->vma;
struct anon_vma *anon_vma = NULL;
struct page *page;
- unsigned long haddr = addr & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
int target_nid, last_cpupid = -1;
bool page_locked;
@@ -1499,8 +1478,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* A PROT_NONE fault should not end up here */
BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
- ptl = pmd_lock(mm, pmdp);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(pmd, *fe->pmd)))
goto out_unlock;
/*
@@ -1508,9 +1487,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* without disrupting NUMA hinting information. Do not relock and
* check_same as the page may no longer be mapped.
*/
- if (unlikely(pmd_trans_migrating(*pmdp))) {
- page = pmd_page(*pmdp);
- spin_unlock(ptl);
+ if (unlikely(pmd_trans_migrating(*fe->pmd))) {
+ page = pmd_page(*fe->pmd);
+ spin_unlock(fe->ptl);
wait_on_page_locked(page);
goto out;
}
@@ -1543,7 +1522,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Migration could have started since the pmd_trans_migrating check */
if (!page_locked) {
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
wait_on_page_locked(page);
page_nid = -1;
goto out;
@@ -1554,12 +1533,12 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* to serialises splits
*/
get_page(page);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PMD did not change while page_table_lock was released */
- spin_lock(ptl);
- if (unlikely(!pmd_same(pmd, *pmdp))) {
+ spin_lock(fe->ptl);
+ if (unlikely(!pmd_same(pmd, *fe->pmd))) {
unlock_page(page);
put_page(page);
page_nid = -1;
@@ -1577,9 +1556,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Migrate the THP to the requested node, returns with page unlocked
* and access rights restored.
*/
- spin_unlock(ptl);
- migrated = migrate_misplaced_transhuge_page(mm, vma,
- pmdp, pmd, addr, page, target_nid);
+ spin_unlock(fe->ptl);
+ migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
+ fe->pmd, pmd, fe->address, page, target_nid);
if (migrated) {
flags |= TNF_MIGRATED;
page_nid = target_nid;
@@ -1594,18 +1573,18 @@ clear_pmdnuma:
pmd = pmd_mkyoung(pmd);
if (was_writable)
pmd = pmd_mkwrite(pmd);
- set_pmd_at(mm, haddr, pmdp, pmd);
- update_mmu_cache_pmd(vma, addr, pmdp);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, pmd);
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
unlock_page(page);
out_unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);
if (page_nid != -1)
- task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
+ task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, fe->flags);
return 0;
}
@@ -2387,29 +2366,32 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd)
{
- unsigned long _address;
- pte_t *pte, pteval;
+ pte_t pteval;
int swapped_in = 0, ret = 0;
-
- pte = pte_offset_map(pmd, address);
- for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
- pte++, _address += PAGE_SIZE) {
- pteval = *pte;
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+ .pmd = pmd,
+ };
+
+ fe.pte = pte_offset_map(pmd, address);
+ for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
+ fe.pte++, fe.address += PAGE_SIZE) {
+ pteval = *fe.pte;
if (!is_swap_pte(pteval))
continue;
swapped_in++;
- ret = do_swap_page(mm, vma, _address, pte, pmd,
- FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
- pteval);
+ ret = do_swap_page(&fe, pteval);
if (ret & VM_FAULT_ERROR) {
trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
return;
}
/* pte is unmapped now, we need to map it */
- pte = pte_offset_map(pmd, _address);
+ fe.pte = pte_offset_map(pmd, fe.address);
}
- pte--;
- pte_unmap(pte);
+ fe.pte--;
+ pte_unmap(fe.pte);
trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
}
diff --git a/mm/internal.h b/mm/internal.h
index c984ceab3431..885b56807b0d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -35,9 +35,7 @@
/* Do not use these with a slab allocator */
#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
-extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte);
+int do_swap_page(struct fault_env *fe, pte_t orig_pte);
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/memory.c b/mm/memory.c
index 0462ed431a4d..850071fcbd52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1990,13 +1990,11 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
* case, all we need to do here is to mark the page as writable and update
* any related book-keeping.
*/
-static inline int wp_page_reuse(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
- struct page *page, int page_mkwrite,
- int dirty_shared)
- __releases(ptl)
+static inline int wp_page_reuse(struct fault_env *fe, pte_t orig_pte,
+ struct page *page, int page_mkwrite, int dirty_shared)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
pte_t entry;
/*
* Clear the pages cpupid information as the existing
@@ -2006,12 +2004,12 @@ static inline int wp_page_reuse(struct mm_struct *mm,
if (page)
page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);
- flush_cache_page(vma, address, pte_pfn(orig_pte));
+ flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (ptep_set_access_flags(vma, address, page_table, entry, 1))
- update_mmu_cache(vma, address, page_table);
- pte_unmap_unlock(page_table, ptl);
+ if (ptep_set_access_flags(vma, fe->address, fe->pte, entry, 1))
+ update_mmu_cache(vma, fe->address, fe->pte);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (dirty_shared) {
struct address_space *mapping;
@@ -2057,30 +2055,31 @@ static inline int wp_page_reuse(struct mm_struct *mm,
* held to the old page, as well as updating the rmap.
* - In any case, unlock the PTL and drop the reference we took to the old page.
*/
-static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- pte_t orig_pte, struct page *old_page)
+static int wp_page_copy(struct fault_env *fe, pte_t orig_pte,
+ struct page *old_page)
{
+ struct vm_area_struct *vma = fe->vma;
+ struct mm_struct *mm = vma->vm_mm;
struct page *new_page = NULL;
- spinlock_t *ptl = NULL;
pte_t entry;
int page_copied = 0;
- const unsigned long mmun_start = address & PAGE_MASK; /* For mmu_notifiers */
- const unsigned long mmun_end = mmun_start + PAGE_SIZE; /* For mmu_notifiers */
+ const unsigned long mmun_start = fe->address & PAGE_MASK;
+ const unsigned long mmun_end = mmun_start + PAGE_SIZE;
struct mem_cgroup *memcg;
if (unlikely(anon_vma_prepare(vma)))
goto oom;
if (is_zero_pfn(pte_pfn(orig_pte))) {
- new_page = alloc_zeroed_user_highpage_movable(vma, address);
+ new_page = alloc_zeroed_user_highpage_movable(vma, fe->address);
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
+ fe->address);
if (!new_page)
goto oom;
- cow_user_page(new_page, old_page, address, vma);
+ cow_user_page(new_page, old_page, fe->address, vma);
}
if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
@@ -2093,8 +2092,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Re-check the pte - we dropped the lock
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (likely(pte_same(*page_table, orig_pte))) {
+ fe->pte = pte_offset_map_lock(mm, fe->pmd, fe->address, &fe->ptl);
+ if (likely(pte_same(*fe->pte, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
dec_mm_counter_fast(mm,
@@ -2104,7 +2103,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
- flush_cache_page(vma, address, pte_pfn(orig_pte));
+ flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
/*
@@ -2113,8 +2112,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush_notify(vma, address, page_table);
- page_add_new_anon_rmap(new_page, vma, address, false);
+ ptep_clear_flush_notify(vma, fe->address, fe->pte);
+ page_add_new_anon_rmap(new_page, vma, fe->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
@@ -2122,8 +2121,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* mmu page tables (such as kvm shadow page tables), we want the
* new page to be mapped directly into the secondary page table.
*/
- set_pte_at_notify(mm, address, page_table, entry);
- update_mmu_cache(vma, address, page_table);
+ set_pte_at_notify(mm, fe->address, fe->pte, entry);
+ update_mmu_cache(vma, fe->address, fe->pte);
if (old_page) {
/*
* Only after switching the pte to the new page may
@@ -2160,7 +2159,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
if (new_page)
page_cache_release(new_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
if (old_page) {
/*
@@ -2188,44 +2187,43 @@ oom:
* Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
* mapping
*/
-static int wp_pfn_shared(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
- pmd_t *pmd)
+static int wp_pfn_shared(struct fault_env *fe, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
+
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
struct vm_fault vmf = {
.page = NULL,
- .pgoff = linear_page_index(vma, address),
- .virtual_address = (void __user *)(address & PAGE_MASK),
+ .pgoff = linear_page_index(vma, fe->address),
+ .virtual_address =
+ (void __user *)(fe->address & PAGE_MASK),
.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
};
int ret;
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
if (ret & VM_FAULT_ERROR)
return ret;
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
/*
* We might have raced with another page fault while we
* released the pte_offset_map_lock.
*/
- if (!pte_same(*page_table, orig_pte)) {
- pte_unmap_unlock(page_table, ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
}
- return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
- NULL, 0, 0);
+ return wp_page_reuse(fe, orig_pte, NULL, 0, 0);
}
-static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table,
- pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
- struct page *old_page)
- __releases(ptl)
+static int wp_page_shared(struct fault_env *fe, pte_t orig_pte,
+ struct page *old_page)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
int page_mkwrite = 0;
page_cache_get(old_page);
@@ -2233,8 +2231,8 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
int tmp;
- pte_unmap_unlock(page_table, ptl);
- tmp = do_page_mkwrite(vma, old_page, address);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ tmp = do_page_mkwrite(vma, old_page, fe->address);
if (unlikely(!tmp || (tmp &
(VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
page_cache_release(old_page);
@@ -2246,19 +2244,18 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* they did, we just return, as we can count on the
* MMU to tell us if they didn't also make it writable.
*/
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
page_cache_release(old_page);
return 0;
}
page_mkwrite = 1;
}
- return wp_page_reuse(mm, vma, address, page_table, ptl,
- orig_pte, old_page, page_mkwrite, 1);
+ return wp_page_reuse(fe, orig_pte, old_page, page_mkwrite, 1);
}
/*
@@ -2279,14 +2276,13 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* but allow concurrent faults), with pte both mapped and locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- spinlock_t *ptl, pte_t orig_pte)
- __releases(ptl)
+static int do_wp_page(struct fault_env *fe, pte_t orig_pte)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *old_page;
- old_page = vm_normal_page(vma, address, orig_pte);
+ old_page = vm_normal_page(vma, fe->address, orig_pte);
if (!old_page) {
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -2297,12 +2293,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))
- return wp_pfn_shared(mm, vma, address, page_table, ptl,
- orig_pte, pmd);
+ return wp_pfn_shared(fe, orig_pte);
- pte_unmap_unlock(page_table, ptl);
- return wp_page_copy(mm, vma, address, page_table, pmd,
- orig_pte, old_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return wp_page_copy(fe, orig_pte, old_page);
}
/*
@@ -2312,13 +2306,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (PageAnon(old_page) && !PageKsm(old_page)) {
if (!trylock_page(old_page)) {
page_cache_get(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
lock_page(old_page);
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+ fe->address, &fe->ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
page_cache_release(old_page);
return 0;
}
@@ -2330,16 +2324,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
* the rmap code will not search our parent or siblings.
* Protected against the rmap code by the page lock.
*/
- page_move_anon_rmap(old_page, vma, address);
+ page_move_anon_rmap(old_page, vma, fe->address);
unlock_page(old_page);
- return wp_page_reuse(mm, vma, address, page_table, ptl,
- orig_pte, old_page, 0, 0);
+ return wp_page_reuse(fe, orig_pte, old_page, 0, 0);
}
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
- return wp_page_shared(mm, vma, address, page_table, pmd,
- ptl, orig_pte, old_page);
+ return wp_page_shared(fe, orig_pte, old_page);
}
/*
@@ -2347,9 +2339,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
page_cache_get(old_page);
- pte_unmap_unlock(page_table, ptl);
- return wp_page_copy(mm, vma, address, page_table, pmd,
- orig_pte, old_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return wp_page_copy(fe, orig_pte, old_page);
}
static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -2440,11 +2431,9 @@ EXPORT_SYMBOL(unmap_mapping_range);
* We return with the mmap_sem locked or unlocked in the same cases
* as does filemap_fault().
*/
-int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+int do_swap_page(struct fault_env *fe, pte_t orig_pte)
{
- spinlock_t *ptl;
+ struct vm_area_struct *vma = fe->vma;
struct page *page, *swapcache;
struct mem_cgroup *memcg;
swp_entry_t entry;
@@ -2453,17 +2442,17 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
int exclusive = 0;
int ret = 0;
- if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
+ if (!pte_unmap_same(vma->vm_mm, fe->pmd, fe->pte, orig_pte))
goto out;
entry = pte_to_swp_entry(orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
} else {
- print_bad_pte(vma, address, orig_pte, NULL);
+ print_bad_pte(vma, fe->address, orig_pte, NULL);
ret = VM_FAULT_SIGBUS;
}
goto out;
@@ -2472,14 +2461,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = lookup_swap_cache(entry);
if (!page) {
page = swapin_readahead(entry,
- GFP_HIGHUSER_MOVABLE, vma, address);
+ GFP_HIGHUSER_MOVABLE, vma, fe->address);
if (!page) {
/*
* Back out if somebody else faulted in this pte
* while we released the pte lock.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (likely(pte_same(*page_table, orig_pte)))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+ fe->address, &fe->ptl);
+ if (likely(pte_same(*fe->pte, orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto unlock;
@@ -2488,7 +2478,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
- mem_cgroup_count_vm_event(mm, PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
} else if (PageHWPoison(page)) {
/*
* hwpoisoned dirty swapcache pages are kept for killing
@@ -2501,7 +2491,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
swapcache = page;
- locked = lock_page_or_retry(page, mm, flags);
+ locked = lock_page_or_retry(page, vma->vm_mm, fe->flags);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
if (!locked) {
@@ -2518,14 +2508,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
goto out_page;
- page = ksm_might_need_to_copy(page, vma, address);
+ page = ksm_might_need_to_copy(page, vma, fe->address);
if (unlikely(!page)) {
ret = VM_FAULT_OOM;
page = swapcache;
goto out_page;
}
- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
ret = VM_FAULT_OOM;
goto out_page;
}
@@ -2533,8 +2524,9 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Back out if somebody else already faulted in this pte.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*page_table, orig_pte)))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte)))
goto out_nomap;
if (unlikely(!PageUptodate(page))) {
@@ -2552,24 +2544,24 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
* must be called after the swap_free(), or it will never succeed.
*/
- inc_mm_counter_fast(mm, MM_ANONPAGES);
- dec_mm_counter_fast(mm, MM_SWAPENTS);
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
- if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+ if ((fe->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
- flags &= ~FAULT_FLAG_WRITE;
+ fe->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
- set_pte_at(mm, address, page_table, pte);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
if (page == swapcache) {
- do_page_add_anon_rmap(page, vma, address, exclusive);
+ do_page_add_anon_rmap(page, vma, fe->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, address, false);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}
@@ -2592,22 +2584,22 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_cache_release(swapcache);
}
- if (flags & FAULT_FLAG_WRITE) {
- ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+ if (fe->flags & FAULT_FLAG_WRITE) {
+ ret |= do_wp_page(fe, pte);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
goto out;
}
/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, page_table);
+ update_mmu_cache(vma, fe->address, fe->pte);
unlock:
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
out:
return ret;
out_nomap:
mem_cgroup_cancel_charge(page, memcg, false);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
out_page:
unlock_page(page);
out_release:
@@ -2658,37 +2650,36 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags)
+static int do_anonymous_page(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
struct mem_cgroup *memcg;
struct page *page;
- spinlock_t *ptl;
pte_t entry;
- pte_unmap(page_table);
+ pte_unmap(fe->pte);
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
/* Check if we need to add a guard page to the stack */
- if (check_stack_guard_page(vma, address) < 0)
+ if (check_stack_guard_page(vma, fe->address) < 0)
return VM_FAULT_SIGSEGV;
/* Use the zero-page for reads */
- if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
- entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
+ if (!(fe->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm)) {
+ entry = pte_mkspecial(pfn_pte(my_zero_pfn(fe->address),
vma->vm_page_prot));
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_none(*fe->pte))
goto unlock;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(page_table, ptl);
- return handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return handle_userfault(fe, VM_UFFD_MISSING);
}
goto setpte;
}
@@ -2696,11 +2687,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_zeroed_user_highpage_movable(vma, address);
+ page = alloc_zeroed_user_highpage_movable(vma, fe->address);
if (!page)
goto oom;
- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
goto oom_free_page;
/*
@@ -2714,30 +2705,30 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_none(*fe->pte))
goto release;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
mem_cgroup_cancel_charge(page, memcg, false);
page_cache_release(page);
- return handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ return handle_userfault(fe, VM_UFFD_MISSING);
}
- inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address, false);
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
- set_pte_at(mm, address, page_table, entry);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, page_table);
+ update_mmu_cache(vma, fe->address, fe->pte);
unlock:
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
release:
mem_cgroup_cancel_charge(page, memcg, false);
@@ -2754,16 +2745,16 @@ oom:
* released depending on flags and vma->vm_ops->fault() return value.
* See filemap_fault() and __lock_page_retry().
*/
-static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags,
- struct page *cow_page, struct page **page)
+static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
+ struct page *cow_page, struct page **page)
{
+ struct vm_area_struct *vma = fe->vma;
struct vm_fault vmf;
int ret;
- vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ vmf.virtual_address = (void __user *)(fe->address & PAGE_MASK);
vmf.pgoff = pgoff;
- vmf.flags = flags;
+ vmf.flags = fe->flags;
vmf.page = NULL;
vmf.gfp_mask = __get_fault_gfp_mask(vma);
vmf.cow_page = cow_page;
@@ -2794,38 +2785,36 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
/**
* do_set_pte - setup new PTE entry for given page and add reverse page mapping.
*
- * @vma: virtual memory area
- * @address: user virtual address
+ * @fe: fault environment
* @page: page to map
- * @pte: pointer to target page table entry
- * @write: true, if new entry is writable
- * @anon: true, if it's anonymous page
*
- * Caller must hold page table lock relevant for @pte.
+ * Caller must hold page table lock relevant for @fe->pte.
*
* Target users are page handler itself and implementations of
* vm_ops->map_pages.
*/
-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon)
+void do_set_pte(struct fault_env *fe, struct page *page)
{
+ struct vm_area_struct *vma = fe->vma;
+ bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (anon) {
+ /* copy-on-write page */
+ if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address, false);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page);
}
- set_pte_at(vma->vm_mm, address, pte, entry);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
/* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, address, pte);
+ update_mmu_cache(vma, fe->address, fe->pte);
}
static unsigned long fault_around_bytes __read_mostly =
@@ -2892,57 +2881,53 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pgoff_t pgoff, unsigned int flags)
+static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
{
- unsigned long start_addr, nr_pages, mask;
- pgoff_t max_pgoff;
- struct vm_fault vmf;
+ unsigned long address = fe->address, start_addr, nr_pages, mask;
+ pte_t *pte = fe->pte;
+ pgoff_t end_pgoff;
int off;
nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
- start_addr = max(address & mask, vma->vm_start);
- off = ((address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- pte -= off;
- pgoff -= off;
+ start_addr = max(fe->address & mask, fe->vma->vm_start);
+ off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+ fe->pte -= off;
+ start_pgoff -= off;
/*
- * max_pgoff is either end of page table or end of vma
- * or fault_around_pages() from pgoff, depending what is nearest.
+ * end_pgoff is either end of page table or end of vma
+ * or fault_around_pages() from start_pgoff, depending what is nearest.
*/
- max_pgoff = pgoff - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ end_pgoff = start_pgoff -
+ ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
- max_pgoff = min3(max_pgoff, vma_pages(vma) + vma->vm_pgoff - 1,
- pgoff + nr_pages - 1);
+ end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
+ start_pgoff + nr_pages - 1);
/* Check if it makes any sense to call ->map_pages */
- while (!pte_none(*pte)) {
- if (++pgoff > max_pgoff)
- return;
- start_addr += PAGE_SIZE;
- if (start_addr >= vma->vm_end)
- return;
- pte++;
+ fe->address = start_addr;
+ while (!pte_none(*fe->pte)) {
+ if (++start_pgoff > end_pgoff)
+ goto out;
+ fe->address += PAGE_SIZE;
+ if (fe->address >= fe->vma->vm_end)
+ goto out;
+ fe->pte++;
}
- vmf.virtual_address = (void __user *) start_addr;
- vmf.pte = pte;
- vmf.pgoff = pgoff;
- vmf.max_pgoff = max_pgoff;
- vmf.flags = flags;
- vmf.gfp_mask = __get_fault_gfp_mask(vma);
- vma->vm_ops->map_pages(vma, &vmf);
+ fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+out:
+ /* restore fault_env */
+ fe->pte = pte;
+ fe->address = address;
}
-static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
- spinlock_t *ptl;
- pte_t *pte;
int ret = 0;
/*
@@ -2951,64 +2936,64 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- do_fault_around(vma, address, pte, pgoff, flags);
- if (!pte_same(*pte, orig_pte))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ do_fault_around(fe, pgoff);
+ if (!pte_same(*fe->pte, orig_pte))
goto unlock_out;
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
}
- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
page_cache_release(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, false, false);
+ do_set_pte(fe, fault_page);
unlock_page(fault_page);
unlock_out:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return ret;
}
-static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
struct mem_cgroup *memcg;
- spinlock_t *ptl;
- pte_t *pte;
int ret;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, fe->address);
if (!new_page)
return VM_FAULT_OOM;
- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
+ if (mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
page_cache_release(new_page);
return VM_FAULT_OOM;
}
- ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+ ret = __do_fault(fe, pgoff, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
if (fault_page)
- copy_user_highpage(new_page, fault_page, address, vma);
+ copy_user_highpage(new_page, fault_page, fe->address, vma);
__SetPageUptodate(new_page);
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
page_cache_release(fault_page);
@@ -3021,10 +3006,10 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
goto uncharge_out;
}
- do_set_pte(vma, address, new_page, pte, true, true);
+ do_set_pte(fe, new_page);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
page_cache_release(fault_page);
@@ -3042,18 +3027,15 @@ uncharge_out:
return ret;
}
-static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
struct address_space *mapping;
- spinlock_t *ptl;
- pte_t *pte;
int dirtied = 0;
int ret, tmp;
- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
@@ -3063,7 +3045,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (vma->vm_ops->page_mkwrite) {
unlock_page(fault_page);
- tmp = do_page_mkwrite(vma, fault_page, address);
+ tmp = do_page_mkwrite(vma, fault_page, fe->address);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
page_cache_release(fault_page);
@@ -3071,15 +3053,16 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
}
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
page_cache_release(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, true, false);
- pte_unmap_unlock(pte, ptl);
+ do_set_pte(fe, fault_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (set_page_dirty(fault_page))
dirtied = 1;
@@ -3111,23 +3094,20 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+static int do_fault(struct fault_env *fe, pte_t orig_pte)
{
- pgoff_t pgoff = linear_page_index(vma, address);
+ struct vm_area_struct *vma = fe->vma;
+ pgoff_t pgoff = linear_page_index(vma, fe->address);
- pte_unmap(page_table);
+ pte_unmap(fe->pte);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
- if (!(flags & FAULT_FLAG_WRITE))
- return do_read_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
+ if (!(fe->flags & FAULT_FLAG_WRITE))
+ return do_read_fault(fe, pgoff, orig_pte);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
- return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+ return do_cow_fault(fe, pgoff, orig_pte);
+ return do_shared_fault(fe, pgoff, orig_pte);
}
static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3145,11 +3125,10 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
return mpol_misplaced(page, vma, addr);
}
-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static int do_numa_page(struct fault_env *fe, pte_t pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *page = NULL;
- spinlock_t *ptl;
int page_nid = -1;
int last_cpupid;
int target_nid;
@@ -3169,10 +3148,10 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* page table entry is not accessible, so there would be no
* concurrent hardware modifications to the PTE.
*/
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*ptep, pte))) {
- pte_unmap_unlock(ptep, ptl);
+ fe->ptl = pte_lockptr(vma->vm_mm, fe->pmd);
+ spin_lock(fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
goto out;
}
@@ -3181,18 +3160,18 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
- set_pte_at(mm, addr, ptep, pte);
- update_mmu_cache(vma, addr, ptep);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
+ update_mmu_cache(vma, fe->address, fe->pte);
- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_page(vma, fe->address, pte);
if (!page) {
- pte_unmap_unlock(ptep, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
/* TODO: handle PTE-mapped THP */
if (PageCompound(page)) {
- pte_unmap_unlock(ptep, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
@@ -3216,8 +3195,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
- pte_unmap_unlock(ptep, ptl);
+ target_nid = numa_migrate_prep(page, vma, fe->address, page_nid,
+ &flags);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (target_nid == -1) {
put_page(page);
goto out;
@@ -3237,24 +3217,24 @@ out:
return 0;
}
-static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+static int create_huge_pmd(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
if (vma_is_anonymous(vma))
- return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
+ return do_huge_pmd_anonymous_page(fe);
if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ return vma->vm_ops->pmd_fault(vma, fe->address, fe->pmd,
+ fe->flags);
return VM_FAULT_FALLBACK;
}
-static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, pmd_t orig_pmd,
- unsigned int flags)
+static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
{
- if (vma_is_anonymous(vma))
- return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
- if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ if (vma_is_anonymous(fe->vma))
+ return do_huge_pmd_wp_page(fe, orig_pmd);
+ if (fe->vma->vm_ops->pmd_fault)
+ return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
+ fe->flags);
return VM_FAULT_FALLBACK;
}
@@ -3274,12 +3254,9 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int handle_pte_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pmd_t *pmd, unsigned int flags)
+static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;
- spinlock_t *ptl;
/*
* some architectures can have larger ptes than wordsize,
@@ -3289,37 +3266,34 @@ static int handle_pte_fault(struct mm_struct *mm,
* we later double check anyway with the ptl lock held. So here
* a barrier will do.
*/
- entry = *pte;
+ entry = *fe->pte;
barrier();
if (!pte_present(entry)) {
if (pte_none(entry)) {
- if (vma_is_anonymous(vma))
- return do_anonymous_page(mm, vma, address,
- pte, pmd, flags);
+ if (vma_is_anonymous(fe->vma))
+ return do_anonymous_page(fe);
else
- return do_fault(mm, vma, address, pte, pmd,
- flags, entry);
+ return do_fault(fe, entry);
}
- return do_swap_page(mm, vma, address,
- pte, pmd, flags, entry);
+ return do_swap_page(fe, entry);
}
if (pte_protnone(entry))
- return do_numa_page(mm, vma, address, entry, pte, pmd);
+ return do_numa_page(fe, entry);
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*pte, entry)))
+ fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
+ spin_lock(fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, entry)))
goto unlock;
- if (flags & FAULT_FLAG_WRITE) {
+ if (fe->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
- return do_wp_page(mm, vma, address,
- pte, pmd, ptl, entry);
+ return do_wp_page(fe, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
- if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
- update_mmu_cache(vma, address, pte);
+ if (ptep_set_access_flags(fe->vma, fe->address, fe->pte, entry,
+ fe->flags & FAULT_FLAG_WRITE)) {
+ update_mmu_cache(fe->vma, fe->address, fe->pte);
} else {
/*
* This is needed only for protection faults but the arch code
@@ -3327,11 +3301,11 @@ static int handle_pte_fault(struct mm_struct *mm,
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
- if (flags & FAULT_FLAG_WRITE)
- flush_tlb_fix_spurious_fault(vma, address);
+ if (fe->flags & FAULT_FLAG_WRITE)
+ flush_tlb_fix_spurious_fault(fe->vma, fe->address);
}
unlock:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
@@ -3344,46 +3318,42 @@ unlock:
static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
{
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = flags,
+ };
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- if (unlikely(is_vm_hugetlb_page(vma)))
- return hugetlb_fault(mm, vma, address, flags);
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
return VM_FAULT_OOM;
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
+ fe.pmd = pmd_alloc(mm, pud, address);
+ if (!fe.pmd)
return VM_FAULT_OOM;
- if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
- int ret = create_huge_pmd(mm, vma, address, pmd, flags);
+ if (pmd_none(*fe.pmd) && transparent_hugepage_enabled(vma)) {
+ int ret = create_huge_pmd(&fe);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- pmd_t orig_pmd = *pmd;
+ pmd_t orig_pmd = *fe.pmd;
int ret;
barrier();
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
- unsigned int dirty = flags & FAULT_FLAG_WRITE;
-
if (pmd_protnone(orig_pmd))
- return do_huge_pmd_numa_page(mm, vma, address,
- orig_pmd, pmd);
+ return do_huge_pmd_numa_page(&fe, orig_pmd);
- if (dirty && !pmd_write(orig_pmd)) {
- ret = wp_huge_pmd(mm, vma, address, pmd,
- orig_pmd, flags);
+ if ((fe.flags & FAULT_FLAG_WRITE) &&
+ !pmd_write(orig_pmd)) {
+ ret = wp_huge_pmd(&fe, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- huge_pmd_set_accessed(mm, vma, address, pmd,
- orig_pmd, dirty);
+ huge_pmd_set_accessed(&fe, orig_pmd);
return 0;
}
}
@@ -3394,7 +3364,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* run pte_offset_map on the pmd, if an huge pmd could
* materialize from under us from a different thread.
*/
- if (unlikely(pte_alloc(mm, pmd, address)))
+ if (unlikely(pte_alloc(fe.vma->vm_mm, fe.pmd, fe.address)))
return VM_FAULT_OOM;
/*
* If a huge pmd materialized under us just retry later. Use
@@ -3407,7 +3377,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* through an atomic read in C, which is what pmd_trans_unstable()
* provides.
*/
- if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
+ if (unlikely(pmd_trans_unstable(fe.pmd) || pmd_devmap(*fe.pmd)))
return 0;
/*
* A regular pmd is established and it can't morph into a huge pmd
@@ -3415,9 +3385,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* read mode and khugepaged takes it in write mode. So now it's
* safe to run pte_offset_map().
*/
- pte = pte_offset_map(pmd, address);
+ fe.pte = pte_offset_map(fe.pmd, fe.address);
- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+ return handle_pte_fault(&fe);
}
/*
@@ -3446,7 +3416,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
- ret = __handle_mm_fault(vma, address, flags);
+ if (unlikely(is_vm_hugetlb_page(vma)))
+ ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+ else
+ ret = __handle_mm_fault(vma, address, flags);
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
diff --git a/mm/nommu.c b/mm/nommu.c
index 6402f2715d48..06c374d73797 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1813,7 +1813,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);
-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
{
BUG();
}
--
2.7.0
Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page. That's not efficient, but it's not obvious
how we can optimize this. We can look into optimization later.
PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/rmap.h | 2 +-
mm/huge_memory.c | 10 +++++++---
mm/memory.c | 4 ++--
mm/migrate.c | 2 +-
mm/rmap.c | 48 +++++++++++++++++++++++++++++++++++-------------
mm/util.c | 6 ++++++
6 files changed, 52 insertions(+), 20 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 49eb4f8ebac9..5704f101b52e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -165,7 +165,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, int);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, bool);
-void page_add_file_rmap(struct page *);
+void page_add_file_rmap(struct page *, bool);
void page_remove_rmap(struct page *, bool);
void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1f58c6026860..1b111d5c0312 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3251,18 +3251,22 @@ static void __split_huge_page(struct page *page, struct list_head *list)
int total_mapcount(struct page *page)
{
- int i, ret;
+ int i, compound, ret;
VM_BUG_ON_PAGE(PageTail(page), page);
if (likely(!PageCompound(page)))
return atomic_read(&page->_mapcount) + 1;
- ret = compound_mapcount(page);
+ compound = compound_mapcount(page);
if (PageHuge(page))
- return ret;
+ return compound;
+ ret = compound;
for (i = 0; i < HPAGE_PMD_NR; i++)
ret += atomic_read(&page[i]._mapcount) + 1;
+ /* File pages has compound_mapcount included in _mapcount */
+ if (!PageAnon(page))
+ return ret - compound * HPAGE_PMD_NR;
if (PageDoubleMap(page))
ret -= HPAGE_PMD_NR;
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 9c896a52565c..a6c1c4955560 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1436,7 +1436,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
/* Ok, finally just insert the thing.. */
get_page(page);
inc_mm_counter_fast(mm, mm_counter_file(page));
- page_add_file_rmap(page);
+ page_add_file_rmap(page, false);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
retval = 0;
@@ -2879,7 +2879,7 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page);
+ page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c822a7b27e0..d20276fffce7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -170,7 +170,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
} else if (PageAnon(new))
page_add_anon_rmap(new, vma, addr, false);
else
- page_add_file_rmap(new);
+ page_add_file_rmap(new, false);
if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
diff --git a/mm/rmap.c b/mm/rmap.c
index c399a0d41b31..66808955a5e0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1285,18 +1285,34 @@ void page_add_new_anon_rmap(struct page *page,
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, bool compound)
{
+ int i, nr = 1;
+
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
lock_page_memcg(page);
- if (atomic_inc_and_test(&page->_mapcount)) {
- __inc_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+ if (compound && PageTransHuge(page)) {
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_inc_and_test(&page[i]._mapcount))
+ nr++;
+ }
+ if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
+ goto out;
+ } else {
+ if (!atomic_inc_and_test(&page->_mapcount))
+ goto out;
}
+ __mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, nr);
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+out:
unlock_page_memcg(page);
}
-static void page_remove_file_rmap(struct page *page)
+static void page_remove_file_rmap(struct page *page, bool compound)
{
+ int i, nr = 1;
+
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
lock_page_memcg(page);
/* Hugepages are not counted in NR_FILE_MAPPED for now. */
@@ -1307,15 +1323,24 @@ static void page_remove_file_rmap(struct page *page)
}
/* page still mapped by someone else? */
- if (!atomic_add_negative(-1, &page->_mapcount))
- goto out;
+ if (compound && PageTransHuge(page)) {
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_add_negative(-1, &page[i]._mapcount))
+ nr++;
+ }
+ if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+ goto out;
+ } else {
+ if (!atomic_add_negative(-1, &page->_mapcount))
+ goto out;
+ }
/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- __dec_zone_page_state(page, NR_FILE_MAPPED);
+ __mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, -nr);
mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
if (unlikely(PageMlocked(page)))
@@ -1371,11 +1396,8 @@ static void page_remove_anon_compound_rmap(struct page *page)
*/
void page_remove_rmap(struct page *page, bool compound)
{
- if (!PageAnon(page)) {
- VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
- page_remove_file_rmap(page);
- return;
- }
+ if (!PageAnon(page))
+ return page_remove_file_rmap(page, compound);
if (compound)
return page_remove_anon_compound_rmap(page);
diff --git a/mm/util.c b/mm/util.c
index 362f4cd8ab3a..2e92f231796b 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,12 @@ int __page_mapcount(struct page *page)
int ret;
ret = atomic_read(&page->_mapcount) + 1;
+ /*
+ * For file THP page->_mapcount contains total number of mapping
+ * of the page: no need to look into compound_mapcount.
+ */
+ if (!PageAnon(page) && !PageHuge(page))
+ return ret;
page = compound_head(page);
ret += atomic_read(compound_mapcount_ptr(page)) + 1;
if (PageDoubleMap(page))
--
2.7.0
Let's add FileHugeMapped field into meminfo and smaps. It indicates how
many times we map file THP
NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
v5:
- add FileHugeMapped to /proc/PID/smaps;
- make FileHugeMapped in meminfo aligned with other fields;
---
drivers/base/node.c | 10 ++++++----
fs/proc/meminfo.c | 5 +++--
fs/proc/task_mmu.c | 8 +++++++-
include/linux/mmzone.h | 3 ++-
mm/huge_memory.c | 2 +-
mm/rmap.c | 12 ++++++------
mm/vmstat.c | 1 +
7 files changed, 26 insertions(+), 15 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..9cc4e9dad47e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -113,6 +113,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d FileHugeMapped: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -131,10 +132,11 @@ static ssize_t node_read_meminfo(struct device *dev,
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
- , nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR));
+ nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+ nid, K(node_page_state(nid, NR_ANON_THPS) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_FILE_THP_MAPPED) *
+ HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
#endif
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..359e713d5fca 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -105,6 +105,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "FileHugeMapped: %8lu kB\n"
#endif
#ifdef CONFIG_CMA
"CmaTotal: %8lu kB\n"
@@ -162,8 +163,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- , K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR)
+ , K(global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_FILE_THP_MAPPED) * HPAGE_PMD_NR)
#endif
#ifdef CONFIG_CMA
, K(totalcma_pages)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e6ee7329cfab..1e71fee77c1d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -448,6 +448,7 @@ struct mem_size_stats {
unsigned long referenced;
unsigned long anonymous;
unsigned long anonymous_thp;
+ unsigned long file_thp;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -576,7 +577,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
if (IS_ERR_OR_NULL(page))
return;
- mss->anonymous_thp += HPAGE_PMD_SIZE;
+ if (PageAnon(page))
+ mss->anonymous_thp += HPAGE_PMD_SIZE;
+ else
+ mss->file_thp += HPAGE_PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
}
#else
@@ -757,6 +761,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
"Referenced: %8lu kB\n"
"Anonymous: %8lu kB\n"
"AnonHugePages: %8lu kB\n"
+ "FileHugeMapped: %8lu kB\n"
"Shared_Hugetlb: %8lu kB\n"
"Private_Hugetlb: %7lu kB\n"
"Swap: %8lu kB\n"
@@ -774,6 +779,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
mss.referenced >> 10,
mss.anonymous >> 10,
mss.anonymous_thp >> 10,
+ mss.file_thp >> 10,
mss.shared_hugetlb >> 10,
mss.private_hugetlb >> 10,
mss.swap >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c60df9257cc7..85fd4aac53a1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,7 +158,8 @@ enum zone_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
- NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_ANON_THPS,
+ NR_FILE_THP_MAPPED,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2e9e6f4afe40..44468fb7cdbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2982,7 +2982,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/* No need in mapcount reference anymore */
for (i = 0; i < HPAGE_PMD_NR; i++)
diff --git a/mm/rmap.c b/mm/rmap.c
index 66808955a5e0..359ec5cff9b0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1227,10 +1227,8 @@ void do_page_add_anon_rmap(struct page *page,
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (compound) {
- __inc_zone_page_state(page,
- NR_ANON_TRANSPARENT_HUGEPAGES);
- }
+ if (compound)
+ __inc_zone_page_state(page, NR_ANON_THPS);
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
}
if (unlikely(PageKsm(page)))
@@ -1268,7 +1266,7 @@ void page_add_new_anon_rmap(struct page *page,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
atomic_set(compound_mapcount_ptr(page), 0);
- __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __inc_zone_page_state(page, NR_ANON_THPS);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1298,6 +1296,7 @@ void page_add_file_rmap(struct page *page, bool compound)
}
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
+ __inc_zone_page_state(page, NR_FILE_THP_MAPPED);
} else {
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
@@ -1330,6 +1329,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
goto out;
+ __dec_zone_page_state(page, NR_FILE_THP_MAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))
goto out;
@@ -1363,7 +1363,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5e4300482897..ca9268e4c69b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -762,6 +762,7 @@ const char * const vmstat_text[] = {
"workingset_activate",
"workingset_nodereclaim",
"nr_anon_transparent_hugepages",
+ "nr_file_transparent_hugepages_mapped",
"nr_free_cma",
/* enum writeback_stat_item counters */
--
2.7.0
Add info about tmpfs/shmem with huge pages.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 130 +++++++++++++++++++++++++++++------------
1 file changed, 93 insertions(+), 37 deletions(-)
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index d9cb65cf5cfd..96a49f123cac 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -9,8 +9,8 @@ using huge pages for the backing of virtual memory with huge pages
that supports the automatic promotion and demotion of page sizes and
without the shortcomings of hugetlbfs.
-Currently it only works for anonymous memory mappings but in the
-future it can expand over the pagecache layer starting with tmpfs.
+Currently it only works for anonymous memory mappings and tmpfs/shmem.
+But in the future it can expand to other filesystems.
The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not
@@ -48,7 +48,7 @@ miss is going to run faster.
- if some task quits and more hugepages become available (either
immediately in the buddy or through the VM), guest physical memory
backed by regular pages should be relocated on hugepages
- automatically (with khugepaged)
+ automatically (with khugepaged, limited to anonymous huge pages for now)
- it doesn't require memory reservation and in turn it uses hugepages
whenever possible (the only possible reservation here is kernelcore=
@@ -57,10 +57,6 @@ miss is going to run faster.
feature that applies to all dynamic high order allocations in the
kernel)
-- this initial support only offers the feature in the anonymous memory
- regions but it'd be ideal to move it to tmpfs and the pagecache
- later
-
Transparent Hugepage Support maximizes the usefulness of free memory
if compared to the reservation approach of hugetlbfs by allowing all
unused memory to be used as cache or other movable (or even unmovable
@@ -94,21 +90,21 @@ madvise(MADV_HUGEPAGE) on their critical mmapped regions.
== sysfs ==
-Transparent Hugepage Support can be entirely disabled (mostly for
-debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
-avoid the risk of consuming more memory resources) or enabled system
-wide. This can be achieved with one of:
+Transparent Hugepage Support for anonymous memory can be entirely disabled
+(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
+regions (to avoid the risk of consuming more memory resources) or enabled
+system wide. This can be achieved with one of:
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled
It's also possible to limit defrag efforts in the VM to generate
-hugepages in case they're not immediately free to madvise regions or
-to never try to defrag memory and simply fallback to regular pages
-unless hugepages are immediately available. Clearly if we spend CPU
-time to defrag memory, we would expect to gain even more by the fact
-we use hugepages later instead of regular pages. This isn't always
+anonymous hugepages in case they're not immediately free to madvise
+regions or to never try to defrag memory and simply fallback to regular
+pages unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact we
+use hugepages later instead of regular pages. This isn't always
guaranteed, but it may be more likely in case the allocation is for a
MADV_HUGEPAGE region.
@@ -133,9 +129,9 @@ that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
"never" should be self-explanatory.
-By default kernel tries to use huge zero page on read page fault.
-It's possible to disable huge zero page by writing 0 or enable it
-back by writing 1:
+By default kernel tries to use huge zero page on read page fault to
+anonymous mapping. It's possible to disable huge zero page by writing 0
+or enable it back by writing 1:
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
@@ -204,21 +200,67 @@ Support by passing the parameter "transparent_hugepage=always" or
"transparent_hugepage=madvise" or "transparent_hugepage=never"
(without "") to the kernel command line.
+== Hugepages in tmpfs/shmem ==
+
+You can control hugepage allocation policy in tmpfs with mount option
+"huge=". It can have following values:
+
+ - "always":
+ Attempt to allocate huge pages every time we need a new page;
+
+ - "never":
+ Do not allocate huge pages;
+
+ - "within_size":
+ Only allocate huge page if it will be fully within i_size.
+ Also respect fadvise()/madvise() hints;
+
+ - "advise:
+ Only allocate huge pages if requested with fadvise()/madvise();
+
+The default policy is "never".
+
+"mount -o remount,huge= /mountpoint" works fine after mount: remounting
+huge=never will not attempt to break up huge pages at all, just stop more
+from being allocated.
+
+There's also sysfs knob to control hugepage allocation policy for internal
+shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
+is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
+MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
+
+In addition to policies listed above, shmem_enabled allows two further
+values:
+
+ - "deny":
+ For use in emergencies, to force the huge option off from
+ all mounts;
+ - "force":
+ Force the huge option on for all - very useful for testing;
+
== Need of application restart ==
-The transparent_hugepage/enabled values only affect future
-behavior. So to make them effective you need to restart any
-application that could have been using hugepages. This also applies to
-the regions registered in khugepaged.
+The transparent_hugepage/enabled values and tmpfs mount option only affect
+future behavior. So to make them effective you need to restart any
+application that could have been using hugepages. This also applies to the
+regions registered in khugepaged.
== Monitoring usage ==
-The number of transparent huge pages currently used by the system is
-available by reading the AnonHugePages field in /proc/meminfo. To
-identify what applications are using transparent huge pages, it is
-necessary to read /proc/PID/smaps and count the AnonHugePages fields
-for each mapping. Note that reading the smaps file is expensive and
-reading it frequently will incur overhead.
+The number of anonymous transparent huge pages currently used by the
+system is available by reading the AnonHugePages field in /proc/meminfo.
+To identify what applications are using anonymous transparent huge pages,
+it is necessary to read /proc/PID/smaps and count the AnonHugePages fields
+for each mapping.
+
+The number of file transparent huge pages mapped to userspace is available
+by reading the FileHugeMapped field in /proc/meminfo. To identify what
+applications are mapping file transparent huge pages, it is necessary
+to read /proc/PID/smaps and count the FileHugeMapped fields for each
+mapping.
+
+Note that reading the smaps file is expensive and reading it
+frequently will incur overhead.
There are a number of counters in /proc/vmstat that may be used to
monitor how successfully the system is providing huge pages for use.
@@ -238,6 +280,12 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
the allocation.
+thp_file_alloc is incremented every time a file huge page is successfully
+i allocated.
+
+thp_file_mapped is incremented every time a file huge page is mapped into
+ user address space.
+
thp_split_page is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
@@ -403,19 +451,27 @@ pages:
on relevant sub-page of the compound page.
- map/unmap of the whole compound page accounted in compound_mapcount
- (stored in first tail page).
+ (stored in first tail page). For file huge pages, we also increment
+ ->_mapcount of all sub-pages in order to have race-free detection of
+ last unmap of subpages.
-PageDoubleMap() indicates that ->_mapcount in all subpages is offset up by one.
-This additional reference is required to get race-free detection of unmap of
-subpages when we have them mapped with both PMDs and PTEs.
+PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
+
+For anonymous pages PageDoubleMap() also indicates ->_mapcount in all
+subpages is offset up by one. This additional reference is required to
+get race-free detection of unmap of subpages when we have them mapped with
+both PMDs and PTEs.
This is optimization required to lower overhead of per-subpage mapcount
tracking. The alternative is alter ->_mapcount in all subpages on each
map/unmap of the whole compound page.
-We set PG_double_map when a PMD of the page got split for the first time,
-but still have PMD mapping. The addtional references go away with last
-compound_mapcount.
+For anonymous pages, we set PG_double_map when a PMD of the page got split
+for the first time, but still have PMD mapping. The additional references
+go away with last compound_mapcount.
+
+File pages get PG_double_map set on first map of the page with PTE and
+goes away when the page gets evicted from page cache.
split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page
@@ -427,7 +483,7 @@ sum of mapcount of all sub-pages plus one (split_huge_page caller must
have reference for head page).
split_huge_page uses migration entries to stabilize page->_count and
-page->_mapcount.
+page->_mapcount of anonymous pages. File pages just got unmapped.
We safe against physical memory scanners too: the only legitimate way
scanner can get reference to a page is get_page_unless_zero().
--
2.7.0
"Kirill A. Shutemov" <[email protected]> writes:
> [ text/plain ]
> Naive approach: on mapping/unmapping the page as compound we update
> ->_mapcount on each 4k page. That's not efficient, but it's not obvious
> how we can optimize this. We can look into optimization later.
>
> PG_double_map optimization doesn't work for file pages since lifecycle
> of file pages is different comparing to anon pages: file page can be
> mapped again at any time.
>
Can you explain this more ?. We added PG_double_map so that we can keep
page_remove_rmap simpler. So if it isn't a compound page we still can do
if (!atomic_add_negative(-1, &page->_mapcount))
I am trying to understand why we can't use that with file pages ?
-aneesh
"Kirill A. Shutemov" <[email protected]> writes:
> [ text/plain ]
> split_huge_pmd() for file mappings (and DAX too) is implemented by just
> clearing pmd entry as we can re-fill this area from page cache on pte
> level later.
>
> This means we don't need deposit page tables when file THP is mapped.
> Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
> file THP PMD.
Archs like ppc64 use deposited page table to track the hardware page
table slot information. We probably may want to add hooks which arch can
use to achieve the same even with file THP
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/huge_memory.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 44468fb7cdbf..c22144e3fe11 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1684,10 +1684,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> struct page *page = pmd_page(orig_pmd);
> page_remove_rmap(page, true);
> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> VM_BUG_ON_PAGE(!PageHead(page), page);
> - pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
> - atomic_long_dec(&tlb->mm->nr_ptes);
> + if (PageAnon(page)) {
> + pgtable_t pgtable;
> + pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> + pte_free(tlb->mm, pgtable);
> + atomic_long_dec(&tlb->mm->nr_ptes);
> + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + } else {
> + add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> + }
> spin_unlock(ptl);
> tlb_remove_page(tlb, page);
> }
> --
> 2.7.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Fri, Mar 18, 2016 at 03:10:06PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > [ text/plain ]
> > Naive approach: on mapping/unmapping the page as compound we update
> > ->_mapcount on each 4k page. That's not efficient, but it's not obvious
> > how we can optimize this. We can look into optimization later.
> >
> > PG_double_map optimization doesn't work for file pages since lifecycle
> > of file pages is different comparing to anon pages: file page can be
> > mapped again at any time.
> >
>
> Can you explain this more ?. We added PG_double_map so that we can keep
> page_remove_rmap simpler. So if it isn't a compound page we still can do
>
> if (!atomic_add_negative(-1, &page->_mapcount))
>
> I am trying to understand why we can't use that with file pages ?
The first thing: for non-compound pages we still have simple
atomic_inc_and_test() / atomic_add_negative(-1), nothing changed here.
About compound pages:
For anon-THP PG_double_map allowed to not touch _mapcount in all subpages
until a PMD which maps the page is split. This way we significantly lower
overhead on refcounting as long as we have the page mapped with PMD-only,
since we only need to increment compound_mapcount().
The optimization is possible due to relatively simple lifecycle of
anonymous THP page:
- anon-THPs always mapped with PMD first;
- new mapping of THP can only be created via fork();
- the page only can get mapped with PTEs via split_huge_pmd();
For file-THP the situation is different. Once we allocated a huge page and
put it on radix tree, the page can be mapped with PTEs or PMDs at any
time. It makes the same optimization inapplicable there.
I think there *can* be some room for optimization, but I don't want to
invest more time here, until it's identified as bottleneck. It can lead to
more complex code on rmap side.
--
Kirill A. Shutemov
On Fri, Mar 18, 2016 at 07:23:41PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > [ text/plain ]
> > split_huge_pmd() for file mappings (and DAX too) is implemented by just
> > clearing pmd entry as we can re-fill this area from page cache on pte
> > level later.
> >
> > This means we don't need deposit page tables when file THP is mapped.
> > Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
> > file THP PMD.
>
> Archs like ppc64 use deposited page table to track the hardware page
> table slot information. We probably may want to add hooks which arch can
> use to achieve the same even with file THP
Could you describe more on what kind of information you're talking about?
--
Kirill A. Shutemov
"Kirill A. Shutemov" <[email protected]> writes:
> [ text/plain ]
> On Fri, Mar 18, 2016 at 07:23:41PM +0530, Aneesh Kumar K.V wrote:
>> "Kirill A. Shutemov" <[email protected]> writes:
>>
>> > [ text/plain ]
>> > split_huge_pmd() for file mappings (and DAX too) is implemented by just
>> > clearing pmd entry as we can re-fill this area from page cache on pte
>> > level later.
>> >
>> > This means we don't need deposit page tables when file THP is mapped.
>> > Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
>> > file THP PMD.
>>
>> Archs like ppc64 use deposited page table to track the hardware page
>> table slot information. We probably may want to add hooks which arch can
>> use to achieve the same even with file THP
>
> Could you describe more on what kind of information you're talking about?
>
Hardware page table in ppc64 requires us to map each subpage of the huge
page. This is needed because at low level we use segment base page size
to find the hash slot and on TLB miss, we use the faulting address and
base page size (which is 64k even with THP) to find whether we have
the page mapped in hash page table. Since we use base page size of 64K,
we need to make sure that subpages are mapped (on demand) in hash page
table. If we have then mapped we also need to track their hash table
slot information so that we can clear it on invalidate of hugepage.
With THP we used the deposited page table to store the hash slot
information.
-aneesh
On Mon, Mar 21, 2016 at 10:03:29AM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > [ text/plain ]
> > On Fri, Mar 18, 2016 at 07:23:41PM +0530, Aneesh Kumar K.V wrote:
> >> "Kirill A. Shutemov" <[email protected]> writes:
> >>
> >> > [ text/plain ]
> >> > split_huge_pmd() for file mappings (and DAX too) is implemented by just
> >> > clearing pmd entry as we can re-fill this area from page cache on pte
> >> > level later.
> >> >
> >> > This means we don't need deposit page tables when file THP is mapped.
> >> > Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
> >> > file THP PMD.
> >>
> >> Archs like ppc64 use deposited page table to track the hardware page
> >> table slot information. We probably may want to add hooks which arch can
> >> use to achieve the same even with file THP
> >
> > Could you describe more on what kind of information you're talking about?
> >
>
> Hardware page table in ppc64 requires us to map each subpage of the huge
> page. This is needed because at low level we use segment base page size
> to find the hash slot and on TLB miss, we use the faulting address and
> base page size (which is 64k even with THP) to find whether we have
> the page mapped in hash page table. Since we use base page size of 64K,
> we need to make sure that subpages are mapped (on demand) in hash page
> table. If we have then mapped we also need to track their hash table
> slot information so that we can clear it on invalidate of hugepage.
>
> With THP we used the deposited page table to store the hash slot
> information.
Could you prepare the patch with these hooks so we can discuss it details?
I still have hard time wrap my had around this.
I think you have the same problem with DAX huge pages. Or you don't care
about DAX?
--
Kirill A. Shutemov
"Kirill A. Shutemov" <[email protected]> writes:
> [ text/plain ]
> On Mon, Mar 21, 2016 at 10:03:29AM +0530, Aneesh Kumar K.V wrote:
>> "Kirill A. Shutemov" <[email protected]> writes:
>>
>> > [ text/plain ]
>> > On Fri, Mar 18, 2016 at 07:23:41PM +0530, Aneesh Kumar K.V wrote:
>> >> "Kirill A. Shutemov" <[email protected]> writes:
>> >>
>> >> > [ text/plain ]
>> >> > split_huge_pmd() for file mappings (and DAX too) is implemented by just
>> >> > clearing pmd entry as we can re-fill this area from page cache on pte
>> >> > level later.
>> >> >
>> >> > This means we don't need deposit page tables when file THP is mapped.
>> >> > Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
>> >> > file THP PMD.
>> >>
>> >> Archs like ppc64 use deposited page table to track the hardware page
>> >> table slot information. We probably may want to add hooks which arch can
>> >> use to achieve the same even with file THP
>> >
>> > Could you describe more on what kind of information you're talking about?
>> >
>>
>> Hardware page table in ppc64 requires us to map each subpage of the huge
>> page. This is needed because at low level we use segment base page size
>> to find the hash slot and on TLB miss, we use the faulting address and
>> base page size (which is 64k even with THP) to find whether we have
>> the page mapped in hash page table. Since we use base page size of 64K,
>> we need to make sure that subpages are mapped (on demand) in hash page
>> table. If we have then mapped we also need to track their hash table
>> slot information so that we can clear it on invalidate of hugepage.
>>
>> With THP we used the deposited page table to store the hash slot
>> information.
>
> Could you prepare the patch with these hooks so we can discuss it details?
> I still have hard time wrap my had around this.
ok
>
> I think you have the same problem with DAX huge pages. Or you don't care
> about DAX?
>
Yes, DAX hugepage will not be working on ppc64. It is there in the TODO
list to get it working :).
-aneesh
On Sat, 12 Mar 2016, Kirill A. Shutemov wrote:
> Here is an updated version of huge pages support implementation in
> tmpfs/shmem.
>
> All known issues has been fixed. I'll continue with validation.
> I will also send follow up patch with documentation update.
>
> Hugh, I would be glad to hear your opinion on this patchset.
At long last I've managed to spend some time getting into it.
My opinion is simple: generally I liked your patches,
but I'm disappointed by where they leave tmpfs functionally,
so wouldn't want the patchset to go in as is.
As soon as I ran it up and tried to copy a tree in there for testing,
my (huge) tmpfs filled up with all these 2MB files: which is just the
same as when you started out two or three years ago on ramfs.
As I've said on several occasions, I am not interested in emulating
the limitations of hugetlbfs inside tmpfs: there might one day be a
case for such a project, but it's transparent hugepages that we want to
have now - blocksize 4kB, but in 2MB extents when possible (on x86_64).
I accept that supporting small files correctly is not the #1
requirement for a "huge" tmpfs; and if it were impossible or
unreasonable to ask for, I'd give up on it. But since we (Google)
have been using a successful implementation for a year now, I see
no reason to give up on it at all: it allows for much wider and
easier adoption (and testing) of huge tmpfs - without "but"s.
The small files thing formed my first impression. My second
impression was similar, when I tried mmap(NULL, size_of_RAM,
PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
cycled around the arena touching all the pages (which of
course has to push a little into swap): that soon OOMed.
But there I think you probably just have some minor bug to be fixed:
I spent a little while trying to debug it, but then decided I'd
better get back to writing to you. I didn't really understand what
I was seeing, but when I hacked some stats into shrink_page_list(),
converting !is_page_cache_freeable(page) to page_cache_references(page)
to return the difference instead of the bool, a large proportion of
huge tmpfs pages seemed to have count 1 too high to be freeable at
that point (and one huge tmpfs page had a count of 3477).
Hmm, I just went to repeat that test, to see if MemFree goes down and
down on each run, but got a VM_BUG_ON_PAGE(page_ref_count(page) == 0)
in put_page_testzero() in find_get_entries() during shmem_evict_inode().
So, something not quite right with the refcounting when under pressure.
My third impression was much better, when I just did a straight
cp /dev/zero /tmpfs_twice_size_of_RAM: that went smoothly,
pushed into swap just as it should, nice.
>
> The main difference with Hugh's approach[1] is that I continue with
> compound pages, where Hugh invents new way couple pages: team pages.
> I believe THP refcounting rework made team pages unnecessary: compound
> page are flexible enough to serve needs of page cache.
I have disagreed with you about the suitability of compound pages;
but at some point on Sunday night, after reading your patchset,
found myself looking at it differently, and now agreeing with you.
They are not flexible enough yet, but I believe you have done a large
part of the work (by diverting all those tail ops to the head), and it
just needs a little more. I say "a little", but it might not be easy.
What I realize now, is that you should be able to do almost everything
I did with team pages, instead with your compound pages. Just abandon
the idea that the whole compound page has to be initialized in one go:
initialize (charge to memcg, add to page cache, fill, SetPageUptodate)
each subpage as it is touched. (But of course, the whole extent has
to be to be initialized before it can be pmd-mapped.)
And the convenient thing about a compound page is that you have plenty
of space to hold your metadata: where I have been squeezing everything
into page->private (which makes it difficult then to extend the same
design to miscellaneous filesystems), you're free to use (certain)
fields of 511 more struct pages.
I said above that you should be able to do almost everything: I think
that "almost" involves dropping a few things that I have, that we can
well do without. With team pages, it is possible for each member to
be charged to a different memcg; but I don't have anyone calling out
for that behaviour, it's more an awkward possibility that we have to
be on guard against in a few places, and I'd be quite happy just to
stop it (charging instead to whichever memcg first touched the extent).
And it's been nice to entertain the idea of the individual team tails,
travelling separately up and down their own lrus, subject to their own
pagesize access patterns even while the whole is pmd-mapped. But in
practice, we continue to throw them on the unevictable list once they
get in the way, with no attempt to bring them back when accessed
individually; so I think it's a nice idea that we can live without.
And a compound page implementation would avoid the overhead I have,
of putting all those tails on lru in the first place.
Another thing I did, that I think you can safely fail to emulate:
I was careful to allow sparse population, and not insist that the
head page be instantiated. That might turn out to be more of a
problem for compound pages, and has been an awkward edge case for me.
Although in principle there are sparse files which would be doubled
in charge by insisting that the head be instantiated, I doubt they
will hurt anyone in practice: perhaps best treated as a bug to be
revisited if it actually comes up as an issue later.
(Thinking about that last paragraph later, I'm not so sure.
Forcing the instantiation of the head is okay when first allocating,
but when trying to recover a fragmented extent into a hugepage later,
it might prove to be seriously limiting to require that the head be
migrated into place before anything else is done.)
>
> Many ideas and some patches were stolen from Hugh's patchset. Having this
> patchset around was very helpful.
People are probably wondering why I didn't repost it, and whether
it's vapourware. No, it's worked out very well, better than we were
expecting; and several of the TODOs that I felt needed attention,
have not yet caused anyone trouble - though wider adoption is sure
to meet new usecases and new bugs. I didn't repost it because, as I
said at the time, I felt it wasn't ready for real use until recovery
of huge pages after fragmentation and swapping (what khugepaged does
for anon THP) was added. After that went in (and then quite a period
of fixing the bugs, I admit!), other priorities intervened until now.
Your patchset also lacks recovery at present, so equally that's a
reason why I don't consider yours ready for merging.
I'll repost mine, including recovery and a few minor additions, against
v4.6-rc2. I still haven't got around to doing any NUMA work on it, but
it does appear to be useful to many people even without that. I'd have
preferred to spend longer rebasing it, it may turn out to be a little
raw: most of the rebases from one release to another have been easy,
but to v4.5 was messier - not particularly due to your THP refcounting
work going in, just various different conflicts that I feel I've not
properly digested yet. But v4.6-rc2 looks about as early and as late
as I can do it, with LSF/MM coming up later in April; and should be a
good base for people to try it out on. (I'll work on preparing it in
advance, then I guess a last-minute edit of PAGE_CACHEs before posting.)
I now believe that the right way forward is to get that team page
implementation in as soon as we can, so users have the functionality;
and then (if you're still interested) you work on converting it over
from teams to compound pages, both as an optimization, and to satisfy
your understandable preference for a single way of managing huge pages.
(I would say "How does that sound to you?" but I'd prefer you to think
it over, maybe wait to see what the new team patchset looks like. Though
aside from the addition of recovery, it is very much like the one against
v3.19 from February last year: no changes to the design at all, which
turned out to fit recovery well; but quite a few bug fixes folded back
in. Or you may prefer to wait to discuss the way forward at LSF/MM.)
>
> I will continue with code validation. I would expect mlock require some
> more attention.
>
> Please, review and test the code.
>
> Git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v4
>
> Rebased onto linux-next:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v4-next
It wasn't obvious to me what tree your posted patchset was based on,
so I made a patch out of the top your hugetmpfs/v4-next branch, and
applied that on top of Andrew's 4.5-rc7-mm1: which appeared to be
fine, just a trivial clash with mmotm's
"do_shared_fault(): check that mmap_sem is held".
[...]
Some random thoughts on your patches FWIW:
01/25 mm: do not pass mm_struct into handle_mm_fault
02/25 mm: introduce fault_env
These two inflate your patchset considerably, with seemingly unrelated
changes: the first with arch files otherwise untouched, the second with
lots of change to memory.c that makes it harder to compare our patchsets.
I think it would be better to leave the handle_mm_fault(mm) removal to
another patchset, and I'm ambivalent about the struct fault_env changes:
I prefer to avoid such argument blocks (and think I saw Linus once say
he detested my zap_details), which do make it harder to see what the
real arguments are. I wonder what your 03/25 would look like without
the fault_env, perhaps I'd agree it's too cumbersome that way (and
perhaps I should look back at earlier versions of your patchset).
But I do understand your desire to get some cleanups out of the way
before basing the real work on the new structure: I'm going to have
to reconsider the initial patches of my series in the light of this
criticism of yours, not sure yet if I'll try to cut them out or not.
03/25 mm: postpone page table allocation until we have page to map
I've not stopped to consider whether your prealloc_pte work (which
perhaps was the trigger for fault_env) meets the case or not, but
want to mention that we recently discovered a drawback to my delaying
the pagetable allocation: I'm allocating the pagetable while holding
the lock on the page from the filesystem (any filesystem, though the
change was to suit tmpfs), which presented a deadlock to the OOM killer.
The base kernel involved was not a recent one, and there have been
rewrites in the OOMing area since, so I'm not sure whether it would
be a deadlock today: but I am sure that it would be better to do the
allocation while not holding the page lock. We have a fix in place,
but I don't think it's quite the right fix, so I want to revisit it:
unless I get a gift of time from somewhere, I'll post my series with
the ordering here just as I had it before.
You remark on yours: "This way we can postpone allocation even in
faultaround case without moving do_fault_around() after __do_fault()."
Was there something wrong or inefficient with my moving do_fault_around()
after __do_fault()? I thought it was fine; but I did also think when
I made the change, that it would be fixing a bug in VM_FAULT_MAJOR
acccounting too; yet when I went to confirm that, couldn't find evidence
for such a bug before; but I moved on before getting to the bottom of it.
04/25 rmap: support file thp
05/25 mm: introduce do_set_pmd()
06/25 mm, rmap: account file thp pages
A firm NAK on this one, for a simple reason: we have enough times
regretted mixing up the swap-backed Shmem into File stats; so although
you intend to support huge file pages by the same mechanism later,
the ones you support in this patchset are Shmem: ShmemPmdMapped is
the name I was using.
07/25 thp, vmstats: add counters for huge file pages
08/25 thp: support file pages in zap_huge_pmd()
I see there's an ongoing discussion with Aneesh about whether you
need to deposit/withdraw pagetables. I blindly assumed that I would
have to participate in that mechanism, so my series should be safe
in that regard, though not as nice as yours. (But I did get the
sequence needed for PowerPC wrong somewhere in my original posting.)
09/25 thp: handle file pages in split_huge_pmd()
That's neat, just unmapping the pmd for the shared file/shmem case,
I didn't think to do it that way. Or did I have some other reason
for converting the pmd to ptes? I think it is incorrect not to
provide ptes when splitting when it's an mlock()ed area; but
you've said yourself that mlock probably needs more work.
10/25 thp: handle file COW faults
We made the same decision (and Matthew for DAX too IIRC): don't
bother to COW hugely, it's more code, and often not beneficial.
11/25 thp: handle file pages in mremap()
Right, I missed that from my v3.19 rebase, but added it later
when I woke up to your v3.15 mod: I prefer my version of this patch,
which adds take_rmap_locks(vma) and drop_rmap_locks(vma) helpers.
12/25 thp: skip file huge pmd on copy_huge_pmd()
Oh, that looks good, I'd forgotten about that optimization.
And I think in this case you don't have to worry about VM_LOCKED
or deposit/withdraw, it's just fine to skip over it. I think I'll
post my one as it is, and let you scoff at it then.
13/25 thp: prepare change_huge_pmd() for file thp
You're very dedicated to inserting the vma_is_anonymous(),
where I just delete the BUG_ON. Good for you.
14/25 thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
Interesting. I haven't spent long enough looking at this, just noted
it as something to come back to: I haven't decided whether this is a
good safe obvious change (always better to do something outside the
lock if it's safe), or not. I don't have that issue of munlock()ing
when splitting, so hadn't noticed the peculiar placing of
vma_adjust_trans_huge() inside the one lock but outside the other.
15/25 thp: file pages support for split_huge_page()
I didn't go much beyond your very helpful commit comment on this one.
I'd been worried by how you would freeze for the split, but it sounds
right. Very interested to see your comment about dropping pages beyond
i_size from the radix-tree: I think that's absolutely right, but never
occurred to me to do - not something that tests or users notice, but
needed for correctness.
16/25 thp, mlock: do not mlock PTE-mapped file huge pages
I think we both wish there were no mlock(), it gave me more trouble
than anything. And it wouldn't be a surprise if what I have and
think works, actually turns out to be very broken. So I'm going to
be careful about criticising your workarounds, they might be better.
17/25 vmscan: split file huge pages before paging them out
As mentioned at the beginning, seemed to work well for the straight
cp, but not when the huge pages were mapped.
18/25 page-flags: relax policy for PG_mappedtodisk and PG_reclaim
19/25 radix-tree: implement radix_tree_maybe_preload_order()
20/25 filemap: prepare find and delete operations for huge pages
21/25 truncate: handle file thp
22/25 shmem: prepare huge= mount option and sysfs knob
23/25 shmem: get_unmapped_area align huge page
24/25 shmem: add huge pages support
25/25 shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
Not much to say on these at the moment, but clearly we'll need to
settle the user interfaces of 22/25 and 25/25 before anything reaches
Linus. For simplicity, I've carried on with my distinct mount options
and knobs, but agree that some rapprochement with anon THP conventions
will probably be required before finalizing.
That's all for now - thanks,
Hugh
On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> The small files thing formed my first impression. My second
> impression was similar, when I tried mmap(NULL, size_of_RAM,
> PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> cycled around the arena touching all the pages (which of
> course has to push a little into swap): that soon OOMed.
>
> But there I think you probably just have some minor bug to be fixed:
> I spent a little while trying to debug it, but then decided I'd
> better get back to writing to you. I didn't really understand what
> I was seeing, but when I hacked some stats into shrink_page_list(),
> converting !is_page_cache_freeable(page) to page_cache_references(page)
> to return the difference instead of the bool, a large proportion of
> huge tmpfs pages seemed to have count 1 too high to be freeable at
> that point (and one huge tmpfs page had a count of 3477).
I'll reply to your other points later, but first I wanted to address this
obvious bug.
I cannot really explain page_count() == 3477, but otherwise:
The root cause is that try_to_unmap() doesn't handle PMD-mapped huge
pages, so we hit 'case SWAP_AGAIN' all the time.
The patch below effectively rewrites 17/25: now we split the huge page
before trying to unmap it.
split_huge_page() has its own check similar to is_page_cache_freeable(),
so we woundn't split pages we cannot free later on.
And split_huge_page() for file pages would unmap the page, so we wouldn't
need to go to try_to_unmap() after that.
The patch look rather simple, but I haven't done full validation cycle for
it. Regressions are unlikely, but possible.
At some point we would need to teach try_to_unmap() to handle huge pages.
It would be required for filesystems with backing storage. But I don't see
need for it to get huge tmpfs/shmem work.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9fa9e15594e9..86008f8f1f9b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -473,14 +473,12 @@ void drop_slab(void)
static inline int is_page_cache_freeable(struct page *page)
{
- int radix_tree_pins = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
-
/*
* A freeable page cache page is referenced only by the caller
* that isolated the page, the page cache radix tree and
* optional buffer heads at page->private.
*/
- return page_count(page) - page_has_private(page) == 1 + radix_tree_pins;
+ return page_count(page) - page_has_private(page) == 2;
}
static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
@@ -550,6 +548,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* swap_backing_dev_info is bust: it doesn't reflect the
* congestion state of the swapdevs. Easy to fix, if needed.
*/
+ if (!is_page_cache_freeable(page))
+ return PAGE_KEEP;
if (!mapping) {
/*
* Some data journaling orphaned pages can have
@@ -1055,8 +1055,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/* Adding to swap updated mapping */
mapping = page_mapping(page);
+ } else if (unlikely(PageTransHuge(page))) {
+ /* Split file THP */
+ if (split_huge_page_to_list(page, page_list))
+ goto keep_locked;
}
+ VM_BUG_ON_PAGE(PageTransHuge(page), page);
+
/*
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
@@ -1112,15 +1118,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
-
- if (!is_page_cache_freeable(page))
- goto keep_locked;
-
- if (unlikely(PageTransHuge(page))) {
- if (split_huge_page_to_list(page, page_list))
- goto keep_locked;
- }
-
switch (pageout(page, mapping, sc)) {
case PAGE_KEEP:
goto keep_locked;
--
Kirill A. Shutemov
On Thu, 24 Mar 2016, Kirill A. Shutemov wrote:
> On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> > The small files thing formed my first impression. My second
> > impression was similar, when I tried mmap(NULL, size_of_RAM,
> > PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> > cycled around the arena touching all the pages (which of
> > course has to push a little into swap): that soon OOMed.
> >
> > But there I think you probably just have some minor bug to be fixed:
> > I spent a little while trying to debug it, but then decided I'd
> > better get back to writing to you. I didn't really understand what
> > I was seeing, but when I hacked some stats into shrink_page_list(),
> > converting !is_page_cache_freeable(page) to page_cache_references(page)
> > to return the difference instead of the bool, a large proportion of
> > huge tmpfs pages seemed to have count 1 too high to be freeable at
> > that point (and one huge tmpfs page had a count of 3477).
>
> I'll reply to your other points later, but first I wanted to address this
> obvious bug.
Thanks. That works better, but is not yet right: memory isn't freed
as it should be, so when I exit then try to run a second time, the
mmap() just gets ENOMEM (with /proc/sys/vm/overcommit_memory 0):
MemFree is low. No rush to fix, I've other stuff to do.
I don't get as far as that on the laptop, since the first run is OOM
killed while swapping; but I can't vouch for the OOM-kill-correctness
of the base tree I'm using, and this laptop has a history of OOMing
rather too easily if all's not right.
Hugh
On Thu, Mar 24, 2016 at 12:08:55PM -0700, Hugh Dickins wrote:
> On Thu, 24 Mar 2016, Kirill A. Shutemov wrote:
> > On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> > > The small files thing formed my first impression. My second
> > > impression was similar, when I tried mmap(NULL, size_of_RAM,
> > > PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> > > cycled around the arena touching all the pages (which of
> > > course has to push a little into swap): that soon OOMed.
> > >
> > > But there I think you probably just have some minor bug to be fixed:
> > > I spent a little while trying to debug it, but then decided I'd
> > > better get back to writing to you. I didn't really understand what
> > > I was seeing, but when I hacked some stats into shrink_page_list(),
> > > converting !is_page_cache_freeable(page) to page_cache_references(page)
> > > to return the difference instead of the bool, a large proportion of
> > > huge tmpfs pages seemed to have count 1 too high to be freeable at
> > > that point (and one huge tmpfs page had a count of 3477).
> >
> > I'll reply to your other points later, but first I wanted to address this
> > obvious bug.
>
> Thanks. That works better, but is not yet right: memory isn't freed
> as it should be, so when I exit then try to run a second time, the
> mmap() just gets ENOMEM (with /proc/sys/vm/overcommit_memory 0):
> MemFree is low. No rush to fix, I've other stuff to do.
>
> I don't get as far as that on the laptop, since the first run is OOM
> killed while swapping; but I can't vouch for the OOM-kill-correctness
> of the base tree I'm using, and this laptop has a history of OOMing
> rather too easily if all's not right.
Hm. I don't see the issue.
I tried to reproduce it in my VM with following script:
#!/bin/sh -efu
swapon -a
ram="$(grep MemTotal /proc/meminfo | sed 's,[^0-9\]\+,,; s, kB,k,')"
usemem -w -f /dev/zero "$ram"
swapoff -a
swapon -a
usemem -w -f /dev/zero "$ram"
cat /proc/meminfo
grep thp /proc/vmstat
-----
usemem is a tool from this archive:
http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar
It works fine even if would double size of mapping.
Do you have a reproducer?
--
Kirill A. Shutemov
On Fri, 25 Mar 2016, Kirill A. Shutemov wrote:
> On Thu, Mar 24, 2016 at 12:08:55PM -0700, Hugh Dickins wrote:
> > On Thu, 24 Mar 2016, Kirill A. Shutemov wrote:
> > > On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> > > > The small files thing formed my first impression. My second
> > > > impression was similar, when I tried mmap(NULL, size_of_RAM,
> > > > PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> > > > cycled around the arena touching all the pages (which of
> > > > course has to push a little into swap): that soon OOMed.
> > > >
> > > > But there I think you probably just have some minor bug to be fixed:
> > > > I spent a little while trying to debug it, but then decided I'd
> > > > better get back to writing to you. I didn't really understand what
> > > > I was seeing, but when I hacked some stats into shrink_page_list(),
> > > > converting !is_page_cache_freeable(page) to page_cache_references(page)
> > > > to return the difference instead of the bool, a large proportion of
> > > > huge tmpfs pages seemed to have count 1 too high to be freeable at
> > > > that point (and one huge tmpfs page had a count of 3477).
> > >
> > > I'll reply to your other points later, but first I wanted to address this
> > > obvious bug.
> >
> > Thanks. That works better, but is not yet right: memory isn't freed
> > as it should be, so when I exit then try to run a second time, the
> > mmap() just gets ENOMEM (with /proc/sys/vm/overcommit_memory 0):
> > MemFree is low. No rush to fix, I've other stuff to do.
> >
> > I don't get as far as that on the laptop, since the first run is OOM
> > killed while swapping; but I can't vouch for the OOM-kill-correctness
> > of the base tree I'm using, and this laptop has a history of OOMing
> > rather too easily if all's not right.
>
> Hm. I don't see the issue.
>
> I tried to reproduce it in my VM with following script:
>
> #!/bin/sh -efu
>
> swapon -a
>
> ram="$(grep MemTotal /proc/meminfo | sed 's,[^0-9\]\+,,; s, kB,k,')"
>
> usemem -w -f /dev/zero "$ram"
>
> swapoff -a
> swapon -a
>
> usemem -w -f /dev/zero "$ram"
>
> cat /proc/meminfo
> grep thp /proc/vmstat
>
> -----
>
> usemem is a tool from this archive:
>
> http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar
>
> It works fine even if would double size of mapping.
>
> Do you have a reproducer?
Yes, my reproducer is simpler (just cycling twice around the arena,
touching each page in order); and I too did not see it running your
script using usemem above. It looks as if that invocation isn't doing
enough work with swap: if I add a "-r 2" to those usemem lines, then
I get "usemem: mmap failed: Cannot allocate memory" on the second.
I also added a "sleep 2" before the second call to usemem: I'm not sure
of the current state of vmstat, but historically it's slow to gather
back from each cpu to global, and I think it used to leave some cpu
counts stranded indefinitely once upon a time. In my own testing,
I have a /proc/sys/vm/stat_refresh to touch before checking meminfo
or vmstat - and I think the vm_enough_memory() check in mmap() may
need that same care, since it refers to NR_FREE_PAGES etc.
8GB is my ramsize, if that matters.
Hugh
On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> On Sat, 12 Mar 2016, Kirill A. Shutemov wrote:
...
> As I've said on several occasions, I am not interested in emulating
> the limitations of hugetlbfs inside tmpfs: there might one day be a
> case for such a project, but it's transparent hugepages that we want to
> have now - blocksize 4kB, but in 2MB extents when possible (on x86_64).
>
> I accept that supporting small files correctly is not the #1
> requirement for a "huge" tmpfs; and if it were impossible or
> unreasonable to ask for, I'd give up on it. But since we (Google)
> have been using a successful implementation for a year now, I see
> no reason to give up on it at all: it allows for much wider and
> easier adoption (and testing) of huge tmpfs - without "but"s.
I think, once we get collapse work, huge=within_size would provide
reasonable support for fs with mixed-sized files.
Yes, we still would have overhead on sparsed or punch-holed files.
As for now, I don't see it being show-stopper. The hugepage allocation
policy can be overwritten with MADV/FADV_NOHUGEPAGE if an application
don't want to see huge pages in a particular case.
...
> Hmm, I just went to repeat that test, to see if MemFree goes down and
> down on each run, but got a VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> in put_page_testzero() in find_get_entries() during shmem_evict_inode().
> So, something not quite right with the refcounting when under pressure.
I suspect some race with split_huge_page(), but don't see it so far.
> My third impression was much better, when I just did a straight
> cp /dev/zero /tmpfs_twice_size_of_RAM: that went smoothly,
> pushed into swap just as it should, nice.
>
> >
> > The main difference with Hugh's approach[1] is that I continue with
> > compound pages, where Hugh invents new way couple pages: team pages.
> > I believe THP refcounting rework made team pages unnecessary: compound
> > page are flexible enough to serve needs of page cache.
>
> I have disagreed with you about the suitability of compound pages;
> but at some point on Sunday night, after reading your patchset,
> found myself looking at it differently, and now agreeing with you.
>
> They are not flexible enough yet, but I believe you have done a large
> part of the work (by diverting all those tail ops to the head), and it
> just needs a little more. I say "a little", but it might not be easy.
>
> What I realize now, is that you should be able to do almost everything
> I did with team pages, instead with your compound pages. Just abandon
> the idea that the whole compound page has to be initialized in one go:
> initialize (charge to memcg, add to page cache, fill, SetPageUptodate)
> each subpage as it is touched. (But of course, the whole extent has
> to be to be initialized before it can be pmd-mapped.)
I'm not convinced, that doing such operations on per-4k is better.
All-at-once is simpler and has natural performance benefit from batching.
I think we can switch some operations to per-4k basis once upside will be
obvious. It's more on optimization side.
> And the convenient thing about a compound page is that you have plenty
> of space to hold your metadata: where I have been squeezing everything
> into page->private (which makes it difficult then to extend the same
> design to miscellaneous filesystems), you're free to use (certain)
> fields of 511 more struct pages.
>
> I said above that you should be able to do almost everything: I think
> that "almost" involves dropping a few things that I have, that we can
> well do without. With team pages, it is possible for each member to
> be charged to a different memcg; but I don't have anyone calling out
> for that behaviour, it's more an awkward possibility that we have to
> be on guard against in a few places, and I'd be quite happy just to
> stop it (charging instead to whichever memcg first touched the extent).
>
> And it's been nice to entertain the idea of the individual team tails,
> travelling separately up and down their own lrus, subject to their own
> pagesize access patterns even while the whole is pmd-mapped. But in
> practice, we continue to throw them on the unevictable list once they
> get in the way, with no attempt to bring them back when accessed
> individually; so I think it's a nice idea that we can live without.
>
> And a compound page implementation would avoid the overhead I have,
> of putting all those tails on lru in the first place.
Exactly. LRU scanning is one of places where this kind of batching is
beneficial.
> Another thing I did, that I think you can safely fail to emulate:
> I was careful to allow sparse population, and not insist that the
> head page be instantiated. That might turn out to be more of a
> problem for compound pages, and has been an awkward edge case for me.
> Although in principle there are sparse files which would be doubled
> in charge by insisting that the head be instantiated, I doubt they
> will hurt anyone in practice: perhaps best treated as a bug to be
> revisited if it actually comes up as an issue later.
Agreed.
> (Thinking about that last paragraph later, I'm not so sure.
> Forcing the instantiation of the head is okay when first allocating,
> but when trying to recover a fragmented extent into a hugepage later,
> it might prove to be seriously limiting to require that the head be
> migrated into place before anything else is done.)
Hm. I think you imply some implementation details of your recovering
process here, don't you?
> >
> > Many ideas and some patches were stolen from Hugh's patchset. Having this
> > patchset around was very helpful.
>
> People are probably wondering why I didn't repost it, and whether
> it's vapourware. No, it's worked out very well, better than we were
> expecting; and several of the TODOs that I felt needed attention,
> have not yet caused anyone trouble - though wider adoption is sure
> to meet new usecases and new bugs. I didn't repost it because, as I
> said at the time, I felt it wasn't ready for real use until recovery
> of huge pages after fragmentation and swapping (what khugepaged does
> for anon THP) was added. After that went in (and then quite a period
> of fixing the bugs, I admit!), other priorities intervened until now.
> Your patchset also lacks recovery at present, so equally that's a
> reason why I don't consider yours ready for merging.
I hoped we will be able to address this later, after initial merge, but
since you insist...
I have started looking into collapsing. I have week or so catch you up on
this. Will see what I'll be able to achieve...
Not having your repair implementation public is a bummer -- I miss
opportunity to rip off your ideas. :-P
> I'll repost mine, including recovery and a few minor additions, against
> v4.6-rc2. I still haven't got around to doing any NUMA work on it, but
> it does appear to be useful to many people even without that. I'd have
> preferred to spend longer rebasing it, it may turn out to be a little
> raw: most of the rebases from one release to another have been easy,
> but to v4.5 was messier - not particularly due to your THP refcounting
> work going in, just various different conflicts that I feel I've not
> properly digested yet. But v4.6-rc2 looks about as early and as late
> as I can do it, with LSF/MM coming up later in April; and should be a
> good base for people to try it out on. (I'll work on preparing it in
> advance, then I guess a last-minute edit of PAGE_CACHEs before posting.)
>
> I now believe that the right way forward is to get that team page
> implementation in as soon as we can, so users have the functionality;
> and then (if you're still interested) you work on converting it over
> from teams to compound pages, both as an optimization, and to satisfy
> your understandable preference for a single way of managing huge pages.
That might work. But major switch of implementation is always risky.
Minor inconsistency in behaviour after switch may be considered as
regression. So we need to be careful.
> (I would say "How does that sound to you?" but I'd prefer you to think
> it over, maybe wait to see what the new team patchset looks like. Though
> aside from the addition of recovery, it is very much like the one against
> v3.19 from February last year: no changes to the design at all, which
> turned out to fit recovery well; but quite a few bug fixes folded back
> in. Or you may prefer to wait to discuss the way forward at LSF/MM.)
>
> >
> > I will continue with code validation. I would expect mlock require some
> > more attention.
> >
> > Please, review and test the code.
> >
> > Git tree:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v4
> >
> > Rebased onto linux-next:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v4-next
>
> It wasn't obvious to me what tree your posted patchset was based on,
> so I made a patch out of the top your hugetmpfs/v4-next branch, and
> applied that on top of Andrew's 4.5-rc7-mm1: which appeared to be
> fine, just a trivial clash with mmotm's
> "do_shared_fault(): check that mmap_sem is held".
It was based on mhocko/mm.git since-4.4.
>
> [...]
>
> Some random thoughts on your patches FWIW:
>
> 01/25 mm: do not pass mm_struct into handle_mm_fault
> 02/25 mm: introduce fault_env
>
> These two inflate your patchset considerably, with seemingly unrelated
> changes: the first with arch files otherwise untouched, the second with
> lots of change to memory.c that makes it harder to compare our patchsets.
>
> I think it would be better to leave the handle_mm_fault(mm) removal to
> another patchset, and I'm ambivalent about the struct fault_env changes:
> I prefer to avoid such argument blocks (and think I saw Linus once say
> he detested my zap_details), which do make it harder to see what the
> real arguments are. I wonder what your 03/25 would look like without
> the fault_env, perhaps I'd agree it's too cumbersome that way (and
> perhaps I should look back at earlier versions of your patchset).
>
> But I do understand your desire to get some cleanups out of the way
> before basing the real work on the new structure: I'm going to have
> to reconsider the initial patches of my series in the light of this
> criticism of yours, not sure yet if I'll try to cut them out or not.
On my side, I would prefer to keep them unless it's total no-go.
Cutting them out would be painful.
> 03/25 mm: postpone page table allocation until we have page to map
>
> I've not stopped to consider whether your prealloc_pte work (which
> perhaps was the trigger for fault_env) meets the case or not, but
> want to mention that we recently discovered a drawback to my delaying
> the pagetable allocation: I'm allocating the pagetable while holding
> the lock on the page from the filesystem (any filesystem, though the
> change was to suit tmpfs), which presented a deadlock to the OOM killer.
> The base kernel involved was not a recent one, and there have been
> rewrites in the OOMing area since, so I'm not sure whether it would
> be a deadlock today: but I am sure that it would be better to do the
> allocation while not holding the page lock. We have a fix in place,
> but I don't think it's quite the right fix, so I want to revisit it:
> unless I get a gift of time from somewhere, I'll post my series with
> the ordering here just as I had it before.
prealloc_pte was invented to avoid allocation under ptl in faultaround
case. But I think we can re-use it to pre-allocate page table before
->fault request.
I wounder why it's a deadlock for OOM. Shoudn't oom-killer use trylock and
bail out if it fails?
> You remark on yours: "This way we can postpone allocation even in
> faultaround case without moving do_fault_around() after __do_fault()."
> Was there something wrong or inefficient with my moving do_fault_around()
> after __do_fault()?
For hot page cache case we can avoid ->fault() call if do_fault_around()
managed to solve the fault. It also means we don't need to lock the page
again and take ptl.
> I thought it was fine; but I did also think when
> I made the change, that it would be fixing a bug in VM_FAULT_MAJOR
> acccounting too; yet when I went to confirm that, couldn't find evidence
> for such a bug before; but I moved on before getting to the bottom of it.
Hm. do_fault_around() is not able to solve major page fault by design.
Or am I missing something?
> 04/25 rmap: support file thp
> 05/25 mm: introduce do_set_pmd()
> 06/25 mm, rmap: account file thp pages
>
> A firm NAK on this one, for a simple reason: we have enough times
> regretted mixing up the swap-backed Shmem into File stats; so although
> you intend to support huge file pages by the same mechanism later,
> the ones you support in this patchset are Shmem: ShmemPmdMapped is
> the name I was using.
Okay. I'll rework that.
>
> 07/25 thp, vmstats: add counters for huge file pages
> 08/25 thp: support file pages in zap_huge_pmd()
>
> I see there's an ongoing discussion with Aneesh about whether you
> need to deposit/withdraw pagetables. I blindly assumed that I would
> have to participate in that mechanism, so my series should be safe
> in that regard, though not as nice as yours. (But I did get the
> sequence needed for PowerPC wrong somewhere in my original posting.)
>
> 09/25 thp: handle file pages in split_huge_pmd()
>
> That's neat, just unmapping the pmd for the shared file/shmem case,
> I didn't think to do it that way. Or did I have some other reason
> for converting the pmd to ptes? I think it is incorrect not to
> provide ptes when splitting when it's an mlock()ed area; but
> you've said yourself that mlock probably needs more work.
Since we don't deposit page tables for file mapping, there's no easy way
to provide ptes for mlocked() areas. DAX has the same problem too.
Special-cased deposit for VM_LOCKED VMAs is going to be ugly and prune to
leaks.
Always-deposit is to wasteful especially for very large mappings (consider
DAX).
/me hates mlock.
> 10/25 thp: handle file COW faults
>
> We made the same decision (and Matthew for DAX too IIRC): don't
> bother to COW hugely, it's more code, and often not beneficial.
Actually, one small topic I wanted to discuss on LSF/MM is if we want to
do the same for anon-THP. New refcounting allows that.
Although, it probably would require some work on khugepaged side to allow
COW-break of PTE-mapped THP to collapse.
> 11/25 thp: handle file pages in mremap()
>
> Right, I missed that from my v3.19 rebase, but added it later
> when I woke up to your v3.15 mod: I prefer my version of this patch,
> which adds take_rmap_locks(vma) and drop_rmap_locks(vma) helpers.
Okay, I'll check it out once it will be public.
> 12/25 thp: skip file huge pmd on copy_huge_pmd()
>
> Oh, that looks good, I'd forgotten about that optimization.
> And I think in this case you don't have to worry about VM_LOCKED
> or deposit/withdraw, it's just fine to skip over it. I think I'll
> post my one as it is, and let you scoff at it then.
>
> 13/25 thp: prepare change_huge_pmd() for file thp
>
> You're very dedicated to inserting the vma_is_anonymous(),
> where I just delete the BUG_ON. Good for you.
Dave suggested this. :)
> 14/25 thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
>
> Interesting. I haven't spent long enough looking at this, just noted
> it as something to come back to: I haven't decided whether this is a
> good safe obvious change (always better to do something outside the
> lock if it's safe), or not. I don't have that issue of munlock()ing
> when splitting, so hadn't noticed the peculiar placing of
> vma_adjust_trans_huge() inside the one lock but outside the other.
We had the same issue back on my first attempt of huge page cache. Before
refcounting rework vma_adjust_trans_huge() implied split_huge_page() which
does rmap walk.
> 15/25 thp: file pages support for split_huge_page()
>
> I didn't go much beyond your very helpful commit comment on this one.
> I'd been worried by how you would freeze for the split, but it sounds
> right. Very interested to see your comment about dropping pages beyond
> i_size from the radix-tree: I think that's absolutely right, but never
> occurred to me to do - not something that tests or users notice, but
> needed for correctness.
>
> 16/25 thp, mlock: do not mlock PTE-mapped file huge pages
>
> I think we both wish there were no mlock(), it gave me more trouble
> than anything. And it wouldn't be a surprise if what I have and
> think works, actually turns out to be very broken. So I'm going to
> be careful about criticising your workarounds, they might be better.
>
> 17/25 vmscan: split file huge pages before paging them out
>
> As mentioned at the beginning, seemed to work well for the straight
> cp, but not when the huge pages were mapped.
>
> 18/25 page-flags: relax policy for PG_mappedtodisk and PG_reclaim
> 19/25 radix-tree: implement radix_tree_maybe_preload_order()
> 20/25 filemap: prepare find and delete operations for huge pages
> 21/25 truncate: handle file thp
> 22/25 shmem: prepare huge= mount option and sysfs knob
> 23/25 shmem: get_unmapped_area align huge page
> 24/25 shmem: add huge pages support
> 25/25 shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
>
> Not much to say on these at the moment, but clearly we'll need to
> settle the user interfaces of 22/25 and 25/25 before anything reaches
> Linus. For simplicity, I've carried on with my distinct mount options
> and knobs, but agree that some rapprochement with anon THP conventions
> will probably be required before finalizing.
>
> That's all for now - thanks,
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kirill A. Shutemov
On Fri, Mar 25, 2016 at 05:00:50PM -0700, Hugh Dickins wrote:
> On Fri, 25 Mar 2016, Kirill A. Shutemov wrote:
> > On Thu, Mar 24, 2016 at 12:08:55PM -0700, Hugh Dickins wrote:
> > > On Thu, 24 Mar 2016, Kirill A. Shutemov wrote:
> > > > On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> > > > > The small files thing formed my first impression. My second
> > > > > impression was similar, when I tried mmap(NULL, size_of_RAM,
> > > > > PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> > > > > cycled around the arena touching all the pages (which of
> > > > > course has to push a little into swap): that soon OOMed.
> > > > >
> > > > > But there I think you probably just have some minor bug to be fixed:
> > > > > I spent a little while trying to debug it, but then decided I'd
> > > > > better get back to writing to you. I didn't really understand what
> > > > > I was seeing, but when I hacked some stats into shrink_page_list(),
> > > > > converting !is_page_cache_freeable(page) to page_cache_references(page)
> > > > > to return the difference instead of the bool, a large proportion of
> > > > > huge tmpfs pages seemed to have count 1 too high to be freeable at
> > > > > that point (and one huge tmpfs page had a count of 3477).
> > > >
> > > > I'll reply to your other points later, but first I wanted to address this
> > > > obvious bug.
> > >
> > > Thanks. That works better, but is not yet right: memory isn't freed
> > > as it should be, so when I exit then try to run a second time, the
> > > mmap() just gets ENOMEM (with /proc/sys/vm/overcommit_memory 0):
> > > MemFree is low. No rush to fix, I've other stuff to do.
> > >
> > > I don't get as far as that on the laptop, since the first run is OOM
> > > killed while swapping; but I can't vouch for the OOM-kill-correctness
> > > of the base tree I'm using, and this laptop has a history of OOMing
> > > rather too easily if all's not right.
> >
> > Hm. I don't see the issue.
> >
> > I tried to reproduce it in my VM with following script:
> >
> > #!/bin/sh -efu
> >
> > swapon -a
> >
> > ram="$(grep MemTotal /proc/meminfo | sed 's,[^0-9\]\+,,; s, kB,k,')"
> >
> > usemem -w -f /dev/zero "$ram"
> >
> > swapoff -a
> > swapon -a
> >
> > usemem -w -f /dev/zero "$ram"
> >
> > cat /proc/meminfo
> > grep thp /proc/vmstat
> >
> > -----
> >
> > usemem is a tool from this archive:
> >
> > http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar
> >
> > It works fine even if would double size of mapping.
> >
> > Do you have a reproducer?
>
> Yes, my reproducer is simpler (just cycling twice around the arena,
> touching each page in order); and I too did not see it running your
> script using usemem above. It looks as if that invocation isn't doing
> enough work with swap: if I add a "-r 2" to those usemem lines, then
> I get "usemem: mmap failed: Cannot allocate memory" on the second.
>
> I also added a "sleep 2" before the second call to usemem: I'm not sure
> of the current state of vmstat, but historically it's slow to gather
> back from each cpu to global, and I think it used to leave some cpu
> counts stranded indefinitely once upon a time. In my own testing,
> I have a /proc/sys/vm/stat_refresh to touch before checking meminfo
> or vmstat - and I think the vm_enough_memory() check in mmap() may
> need that same care, since it refers to NR_FREE_PAGES etc.
>
> 8GB is my ramsize, if that matters.
I think I found it. I have refcounting screwed up in faultaround.
This should fix the problem:
diff --git a/mm/filemap.c b/mm/filemap.c
index 94c097ec08e7..1325bb4568d1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2292,19 +2292,18 @@ repeat:
if (fe->pte)
fe->pte += iter.index - last_pgoff;
last_pgoff = iter.index;
- alloc_set_pte(fe, NULL, page);
+ if (alloc_set_pte(fe, NULL, page))
+ goto unlock;
unlock_page(page);
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*fe->pmd))
- break;
- /* Failed to setup page table? */
- VM_BUG_ON(!fe->pte);
goto next;
unlock:
unlock_page(page);
skip:
page_cache_release(page);
next:
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*fe->pmd))
+ break;
if (iter.index == end_pgoff)
break;
}
--
Kirill A. Shutemov
On Mon, 28 Mar 2016, Kirill A. Shutemov wrote:
>
> I think I found it. I have refcounting screwed up in faultaround.
>
> This should fix the problem:
Yes, this fixes it for me - thanks.
Hugh
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 94c097ec08e7..1325bb4568d1 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2292,19 +2292,18 @@ repeat:
> if (fe->pte)
> fe->pte += iter.index - last_pgoff;
> last_pgoff = iter.index;
> - alloc_set_pte(fe, NULL, page);
> + if (alloc_set_pte(fe, NULL, page))
> + goto unlock;
> unlock_page(page);
> - /* Huge page is mapped? No need to proceed. */
> - if (pmd_trans_huge(*fe->pmd))
> - break;
> - /* Failed to setup page table? */
> - VM_BUG_ON(!fe->pte);
> goto next;
> unlock:
> unlock_page(page);
> skip:
> page_cache_release(page);
> next:
> + /* Huge page is mapped? No need to proceed. */
> + if (pmd_trans_huge(*fe->pmd))
> + break;
> if (iter.index == end_pgoff)
> break;
> }
> --
> Kirill A. Shutemov