2016-03-03 16:52:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 00/29] huge tmpfs implementation using compound pages

Here is an updated version of huge pages support implementation in
tmpfs/shmem.

I consider it feature complete for initial step into upstream. I'll focus
on validation now. I work with Sasha on that.

Hugh, I would be glad to hear your opinion on this patchset.

The main difference with Hugh's approach[1] is that I continue with
compound pages, where Hugh invents new way couple pages: team pages.
I believe THP refcounting rework made team pages unnecessary: compound
page are flexible enough to serve needs of page cache.

Many ideas and some patches were stolen from Hugh's patchset. Having this
patchset around was very helpful.

I will continue with code validation. I would expect mlock require some
more attention.

Please, review and test the code.

[1] http://lkml.kernel.org/g/[email protected]

Git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v3

== Changelog ==

v3:
- huge= mountoption now can have values always, within_size, advice and
never;

- sysctl handle is replaced with sysfs knob;

- MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation via
page fault;

- mlock() handling had been fixed;

- bunch of smaller bugfixes and cleanups.

== Design overview ==

Huge pages are allocated by shmem when it's allowed (by mount option) and
there's no entries for the range in radix-tree. Huge page is represented by
HPAGE_PMD_NR entries in radix-tree.

MM core maps a page with PMD if ->fault() returns huge page and the VMA is
suitable for huge pages (size, alignment). There's no need into two
requests to file system: filesystem returns huge page if it can,
graceful fallback to small pages otherwise.

As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we can
re-fault the page with PTEs later.

Basic scheme for split_huge_page() is the same as for anon-THP.
Few differences:

- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.

- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;

- lockless access is prevented by setting the head->_count to 0 during
split, so get_page_unless_zero() would fail;

After the page is split, some of small pages can be beyond i_size. We
cannot drop them right away in split_huge_page() due unsuitable locking
context -- one of such pages can be locked by split_huge_page() caller,
which can lead to deadlock in shmem_undo_range(). For now, I leave these
pages alone. I'll look into the problem more for v4.

COW mapping handled on PTE-level. It's not clear how beneficial would be
allocation of huge pages on COW faults. And it would require some code to
make them work.

I think at some point we can consider teaching khugepaged to collapse
pages in COW mappings, but allocating huge on fault is probably overkill.

As with anon THP, we mlock file huge page only if it mapped with PMD.
PTE-mapped THPs are never mlocked. This way we can avoid all sorts of
scenarios when we can leak mlocked page.

As with anon THP, we split huge page on swap out.

Truncate and punch hole that only cover part of THP range is implemented
by zero out this part of THP.

This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
As we don't really create hole in this case, lseek(SEEK_HOLE) may have
inconsistent results depending what pages happened to be allocated.
Not sure if it should be considered ABI break or not.


== Patchset overview ==

[01-04/29]
These patches were posted separately. They simplify
split_huge_page() code with speed trade off. I'm not sure if they
should go upstream, but they make my life easier for now.
Patches were slightly adjust to handle file pages too.

[05-09/29]
Rework fault path and rmap to handle file pmd. Unlike DAX with
vm_ops->pmd_fault, we don't need to ask filesystem twice -- first
for huge page and then for small. If ->fault happened to return
huge page and VMA is suitable for mapping it as huge, we would
do so.
[10/29]
Add support for huge file pages in rmap;

[11-20/29]
Various preparation of THP core for file pages.

[21-25/29]
Various preparation of MM core for file pages.

[26-28/29]
And finally, bring huge pages into tmpfs/shmem.
Two of three patches came from Hugh's patchset. :)
[29/29]
Wire up madvise() existing hints for file THP.
We can implement fadvise() later.

Hugh Dickins (1):
shmem: get_unmapped_area align huge page

Kirill A. Shutemov (28):
rmap: introduce rmap_walk_locked()
rmap: extend try_to_unmap() to be usable by split_huge_page()
mm: make remove_migration_ptes() beyond mm/migration.c
thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers
mm: do not pass mm_struct into handle_mm_fault
mm: introduce fault_env
mm: postpone page table allocation until we have page to map
rmap: support file thp
mm: introduce do_set_pmd()
mm, rmap: account file thp pages
thp, vmstats: add counters for huge file pages
thp: support file pages in zap_huge_pmd()
thp: handle file pages in split_huge_pmd()
thp: handle file COW faults
thp: handle file pages in mremap()
thp: skip file huge pmd on copy_huge_pmd()
thp: prepare change_huge_pmd() for file thp
thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
thp: file pages support for split_huge_page()
thp, mlock: do not mlock PTE-mapped file huge pages
vmscan: split file huge pages before paging them out
page-flags: relax policy for PG_mappedtodisk and PG_reclaim
radix-tree: implement radix_tree_maybe_preload_order()
filemap: prepare find and delete operations for huge pages
truncate: handle file thp
shmem: prepare huge= mount option and sysfs knob
shmem: add huge pages support
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings

Documentation/filesystems/Locking | 10 +-
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/avr32/mm/fault.c | 2 +-
arch/cris/mm/fault.c | 2 +-
arch/frv/mm/fault.c | 2 +-
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m32r/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/metag/mm/fault.c | 2 +-
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/mn10300/mm/fault.c | 2 +-
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/mm/copro_fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/score/mm/fault.c | 2 +-
arch/sh/mm/fault.c | 2 +-
arch/sparc/mm/fault_32.c | 4 +-
arch/sparc/mm/fault_64.c | 2 +-
arch/tile/mm/fault.c | 2 +-
arch/um/kernel/trap.c | 2 +-
arch/unicore32/mm/fault.c | 2 +-
arch/x86/mm/fault.c | 2 +-
arch/xtensa/mm/fault.c | 2 +-
drivers/base/node.c | 10 +-
drivers/char/mem.c | 24 ++
drivers/iommu/amd_iommu_v2.c | 2 +-
drivers/iommu/intel-svm.c | 2 +-
fs/proc/meminfo.c | 5 +-
fs/userfaultfd.c | 22 +-
include/linux/huge_mm.h | 33 +-
include/linux/mm.h | 51 ++-
include/linux/mmzone.h | 3 +-
include/linux/page-flags.h | 19 +-
include/linux/radix-tree.h | 1 +
include/linux/rmap.h | 8 +-
include/linux/shmem_fs.h | 5 +-
include/linux/userfaultfd_k.h | 8 +-
include/linux/vm_event_item.h | 7 +
ipc/shm.c | 6 +-
lib/radix-tree.c | 68 ++-
mm/filemap.c | 220 +++++++---
mm/gup.c | 7 +-
mm/huge_memory.c | 763 +++++++++++++++-------------------
mm/internal.h | 4 +-
mm/ksm.c | 3 +-
mm/memory.c | 852 +++++++++++++++++++++-----------------
mm/mempolicy.c | 4 +-
mm/migrate.c | 20 +-
mm/mmap.c | 26 +-
mm/mremap.c | 22 +-
mm/nommu.c | 3 +-
mm/page-writeback.c | 1 +
mm/page_alloc.c | 2 +
mm/rmap.c | 141 +++++--
mm/shmem.c | 620 +++++++++++++++++++++++----
mm/swap.c | 2 +
mm/truncate.c | 22 +-
mm/util.c | 6 +
mm/vmscan.c | 15 +-
mm/vmstat.c | 3 +
68 files changed, 1944 insertions(+), 1138 deletions(-)

--
2.7.0


2016-03-03 16:52:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 09/29] mm: introduce do_set_pmd()

With postponed page table allocation we have chance to setup huge pages.
do_set_pte() calls do_set_pmd() if following criteria met:

- page is compound;
- pmd entry in pmd_none();
- vma has suitable size and alignment;

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 8 ------
mm/memory.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++-
mm/migrate.c | 3 +--
4 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a9ec30594a81..c958e3db0a0e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -150,6 +150,8 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)

struct page *get_huge_zero_page(void);

+#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e0f8e4d63ffd..2cade3851b7a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -771,14 +771,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}

-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
- pmd_t entry;
- entry = mk_pmd(page, prot);
- entry = pmd_mkhuge(entry);
- return entry;
-}
-
static inline struct list_head *page_deferred_list(struct page *page)
{
/*
diff --git a/mm/memory.c b/mm/memory.c
index f8c986df8c63..3bb51a550bbb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2840,6 +2840,66 @@ map_pte:
return 0;
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+ unsigned long haddr)
+{
+ if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+ return false;
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return false;
+ return true;
+}
+
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+ struct vm_area_struct *vma = fe->vma;
+ bool write = fe->flags & FAULT_FLAG_WRITE;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
+ pmd_t entry;
+ int i, ret;
+
+ if (!transhuge_vma_suitable(vma, haddr))
+ return VM_FAULT_FALLBACK;
+
+ ret = VM_FAULT_FALLBACK;
+ page = compound_head(page);
+
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd)))
+ goto out;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ flush_icache_page(vma, page + i);
+
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ if (write)
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+
+ add_mm_counter(vma->vm_mm, MM_FILEPAGES, HPAGE_PMD_NR);
+ page_add_file_rmap(page, true);
+
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+
+ update_mmu_cache_pmd(vma, haddr, fe->pmd);
+
+ /* fault is handled */
+ ret = 0;
+out:
+ spin_unlock(fe->ptl);
+ return ret;
+}
+#else
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+ BUILD_BUG();
+ return 0;
+}
+#endif
+
/**
* alloc_set_pte - setup new PTE entry for given page and add reverse page
* mapping. If needed, the fucntion allocates page table or use pre-allocated.
@@ -2859,9 +2919,19 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
+ int ret;
+
+ if (pmd_none(*fe->pmd) && PageTransCompound(page)) {
+ /* THP on COW? */
+ VM_BUG_ON_PAGE(memcg, page);
+
+ ret = do_set_pmd(fe, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }

if (!fe->pte) {
- int ret = pte_alloc_one_map(fe);
+ ret = pte_alloc_one_map(fe);
if (ret)
return ret;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index d20276fffce7..5c9cd90334ea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1820,8 +1820,7 @@ fail_putback:
}

orig_entry = *pmd;
- entry = mk_pmd(new_page, vma->vm_page_prot);
- entry = pmd_mkhuge(entry);
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

/*
--
2.7.0

2016-03-03 16:52:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 19/29] thp: file pages support for split_huge_page()

Basic scheme is the same as for anon THP.

Main differences:

- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.

- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;

- lockless access is prevented by setting the head->_count to 0 during
split;

TODO: After split, some pages can be beyond i_size. We need to unmap
them and drop from radix-tree.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/gup.c | 2 +
mm/huge_memory.c | 138 ++++++++++++++++++++++++++++++++++++++-----------------
mm/mempolicy.c | 2 +
3 files changed, 101 insertions(+), 41 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 60f422a0af8b..76148816c0cd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -285,6 +285,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
ret = split_huge_page(page);
unlock_page(page);
put_page(page);
+ if (pmd_none(*pmd))
+ return no_page_table(vma, flags);
}

return ret ? ERR_PTR(ret) :
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29a687cbef40..09c32bc7ff0b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -29,6 +29,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/page_idle.h>
#include <linux/swapops.h>
+#include <linux/shmem_fs.h>
#include <linux/debugfs.h>

#include <asm/tlb.h>
@@ -3098,7 +3099,7 @@ static void freeze_page(struct page *page)
ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
for (i = 1; !ret && i < HPAGE_PMD_NR; i++)
ret = try_to_unmap(page + i, ttu_flags);
- VM_BUG_ON(ret);
+ VM_BUG_ON_PAGE(ret, page + i - 1);
}

static void unfreeze_page(struct page *page)
@@ -3120,15 +3121,20 @@ static void __split_huge_page_tail(struct page *head, int tail,
/*
* tail_page->_count is zero and not changing from under us. But
* get_page_unless_zero() may be running from under us on the
- * tail_page. If we used atomic_set() below instead of atomic_inc(), we
- * would then run atomic_set() concurrently with
+ * tail_page. If we used atomic_set() below instead of atomic_inc() or
+ * atomic_add(), we would then run atomic_set() concurrently with
* get_page_unless_zero(), and atomic_set() is implemented in C not
* using locked ops. spin_unlock on x86 sometime uses locked ops
* because of PPro errata 66, 92, so unless somebody can guarantee
* atomic_set() here would be safe on all archs (and not only on x86),
- * it's safer to use atomic_inc().
+ * it's safer to use atomic_inc()/atomic_add().
*/
- page_ref_inc(page_tail);
+ if (PageAnon(head)) {
+ page_ref_inc(page_tail);
+ } else {
+ /* Additional pin to radix tree */
+ page_ref_add(page_tail, 2);
+ }

page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
page_tail->flags |= (head->flags &
@@ -3164,15 +3170,14 @@ static void __split_huge_page_tail(struct page *head, int tail,
lru_add_page_tail(head, page_tail, lruvec, list);
}

-static void __split_huge_page(struct page *page, struct list_head *list)
+static void __split_huge_page(struct page *page, struct list_head *list,
+ unsigned long flags)
{
struct page *head = compound_head(page);
struct zone *zone = page_zone(head);
struct lruvec *lruvec;
int i;

- /* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(head, zone);

/* complete memcg works before add pages to LRU */
@@ -3182,7 +3187,16 @@ static void __split_huge_page(struct page *page, struct list_head *list)
__split_huge_page_tail(head, i, lruvec, list);

ClearPageCompound(head);
- spin_unlock_irq(&zone->lru_lock);
+ /* See comment in __split_huge_page_tail() */
+ if (PageAnon(head)) {
+ atomic_inc(&head->_count);
+ } else {
+ /* Additional pin to radix tree */
+ atomic_add(2, &head->_count);
+ spin_unlock(&head->mapping->tree_lock);
+ }
+
+ spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);

unfreeze_page(head);

@@ -3250,35 +3264,48 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
struct page *head = compound_head(page);
struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
struct anon_vma *anon_vma;
- int count, mapcount, ret;
+ int count, mapcount, extra_pins, ret;
bool mlocked;
unsigned long flags;

VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
- VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
VM_BUG_ON_PAGE(!PageCompound(page), page);

- /*
- * The caller does not necessarily hold an mmap_sem that would prevent
- * the anon_vma disappearing so we first we take a reference to it
- * and then lock the anon_vma for write. This is similar to
- * page_lock_anon_vma_read except the write lock is taken to serialise
- * against parallel split or collapse operations.
- */
- anon_vma = page_get_anon_vma(head);
- if (!anon_vma) {
- ret = -EBUSY;
- goto out;
+ if (PageAnon(head)) {
+ extra_pins = 0;
+ /*
+ * The caller does not necessarily hold an mmap_sem that would
+ * prevent the anon_vma disappearing so we first we take a
+ * reference to it and then lock the anon_vma for write. This
+ * is similar to page_lock_anon_vma_read except the write lock
+ * is taken to serialise against parallel split or collapse
+ * operations.
+ */
+ anon_vma = page_get_anon_vma(head);
+ if (!anon_vma) {
+ ret = -EBUSY;
+ goto out;
+ }
+ anon_vma_lock_write(anon_vma);
+ } else {
+ /* Truncated ? */
+ if (!head->mapping) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /* Addidional pins from radix tree */
+ extra_pins = HPAGE_PMD_NR;
+ i_mmap_lock_read(head->mapping);
+ anon_vma = NULL;
}
- anon_vma_lock_write(anon_vma);

/*
* Racy check if we can split the page, before freeze_page() will
* split PMDs
*/
- if (total_mapcount(head) != page_count(head) - 1) {
+ if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
ret = -EBUSY;
goto out_unlock;
}
@@ -3291,35 +3318,65 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (mlocked)
lru_add_drain();

+ /* prevent PageLRU to go away from under us, and freeze lru stats */
+ spin_lock_irqsave(&page_zone(head)->lru_lock, flags);
+
+ if (!anon_vma) {
+ void **pslot;
+
+ spin_lock(&head->mapping->tree_lock);
+ pslot = radix_tree_lookup_slot(&head->mapping->page_tree,
+ page_index(head));
+ /*
+ * Check if the head page is present in radix tree.
+ * We assume all tail are present too, if head is there.
+ */
+ if (radix_tree_deref_slot_protected(pslot,
+ &head->mapping->tree_lock) != head)
+ goto fail;
+ }
+
/* Prevent deferred_split_scan() touching ->_count */
- spin_lock_irqsave(&pgdata->split_queue_lock, flags);
+ spin_lock(&pgdata->split_queue_lock);
count = page_count(head);
mapcount = total_mapcount(head);
- if (!mapcount && count == 1) {
+ if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
if (!list_empty(page_deferred_list(head))) {
pgdata->split_queue_len--;
list_del(page_deferred_list(head));
}
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- __split_huge_page(page, list);
+ spin_unlock(&pgdata->split_queue_lock);
+ __split_huge_page(page, list, flags);
ret = 0;
- } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- pr_alert("total_mapcount: %u, page_count(): %u\n",
- mapcount, count);
- if (PageTail(page))
- dump_page(head, NULL);
- dump_page(page, "total_mapcount(head) > 0");
- BUG();
} else {
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
+ if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
+ pr_alert("total_mapcount: %u, page_count(): %u\n",
+ mapcount, count);
+ if (PageTail(page))
+ dump_page(head, NULL);
+ dump_page(page, "total_mapcount(head) > 0");
+ BUG();
+ }
+ spin_unlock(&pgdata->split_queue_lock);
+fail: if (!anon_vma)
+ spin_unlock(&head->mapping->tree_lock);
+ spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
unfreeze_page(head);
ret = -EBUSY;
}

out_unlock:
- anon_vma_unlock_write(anon_vma);
- put_anon_vma(anon_vma);
+ if (anon_vma) {
+ anon_vma_unlock_write(anon_vma);
+ put_anon_vma(anon_vma);
+ } else {
+ i_mmap_unlock_read(head->mapping);
+
+ /*
+ * After split, some pages can be beyond i_size.
+ * TODO: We need to unmap them and drop from radix-tree.
+ */
+ }
out:
count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
return ret;
@@ -3442,8 +3499,7 @@ static int split_huge_pages_set(void *data, u64 val)
if (zone != page_zone(page))
goto next;

- if (!PageHead(page) || !PageAnon(page) ||
- PageHuge(page))
+ if (!PageHead(page) || PageHuge(page) || !PageLRU(page))
goto next;

total++;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8c5fd08c253c..a4c0e169517c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -515,6 +515,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
}
}

+ if (pmd_trans_unstable(pmd))
+ return 0;
retry:
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
--
2.7.0

2016-03-03 16:52:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 04/29] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers

freeze_page() and unfreeze_page() helpers evolved in rather complex
beasts. It would be nice to cut complexity of this code.

This patch rewrites freeze_page() using standard try_to_unmap().
unfreeze_page() is rewritten with remove_migration_ptes().

The result is much simpler.

But the new variant is somewhat slower. Current helpers iterates over
VMAs the compound page is mapped to, and then over ptes within this VMA.
New helpers iterates over small page, then over VMA the small page
mapped to, and only then find relevant pte.

Also we've lost optimization which allows to split PMD directly into
migration entries.

I don't think the slowdown is critical, considering how much simpler
result is and that split_huge_page() is quite rare nowadays. It only
happens due memory pressure or migration.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 214 +++++++------------------------------------------------
1 file changed, 24 insertions(+), 190 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 354464b484a7..a9921a485400 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2904,7 +2904,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}

static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long haddr, bool freeze)
+ unsigned long haddr)
{
struct mm_struct *mm = vma->vm_mm;
struct page *page;
@@ -2946,18 +2946,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* transferred to avoid any possibility of altering
* permissions across VMAs.
*/
- if (freeze) {
- swp_entry_t swp_entry;
- swp_entry = make_migration_entry(page + i, write);
- entry = swp_entry_to_pte(swp_entry);
- } else {
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(entry, vma);
- if (!write)
- entry = pte_wrprotect(entry);
- if (!young)
- entry = pte_mkold(entry);
- }
+ entry = mk_pte(page + i, vma->vm_page_prot);
+ entry = maybe_mkwrite(entry, vma);
+ if (!write)
+ entry = pte_wrprotect(entry);
+ if (!young)
+ entry = pte_mkold(entry);
if (dirty)
SetPageDirty(page + i);
pte = pte_offset_map(&_pmd, haddr);
@@ -3010,13 +3004,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*/
pmdp_invalidate(vma, haddr, pmd);
pmd_populate(mm, pmd, pgtable);
-
- if (freeze) {
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- page_remove_rmap(page + i, false);
- put_page(page + i);
- }
- }
}

void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3037,7 +3024,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
page = NULL;
} else if (!pmd_devmap(*pmd))
goto out;
- __split_huge_pmd_locked(vma, pmd, haddr, false);
+ __split_huge_pmd_locked(vma, pmd, haddr);
out:
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
@@ -3114,180 +3101,27 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
}
}

-static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
- unsigned long address)
+static void freeze_page(struct page *page)
{
- unsigned long haddr = address & HPAGE_PMD_MASK;
- spinlock_t *ptl;
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
- int i, nr = HPAGE_PMD_NR;
-
- /* Skip pages which doesn't belong to the VMA */
- if (address < vma->vm_start) {
- int off = (vma->vm_start - address) >> PAGE_SHIFT;
- page += off;
- nr -= off;
- address = vma->vm_start;
- }
-
- pgd = pgd_offset(vma->vm_mm, address);
- if (!pgd_present(*pgd))
- return;
- pud = pud_offset(pgd, address);
- if (!pud_present(*pud))
- return;
- pmd = pmd_offset(pud, address);
- ptl = pmd_lock(vma->vm_mm, pmd);
- if (!pmd_present(*pmd)) {
- spin_unlock(ptl);
- return;
- }
- if (pmd_trans_huge(*pmd)) {
- if (page == pmd_page(*pmd))
- __split_huge_pmd_locked(vma, pmd, haddr, true);
- spin_unlock(ptl);
- return;
- }
- spin_unlock(ptl);
-
- pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
- for (i = 0; i < nr; i++, address += PAGE_SIZE, page++, pte++) {
- pte_t entry, swp_pte;
- swp_entry_t swp_entry;
-
- /*
- * We've just crossed page table boundary: need to map next one.
- * It can happen if THP was mremaped to non PMD-aligned address.
- */
- if (unlikely(address == haddr + HPAGE_PMD_SIZE)) {
- pte_unmap_unlock(pte - 1, ptl);
- pmd = mm_find_pmd(vma->vm_mm, address);
- if (!pmd)
- return;
- pte = pte_offset_map_lock(vma->vm_mm, pmd,
- address, &ptl);
- }
-
- if (!pte_present(*pte))
- continue;
- if (page_to_pfn(page) != pte_pfn(*pte))
- continue;
- flush_cache_page(vma, address, page_to_pfn(page));
- entry = ptep_clear_flush(vma, address, pte);
- if (pte_dirty(entry))
- SetPageDirty(page);
- swp_entry = make_migration_entry(page, pte_write(entry));
- swp_pte = swp_entry_to_pte(swp_entry);
- if (pte_soft_dirty(entry))
- swp_pte = pte_swp_mksoft_dirty(swp_pte);
- set_pte_at(vma->vm_mm, address, pte, swp_pte);
- page_remove_rmap(page, false);
- put_page(page);
- }
- pte_unmap_unlock(pte - 1, ptl);
-}
-
-static void freeze_page(struct anon_vma *anon_vma, struct page *page)
-{
- struct anon_vma_chain *avc;
- pgoff_t pgoff = page_to_pgoff(page);
+ enum ttu_flags ttu_flags = TTU_MIGRATION | TTU_IGNORE_MLOCK |
+ TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED;
+ int i, ret;

VM_BUG_ON_PAGE(!PageHead(page), page);

- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
- pgoff + HPAGE_PMD_NR - 1) {
- unsigned long address = __vma_address(page, avc->vma);
-
- mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE);
- freeze_page_vma(avc->vma, page, address);
- mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE);
- }
-}
-
-static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
- unsigned long address)
-{
- spinlock_t *ptl;
- pmd_t *pmd;
- pte_t *pte, entry;
- swp_entry_t swp_entry;
- unsigned long haddr = address & HPAGE_PMD_MASK;
- int i, nr = HPAGE_PMD_NR;
-
- /* Skip pages which doesn't belong to the VMA */
- if (address < vma->vm_start) {
- int off = (vma->vm_start - address) >> PAGE_SHIFT;
- page += off;
- nr -= off;
- address = vma->vm_start;
- }
-
- pmd = mm_find_pmd(vma->vm_mm, address);
- if (!pmd)
- return;
-
- pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
- for (i = 0; i < nr; i++, address += PAGE_SIZE, page++, pte++) {
- /*
- * We've just crossed page table boundary: need to map next one.
- * It can happen if THP was mremaped to non-PMD aligned address.
- */
- if (unlikely(address == haddr + HPAGE_PMD_SIZE)) {
- pte_unmap_unlock(pte - 1, ptl);
- pmd = mm_find_pmd(vma->vm_mm, address);
- if (!pmd)
- return;
- pte = pte_offset_map_lock(vma->vm_mm, pmd,
- address, &ptl);
- }
-
- if (!is_swap_pte(*pte))
- continue;
-
- swp_entry = pte_to_swp_entry(*pte);
- if (!is_migration_entry(swp_entry))
- continue;
- if (migration_entry_to_page(swp_entry) != page)
- continue;
-
- get_page(page);
- page_add_anon_rmap(page, vma, address, false);
-
- entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
- if (PageDirty(page))
- entry = pte_mkdirty(entry);
- if (is_write_migration_entry(swp_entry))
- entry = maybe_mkwrite(entry, vma);
-
- flush_dcache_page(page);
- set_pte_at(vma->vm_mm, address, pte, entry);
-
- /* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, pte);
- }
- pte_unmap_unlock(pte - 1, ptl);
+ /* We only need TTU_SPLIT_HUGE_PMD once */
+ ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
+ for (i = 1; !ret && i < HPAGE_PMD_NR; i++)
+ ret = try_to_unmap(page + i, ttu_flags);
+ VM_BUG_ON(ret);
}

-static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+static void unfreeze_page(struct page *page)
{
- struct anon_vma_chain *avc;
- pgoff_t pgoff = page_to_pgoff(page);
-
- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
- pgoff, pgoff + HPAGE_PMD_NR - 1) {
- unsigned long address = __vma_address(page, avc->vma);
+ int i;

- mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE);
- unfreeze_page_vma(avc->vma, page, address);
- mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
- address, address + HPAGE_PMD_SIZE);
- }
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ remove_migration_ptes(page + i, page + i, true);
}

static void __split_huge_page_tail(struct page *head, int tail,
@@ -3365,7 +3199,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
ClearPageCompound(head);
spin_unlock_irq(&zone->lru_lock);

- unfreeze_page(page_anon_vma(head), head);
+ unfreeze_page(head);

for (i = 0; i < HPAGE_PMD_NR; i++) {
struct page *subpage = head + i;
@@ -3461,7 +3295,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
}

mlocked = PageMlocked(page);
- freeze_page(anon_vma, head);
+ freeze_page(head);
VM_BUG_ON_PAGE(compound_mapcount(head), head);

/* Make sure the page is not on per-CPU pagevec as it takes pin */
@@ -3490,7 +3324,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
BUG();
} else {
spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- unfreeze_page(anon_vma, head);
+ unfreeze_page(head);
ret = -EBUSY;
}

--
2.7.0

2016-03-03 16:53:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 17/29] thp: prepare change_huge_pmd() for file thp

change_huge_pmd() has assert which is not relvant for file page.
For shared mapping it's perfectly fine to have page table entry
writable, without explicit mkwrite.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c0ffa6a7803a..29a687cbef40 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1756,7 +1756,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
entry = pmd_mkwrite(entry);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(!preserve_write && pmd_write(entry));
+ BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
+ pmd_write(entry));
}
spin_unlock(ptl);
}
--
2.7.0

2016-03-03 16:52:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 27/29] shmem: get_unmapped_area align huge page

From: Hugh Dickins <[email protected]>

Provide a shmem_get_unmapped_area method in file_operations, called
at mmap time to decide the mapping address. It could be conditional
on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by
making it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout). Lots of conditions, and in most cases
it just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back
to ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(),
not via the file_operations: because of the way shmem_zero_setup()
is called to create a shmem object late in the mmap sequence, when
MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero. Though
this only matters when /proc/sys/vm/shmem_huge has been set.

Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/char/mem.c | 24 ++++++++++++
include/linux/shmem_fs.h | 2 +
ipc/shm.c | 6 ++-
mm/mmap.c | 16 +++++++-
mm/shmem.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 142 insertions(+), 4 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 6b1721f978c2..a4c3ce0c9ece 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -22,6 +22,7 @@
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/backing-dev.h>
+#include <linux/shmem_fs.h>
#include <linux/splice.h>
#include <linux/pfn.h>
#include <linux/export.h>
@@ -661,6 +662,28 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma)
return 0;
}

+static unsigned long get_unmapped_area_zero(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+#ifdef CONFIG_MMU
+ if (flags & MAP_SHARED) {
+ /*
+ * mmap_zero() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge;
+ * and pass NULL for file as in mmap.c's get_unmapped_area(),
+ * so as not to confuse shmem with our handle on "/dev/zero".
+ */
+ return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+ }
+
+ /* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+#else
+ return -ENOSYS;
+#endif
+}
+
static ssize_t write_full(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -768,6 +791,7 @@ static const struct file_operations zero_fops = {
.read_iter = read_iter_zero,
.write_iter = write_iter_zero,
.mmap = mmap_zero,
+ .get_unmapped_area = get_unmapped_area_zero,
#ifndef CONFIG_MMU
.mmap_capabilities = zero_mmap_capabilities,
#endif
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 03490d1554ba..25ac75b424d3 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -53,6 +53,8 @@ extern struct file *shmem_file_setup(const char *name,
extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
unsigned long flags);
extern int shmem_zero_setup(struct vm_area_struct *);
+extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
extern bool shmem_mapping(struct address_space *mapping);
extern void shmem_unlock_mapping(struct address_space *mapping);
diff --git a/ipc/shm.c b/ipc/shm.c
index 3174634ca4e5..b797a6e49d78 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -476,13 +476,15 @@ static const struct file_operations shm_file_operations = {
.mmap = shm_mmap,
.fsync = shm_fsync,
.release = shm_release,
-#ifndef CONFIG_MMU
.get_unmapped_area = shm_get_unmapped_area,
-#endif
.llseek = noop_llseek,
.fallocate = shm_fallocate,
};

+/*
+ * shm_file_operations_huge is now identical to shm_file_operations,
+ * but we keep it distinct for the sake of is_file_shm_hugepages().
+ */
static const struct file_operations shm_file_operations_huge = {
.mmap = shm_mmap,
.fsync = shm_fsync,
diff --git a/mm/mmap.c b/mm/mmap.c
index b6aaf3609121..33f6bbb80d1c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -25,6 +25,7 @@
#include <linux/personality.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
#include <linux/profile.h>
#include <linux/export.h>
#include <linux/mount.h>
@@ -1892,8 +1893,19 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
return -ENOMEM;

get_area = current->mm->get_unmapped_area;
- if (file && file->f_op->get_unmapped_area)
- get_area = file->f_op->get_unmapped_area;
+ if (file) {
+ if (file->f_op->get_unmapped_area)
+ get_area = file->f_op->get_unmapped_area;
+ } else if (flags & MAP_SHARED) {
+ /*
+ * mmap_region() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge.
+ * do_mmap_pgoff() will clear pgoff, so match alignment.
+ */
+ pgoff = 0;
+ get_area = shmem_get_unmapped_area;
+ }
+
addr = get_area(file, addr, len, pgoff, flags);
if (IS_ERR_VALUE(addr))
return addr;
diff --git a/mm/shmem.c b/mm/shmem.c
index 5f47a143bd9d..fcd9dfa0bf90 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,94 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return ret;
}

+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long uaddr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ unsigned long (*get_area)(struct file *,
+ unsigned long, unsigned long, unsigned long, unsigned long);
+ unsigned long addr;
+ unsigned long offset;
+ unsigned long inflated_len;
+ unsigned long inflated_addr;
+ unsigned long inflated_offset;
+
+ if (len > TASK_SIZE)
+ return -ENOMEM;
+
+ get_area = current->mm->get_unmapped_area;
+ addr = get_area(file, uaddr, len, pgoff, flags);
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return addr;
+ if (IS_ERR_VALUE(addr))
+ return addr;
+ if (addr & ~PAGE_MASK)
+ return addr;
+ if (addr > TASK_SIZE - len)
+ return addr;
+
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return addr;
+ if (len < HPAGE_PMD_SIZE)
+ return addr;
+ if (flags & MAP_FIXED)
+ return addr;
+ /*
+ * Our priority is to support MAP_SHARED mapped hugely;
+ * and support MAP_PRIVATE mapped hugely too, until it is COWed.
+ * But if caller specified an address hint, respect that as before.
+ */
+ if (uaddr)
+ return addr;
+
+ if (shmem_huge != SHMEM_HUGE_FORCE) {
+ struct super_block *sb;
+
+ if (file) {
+ VM_BUG_ON(file->f_op != &shmem_file_operations);
+ sb = file_inode(file)->i_sb;
+ } else {
+ /*
+ * Called directly from mm/mmap.c, or drivers/char/mem.c
+ * for "/dev/zero", to create a shared anonymous object.
+ */
+ if (IS_ERR(shm_mnt))
+ return addr;
+ sb = shm_mnt->mnt_sb;
+ }
+ if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER)
+ return addr;
+ }
+
+ offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
+ if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+ return addr;
+ if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+ return addr;
+
+ inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+ if (inflated_len > TASK_SIZE)
+ return addr;
+ if (inflated_len < len)
+ return addr;
+
+ inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+ if (IS_ERR_VALUE(inflated_addr))
+ return addr;
+ if (inflated_addr & ~PAGE_MASK)
+ return addr;
+
+ inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+ inflated_addr += offset - inflated_offset;
+ if (inflated_offset > offset)
+ inflated_addr += HPAGE_PMD_SIZE;
+
+ if (inflated_addr > TASK_SIZE - len)
+ return addr;
+ return inflated_addr;
+}
+
#ifdef CONFIG_NUMA
static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
{
@@ -3315,6 +3403,7 @@ static const struct address_space_operations shmem_aops = {

static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
+ .get_unmapped_area = shmem_get_unmapped_area,
#ifdef CONFIG_TMPFS
.llseek = shmem_file_llseek,
.read_iter = shmem_file_read_iter,
@@ -3541,6 +3630,15 @@ void shmem_unlock_mapping(struct address_space *mapping)
{
}

+#ifdef CONFIG_MMU
+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+#endif
+
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
{
truncate_inode_pages_range(inode->i_mapping, lstart, lend);
--
2.7.0

2016-03-03 16:52:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 10/29] mm, rmap: account file thp pages

Let's add FileHugeMapped field into meminfo. It indicates how many time
we map file THP

NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/base/node.c | 10 ++++++----
fs/proc/meminfo.c | 5 +++--
include/linux/mmzone.h | 3 ++-
mm/huge_memory.c | 2 +-
mm/rmap.c | 12 ++++++------
mm/vmstat.c | 1 +
6 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..9cc4e9dad47e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -113,6 +113,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d FileHugeMapped: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -131,10 +132,11 @@ static ssize_t node_read_meminfo(struct device *dev,
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
- , nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR));
+ nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+ nid, K(node_page_state(nid, NR_ANON_THPS) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_FILE_THP_MAPPED) *
+ HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
#endif
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..50666e987fbd 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -105,6 +105,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "FileHugeMapped: %8lu kB\n"
#endif
#ifdef CONFIG_CMA
"CmaTotal: %8lu kB\n"
@@ -162,8 +163,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- , K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR)
+ , K(global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_FILE_THP_MAPPED) * HPAGE_PMD_NR)
#endif
#ifdef CONFIG_CMA
, K(totalcma_pages)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c60df9257cc7..85fd4aac53a1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,7 +158,8 @@ enum zone_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
- NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_ANON_THPS,
+ NR_FILE_THP_MAPPED,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2cade3851b7a..79598ee8a3ff 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2946,7 +2946,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/* No need in mapcount reference anymore */
for (i = 0; i < HPAGE_PMD_NR; i++)
diff --git a/mm/rmap.c b/mm/rmap.c
index b550bf637ce3..765e001836dc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1227,10 +1227,8 @@ void do_page_add_anon_rmap(struct page *page,
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (compound) {
- __inc_zone_page_state(page,
- NR_ANON_TRANSPARENT_HUGEPAGES);
- }
+ if (compound)
+ __inc_zone_page_state(page, NR_ANON_THPS);
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
}
if (unlikely(PageKsm(page)))
@@ -1268,7 +1266,7 @@ void page_add_new_anon_rmap(struct page *page,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
atomic_set(compound_mapcount_ptr(page), 0);
- __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __inc_zone_page_state(page, NR_ANON_THPS);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1298,6 +1296,7 @@ void page_add_file_rmap(struct page *page, bool compound)
}
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
+ __inc_zone_page_state(page, NR_FILE_THP_MAPPED);
} else {
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
@@ -1330,6 +1329,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
goto out;
+ __dec_zone_page_state(page, NR_FILE_THP_MAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))
goto out;
@@ -1363,7 +1363,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;

- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);

if (TestClearPageDoubleMap(page)) {
/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74f8c918ac4b..943b37f17007 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -762,6 +762,7 @@ const char * const vmstat_text[] = {
"workingset_activate",
"workingset_nodereclaim",
"nr_anon_transparent_hugepages",
+ "nr_file_transparent_hugepages_mapped",
"nr_free_cma",

/* enum writeback_stat_item counters */
--
2.7.0

2016-03-03 16:53:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 21/29] vmscan: split file huge pages before paging them out

This is preparation of vmscan for file huge pages. We cannot write out
huge pages, so we need to split them on the way out.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/vmscan.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 39e90e24caff..832a00523ffb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -473,12 +473,14 @@ void drop_slab(void)

static inline int is_page_cache_freeable(struct page *page)
{
+ int radix_tree_pins = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
+
/*
* A freeable page cache page is referenced only by the caller
* that isolated the page, the page cache radix tree and
* optional buffer heads at page->private.
*/
- return page_count(page) - page_has_private(page) == 2;
+ return page_count(page) - page_has_private(page) == 1 + radix_tree_pins;
}

static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
@@ -548,8 +550,6 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* swap_backing_dev_info is bust: it doesn't reflect the
* congestion state of the swapdevs. Easy to fix, if needed.
*/
- if (!is_page_cache_freeable(page))
- return PAGE_KEEP;
if (!mapping) {
/*
* Some data journaling orphaned pages can have
@@ -1112,6 +1112,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
+
+ if (!is_page_cache_freeable(page))
+ goto keep_locked;
+
+ if (unlikely(PageTransHuge(page))) {
+ if (split_huge_page_to_list(page, page_list))
+ goto keep_locked;
+ }
+
switch (pageout(page, mapping, sc)) {
case PAGE_KEEP:
goto keep_locked;
--
2.7.0

2016-03-03 16:53:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 24/29] filemap: prepare find and delete operations for huge pages

For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
page. That's suboptimal and it will be changed to use Matthew's
multi-order entries later.

'add' operation is not changed, because we don't need it to implement
hugetmpfs: shmem uses its own implementation.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 187 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 134 insertions(+), 53 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index de867f83dc0a..eb9dbdf5573d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,43 +110,18 @@
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/

-static void page_cache_tree_delete(struct address_space *mapping,
- struct page *page, void *shadow)
+static void __page_cache_tree_delete(struct address_space *mapping,
+ struct radix_tree_node *node, void **slot, unsigned long index,
+ void *shadow)
{
- struct radix_tree_node *node;
- unsigned long index;
- unsigned int offset;
unsigned int tag;
- void **slot;
-
- VM_BUG_ON(!PageLocked(page));

- __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
-
- if (shadow) {
- mapping->nrexceptional++;
- /*
- * Make sure the nrexceptional update is committed before
- * the nrpages update so that final truncate racing
- * with reclaim does not see both counters 0 at the
- * same time and miss a shadow entry.
- */
- smp_wmb();
- }
- mapping->nrpages--;
-
- if (!node) {
- /* Clear direct pointer tags in root node */
- mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
- radix_tree_replace_slot(slot, shadow);
- return;
- }
+ VM_BUG_ON(node == NULL);
+ VM_BUG_ON(*slot == NULL);

/* Clear tree tags for the removed page */
- index = page->index;
- offset = index & RADIX_TREE_MAP_MASK;
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
- if (test_bit(offset, node->tags[tag]))
+ if (test_bit(index & RADIX_TREE_MAP_MASK, node->tags[tag]))
radix_tree_tag_clear(&mapping->page_tree, index, tag);
}

@@ -173,6 +148,54 @@ static void page_cache_tree_delete(struct address_space *mapping,
}
}

+static void page_cache_tree_delete(struct address_space *mapping,
+ struct page *page, void *shadow)
+{
+ struct radix_tree_node *node;
+ unsigned long index;
+ void **slot;
+ int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+ if (shadow) {
+ mapping->nrexceptional += nr;
+ /*
+ * Make sure the nrexceptional update is committed before
+ * the nrpages update so that final truncate racing
+ * with reclaim does not see both counters 0 at the
+ * same time and miss a shadow entry.
+ */
+ smp_wmb();
+ }
+ mapping->nrpages -= nr;
+
+ if (!node) {
+ /* Clear direct pointer tags in root node */
+ mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+ VM_BUG_ON(nr != 1);
+ radix_tree_replace_slot(slot, shadow);
+ return;
+ }
+
+ index = page->index;
+ VM_BUG_ON_PAGE(index & (nr - 1), page);
+ for (i = 0; i < nr; i++) {
+ /* Cross node border */
+ if (i && ((index + i) & RADIX_TREE_MAP_MASK) == 0) {
+ __radix_tree_lookup(&mapping->page_tree,
+ page->index + i, &node, &slot);
+ }
+
+ __page_cache_tree_delete(mapping, node,
+ slot + (i & RADIX_TREE_MAP_MASK), index + i,
+ shadow);
+ }
+}
+
/*
* Delete a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
@@ -181,6 +204,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
void __delete_from_page_cache(struct page *page, void *shadow)
{
struct address_space *mapping = page->mapping;
+ int nr = hpage_nr_pages(page);

trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -200,9 +224,10 @@ void __delete_from_page_cache(struct page *page, void *shadow)

/* hugetlb pages do not participate in page cache accounting. */
if (!PageHuge(page))
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
+ VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(page_mapped(page), page);

/*
@@ -227,9 +252,8 @@ void __delete_from_page_cache(struct page *page, void *shadow)
*/
void delete_from_page_cache(struct page *page)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page_mapping(page);
unsigned long flags;
-
void (*freepage)(struct page *);

BUG_ON(!PageLocked(page));
@@ -242,7 +266,13 @@ void delete_from_page_cache(struct page *page)

if (freepage)
freepage(page);
- page_cache_release(page);
+
+ if (PageTransHuge(page) && !PageHuge(page)) {
+ atomic_sub(HPAGE_PMD_NR, &page->_count);
+ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ } else {
+ page_cache_release(page);
+ }
}
EXPORT_SYMBOL(delete_from_page_cache);

@@ -1035,7 +1065,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
{
void **pagep;
- struct page *page;
+ struct page *head, *page;

rcu_read_lock();
repeat:
@@ -1055,9 +1085,17 @@ repeat:
*/
goto out;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;

+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
/*
* Has the page moved?
* This is part of the lockless pagecache protocol. See
@@ -1100,12 +1138,12 @@ repeat:
if (page && !radix_tree_exception(page)) {
lock_page(page);
/* Has the page been truncated? */
- if (unlikely(page->mapping != mapping)) {
+ if (unlikely(page_mapping(page) != mapping)) {
unlock_page(page);
page_cache_release(page);
goto repeat;
}
- VM_BUG_ON_PAGE(page->index != offset, page);
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
}
return page;
}
@@ -1238,7 +1276,7 @@ unsigned find_get_entries(struct address_space *mapping,
rcu_read_lock();
restart:
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1253,8 +1291,16 @@ repeat:
*/
goto export;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
goto repeat;
+ }

/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1300,7 +1346,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
rcu_read_lock();
restart:
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1324,9 +1370,16 @@ repeat:
continue;
}

- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;

+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
/* Has the page moved? */
if (unlikely(page != *slot)) {
page_cache_release(page);
@@ -1367,7 +1420,7 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
rcu_read_lock();
restart:
radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
/* The hole, there no reason to continue */
@@ -1391,8 +1444,14 @@ repeat:
break;
}

- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
+ goto repeat;
+ }

/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1405,7 +1464,7 @@ repeat:
* otherwise we can get both false positives and false
* negatives, which is just confusing to the caller.
*/
- if (page->mapping == NULL || page->index != iter.index) {
+ if (page->mapping == NULL || page_to_pgoff(page) != iter.index) {
page_cache_release(page);
break;
}
@@ -1444,7 +1503,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
restart:
radix_tree_for_each_tagged(slot, &mapping->page_tree,
&iter, *index, tag) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1473,8 +1532,15 @@ repeat:
continue;
}

- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
goto repeat;
+ }

/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1523,7 +1589,7 @@ unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
restart:
radix_tree_for_each_tagged(slot, &mapping->page_tree,
&iter, start, tag) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1545,9 +1611,17 @@ repeat:
*/
goto export;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;

+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
/* Has the page moved? */
if (unlikely(page != *slot)) {
page_cache_release(page);
@@ -2140,7 +2214,7 @@ void filemap_map_pages(struct fault_env *fe,
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
loff_t size;
- struct page *page;
+ struct page *head, *page;

rcu_read_lock();
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
@@ -2158,8 +2232,15 @@ repeat:
goto next;
}

- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ page_cache_release(page);
goto repeat;
+ }

/* Has the page moved? */
if (unlikely(page != *slot)) {
--
2.7.0

2016-03-03 17:01:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 15/29] thp: handle file pages in mremap()

We need to mirror logic in move_ptes() wrt need_rmap_locks to get proper
serialization file THP.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/mremap.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 3fa0a467df66..88fa7ab1a8ce 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -192,17 +192,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
break;
if (pmd_trans_huge(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE) {
+ struct address_space *mapping = NULL;
+ struct anon_vma *anon_vma = NULL;
bool moved;
- VM_BUG_ON_VMA(vma->vm_file || !vma->anon_vma,
- vma);
/* See comment in move_ptes() */
- if (need_rmap_locks)
- anon_vma_lock_write(vma->anon_vma);
+ if (need_rmap_locks) {
+ if (vma->vm_file) {
+ mapping = vma->vm_file->f_mapping;
+ i_mmap_lock_write(mapping);
+ }
+ if (vma->anon_vma) {
+ anon_vma = vma->anon_vma;
+ anon_vma_lock_write(anon_vma);
+ }
+ }
moved = move_huge_pmd(vma, new_vma, old_addr,
new_addr, old_end,
old_pmd, new_pmd);
- if (need_rmap_locks)
- anon_vma_unlock_write(vma->anon_vma);
+ if (anon_vma)
+ anon_vma_unlock_write(anon_vma);
+ if (mapping)
+ i_mmap_unlock_write(mapping);
if (moved) {
need_flush = true;
continue;
--
2.7.0

2016-03-03 17:01:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 14/29] thp: handle file COW faults

File COW for THP is handled on pte level: just split the pmd.

It's not clear how benefitial would be allocation of huge pages on COW
faults. And it would require some code to make them work.

I think at some point we can consider teaching khugepaged to collapse
pages in COW mappings, but allocating huge on fault is probably
overkill.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index ab9acc4d83a2..0acdd33bfe16 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3369,6 +3369,11 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
if (fe->vma->vm_ops->pmd_fault)
return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
fe->flags);
+
+ /* COW handled on pte level: split pmd */
+ VM_BUG_ON_VMA(fe->vma->vm_flags & VM_SHARED, fe->vma);
+ split_huge_pmd(fe->vma, fe->pmd, fe->address);
+
return VM_FAULT_FALLBACK;
}

--
2.7.0

2016-03-03 17:01:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 12/29] thp: support file pages in zap_huge_pmd()

split_huge_pmd() for file mappings (and DAX too) is implemented by just
clearing pmd entry as we can re-fill this area from page cache on pte
level later.

This means we don't need deposit page tables when file THP is mapped.
Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
file THP PMD.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 79598ee8a3ff..15704484abf4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1654,10 +1654,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page = pmd_page(orig_pmd);
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON_PAGE(!PageHead(page), page);
- pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
- atomic_long_dec(&tlb->mm->nr_ptes);
+ if (PageAnon(page)) {
+ pgtable_t pgtable;
+ pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+ pte_free(tlb->mm, pgtable);
+ atomic_long_dec(&tlb->mm->nr_ptes);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ } else {
+ add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+ }
spin_unlock(ptl);
tlb_remove_page(tlb, page);
}
--
2.7.0

2016-03-03 17:02:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 28/29] shmem: add huge pages support

Here's basic implementation of huge pages support for shmem/tmpfs.

It's all pretty streight-forward:

- shmem_getpage() allcoates huge page if it can and try to inserd into
radix tree with shmem_add_to_page_cache();

- shmem_add_to_page_cache() puts the page onto radix-tree if there's
space for it;

- shmem_undo_range() removes huge pages, if it fully within range.
Partial truncate of huge pages zero out this part of THP.

This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
behaviour. As we don't really create hole in this case,
lseek(SEEK_HOLE) may have inconsistent results depending what
pages happened to be allocated.

- no need to change shmem_fault: core-mm will map an compound page as
huge if VMA is suitable;

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 +
mm/memory.c | 5 +-
mm/mempolicy.c | 2 +-
mm/page-writeback.c | 1 +
mm/shmem.c | 354 +++++++++++++++++++++++++++++++++++++-----------
mm/swap.c | 2 +
6 files changed, 281 insertions(+), 85 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ac6dc46dc65a..30382963c8f6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -163,6 +163,8 @@ struct page *get_huge_zero_page(void);

#define transparent_hugepage_enabled(__vma) 0

+static inline void prep_transhuge_page(struct page *page) {}
+
#define transparent_hugepage_flags 0UL
static inline int
split_huge_page_to_list(struct page *page, struct list_head *list)
diff --git a/mm/memory.c b/mm/memory.c
index 0acdd33bfe16..b6ba8e225996 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1093,7 +1093,7 @@ again:
* unmap shared but keep private pages.
*/
if (details->check_mapping &&
- details->check_mapping != page->mapping)
+ details->check_mapping != page_rmapping(page))
continue;
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -1184,7 +1184,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
- VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
+ VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
+ !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
split_huge_pmd(vma, pmd, addr);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a4c0e169517c..c7654344407d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -534,7 +534,7 @@ retry:
nid = page_to_nid(page);
if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
continue;
- if (PageTail(page) && PageAnon(page)) {
+ if (PageTransCompound(page)) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11ff8f758631..2c8d5386665d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2554,6 +2554,7 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);

+ page = compound_head(page);
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index fcd9dfa0bf90..2de464f65e53 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -173,10 +173,13 @@ static inline int shmem_reacct_size(unsigned long flags,
* shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
* so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
*/
-static inline int shmem_acct_block(unsigned long flags)
+static inline int shmem_acct_block(unsigned long flags, long pages)
{
- return (flags & VM_NORESERVE) ?
- security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0;
+ if (!(flags & VM_NORESERVE))
+ return 0;
+
+ return security_vm_enough_memory_mm(current->mm,
+ pages * VM_ACCT(PAGE_CACHE_SIZE));
}

static inline void shmem_unacct_blocks(unsigned long flags, long pages)
@@ -376,30 +379,55 @@ static int shmem_add_to_page_cache(struct page *page,
struct address_space *mapping,
pgoff_t index, void *expected)
{
- int error;
+ int error, nr = hpage_nr_pages(page);

+ VM_BUG_ON_PAGE(PageTail(page), page);
+ VM_BUG_ON_PAGE(index != round_down(index, nr), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ VM_BUG_ON(expected && PageTransHuge(page));

- page_cache_get(page);
+ atomic_add(nr, &page->_count);
page->mapping = mapping;
page->index = index;

spin_lock_irq(&mapping->tree_lock);
- if (!expected)
+ if (PageTransHuge(page)) {
+ void __rcu **results;
+ pgoff_t idx;
+ int i;
+
+ error = 0;
+ if (radix_tree_gang_lookup_slot(&mapping->page_tree,
+ &results, &idx, index, 1) &&
+ idx < index + HPAGE_PMD_NR) {
+ error = -EEXIST;
+ }
+
+ if (!error) {
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ error = radix_tree_insert(&mapping->page_tree,
+ index + i, page + i);
+ VM_BUG_ON(error);
+ }
+ count_vm_event(THP_FILE_ALLOC);
+ }
+ } else if (!expected) {
error = radix_tree_insert(&mapping->page_tree, index, page);
- else
+ } else {
error = shmem_radix_tree_replace(mapping, index, expected,
page);
+ }
+
if (!error) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- __inc_zone_page_state(page, NR_SHMEM);
+ mapping->nrpages += nr;
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, nr);
spin_unlock_irq(&mapping->tree_lock);
} else {
page->mapping = NULL;
spin_unlock_irq(&mapping->tree_lock);
- page_cache_release(page);
+ atomic_sub(nr, &page->_count);
}
return error;
}
@@ -412,6 +440,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
struct address_space *mapping = page->mapping;
int error;

+ VM_BUG_ON_PAGE(PageCompound(page), page);
+
spin_lock_irq(&mapping->tree_lock);
error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
page->mapping = NULL;
@@ -587,6 +617,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index = indices[i];
if (index >= end)
break;
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page);

if (radix_tree_exceptional_entry(page)) {
if (unfalloc)
@@ -598,8 +629,29 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,

if (!trylock_page(page))
continue;
+
+ if (PageTransTail(page)) {
+ /* Middle of THP: zero out the page */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ } else if (PageTransHuge(page)) {
+ if (index == round_down(end, HPAGE_PMD_NR)) {
+ /*
+ * Range ends in the middle of THP:
+ * zero out the page
+ */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ }
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ }
+
if (!unfalloc || !PageUptodate(page)) {
- if (page->mapping == mapping) {
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (page_mapping(page) == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
truncate_inode_page(mapping, page);
}
@@ -675,8 +727,36 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}

lock_page(page);
+
+ if (PageTransTail(page)) {
+ /* Middle of THP: zero out the page */
+ clear_highpage(page);
+ unlock_page(page);
+ /*
+ * Partial thp truncate due 'start' in middle
+ * of THP: don't need to look on these pages
+ * again on !pvec.nr restart.
+ */
+ if (index != round_down(end, HPAGE_PMD_NR))
+ start++;
+ continue;
+ } else if (PageTransHuge(page)) {
+ if (index == round_down(end, HPAGE_PMD_NR)) {
+ /*
+ * Range ends in the middle of THP:
+ * zero out the page
+ */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ }
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ }
+
if (!unfalloc || !PageUptodate(page)) {
- if (page->mapping == mapping) {
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (page_mapping(page) == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
truncate_inode_page(mapping, page);
} else {
@@ -935,6 +1015,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
swp_entry_t swap;
pgoff_t index;

+ VM_BUG_ON_PAGE(PageCompound(page), page);
BUG_ON(!PageLocked(page));
mapping = page->mapping;
index = page->index;
@@ -1034,8 +1115,8 @@ redirty:
return 0;
}

-#ifdef CONFIG_NUMA
#ifdef CONFIG_TMPFS
+#ifdef CONFIG_NUMA
static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
{
char buffer[64];
@@ -1059,68 +1140,129 @@ static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
}
return mpol;
}
+
+#else
+
+static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
+{
+}
+#endif /* CONFIG_NUMA */
#endif /* CONFIG_TMPFS */

+static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
+ struct shmem_inode_info *info, pgoff_t index)
+{
+ /* Create a pseudo vma that just contains the policy */
+ vma->vm_start = 0;
+ /* Bias interleave by inode number to distribute better across nodes */
+ vma->vm_pgoff = index + info->vfs_inode.i_ino;
+ vma->vm_ops = NULL;
+
+#ifdef CONFIG_NUMA
+ vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+#endif /* CONFIG_NUMA */
+}
+
+static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_NUMA
+ /* Drop reference taken by mpol_shared_policy_lookup() */
+ mpol_cond_put(vma->vm_policy);
+#endif
+}
+
static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
struct vm_area_struct pvma;
struct page *page;

- /* Create a pseudo vma that just contains the policy */
- pvma.vm_start = 0;
- /* Bias interleave by inode number to distribute better across nodes */
- pvma.vm_pgoff = index + info->vfs_inode.i_ino;
- pvma.vm_ops = NULL;
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
-
+ shmem_pseudo_vma_init(&pvma, info, index);
page = swapin_readahead(swap, gfp, &pvma, 0);
-
- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(pvma.vm_policy);
+ shmem_pseudo_vma_destroy(&pvma);

return page;
}

-static struct page *shmem_alloc_page(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_hugepage(gfp_t gfp,
+ struct shmem_inode_info *info, pgoff_t index)
{
struct vm_area_struct pvma;
+ struct inode *inode = &info->vfs_inode;
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t idx, hindex = round_down(index, HPAGE_PMD_NR);
+ void __rcu **results;
struct page *page;

- /* Create a pseudo vma that just contains the policy */
- pvma.vm_start = 0;
- /* Bias interleave by inode number to distribute better across nodes */
- pvma.vm_pgoff = index + info->vfs_inode.i_ino;
- pvma.vm_ops = NULL;
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
-
- page = alloc_page_vma(gfp, &pvma, 0);
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return NULL;

- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(pvma.vm_policy);
+ rcu_read_lock();
+ if (radix_tree_gang_lookup_slot(&mapping->page_tree, &results, &idx,
+ hindex, 1) && idx < hindex + HPAGE_PMD_NR) {
+ rcu_read_unlock();
+ return NULL;
+ }
+ rcu_read_unlock();

+ shmem_pseudo_vma_init(&pvma, info, hindex);
+ page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
+ HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+ shmem_pseudo_vma_destroy(&pvma);
+ if (page)
+ prep_transhuge_page(page);
return page;
}
-#else /* !CONFIG_NUMA */
-#ifdef CONFIG_TMPFS
-static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
-{
-}
-#endif /* CONFIG_TMPFS */

-static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+static struct page *shmem_alloc_page(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
- return swapin_readahead(swap, gfp, NULL, 0);
+ struct vm_area_struct pvma;
+ struct page *page;
+
+ shmem_pseudo_vma_init(&pvma, info, index);
+ page = alloc_page_vma(gfp, &pvma, 0);
+ shmem_pseudo_vma_destroy(&pvma);
+
+ return page;
}

-static inline struct page *shmem_alloc_page(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_and_acct_page(gfp_t gfp,
+ struct shmem_inode_info *info, struct shmem_sb_info *sbinfo,
+ pgoff_t index, bool huge)
{
- return alloc_page(gfp);
+ struct page *page;
+ int nr;
+ int err = -ENOSPC;
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ huge = false;
+ nr = huge ? HPAGE_PMD_NR : 1;
+
+ if (shmem_acct_block(info->flags, nr))
+ goto failed;
+ if (sbinfo->max_blocks) {
+ if (percpu_counter_compare(&sbinfo->used_blocks,
+ sbinfo->max_blocks + nr) > 0)
+ goto unacct;
+ percpu_counter_add(&sbinfo->used_blocks, nr);
+ }
+
+ if (huge)
+ page = shmem_alloc_hugepage(gfp, info, index);
+ else
+ page = shmem_alloc_page(gfp, info, index);
+ if (page)
+ return page;
+
+ err = -ENOMEM;
+ if (sbinfo->max_blocks)
+ percpu_counter_add(&sbinfo->used_blocks, -nr);
+unacct:
+ shmem_unacct_blocks(info->flags, nr);
+failed:
+ return ERR_PTR(err);
}
-#endif /* CONFIG_NUMA */

#if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
@@ -1228,6 +1370,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ pgoff_t hindex = index;
int error;
int once = 0;
int alloced = 0;
@@ -1344,50 +1487,75 @@ repeat:
swap_free(swap);

} else {
- if (shmem_acct_block(info->flags)) {
- error = -ENOSPC;
- goto failed;
- }
- if (sbinfo->max_blocks) {
- if (percpu_counter_compare(&sbinfo->used_blocks,
- sbinfo->max_blocks) >= 0) {
- error = -ENOSPC;
- goto unacct;
- }
- percpu_counter_inc(&sbinfo->used_blocks);
+ /* shmem_symlink() */
+ if (mapping->a_ops != &shmem_aops)
+ goto alloc_nohuge;
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ goto alloc_nohuge;
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ goto alloc_huge;
+ switch (sbinfo->huge) {
+ loff_t i_size;
+ pgoff_t off;
+ case SHMEM_HUGE_NEVER:
+ goto alloc_nohuge;
+ case SHMEM_HUGE_WITHIN_SIZE:
+ off = round_up(index, HPAGE_PMD_NR);
+ i_size = round_up(i_size_read(inode), PAGE_CACHE_SIZE);
+ if (i_size >> PAGE_CACHE_SHIFT >= off)
+ goto alloc_huge;
+ /* fallthrough */
+ case SHMEM_HUGE_ADVISE:
+ /* TODO: wire up fadvise()/madvise() */
+ goto alloc_nohuge;
}

- page = shmem_alloc_page(gfp, info, index);
- if (!page) {
- error = -ENOMEM;
- goto decused;
+alloc_huge:
+ page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+ index, true);
+ if (IS_ERR(page)) {
+alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+ index, false);
+ }
+ if (IS_ERR(page)) {
+ error = PTR_ERR(page);
+ page = NULL;
+ goto failed;
}

+ if (PageTransHuge(page))
+ hindex = round_down(index, HPAGE_PMD_NR);
+ else
+ hindex = index;
+
__SetPageSwapBacked(page);
__SetPageLocked(page);
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
- false);
+ PageTransHuge(page));
if (error)
- goto decused;
- error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
+ goto unacct;
+ error = radix_tree_maybe_preload_order(gfp & GFP_RECLAIM_MASK,
+ compound_order(page));
if (!error) {
- error = shmem_add_to_page_cache(page, mapping, index,
+ error = shmem_add_to_page_cache(page, mapping, hindex,
NULL);
radix_tree_preload_end();
}
if (error) {
- mem_cgroup_cancel_charge(page, memcg, false);
- goto decused;
+ mem_cgroup_cancel_charge(page, memcg,
+ PageTransHuge(page));
+ goto unacct;
}
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false,
+ PageTransHuge(page));
lru_cache_add_anon(page);

spin_lock(&info->lock);
- info->alloced++;
- inode->i_blocks += BLOCKS_PER_PAGE;
+ info->alloced += 1 << compound_order(page);
+ inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
shmem_recalc_inode(inode);
spin_unlock(&info->lock);
alloced = true;
@@ -1403,10 +1571,15 @@ clear:
* but SGP_FALLOC on a page fallocated earlier must initialize
* it now, lest undo on failure cancel our earlier guarantee.
*/
- if (sgp != SGP_WRITE) {
- clear_highpage(page);
- flush_dcache_page(page);
- SetPageUptodate(page);
+ if (sgp != SGP_WRITE && !PageUptodate(page)) {
+ struct page *head = compound_head(page);
+ int i;
+
+ for (i = 0; i < (1 << compound_order(head)); i++) {
+ clear_highpage(head + i);
+ flush_dcache_page(head + i);
+ }
+ SetPageUptodate(head);
}
if (sgp == SGP_DIRTY)
set_page_dirty(page);
@@ -1425,17 +1598,23 @@ clear:
error = -EINVAL;
goto unlock;
}
- *pagep = page;
+ *pagep = page + index - hindex;
return 0;

/*
* Error recovery.
*/
-decused:
- if (sbinfo->max_blocks)
- percpu_counter_add(&sbinfo->used_blocks, -1);
unacct:
- shmem_unacct_blocks(info->flags, 1);
+ if (sbinfo->max_blocks)
+ percpu_counter_add(&sbinfo->used_blocks,
+ 1 << compound_order(page));
+ shmem_unacct_blocks(info->flags, 1 << compound_order(page));
+
+ if (PageTransHuge(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto alloc_nohuge;
+ }
failed:
if (swap.val && !shmem_confirm_swap(mapping, index, swap))
error = -EEXIST;
@@ -1776,12 +1955,23 @@ shmem_write_end(struct file *file, struct address_space *mapping,
i_size_write(inode, pos + copied);

if (!PageUptodate(page)) {
+ struct page *head = compound_head(page);
+ if (PageTransCompound(page)) {
+ int i;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ if (head + i == page)
+ continue;
+ clear_highpage(head + i);
+ flush_dcache_page(head + i);
+ }
+ }
if (copied < PAGE_CACHE_SIZE) {
unsigned from = pos & (PAGE_CACHE_SIZE - 1);
zero_user_segments(page, 0, from,
from + copied, PAGE_CACHE_SIZE);
}
- SetPageUptodate(page);
+ SetPageUptodate(head);
}
set_page_dirty(page);
unlock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e97714a..5ee5118f45d4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -291,6 +291,7 @@ static bool need_activate_page_drain(int cpu)

void activate_page(struct page *page)
{
+ page = compound_head(page);
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);

@@ -315,6 +316,7 @@ void activate_page(struct page *page)
{
struct zone *zone = page_zone(page);

+ page = compound_head(page);
spin_lock_irq(&zone->lru_lock);
__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
spin_unlock_irq(&zone->lru_lock);
--
2.7.0

2016-03-03 17:02:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 05/29] mm: do not pass mm_struct into handle_mm_fault

We always have vma->vm_mm around.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/avr32/mm/fault.c | 2 +-
arch/cris/mm/fault.c | 2 +-
arch/frv/mm/fault.c | 2 +-
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m32r/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/metag/mm/fault.c | 2 +-
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/mn10300/mm/fault.c | 2 +-
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/mm/copro_fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/score/mm/fault.c | 2 +-
arch/sh/mm/fault.c | 2 +-
arch/sparc/mm/fault_32.c | 4 ++--
arch/sparc/mm/fault_64.c | 2 +-
arch/tile/mm/fault.c | 2 +-
arch/um/kernel/trap.c | 2 +-
arch/unicore32/mm/fault.c | 2 +-
arch/x86/mm/fault.c | 2 +-
arch/xtensa/mm/fault.c | 2 +-
drivers/iommu/amd_iommu_v2.c | 2 +-
drivers/iommu/intel-svm.c | 2 +-
include/linux/mm.h | 9 ++++-----
mm/gup.c | 5 ++---
mm/ksm.c | 3 +--
mm/memory.c | 13 +++++++------
36 files changed, 47 insertions(+), 49 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 4a905bd667e2..83e9eee57a55 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -147,7 +147,7 @@ retry:
/* If for any reason at all we couldn't handle the fault,
make sure we exit gracefully rather than endlessly redo
the fault. */
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index af63f4a13e60..e94e5aa33985 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -137,7 +137,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
if (unlikely(fatal_signal_pending(current))) {
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index daafcf121ce0..7cd0d5b2ef50 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -243,7 +243,7 @@ good_area:
goto out;
}

- return handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+ return handle_mm_fault(vma, addr & PAGE_MASK, flags);

check_stack:
/* Don't allow expansion below FIRST_USER_ADDRESS */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 92ddac1e8ca2..a36c31adb087 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -183,7 +183,7 @@ good_area:
goto out;
}

- return handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
+ return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);

check_stack:
if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index c03533937a9f..a4b7edac8f10 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -134,7 +134,7 @@ good_area:
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 3066d40a6db1..112ef26c7f2e 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -168,7 +168,7 @@ retry:
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 61d99767fe16..614a46c413d2 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -164,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, ear0, flags);
+ fault = handle_mm_fault(vma, ear0, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 8704c9320032..bd7c251e2bce 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -101,7 +101,7 @@ good_area:
break;
}

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 70b40d1205a6..fa6ad95e992e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -159,7 +159,7 @@ retry:
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 8f9875b7933d..a3785d3644c2 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -196,7 +196,7 @@ good_area:
*/
addr = (address & PAGE_MASK);
set_thread_fault_code(error_code);
- fault = handle_mm_fault(mm, vma, addr, flags);
+ fault = handle_mm_fault(vma, addr, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 6a94cdd0c830..bd66a0b20c6b 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -136,7 +136,7 @@ good_area:
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
pr_debug("handle_mm_fault returns %d\n", fault);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index f57edca63609..372783a67dda 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -133,7 +133,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 177dfc003643..abb678ccde6f 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -216,7 +216,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 4b88fa031891..9560ad731120 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -153,7 +153,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 4a1d181ed32f..f23781d6bbb3 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -254,7 +254,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index b51878b0c6b8..affc4eb3f89e 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -131,7 +131,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 230ac20ae794..e94cd225e816 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -163,7 +163,7 @@ good_area:
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index a762864ec92e..87436cd7ea1a 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -243,7 +243,7 @@ good_area:
* fault.
*/

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 6527882ce05e..bb0354222b11 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -75,7 +75,7 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
}

ret = 0;
- *flt = handle_mm_fault(mm, vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
+ *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
if (unlikely(*flt & VM_FAULT_ERROR)) {
if (*flt & VM_FAULT_OOM) {
ret = -ENOMEM;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a67c6d781c52..a4db22f65021 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -429,7 +429,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
if (fault & VM_FAULT_SIGSEGV)
goto bad_area;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index ec1a30d0d11a..9579c92e35b9 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -455,7 +455,7 @@ retry:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
/* No reason to continue if interrupted by SIGKILL. */
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
fault = VM_FAULT_SIGNAL;
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 37a6c2e0e969..995b71e4db4b 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -111,7 +111,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 79d8276377d1..9bf876780cef 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -487,7 +487,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
if (mm_fault_error(regs, error_code, address, fault))
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index c399e7b3b035..3d2d6686d8e9 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -241,7 +241,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
@@ -411,7 +411,7 @@ good_area:
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
}
- switch (handle_mm_fault(mm, vma, address, flags)) {
+ switch (handle_mm_fault(vma, address, flags)) {
case VM_FAULT_SIGBUS:
case VM_FAULT_OOM:
goto do_sigbus;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cb841a33da59..6c43b924a7a2 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -436,7 +436,7 @@ good_area:
goto bad_area;
}

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
goto exit_exception;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 13eac59bf16a..4be04f649396 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -435,7 +435,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 98783dd0fa2e..ad8f206ab5e8 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -73,7 +73,7 @@ good_area:
do {
int fault;

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
goto out_nosemaphore;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index afccef5529cc..43c84cb70fc7 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -194,7 +194,7 @@ good_area:
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the fault.
*/
- fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+ fault = handle_mm_fault(vma, addr & PAGE_MASK, flags);
return fault;

check_stack:
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..d01fd75adfcb 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1235,7 +1235,7 @@ good_area:
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;

/*
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index c9784c1b18d8..551bba2c9ed8 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -110,7 +110,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 7caf2fa237f2..d2f978245ca6 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -539,7 +539,7 @@ static void do_fault(struct work_struct *work)
goto out;
}

- ret = handle_mm_fault(mm, vma, address, write);
+ ret = handle_mm_fault(vma, address, write);
if (ret & VM_FAULT_ERROR) {
/* failed to service fault */
up_read(&mm->mmap_sem);
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 50464833d0b8..48c0f18e7a84 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -559,7 +559,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (access_error(vma, req))
goto invalid;

- ret = handle_mm_fault(svm->mm, vma, address,
+ ret = handle_mm_fault(vma, address,
req->wr_req ? FAULT_FLAG_WRITE : 0);
if (ret & VM_FAULT_ERROR)
goto invalid;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 619564bc2fc0..82f7836686a3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1195,15 +1195,14 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
int invalidate_inode_page(struct page *page);

#ifdef CONFIG_MMU
-extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags);
+extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags);
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
#else
-static inline int handle_mm_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- unsigned int flags)
+static inline int handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
/* should never happen if there's no MMU */
BUG();
diff --git a/mm/gup.c b/mm/gup.c
index 7bf19ffa2199..60f422a0af8b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -349,7 +349,6 @@ unmap:
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned long address, unsigned int *flags, int *nonblocking)
{
- struct mm_struct *mm = vma->vm_mm;
unsigned int fault_flags = 0;
int ret;

@@ -372,7 +371,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_TRIED;
}

- ret = handle_mm_fault(mm, vma, address, fault_flags);
+ ret = handle_mm_fault(vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return -ENOMEM;
@@ -659,7 +658,7 @@ retry:
if (!(vm_flags & vma->vm_flags))
return -EFAULT;

- ret = handle_mm_fault(mm, vma, address, fault_flags);
+ ret = handle_mm_fault(vma, address, fault_flags);
major |= ret & VM_FAULT_MAJOR;
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
diff --git a/mm/ksm.c b/mm/ksm.c
index 823d78b2a055..1bd92327bf01 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -440,8 +440,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
if (IS_ERR_OR_NULL(page))
break;
if (PageKsm(page))
- ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ ret = handle_mm_fault(vma, addr, FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index f3fa56e29c58..d8928f567193 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3344,9 +3344,10 @@ unlock:
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
{
+ struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -3428,15 +3429,15 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
{
int ret;

__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGFAULT);

/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
@@ -3448,7 +3449,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();

- ret = __handle_mm_fault(mm, vma, address, flags);
+ ret = __handle_mm_fault(vma, address, flags);

if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
--
2.7.0

2016-03-03 17:02:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 13/29] thp: handle file pages in split_huge_pmd()

Splitting THP PMD is simple: just unmap it as in DAX case.
Unlike DAX, we also remove the page from rmap and drop reference.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 15704484abf4..985041c453bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2900,10 +2900,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

count_vm_event(THP_SPLIT_PMD);

- if (vma_is_dax(vma)) {
- pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ if (!vma_is_anonymous(vma)) {
+ _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
if (is_huge_zero_pmd(_pmd))
put_huge_zero_page();
+ if (vma_is_dax(vma))
+ return;
+ page = pmd_page(_pmd);
+ page_remove_rmap(page, true);
+ put_page(page);
+ add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
return;
} else if (is_huge_zero_pmd(*pmd)) {
return __split_huge_zero_page_pmd(vma, haddr, pmd);
--
2.7.0

2016-03-03 16:52:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 18/29] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem

vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
During split we munlock the huge page which requires rmap walk.
rmap wants to take the lock on its own.

Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/mmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index af6722385edf..2cb88eb252b8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -677,6 +677,8 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

+ vma_adjust_trans_huge(vma, start, end, adjust_next);
+
if (file) {
mapping = file->f_mapping;
root = &mapping->i_mmap;
@@ -697,8 +699,6 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

- vma_adjust_trans_huge(vma, start, end, adjust_next);
-
anon_vma = vma->anon_vma;
if (!anon_vma && adjust_next)
anon_vma = next->anon_vma;
--
2.7.0

2016-03-03 17:04:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 23/29] radix-tree: implement radix_tree_maybe_preload_order()

The new helper is similar to radix_tree_maybe_preload(), but tries to
preload number of nodes required to insert (1 << order) continuous
naturally-aligned elements.

This is required to push huge pages into pagecache.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/radix-tree.h | 1 +
lib/radix-tree.c | 68 ++++++++++++++++++++++++++++++++++++++++------
2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 32623d26b62a..20b626160430 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -288,6 +288,7 @@ unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
unsigned long first_index, unsigned int max_items);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 224b369f5a5e..84d417665ddc 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -42,6 +42,9 @@
*/
static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly;

+/* Number of nodes in fully populated tree of given height */
+static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
+
/*
* Radix tree node cache.
*/
@@ -261,7 +264,7 @@ radix_tree_node_free(struct radix_tree_node *node)
* To make use of this facility, the radix tree must be initialised without
* __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
*/
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload(gfp_t gfp_mask, int nr)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
@@ -269,14 +272,14 @@ static int __radix_tree_preload(gfp_t gfp_mask)

preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
- while (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+ while (rtp->nr < nr) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
- if (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+ if (rtp->nr < nr) {
node->private_data = rtp->nodes;
rtp->nodes = node;
rtp->nr++;
@@ -302,7 +305,7 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
}
EXPORT_SYMBOL(radix_tree_preload);

@@ -314,7 +317,7 @@ EXPORT_SYMBOL(radix_tree_preload);
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
if (gfpflags_allow_blocking(gfp_mask))
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
return 0;
@@ -322,6 +325,51 @@ int radix_tree_maybe_preload(gfp_t gfp_mask)
EXPORT_SYMBOL(radix_tree_maybe_preload);

/*
+ * The same as function above, but preload number of nodes required to insert
+ * (1 << order) continuous naturally-aligned elements.
+ */
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
+{
+ unsigned long nr_subtrees;
+ int nr_nodes, subtree_height;
+
+ /* Preloading doesn't help anything with this gfp mask, skip it */
+ if (!gfpflags_allow_blocking(gfp_mask)) {
+ preempt_disable();
+ return 0;
+ }
+
+ /*
+ * Calculate number and height of fully populated subtrees it takes to
+ * store (1 << order) elements.
+ */
+ nr_subtrees = 1 << order;
+ for (subtree_height = 0; nr_subtrees > RADIX_TREE_MAP_SIZE;
+ subtree_height++)
+ nr_subtrees >>= RADIX_TREE_MAP_SHIFT;
+
+ /*
+ * The worst case is zero height tree with a single item at index 0 and
+ * then inserting items starting at ULONG_MAX - (1 << order).
+ *
+ * This requires RADIX_TREE_MAX_PATH nodes to build branch from root to
+ * 0-index item.
+ */
+ nr_nodes = RADIX_TREE_MAX_PATH;
+
+ /* Plus branch to fully populated subtrees. */
+ nr_nodes += RADIX_TREE_MAX_PATH - subtree_height;
+
+ /* Root node is shared. */
+ nr_nodes--;
+
+ /* Plus nodes required to build subtrees. */
+ nr_nodes += nr_subtrees * height_to_maxnodes[subtree_height];
+
+ return __radix_tree_preload(gfp_mask, nr_nodes);
+}
+
+/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
*/
@@ -1472,12 +1520,16 @@ static __init unsigned long __maxindex(unsigned int height)
return ~0UL >> shift;
}

-static __init void radix_tree_init_maxindex(void)
+static __init void radix_tree_init_arrays(void)
{
- unsigned int i;
+ unsigned int i, j;

for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
height_to_maxindex[i] = __maxindex(i);
+ for (i = 0; i < ARRAY_SIZE(height_to_maxnodes); i++) {
+ for (j = i; j > 0; j--)
+ height_to_maxnodes[i] += height_to_maxindex[j - 1] + 1;
+ }
}

static int radix_tree_callback(struct notifier_block *nfb,
@@ -1507,6 +1559,6 @@ void __init radix_tree_init(void)
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
- radix_tree_init_maxindex();
+ radix_tree_init_arrays();
hotcpu_notifier(radix_tree_callback, 0);
}
--
2.7.0

2016-03-03 17:04:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 25/29] truncate: handle file thp

For shmem/tmpfs we only need to tweak truncate_inode_page() and
invalidate_mapping_pages().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/truncate.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 7598b552ae03..40d3730a8e62 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -157,10 +157,14 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)

int truncate_inode_page(struct address_space *mapping, struct page *page)
{
+ loff_t holelen;
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ holelen = PageTransHuge(page) ? HPAGE_PMD_SIZE : PAGE_CACHE_SIZE;
if (page_mapped(page)) {
unmap_mapping_range(mapping,
(loff_t)page->index << PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
+ holelen, 0);
}
return truncate_complete_page(mapping, page);
}
@@ -489,7 +493,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,

if (!trylock_page(page))
continue;
- WARN_ON(page->index != index);
+
+ WARN_ON(page_to_pgoff(page) != index);
+
+ /* Middle of THP: skip */
+ if (PageTransTail(page)) {
+ unlock_page(page);
+ continue;
+ } else if (PageTransHuge(page)) {
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ /* 'end' is in the middle of THP */
+ if (index == round_down(end, HPAGE_PMD_NR))
+ continue;
+ }
+
ret = invalidate_inode_page(page);
unlock_page(page);
/*
--
2.7.0

2016-03-03 17:04:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 22/29] page-flags: relax policy for PG_mappedtodisk and PG_reclaim

These flags are in use for file THP.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/page-flags.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 517707ae8cd1..9d518876dd6a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -292,11 +292,11 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
*/
TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
-PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_COMPOUND)
+PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)

/* PG_readahead is only used for reads; PG_reclaim is only for writes */
-PAGEFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
- TESTCLEARFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
+ TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)

--
2.7.0

2016-03-03 17:05:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 26/29] shmem: prepare huge= mount option and sysfs knob

This patch adds new mount option "huge=". It can have following values:

- "always":
Attempt to allocate huge pages every time we need a new page;

- "never":
Do not allocate huge pages;

- "within_size":
Only allocate huge page if it will be fully within i_size.
Also respect fadvise()/madvise() hints;

- "advise:
Only allocate huge pages if requested with fadvise()/madvise();

Default is "never" for now.

"mount -o remount,huge= /mountpoint" works fine after mount: remounting
huge=never will not attempt to break up huge pages at all, just stop
more from being allocated.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
which is the appropriate option to protect those who don't want
the new bloat, and with which we shall share some pmd code.

Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
explicit).

Allow enabling THP only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount? SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
DRM objects, Ashmem. Though unlikely to suit all usages, provide
sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled to
experiment with huge on those.

And allow shmem_enabled two further values:

- "deny":
For use in emergencies, to force the huge option off from
all mounts;
- "force":
Force the huge option on for all - very useful for testing;

Based on patch by Hugh Dickins.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 +
include/linux/shmem_fs.h | 3 +-
mm/huge_memory.c | 3 +
mm/shmem.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 159 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c958e3db0a0e..ac6dc46dc65a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -42,6 +42,8 @@ enum transparent_hugepage_flag {
#endif
};

+extern struct kobj_attribute shmem_enabled_attr;
+
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index a43f41cb3c43..03490d1554ba 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -31,9 +31,10 @@ struct shmem_sb_info {
unsigned long max_inodes; /* How many inodes are allowed */
unsigned long free_inodes; /* How many are left for allocation */
spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
+ umode_t mode; /* Mount mode for root directory */
+ unsigned char huge; /* Whether to try for hugepages */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
- umode_t mode; /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
};

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 32439a6fdd4f..4eaf027f48b6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -432,6 +432,9 @@ static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
&use_zero_page_attr.attr,
+#ifdef CONFIG_SHMEM
+ &shmem_enabled_attr.attr,
+#endif
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
#endif
diff --git a/mm/shmem.c b/mm/shmem.c
index d60d6335a253..5f47a143bd9d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -289,6 +289,87 @@ static bool shmem_confirm_swap(struct address_space *mapping,
}

/*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
+ *
+ * SHMEM_HUGE_NEVER:
+ * disables huge pages for the mount;
+ * SHMEM_HUGE_ALWAYS:
+ * enables huge pages for the mount;
+ * SHMEM_HUGE_WITHIN_SIZE:
+ * only allocate huge pages if the page will be fully within i_size,
+ * also respect fadvise()/madvise() hints;
+ * SHMEM_HUGE_ADVISE:
+ * only allocate huge pages if requested with fadvise()/madvise();
+ */
+
+#define SHMEM_HUGE_NEVER 0
+#define SHMEM_HUGE_ALWAYS 1
+#define SHMEM_HUGE_WITHIN_SIZE 2
+#define SHMEM_HUGE_ADVISE 3
+
+/*
+ * Special values.
+ * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
+ *
+ * SHMEM_HUGE_DENY:
+ * disables huge on shm_mnt and all mounts, for emergency use;
+ * SHMEM_HUGE_FORCE:
+ * enables huge on shm_mnt and all mounts, w/o needing option, for testing;
+ *
+ */
+#define SHMEM_HUGE_DENY (-1)
+#define SHMEM_HUGE_FORCE (-2)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* ifdef here to avoid bloating shmem.o when not necessary */
+
+int shmem_huge __read_mostly;
+
+static int shmem_parse_huge(const char *str)
+{
+ if (!strcmp(str, "never"))
+ return SHMEM_HUGE_NEVER;
+ if (!strcmp(str, "always"))
+ return SHMEM_HUGE_ALWAYS;
+ if (!strcmp(str, "within_size"))
+ return SHMEM_HUGE_WITHIN_SIZE;
+ if (!strcmp(str, "advise"))
+ return SHMEM_HUGE_ADVISE;
+ if (!strcmp(str, "deny"))
+ return SHMEM_HUGE_DENY;
+ if (!strcmp(str, "force"))
+ return SHMEM_HUGE_FORCE;
+ return -EINVAL;
+}
+
+static const char *shmem_format_huge(int huge)
+{
+ switch (huge) {
+ case SHMEM_HUGE_NEVER:
+ return "never";
+ case SHMEM_HUGE_ALWAYS:
+ return "always";
+ case SHMEM_HUGE_WITHIN_SIZE:
+ return "within_size";
+ case SHMEM_HUGE_ADVISE:
+ return "advise";
+ case SHMEM_HUGE_DENY:
+ return "deny";
+ case SHMEM_HUGE_FORCE:
+ return "force";
+ default:
+ VM_BUG_ON(1);
+ return "bad_val";
+ }
+}
+
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define shmem_huge SHMEM_HUGE_DENY
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
* Like add_to_page_cache_locked, but error if expected item has gone.
*/
static int shmem_add_to_page_cache(struct page *page,
@@ -2915,11 +2996,24 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
sbinfo->gid = make_kgid(current_user_ns(), gid);
if (!gid_valid(sbinfo->gid))
goto bad_val;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ } else if (!strcmp(this_char, "huge")) {
+ int huge;
+ huge = shmem_parse_huge(value);
+ if (huge < 0)
+ goto bad_val;
+ if (!has_transparent_hugepage() &&
+ huge != SHMEM_HUGE_NEVER)
+ goto bad_val;
+ sbinfo->huge = huge;
+#endif
+#ifdef CONFIG_NUMA
} else if (!strcmp(this_char,"mpol")) {
mpol_put(mpol);
mpol = NULL;
if (mpol_parse_str(value, &mpol))
goto bad_val;
+#endif
} else {
printk(KERN_ERR "tmpfs: Bad mount option %s\n",
this_char);
@@ -2966,6 +3060,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
goto out;

error = 0;
+ sbinfo->huge = config.huge;
sbinfo->max_blocks = config.max_blocks;
sbinfo->max_inodes = config.max_inodes;
sbinfo->free_inodes = config.max_inodes - inodes;
@@ -2999,6 +3094,11 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u",
from_kgid_munged(&init_user_ns, sbinfo->gid));
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
+ if (sbinfo->huge)
+ seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
+#endif
shmem_show_mpol(seq, sbinfo->mpol);
return 0;
}
@@ -3347,6 +3447,58 @@ out3:
return error;
}

+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS)
+static ssize_t shmem_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int values[] = {
+ SHMEM_HUGE_ALWAYS,
+ SHMEM_HUGE_WITHIN_SIZE,
+ SHMEM_HUGE_ADVISE,
+ SHMEM_HUGE_NEVER,
+ SHMEM_HUGE_DENY,
+ SHMEM_HUGE_FORCE,
+ };
+ int i, count;
+
+ for (i = 0, count = 0; i < ARRAY_SIZE(values); i++) {
+ const char *fmt = shmem_huge == values[i] ? "[%s] " : "%s ";
+
+ count += sprintf(buf + count, fmt,
+ shmem_format_huge(values[i]));
+ }
+ buf[count - 1] = '\n';
+ return count;
+}
+
+static ssize_t shmem_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ char tmp[16];
+ int huge;
+
+ if (count + 1 > sizeof(tmp))
+ return -EINVAL;
+ memcpy(tmp, buf, count);
+ tmp[count] = '\0';
+ if (count && tmp[count - 1] == '\n')
+ tmp[count - 1] = '\0';
+
+ huge = shmem_parse_huge(tmp);
+ if (huge == -EINVAL)
+ return -EINVAL;
+ if (!has_transparent_hugepage() &&
+ huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
+ return -EINVAL;
+
+ shmem_huge = huge;
+ return count;
+}
+
+struct kobj_attribute shmem_enabled_attr =
+ __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
+
#else /* !CONFIG_SHMEM */

/*
--
2.7.0

2016-03-03 17:05:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 16/29] thp: skip file huge pmd on copy_huge_pmd()

copy_page_range() has a check for "Don't copy ptes where a page fault
will fill them correctly." It works on VMA level. We still copy all page
table entries from private mappings, even if they map page cache.

We can simplify copy_huge_pmd() a bit by skipping file PMDs.

We don't map file private pages with PMDs, so they only can map page
cache. It's safe to skip them as they can be re-faulted later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 34 ++++++++++++++++------------------
1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 985041c453bf..c0ffa6a7803a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1052,14 +1052,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
struct page *src_page;
pmd_t pmd;
pgtable_t pgtable = NULL;
- int ret;
+ int ret = -ENOMEM;

- if (!vma_is_dax(vma)) {
- ret = -ENOMEM;
- pgtable = pte_alloc_one(dst_mm, addr);
- if (unlikely(!pgtable))
- goto out;
- }
+ /* Skip if can be re-fill on fault */
+ if (!vma_is_anonymous(vma))
+ return 0;
+
+ pgtable = pte_alloc_one(dst_mm, addr);
+ if (unlikely(!pgtable))
+ goto out;

dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1067,7 +1068,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

ret = -EAGAIN;
pmd = *src_pmd;
- if (unlikely(!pmd_trans_huge(pmd) && !pmd_devmap(pmd))) {
+ if (unlikely(!pmd_trans_huge(pmd))) {
pte_free(dst_mm, pgtable);
goto out_unlock;
}
@@ -1090,16 +1091,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}

- if (!vma_is_dax(vma)) {
- /* thp accounting separate from pmd_devmap accounting */
- src_page = pmd_page(pmd);
- VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
- get_page(src_page);
- page_dup_rmap(src_page, true);
- add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
- atomic_long_inc(&dst_mm->nr_ptes);
- pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
- }
+ src_page = pmd_page(pmd);
+ VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+ get_page(src_page);
+ page_dup_rmap(src_page, true);
+ add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ atomic_long_inc(&dst_mm->nr_ptes);
+ pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);

pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
--
2.7.0

2016-03-03 17:05:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 20/29] thp, mlock: do not mlock PTE-mapped file huge pages

As with anon THP, we only mlock file huge pages if we can prove that the
page is not mapped with PTE. This way we can avoid mlock leak into
non-mlocked vma on split.

We rely on PageDoubleMap() under lock_page() to check if the the page
may be PTE mapped. PG_double_map is set by page_add_file_rmap() when the
page mapped with PTEs.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/page-flags.h | 13 ++++++++++++-
mm/huge_memory.c | 33 +++++++++++++++++++++++++--------
mm/mmap.c | 6 ++++++
mm/page_alloc.c | 2 ++
mm/rmap.c | 16 ++++++++++++++--
5 files changed, 59 insertions(+), 11 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4ed4f1b0c77..517707ae8cd1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -544,6 +544,17 @@ static inline int PageDoubleMap(struct page *page)
return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
}

+static inline void SetPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ set_bit(PG_double_map, &page[1].flags);
+}
+
+static inline void ClearPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ clear_bit(PG_double_map, &page[1].flags);
+}
static inline int TestSetPageDoubleMap(struct page *page)
{
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -560,7 +571,7 @@ static inline int TestClearPageDoubleMap(struct page *page)
TESTPAGEFLAG_FALSE(TransHuge)
TESTPAGEFLAG_FALSE(TransCompound)
TESTPAGEFLAG_FALSE(TransTail)
-TESTPAGEFLAG_FALSE(DoubleMap)
+PAGEFLAG_FALSE(DoubleMap)
TESTSETFLAG_FALSE(DoubleMap)
TESTCLEARFLAG_FALSE(DoubleMap)
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 09c32bc7ff0b..32439a6fdd4f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1397,6 +1397,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
* We don't mlock() pte-mapped THPs. This way we can avoid
* leaking mlocked pages into non-VM_LOCKED VMAs.
*
+ * For anon THP:
+ *
* In most cases the pmd is the only mapping of the page as we
* break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
* writable private mappings in populate_vma_page_range().
@@ -1404,15 +1406,26 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
* The only scenario when we have the page shared here is if we
* mlocking read-only mapping shared over fork(). We skip
* mlocking such pages.
+ *
+ * For file THP:
+ *
+ * We can expect PageDoubleMap() to be stable under page lock:
+ * for file pages we set it in page_add_file_rmap(), which
+ * requires page to be locked.
*/
- if (compound_mapcount(page) == 1 && !PageDoubleMap(page) &&
- page->mapping && trylock_page(page)) {
- lru_add_drain();
- if (page->mapping)
- mlock_vma_page(page);
- unlock_page(page);
- }
+
+ if (PageAnon(page) && compound_mapcount(page) != 1)
+ goto skip_mlock;
+ if (PageDoubleMap(page) || !page->mapping)
+ goto skip_mlock;
+ if (!trylock_page(page))
+ goto skip_mlock;
+ lru_add_drain();
+ if (page->mapping && !PageDoubleMap(page))
+ mlock_vma_page(page);
+ unlock_page(page);
}
+skip_mlock:
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page), page);
if (flags & FOLL_GET)
@@ -3004,7 +3017,11 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
ptl = pmd_lock(mm, pmd);
if (pmd_trans_huge(*pmd)) {
page = pmd_page(*pmd);
- if (PageMlocked(page))
+ /*
+ * File pages are munlocked on removing from rmap in
+ * __split_huge_pmd_locked().
+ */
+ if (PageMlocked(page) && PageAnon(page))
get_page(page);
else
page = NULL;
diff --git a/mm/mmap.c b/mm/mmap.c
index 2cb88eb252b8..b6aaf3609121 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2577,6 +2577,12 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
/* drop PG_Mlocked flag for over-mapped range */
for (tmp = vma; tmp->vm_start >= start + size;
tmp = tmp->vm_next) {
+ /*
+ * Split pmd and munlock page on the border
+ * of the range.
+ */
+ vma_adjust_trans_huge(tmp, start, start + size, 0);
+
munlock_vma_pages_range(tmp,
max(tmp->vm_start, start),
min(tmp->vm_end, start + size));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..e3e1c1c592f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1009,6 +1009,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)

if (PageAnon(page))
page->mapping = NULL;
+ if (compound)
+ ClearPageDoubleMap(page);
bad += free_pages_check(page);
for (i = 1; i < (1 << order); i++) {
if (compound)
diff --git a/mm/rmap.c b/mm/rmap.c
index 765e001836dc..bd79e75b0531 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1298,6 +1298,12 @@ void page_add_file_rmap(struct page *page, bool compound)
goto out;
__inc_zone_page_state(page, NR_FILE_THP_MAPPED);
} else {
+ if (PageTransCompound(page)) {
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ SetPageDoubleMap(compound_head(page));
+ if (PageMlocked(page))
+ clear_page_mlock(compound_head(page));
+ }
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
}
@@ -1466,8 +1472,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
- /* Holding pte lock, we do *not* need mmap_sem here */
- mlock_vma_page(page);
+ /* PTE-mapped THP are never mlocked */
+ if (!PageTransCompound(page)) {
+ /*
+ * Holding pte lock, we do *not* need
+ * mmap_sem here
+ */
+ mlock_vma_page(page);
+ }
ret = SWAP_MLOCK;
goto out_unmap;
}
--
2.7.0

2016-03-03 17:05:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 08/29] rmap: support file thp

Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page. That's not efficient, but it's not obvious
how we can optimize this. We can look into optimization later.

PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/rmap.h | 2 +-
mm/huge_memory.c | 10 +++++++---
mm/memory.c | 4 ++--
mm/migrate.c | 2 +-
mm/rmap.c | 48 +++++++++++++++++++++++++++++++++++-------------
mm/util.c | 6 ++++++
6 files changed, 52 insertions(+), 20 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 49eb4f8ebac9..5704f101b52e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -165,7 +165,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, int);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, bool);
-void page_add_file_rmap(struct page *);
+void page_add_file_rmap(struct page *, bool);
void page_remove_rmap(struct page *, bool);

void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a9a79a6de716..e0f8e4d63ffd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3202,18 +3202,22 @@ static void __split_huge_page(struct page *page, struct list_head *list)

int total_mapcount(struct page *page)
{
- int i, ret;
+ int i, compound, ret;

VM_BUG_ON_PAGE(PageTail(page), page);

if (likely(!PageCompound(page)))
return atomic_read(&page->_mapcount) + 1;

- ret = compound_mapcount(page);
+ compound = compound_mapcount(page);
if (PageHuge(page))
- return ret;
+ return compound;
+ ret = compound;
for (i = 0; i < HPAGE_PMD_NR; i++)
ret += atomic_read(&page[i]._mapcount) + 1;
+ /* File pages has compound_mapcount included in _mapcount */
+ if (!PageAnon(page))
+ return ret - compound * HPAGE_PMD_NR;
if (PageDoubleMap(page))
ret -= HPAGE_PMD_NR;
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index c90d9c47c240..f8c986df8c63 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1439,7 +1439,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
/* Ok, finally just insert the thing.. */
get_page(page);
inc_mm_counter_fast(mm, mm_counter_file(page));
- page_add_file_rmap(page);
+ page_add_file_rmap(page, false);
set_pte_at(mm, addr, pte, mk_pte(page, prot));

retval = 0;
@@ -2882,7 +2882,7 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page);
+ page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);

diff --git a/mm/migrate.c b/mm/migrate.c
index 6c822a7b27e0..d20276fffce7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -170,7 +170,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
} else if (PageAnon(new))
page_add_anon_rmap(new, vma, addr, false);
else
- page_add_file_rmap(new);
+ page_add_file_rmap(new, false);

if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
diff --git a/mm/rmap.c b/mm/rmap.c
index 945933a01010..b550bf637ce3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1285,18 +1285,34 @@ void page_add_new_anon_rmap(struct page *page,
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, bool compound)
{
+ int i, nr = 1;
+
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
lock_page_memcg(page);
- if (atomic_inc_and_test(&page->_mapcount)) {
- __inc_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+ if (compound && PageTransHuge(page)) {
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_inc_and_test(&page[i]._mapcount))
+ nr++;
+ }
+ if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
+ goto out;
+ } else {
+ if (!atomic_inc_and_test(&page->_mapcount))
+ goto out;
}
+ __mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, nr);
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+out:
unlock_page_memcg(page);
}

-static void page_remove_file_rmap(struct page *page)
+static void page_remove_file_rmap(struct page *page, bool compound)
{
+ int i, nr = 1;
+
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
lock_page_memcg(page);

/* Hugepages are not counted in NR_FILE_MAPPED for now. */
@@ -1307,15 +1323,24 @@ static void page_remove_file_rmap(struct page *page)
}

/* page still mapped by someone else? */
- if (!atomic_add_negative(-1, &page->_mapcount))
- goto out;
+ if (compound && PageTransHuge(page)) {
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_add_negative(-1, &page[i]._mapcount))
+ nr++;
+ }
+ if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+ goto out;
+ } else {
+ if (!atomic_add_negative(-1, &page->_mapcount))
+ goto out;
+ }

/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- __dec_zone_page_state(page, NR_FILE_MAPPED);
+ __mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, -nr);
mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);

if (unlikely(PageMlocked(page)))
@@ -1371,11 +1396,8 @@ static void page_remove_anon_compound_rmap(struct page *page)
*/
void page_remove_rmap(struct page *page, bool compound)
{
- if (!PageAnon(page)) {
- VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
- page_remove_file_rmap(page);
- return;
- }
+ if (!PageAnon(page))
+ return page_remove_file_rmap(page, compound);

if (compound)
return page_remove_anon_compound_rmap(page);
diff --git a/mm/util.c b/mm/util.c
index 362f4cd8ab3a..2e92f231796b 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,12 @@ int __page_mapcount(struct page *page)
int ret;

ret = atomic_read(&page->_mapcount) + 1;
+ /*
+ * For file THP page->_mapcount contains total number of mapping
+ * of the page: no need to look into compound_mapcount.
+ */
+ if (!PageAnon(page) && !PageHuge(page))
+ return ret;
page = compound_head(page);
ret += atomic_read(compound_mapcount_ptr(page)) + 1;
if (PageDoubleMap(page))
--
2.7.0

2016-03-03 17:05:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 29/29] shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings

Let's wire up existing madvise() hugepage hints for file mappings.

MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
VMA. It only has effect if the filesystem is mounted with huge=advise or
huge=within_size.

MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
the VMA. It doesn't prevent a huge page from being allocated by other
means, i.e. page fault into different mapping or write(2) into file.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 13 ++-----------
mm/shmem.c | 20 +++++++++++++++++---
2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4eaf027f48b6..dd4456bbe3cd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1814,11 +1814,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
if (mm_has_pgste(vma->vm_mm))
return 0;
#endif
- /*
- * Be somewhat over-protective like KSM for now!
- */
- if (*vm_flags & VM_NO_THP)
- return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
@@ -1826,15 +1821,11 @@ int hugepage_madvise(struct vm_area_struct *vma,
* register it here without waiting a page fault that
* may not happen any time soon.
*/
- if (unlikely(khugepaged_enter_vma_merge(vma, *vm_flags)))
+ if ((*vm_flags & VM_NO_THP) &&
+ khugepaged_enter_vma_merge(vma, *vm_flags))
return -ENOMEM;
break;
case MADV_NOHUGEPAGE:
- /*
- * Be somewhat over-protective like KSM for now!
- */
- if (*vm_flags & VM_NO_THP)
- return -EINVAL;
*vm_flags &= ~VM_HUGEPAGE;
*vm_flags |= VM_NOHUGEPAGE;
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 2de464f65e53..a60c90cff690 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -101,6 +101,8 @@ struct shmem_falloc {
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */
+ SGP_HUGE, /* like SGP_CACHE, huge pages preferred */
SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
@@ -1370,6 +1372,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ enum sgp_type sgp_huge = sgp;
pgoff_t hindex = index;
int error;
int once = 0;
@@ -1377,6 +1380,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,

if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
return -EFBIG;
+ if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
+ sgp = SGP_CACHE;
repeat:
swap.val = 0;
page = find_lock_entry(mapping, index);
@@ -1490,7 +1495,7 @@ repeat:
/* shmem_symlink() */
if (mapping->a_ops != &shmem_aops)
goto alloc_nohuge;
- if (shmem_huge == SHMEM_HUGE_DENY)
+ if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
goto alloc_nohuge;
if (shmem_huge == SHMEM_HUGE_FORCE)
goto alloc_huge;
@@ -1506,7 +1511,9 @@ repeat:
goto alloc_huge;
/* fallthrough */
case SHMEM_HUGE_ADVISE:
- /* TODO: wire up fadvise()/madvise() */
+ if (sgp_huge == SGP_HUGE)
+ goto alloc_huge;
+ /* TODO: implement fadvise() hints */
goto alloc_nohuge;
}

@@ -1638,6 +1645,7 @@ unlock:
static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct inode *inode = file_inode(vma->vm_file);
+ enum sgp_type sgp;
int error;
int ret = VM_FAULT_LOCKED;

@@ -1699,7 +1707,13 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
spin_unlock(&inode->i_lock);
}

- error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
+ sgp = SGP_CACHE;
+ if (vma->vm_flags & VM_HUGEPAGE)
+ sgp = SGP_HUGE;
+ else if (vma->vm_flags & VM_NOHUGEPAGE)
+ sgp = SGP_NOHUGE;
+
+ error = shmem_getpage(inode, vmf->pgoff, &vmf->page, sgp, &ret);
if (error)
return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);

--
2.7.0

2016-03-03 17:05:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 11/29] thp, vmstats: add counters for huge file pages

THP_FILE_ALLOC: how many times huge page was allocated and put page
cache.

THP_FILE_MAPPED: how many times file huge page was mapped.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 7 +++++++
mm/memory.c | 1 +
mm/vmstat.c | 2 ++
3 files changed, 10 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ec084321fe09..42604173f122 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -70,6 +70,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_FILE_ALLOC,
+ THP_FILE_MAPPED,
THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED,
THP_DEFERRED_SPLIT_PAGE,
@@ -100,4 +102,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NR_VM_EVENT_ITEMS
};

+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_FILE_MAPPED ({ BUILD_BUG(); 0; })
+#endif
+
#endif /* VM_EVENT_ITEM_H_INCLUDED */
diff --git a/mm/memory.c b/mm/memory.c
index 3bb51a550bbb..ab9acc4d83a2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2888,6 +2888,7 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)

/* fault is handled */
ret = 0;
+ count_vm_event(THP_FILE_MAPPED);
out:
spin_unlock(fe->ptl);
return ret;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 943b37f17007..78e94198b35f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -847,6 +847,8 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_file_alloc",
+ "thp_file_mapped",
"thp_split_page",
"thp_split_page_failed",
"thp_deferred_split_page",
--
2.7.0

2016-03-03 17:07:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 06/29] mm: introduce fault_env

The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.

The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.

This patch is preparation for the next one.

[1] http://lkml.kernel.org/r/[email protected]

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
Documentation/filesystems/Locking | 10 +-
fs/userfaultfd.c | 22 +-
include/linux/huge_mm.h | 20 +-
include/linux/mm.h | 34 ++-
include/linux/userfaultfd_k.h | 8 +-
mm/filemap.c | 28 +-
mm/huge_memory.c | 280 +++++++++----------
mm/internal.h | 4 +-
mm/memory.c | 569 ++++++++++++++++++--------------------
mm/nommu.c | 3 +-
10 files changed, 468 insertions(+), 510 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 06d443450f21..0e499a7944a5 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -546,13 +546,13 @@ subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.

->map_pages() is called when VM asks to map easy accessible pages.
-Filesystem should find and map pages associated with offsets from "pgoff"
-till "max_pgoff". ->map_pages() is called with page table locked and must
+Filesystem should find and map pages associated with offsets from "start_pgoff"
+till "end_pgoff". ->map_pages() is called with page table locked and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
-page table entry. Pointer to entry associated with offset "pgoff" is
-passed in "pte" field in vm_fault structure. Pointers to entries for other
-offsets should be calculated relative to "pte".
+page table entry. Pointer to entry associated with the page is passed in
+"pte" field in fault_env structure. Pointers to entries for other offsets
+should be calculated relative to "pte".

->page_mkwrite() is called when a previously read-only pte is
about to become writeable. The filesystem again must ensure that there are
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 50311703135b..0a08143dbc87 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -257,10 +257,9 @@ out:
* fatal_signal_pending()s, and the mmap_sem must be released before
* returning it.
*/
-int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, unsigned long reason)
+int handle_userfault(struct fault_env *fe, unsigned long reason)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct mm_struct *mm = fe->vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
int ret;
@@ -269,7 +268,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));

ret = VM_FAULT_SIGBUS;
- ctx = vma->vm_userfaultfd_ctx.ctx;
+ ctx = fe->vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
goto out;

@@ -296,17 +295,17 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* without first stopping userland access to the memory. For
* VM_UFFD_MISSING userfaults this is enough for now.
*/
- if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+ if (unlikely(!(fe->flags & FAULT_FLAG_ALLOW_RETRY))) {
/*
* Validate the invariant that nowait must allow retry
* to be sure not to return SIGBUS erroneously on
* nowait invocations.
*/
- BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+ BUG_ON(fe->flags & FAULT_FLAG_RETRY_NOWAIT);
#ifdef CONFIG_DEBUG_VM
if (printk_ratelimit()) {
printk(KERN_WARNING
- "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+ "FAULT_FLAG_ALLOW_RETRY missing %x\n", fe->flags);
dump_stack();
}
#endif
@@ -318,7 +317,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* and wait.
*/
ret = VM_FAULT_RETRY;
- if (flags & FAULT_FLAG_RETRY_NOWAIT)
+ if (fe->flags & FAULT_FLAG_RETRY_NOWAIT)
goto out;

/* take the reference before dropping the mmap_sem */
@@ -326,10 +325,11 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,

init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
- uwq.msg = userfault_msg(address, flags, reason);
+ uwq.msg = userfault_msg(fe->address, fe->flags, reason);
uwq.ctx = ctx;

- return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+ return_to_userland =
+ (fe->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);

spin_lock(&ctx->fault_pending_wqh.lock);
@@ -347,7 +347,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
TASK_KILLABLE);
spin_unlock(&ctx->fault_pending_wqh.lock);

- must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
+ must_wait = userfaultfd_must_wait(ctx, fe->address, fe->flags, reason);
up_read(&mm->mmap_sem);

if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c47067151ffd..a9ec30594a81 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,20 +1,12 @@
#ifndef _LINUX_HUGE_MM_H
#define _LINUX_HUGE_MM_H

-extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags);
+extern int do_huge_pmd_anonymous_page(struct fault_env *fe);
extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma);
-extern void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd, int dirty);
-extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd);
+extern void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd);
+extern int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd);
extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmd,
@@ -142,8 +134,7 @@ static inline int hpage_nr_pages(struct page *page)
return 1;
}

-extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd);

extern struct page *huge_zero_page;

@@ -203,8 +194,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
return NULL;
}

-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+static inline int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd)
{
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 82f7836686a3..58c43a94d5b0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -279,10 +279,27 @@ struct vm_fault {
* is set (which is also implied by
* VM_FAULT_ERROR).
*/
- /* for ->map_pages() only */
- pgoff_t max_pgoff; /* map pages for offset from pgoff till
- * max_pgoff inclusive */
- pte_t *pte; /* pte entry associated with ->pgoff */
+};
+
+/*
+ * Page fault context: passes though page fault handler instead of endless list
+ * of function arguments.
+ */
+struct fault_env {
+ struct vm_area_struct *vma; /* Target VMA */
+ unsigned long address; /* Faulting virtual address */
+ unsigned int flags; /* FAULT_FLAG_xxx flags */
+ pmd_t *pmd; /* Pointer to pmd entry matching
+ * the 'address'
+ */
+ pte_t *pte; /* Pointer to pte entry matching
+ * the 'address'. NULL if the page
+ * table hasn't been allocated.
+ */
+ spinlock_t *ptl; /* Page table lock.
+ * Protects pte page table if 'pte'
+ * is not NULL, otherwise pmd.
+ */
};

/*
@@ -297,7 +314,8 @@ struct vm_operations_struct {
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
pmd_t *, unsigned int flags);
- void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ void (*map_pages)(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -562,8 +580,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon);
+void do_set_pte(struct fault_env *fe, struct page *page);
#endif

/*
@@ -2042,7 +2059,8 @@ extern void truncate_inode_pages_final(struct address_space *);

/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
-extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff);
extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* mm/page-writeback.c */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480ad41b7..dd66a952e8cd 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -27,8 +27,7 @@
#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)

-extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, unsigned long reason);
+extern int handle_userfault(struct fault_env *fe, unsigned long reason);

extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len);
@@ -56,10 +55,7 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
#else /* CONFIG_USERFAULTFD */

/* mm helpers */
-static inline int handle_userfault(struct vm_area_struct *vma,
- unsigned long address,
- unsigned int flags,
- unsigned long reason)
+static inline int handle_userfault(struct fault_env *fe, unsigned long reason)
{
return VM_FAULT_SIGBUS;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b1c40cff762..3202ce17a515 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2131,22 +2131,27 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
{
struct radix_tree_iter iter;
void **slot;
- struct file *file = vma->vm_file;
+ struct file *file = fe->vma->vm_file;
struct address_space *mapping = file->f_mapping;
+ pgoff_t last_pgoff = start_pgoff;
loff_t size;
struct page *page;
- unsigned long address = (unsigned long) vmf->virtual_address;
- unsigned long addr;
- pte_t *pte;

rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
- if (iter.index > vmf->max_pgoff)
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
+ start_pgoff) {
+ if (iter.index > end_pgoff)
break;
+ fe->pte += iter.index - last_pgoff;
+ fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ last_pgoff = iter.index;
+ if (!pte_none(*fe->pte))
+ goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2181,14 +2186,9 @@ repeat:
if (page->index >= size >> PAGE_CACHE_SHIFT)
goto unlock;

- pte = vmf->pte + page->index - vmf->pgoff;
- if (!pte_none(*pte))
- goto unlock;
-
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
- addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
- do_set_pte(vma, addr, page, pte, false, false);
+ do_set_pte(fe, page);
unlock_page(page);
goto next;
unlock:
@@ -2196,7 +2196,7 @@ unlock:
skip:
page_cache_release(page);
next:
- if (iter.index == vmf->max_pgoff)
+ if (iter.index == end_pgoff)
break;
}
rcu_read_unlock();
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a9921a485400..a9a79a6de716 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -800,26 +800,23 @@ void prep_transhuge_page(struct page *page)
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
}

-static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- struct page *page, gfp_t gfp,
- unsigned int flags)
+static int __do_huge_pmd_anonymous_page(struct fault_env *fe, struct page *page,
+ gfp_t gfp)
{
+ struct vm_area_struct *vma = fe->vma;
struct mem_cgroup *memcg;
pgtable_t pgtable;
- spinlock_t *ptl;
- unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}

- pgtable = pte_alloc_one(mm, haddr);
+ pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable)) {
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
@@ -834,12 +831,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
*/
__SetPageUptodate(page);

- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_none(*pmd))) {
- spin_unlock(ptl);
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd))) {
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
} else {
pmd_t entry;

@@ -847,12 +844,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
if (userfaultfd_missing(vma)) {
int ret;

- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
- pte_free(mm, pgtable);
- ret = handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ pte_free(vma->vm_mm, pgtable);
+ ret = handle_userfault(fe, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
return ret;
}
@@ -862,11 +858,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false, true);
lru_cache_add_active_or_unevictable(page, vma);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- set_pmd_at(mm, haddr, pmd, entry);
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
- atomic_long_inc(&mm->nr_ptes);
- spin_unlock(ptl);
+ pgtable_trans_huge_deposit(vma->vm_mm, fe->pmd, pgtable);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ spin_unlock(fe->ptl);
count_vm_event(THP_FAULT_ALLOC);
}

@@ -895,13 +891,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
return true;
}

-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags)
+int do_huge_pmd_anonymous_page(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
gfp_t gfp;
struct page *page;
- unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;

if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
return VM_FAULT_FALLBACK;
@@ -909,42 +904,40 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm) &&
+ if (!(fe->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm) &&
transparent_hugepage_use_zero_page()) {
- spinlock_t *ptl;
pgtable_t pgtable;
struct page *zero_page;
bool set;
int ret;
- pgtable = pte_alloc_one(mm, haddr);
+ pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
zero_page = get_huge_zero_page();
if (unlikely(!zero_page)) {
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- ptl = pmd_lock(mm, pmd);
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
ret = 0;
set = false;
- if (pmd_none(*pmd)) {
+ if (pmd_none(*fe->pmd)) {
if (userfaultfd_missing(vma)) {
- spin_unlock(ptl);
- ret = handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ spin_unlock(fe->ptl);
+ ret = handle_userfault(fe, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
} else {
- set_huge_zero_page(pgtable, mm, vma,
- haddr, pmd,
- zero_page);
- spin_unlock(ptl);
+ set_huge_zero_page(pgtable, vma->vm_mm, vma,
+ haddr, fe->pmd, zero_page);
+ spin_unlock(fe->ptl);
set = true;
}
} else
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
if (!set) {
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
put_huge_zero_page();
}
return ret;
@@ -956,8 +949,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
}
prep_transhuge_page(page);
- return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
- flags);
+ return __do_huge_pmd_anonymous_page(fe, page, gfp);
}

static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1129,38 +1121,31 @@ out:
return ret;
}

-void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmd, pmd_t orig_pmd,
- int dirty)
+void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd)
{
- spinlock_t *ptl;
pmd_t entry;
unsigned long haddr;

- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ fe->ptl = pmd_lock(fe->vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto unlock;

entry = pmd_mkyoung(orig_pmd);
- haddr = address & HPAGE_PMD_MASK;
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
- update_mmu_cache_pmd(vma, address, pmd);
+ haddr = fe->address & HPAGE_PMD_MASK;
+ if (pmdp_set_access_flags(fe->vma, haddr, fe->pmd, entry,
+ fe->flags & FAULT_FLAG_WRITE))
+ update_mmu_cache_pmd(fe->vma, fe->address, fe->pmd);

unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
}

-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmd, pmd_t orig_pmd,
- struct page *page,
- unsigned long haddr)
+static int do_huge_pmd_wp_page_fallback(struct fault_env *fe, pmd_t orig_pmd,
+ struct page *page)
{
+ struct vm_area_struct *vma = fe->vma;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
struct mem_cgroup *memcg;
- spinlock_t *ptl;
pgtable_t pgtable;
pmd_t _pmd;
int ret = 0, i;
@@ -1177,11 +1162,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

for (i = 0; i < HPAGE_PMD_NR; i++) {
pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE |
- __GFP_OTHER_NODE,
- vma, address, page_to_nid(page));
+ __GFP_OTHER_NODE, vma,
+ fe->address, page_to_nid(page));
if (unlikely(!pages[i] ||
- mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
- &memcg, false))) {
+ mem_cgroup_try_charge(pages[i], vma->vm_mm,
+ GFP_KERNEL, &memcg, false))) {
if (pages[i])
put_page(pages[i]);
while (--i >= 0) {
@@ -1207,41 +1192,41 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);

- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_free_pages;
VM_BUG_ON_PAGE(!PageHead(page), page);

- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
/* leave pmd empty until pte is filled */

- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
+ pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, fe->pmd);
+ pmd_populate(vma->vm_mm, &_pmd, pgtable);

for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
+ pte_t entry;
entry = mk_pte(pages[i], vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- page_add_new_anon_rmap(pages[i], vma, haddr, false);
+ page_add_new_anon_rmap(pages[i], fe->vma, haddr, false);
mem_cgroup_commit_charge(pages[i], memcg, false, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
- pte = pte_offset_map(&_pmd, haddr);
- VM_BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
+ fe->pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*fe->pte));
+ set_pte_at(vma->vm_mm, haddr, fe->pte, entry);
+ pte_unmap(fe->pte);
}
kfree(pages);

smp_wmb(); /* make pte visible before pmd */
- pmd_populate(mm, pmd, pgtable);
+ pmd_populate(vma->vm_mm, fe->pmd, pgtable);
page_remove_rmap(page, true);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);

- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);

ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1250,8 +1235,8 @@ out:
return ret;

out_free_pages:
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ spin_unlock(fe->ptl);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1262,25 +1247,23 @@ out_free_pages:
goto out;
}

-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
{
- spinlock_t *ptl;
- int ret = 0;
+ struct vm_area_struct *vma = fe->vma;
struct page *page = NULL, *new_page;
struct mem_cgroup *memcg;
- unsigned long haddr;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
gfp_t huge_gfp; /* for allocation and charge */
+ int ret = 0;

- ptl = pmd_lockptr(mm, pmd);
+ fe->ptl = pmd_lockptr(vma->vm_mm, fe->pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
- haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
- spin_lock(ptl);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ spin_lock(fe->ptl);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_unlock;

page = pmd_page(orig_pmd);
@@ -1299,13 +1282,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
- update_mmu_cache_pmd(vma, address, pmd);
+ if (pmdp_set_access_flags(vma, haddr, fe->pmd, entry, 1))
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
ret |= VM_FAULT_WRITE;
goto out_unlock;
}
get_page(page);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow()) {
@@ -1318,13 +1301,12 @@ alloc:
prep_transhuge_page(new_page);
} else {
if (!page) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
ret |= VM_FAULT_FALLBACK;
} else {
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
+ ret = do_huge_pmd_wp_page_fallback(fe, orig_pmd, page);
if (ret & VM_FAULT_OOM) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
ret |= VM_FAULT_FALLBACK;
}
put_page(page);
@@ -1333,14 +1315,12 @@ alloc:
goto out;
}

- if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg,
- true))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
+ huge_gfp, &memcg, true))) {
put_page(new_page);
- if (page) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
+ if (page)
put_page(page);
- } else
- split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -1356,13 +1336,13 @@ alloc:

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);

- spin_lock(ptl);
+ spin_lock(fe->ptl);
if (page)
put_page(page);
- if (unlikely(!pmd_same(*pmd, orig_pmd))) {
- spin_unlock(ptl);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd))) {
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(new_page, memcg, true);
put_page(new_page);
goto out_mn;
@@ -1370,14 +1350,14 @@ alloc:
pmd_t entry;
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
- set_pmd_at(mm, haddr, pmd, entry);
- update_mmu_cache_pmd(vma, address, pmd);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
if (!page) {
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
put_huge_zero_page();
} else {
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -1386,13 +1366,13 @@ alloc:
}
ret |= VM_FAULT_WRITE;
}
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
out:
return ret;
out_unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
return ret;
}

@@ -1452,13 +1432,12 @@ out:
}

/* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t pmd)
{
- spinlock_t *ptl;
+ struct vm_area_struct *vma = fe->vma;
struct anon_vma *anon_vma = NULL;
struct page *page;
- unsigned long haddr = addr & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
int target_nid, last_cpupid = -1;
bool page_locked;
@@ -1469,8 +1448,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* A PROT_NONE fault should not end up here */
BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));

- ptl = pmd_lock(mm, pmdp);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(pmd, *fe->pmd)))
goto out_unlock;

/*
@@ -1478,9 +1457,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* without disrupting NUMA hinting information. Do not relock and
* check_same as the page may no longer be mapped.
*/
- if (unlikely(pmd_trans_migrating(*pmdp))) {
- page = pmd_page(*pmdp);
- spin_unlock(ptl);
+ if (unlikely(pmd_trans_migrating(*fe->pmd))) {
+ page = pmd_page(*fe->pmd);
+ spin_unlock(fe->ptl);
wait_on_page_locked(page);
goto out;
}
@@ -1513,7 +1492,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

/* Migration could have started since the pmd_trans_migrating check */
if (!page_locked) {
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
wait_on_page_locked(page);
page_nid = -1;
goto out;
@@ -1524,12 +1503,12 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* to serialises splits
*/
get_page(page);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
anon_vma = page_lock_anon_vma_read(page);

/* Confirm the PMD did not change while page_table_lock was released */
- spin_lock(ptl);
- if (unlikely(!pmd_same(pmd, *pmdp))) {
+ spin_lock(fe->ptl);
+ if (unlikely(!pmd_same(pmd, *fe->pmd))) {
unlock_page(page);
put_page(page);
page_nid = -1;
@@ -1547,9 +1526,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Migrate the THP to the requested node, returns with page unlocked
* and access rights restored.
*/
- spin_unlock(ptl);
- migrated = migrate_misplaced_transhuge_page(mm, vma,
- pmdp, pmd, addr, page, target_nid);
+ spin_unlock(fe->ptl);
+ migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
+ fe->pmd, pmd, fe->address, page, target_nid);
if (migrated) {
flags |= TNF_MIGRATED;
page_nid = target_nid;
@@ -1564,18 +1543,18 @@ clear_pmdnuma:
pmd = pmd_mkyoung(pmd);
if (was_writable)
pmd = pmd_mkwrite(pmd);
- set_pmd_at(mm, haddr, pmdp, pmd);
- update_mmu_cache_pmd(vma, addr, pmdp);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, pmd);
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
unlock_page(page);
out_unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);

out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);

if (page_nid != -1)
- task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
+ task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, fe->flags);

return 0;
}
@@ -2356,29 +2335,32 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd)
{
- unsigned long _address;
- pte_t *pte, pteval;
+ pte_t pteval;
int swapped_in = 0, ret = 0;
-
- pte = pte_offset_map(pmd, address);
- for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
- pte++, _address += PAGE_SIZE) {
- pteval = *pte;
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+ .pmd = pmd,
+ };
+
+ fe.pte = pte_offset_map(pmd, address);
+ for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
+ fe.pte++, fe.address += PAGE_SIZE) {
+ pteval = *fe.pte;
if (!is_swap_pte(pteval))
continue;
swapped_in++;
- ret = do_swap_page(mm, vma, _address, pte, pmd,
- FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
- pteval);
+ ret = do_swap_page(&fe, pteval);
if (ret & VM_FAULT_ERROR) {
trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
return;
}
/* pte is unmapped now, we need to map it */
- pte = pte_offset_map(pmd, _address);
+ fe.pte = pte_offset_map(pmd, fe.address);
}
- pte--;
- pte_unmap(pte);
+ fe.pte--;
+ pte_unmap(fe.pte);
trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
}

diff --git a/mm/internal.h b/mm/internal.h
index 72bbce3efc36..87e52346fdbc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -35,9 +35,7 @@
/* Do not use these with a slab allocator */
#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)

-extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte);
+int do_swap_page(struct fault_env *fe, pte_t orig_pte);

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/memory.c b/mm/memory.c
index d8928f567193..463597eff6ed 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1993,13 +1993,11 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
* case, all we need to do here is to mark the page as writable and update
* any related book-keeping.
*/
-static inline int wp_page_reuse(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
- struct page *page, int page_mkwrite,
- int dirty_shared)
- __releases(ptl)
+static inline int wp_page_reuse(struct fault_env *fe, pte_t orig_pte,
+ struct page *page, int page_mkwrite, int dirty_shared)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
pte_t entry;
/*
* Clear the pages cpupid information as the existing
@@ -2009,12 +2007,12 @@ static inline int wp_page_reuse(struct mm_struct *mm,
if (page)
page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);

- flush_cache_page(vma, address, pte_pfn(orig_pte));
+ flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (ptep_set_access_flags(vma, address, page_table, entry, 1))
- update_mmu_cache(vma, address, page_table);
- pte_unmap_unlock(page_table, ptl);
+ if (ptep_set_access_flags(vma, fe->address, fe->pte, entry, 1))
+ update_mmu_cache(vma, fe->address, fe->pte);
+ pte_unmap_unlock(fe->pte, fe->ptl);

if (dirty_shared) {
struct address_space *mapping;
@@ -2060,30 +2058,31 @@ static inline int wp_page_reuse(struct mm_struct *mm,
* held to the old page, as well as updating the rmap.
* - In any case, unlock the PTL and drop the reference we took to the old page.
*/
-static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- pte_t orig_pte, struct page *old_page)
+static int wp_page_copy(struct fault_env *fe, pte_t orig_pte,
+ struct page *old_page)
{
+ struct vm_area_struct *vma = fe->vma;
+ struct mm_struct *mm = vma->vm_mm;
struct page *new_page = NULL;
- spinlock_t *ptl = NULL;
pte_t entry;
int page_copied = 0;
- const unsigned long mmun_start = address & PAGE_MASK; /* For mmu_notifiers */
- const unsigned long mmun_end = mmun_start + PAGE_SIZE; /* For mmu_notifiers */
+ const unsigned long mmun_start = fe->address & PAGE_MASK;
+ const unsigned long mmun_end = mmun_start + PAGE_SIZE;
struct mem_cgroup *memcg;

if (unlikely(anon_vma_prepare(vma)))
goto oom;

if (is_zero_pfn(pte_pfn(orig_pte))) {
- new_page = alloc_zeroed_user_highpage_movable(vma, address);
+ new_page = alloc_zeroed_user_highpage_movable(vma, fe->address);
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
+ fe->address);
if (!new_page)
goto oom;
- cow_user_page(new_page, old_page, address, vma);
+ cow_user_page(new_page, old_page, fe->address, vma);
}

if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
@@ -2096,8 +2095,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Re-check the pte - we dropped the lock
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (likely(pte_same(*page_table, orig_pte))) {
+ fe->pte = pte_offset_map_lock(mm, fe->pmd, fe->address, &fe->ptl);
+ if (likely(pte_same(*fe->pte, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
dec_mm_counter_fast(mm,
@@ -2107,7 +2106,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
- flush_cache_page(vma, address, pte_pfn(orig_pte));
+ flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
/*
@@ -2116,8 +2115,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush_notify(vma, address, page_table);
- page_add_new_anon_rmap(new_page, vma, address, false);
+ ptep_clear_flush_notify(vma, fe->address, fe->pte);
+ page_add_new_anon_rmap(new_page, vma, fe->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
@@ -2125,8 +2124,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* mmu page tables (such as kvm shadow page tables), we want the
* new page to be mapped directly into the secondary page table.
*/
- set_pte_at_notify(mm, address, page_table, entry);
- update_mmu_cache(vma, address, page_table);
+ set_pte_at_notify(mm, fe->address, fe->pte, entry);
+ update_mmu_cache(vma, fe->address, fe->pte);
if (old_page) {
/*
* Only after switching the pte to the new page may
@@ -2163,7 +2162,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
if (new_page)
page_cache_release(new_page);

- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
if (old_page) {
/*
@@ -2191,44 +2190,43 @@ oom:
* Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
* mapping
*/
-static int wp_pfn_shared(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
- pmd_t *pmd)
+static int wp_pfn_shared(struct fault_env *fe, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
+
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
struct vm_fault vmf = {
.page = NULL,
- .pgoff = linear_page_index(vma, address),
- .virtual_address = (void __user *)(address & PAGE_MASK),
+ .pgoff = linear_page_index(vma, fe->address),
+ .virtual_address =
+ (void __user *)(fe->address & PAGE_MASK),
.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
};
int ret;

- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
if (ret & VM_FAULT_ERROR)
return ret;
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
/*
* We might have raced with another page fault while we
* released the pte_offset_map_lock.
*/
- if (!pte_same(*page_table, orig_pte)) {
- pte_unmap_unlock(page_table, ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
}
- return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
- NULL, 0, 0);
+ return wp_page_reuse(fe, orig_pte, NULL, 0, 0);
}

-static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table,
- pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
- struct page *old_page)
- __releases(ptl)
+static int wp_page_shared(struct fault_env *fe, pte_t orig_pte,
+ struct page *old_page)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
int page_mkwrite = 0;

page_cache_get(old_page);
@@ -2236,8 +2234,8 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
int tmp;

- pte_unmap_unlock(page_table, ptl);
- tmp = do_page_mkwrite(vma, old_page, address);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ tmp = do_page_mkwrite(vma, old_page, fe->address);
if (unlikely(!tmp || (tmp &
(VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
page_cache_release(old_page);
@@ -2249,19 +2247,18 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* they did, we just return, as we can count on the
* MMU to tell us if they didn't also make it writable.
*/
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
page_cache_release(old_page);
return 0;
}
page_mkwrite = 1;
}

- return wp_page_reuse(mm, vma, address, page_table, ptl,
- orig_pte, old_page, page_mkwrite, 1);
+ return wp_page_reuse(fe, orig_pte, old_page, page_mkwrite, 1);
}

/*
@@ -2282,14 +2279,13 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* but allow concurrent faults), with pte both mapped and locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- spinlock_t *ptl, pte_t orig_pte)
- __releases(ptl)
+static int do_wp_page(struct fault_env *fe, pte_t orig_pte)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *old_page;

- old_page = vm_normal_page(vma, address, orig_pte);
+ old_page = vm_normal_page(vma, fe->address, orig_pte);
if (!old_page) {
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -2300,12 +2296,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))
- return wp_pfn_shared(mm, vma, address, page_table, ptl,
- orig_pte, pmd);
+ return wp_pfn_shared(fe, orig_pte);

- pte_unmap_unlock(page_table, ptl);
- return wp_page_copy(mm, vma, address, page_table, pmd,
- orig_pte, old_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return wp_page_copy(fe, orig_pte, old_page);
}

/*
@@ -2315,13 +2309,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (PageAnon(old_page) && !PageKsm(old_page)) {
if (!trylock_page(old_page)) {
page_cache_get(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
lock_page(old_page);
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+ fe->address, &fe->ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
page_cache_release(old_page);
return 0;
}
@@ -2333,16 +2327,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
* the rmap code will not search our parent or siblings.
* Protected against the rmap code by the page lock.
*/
- page_move_anon_rmap(old_page, vma, address);
+ page_move_anon_rmap(old_page, vma, fe->address);
unlock_page(old_page);
- return wp_page_reuse(mm, vma, address, page_table, ptl,
- orig_pte, old_page, 0, 0);
+ return wp_page_reuse(fe, orig_pte, old_page, 0, 0);
}
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
- return wp_page_shared(mm, vma, address, page_table, pmd,
- ptl, orig_pte, old_page);
+ return wp_page_shared(fe, orig_pte, old_page);
}

/*
@@ -2350,9 +2342,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
page_cache_get(old_page);

- pte_unmap_unlock(page_table, ptl);
- return wp_page_copy(mm, vma, address, page_table, pmd,
- orig_pte, old_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return wp_page_copy(fe, orig_pte, old_page);
}

static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -2443,11 +2434,9 @@ EXPORT_SYMBOL(unmap_mapping_range);
* We return with the mmap_sem locked or unlocked in the same cases
* as does filemap_fault().
*/
-int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+int do_swap_page(struct fault_env *fe, pte_t orig_pte)
{
- spinlock_t *ptl;
+ struct vm_area_struct *vma = fe->vma;
struct page *page, *swapcache;
struct mem_cgroup *memcg;
swp_entry_t entry;
@@ -2456,17 +2445,17 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
int exclusive = 0;
int ret = 0;

- if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
+ if (!pte_unmap_same(vma->vm_mm, fe->pmd, fe->pte, orig_pte))
goto out;

entry = pte_to_swp_entry(orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
} else {
- print_bad_pte(vma, address, orig_pte, NULL);
+ print_bad_pte(vma, fe->address, orig_pte, NULL);
ret = VM_FAULT_SIGBUS;
}
goto out;
@@ -2475,14 +2464,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = lookup_swap_cache(entry);
if (!page) {
page = swapin_readahead(entry,
- GFP_HIGHUSER_MOVABLE, vma, address);
+ GFP_HIGHUSER_MOVABLE, vma, fe->address);
if (!page) {
/*
* Back out if somebody else faulted in this pte
* while we released the pte lock.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (likely(pte_same(*page_table, orig_pte)))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+ fe->address, &fe->ptl);
+ if (likely(pte_same(*fe->pte, orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto unlock;
@@ -2491,7 +2481,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
- mem_cgroup_count_vm_event(mm, PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
} else if (PageHWPoison(page)) {
/*
* hwpoisoned dirty swapcache pages are kept for killing
@@ -2504,7 +2494,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

swapcache = page;
- locked = lock_page_or_retry(page, mm, flags);
+ locked = lock_page_or_retry(page, vma->vm_mm, fe->flags);

delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
if (!locked) {
@@ -2521,14 +2511,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
goto out_page;

- page = ksm_might_need_to_copy(page, vma, address);
+ page = ksm_might_need_to_copy(page, vma, fe->address);
if (unlikely(!page)) {
ret = VM_FAULT_OOM;
page = swapcache;
goto out_page;
}

- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
ret = VM_FAULT_OOM;
goto out_page;
}
@@ -2536,8 +2527,9 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Back out if somebody else already faulted in this pte.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*page_table, orig_pte)))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte)))
goto out_nomap;

if (unlikely(!PageUptodate(page))) {
@@ -2555,24 +2547,24 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
* must be called after the swap_free(), or it will never succeed.
*/

- inc_mm_counter_fast(mm, MM_ANONPAGES);
- dec_mm_counter_fast(mm, MM_SWAPENTS);
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
- if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+ if ((fe->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
- flags &= ~FAULT_FLAG_WRITE;
+ fe->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
- set_pte_at(mm, address, page_table, pte);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
if (page == swapcache) {
- do_page_add_anon_rmap(page, vma, address, exclusive);
+ do_page_add_anon_rmap(page, vma, fe->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, address, false);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}
@@ -2595,22 +2587,22 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_cache_release(swapcache);
}

- if (flags & FAULT_FLAG_WRITE) {
- ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+ if (fe->flags & FAULT_FLAG_WRITE) {
+ ret |= do_wp_page(fe, pte);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
goto out;
}

/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, page_table);
+ update_mmu_cache(vma, fe->address, fe->pte);
unlock:
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
out:
return ret;
out_nomap:
mem_cgroup_cancel_charge(page, memcg, false);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
out_page:
unlock_page(page);
out_release:
@@ -2661,37 +2653,36 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags)
+static int do_anonymous_page(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
struct mem_cgroup *memcg;
struct page *page;
- spinlock_t *ptl;
pte_t entry;

- pte_unmap(page_table);
+ pte_unmap(fe->pte);

/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;

/* Check if we need to add a guard page to the stack */
- if (check_stack_guard_page(vma, address) < 0)
+ if (check_stack_guard_page(vma, fe->address) < 0)
return VM_FAULT_SIGSEGV;

/* Use the zero-page for reads */
- if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
- entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
+ if (!(fe->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm)) {
+ entry = pte_mkspecial(pfn_pte(my_zero_pfn(fe->address),
vma->vm_page_prot));
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_none(*fe->pte))
goto unlock;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(page_table, ptl);
- return handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return handle_userfault(fe, VM_UFFD_MISSING);
}
goto setpte;
}
@@ -2699,11 +2690,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_zeroed_user_highpage_movable(vma, address);
+ page = alloc_zeroed_user_highpage_movable(vma, fe->address);
if (!page)
goto oom;

- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
goto oom_free_page;

/*
@@ -2717,30 +2708,30 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));

- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_none(*fe->pte))
goto release;

/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
mem_cgroup_cancel_charge(page, memcg, false);
page_cache_release(page);
- return handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ return handle_userfault(fe, VM_UFFD_MISSING);
}

- inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address, false);
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
- set_pte_at(mm, address, page_table, entry);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);

/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, page_table);
+ update_mmu_cache(vma, fe->address, fe->pte);
unlock:
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
release:
mem_cgroup_cancel_charge(page, memcg, false);
@@ -2757,16 +2748,16 @@ oom:
* released depending on flags and vma->vm_ops->fault() return value.
* See filemap_fault() and __lock_page_retry().
*/
-static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags,
- struct page *cow_page, struct page **page)
+static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
+ struct page *cow_page, struct page **page)
{
+ struct vm_area_struct *vma = fe->vma;
struct vm_fault vmf;
int ret;

- vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ vmf.virtual_address = (void __user *)(fe->address & PAGE_MASK);
vmf.pgoff = pgoff;
- vmf.flags = flags;
+ vmf.flags = fe->flags;
vmf.page = NULL;
vmf.gfp_mask = __get_fault_gfp_mask(vma);
vmf.cow_page = cow_page;
@@ -2797,38 +2788,36 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
/**
* do_set_pte - setup new PTE entry for given page and add reverse page mapping.
*
- * @vma: virtual memory area
- * @address: user virtual address
+ * @fe: fault environment
* @page: page to map
- * @pte: pointer to target page table entry
- * @write: true, if new entry is writable
- * @anon: true, if it's anonymous page
*
- * Caller must hold page table lock relevant for @pte.
+ * Caller must hold page table lock relevant for @fe->pte.
*
* Target users are page handler itself and implementations of
* vm_ops->map_pages.
*/
-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon)
+void do_set_pte(struct fault_env *fe, struct page *page)
{
+ struct vm_area_struct *vma = fe->vma;
+ bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (anon) {
+ /* copy-on-write page */
+ if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address, false);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page);
}
- set_pte_at(vma->vm_mm, address, pte, entry);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);

/* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, address, pte);
+ update_mmu_cache(vma, fe->address, fe->pte);
}

static unsigned long fault_around_bytes __read_mostly =
@@ -2895,57 +2884,53 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pgoff_t pgoff, unsigned int flags)
+static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
{
- unsigned long start_addr, nr_pages, mask;
- pgoff_t max_pgoff;
- struct vm_fault vmf;
+ unsigned long address = fe->address, start_addr, nr_pages, mask;
+ pte_t *pte = fe->pte;
+ pgoff_t end_pgoff;
int off;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;

- start_addr = max(address & mask, vma->vm_start);
- off = ((address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- pte -= off;
- pgoff -= off;
+ start_addr = max(fe->address & mask, fe->vma->vm_start);
+ off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+ fe->pte -= off;
+ start_pgoff -= off;

/*
- * max_pgoff is either end of page table or end of vma
- * or fault_around_pages() from pgoff, depending what is nearest.
+ * end_pgoff is either end of page table or end of vma
+ * or fault_around_pages() from start_pgoff, depending what is nearest.
*/
- max_pgoff = pgoff - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ end_pgoff = start_pgoff -
+ ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
- max_pgoff = min3(max_pgoff, vma_pages(vma) + vma->vm_pgoff - 1,
- pgoff + nr_pages - 1);
+ end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
+ start_pgoff + nr_pages - 1);

/* Check if it makes any sense to call ->map_pages */
- while (!pte_none(*pte)) {
- if (++pgoff > max_pgoff)
- return;
- start_addr += PAGE_SIZE;
- if (start_addr >= vma->vm_end)
- return;
- pte++;
+ fe->address = start_addr;
+ while (!pte_none(*fe->pte)) {
+ if (++start_pgoff > end_pgoff)
+ goto out;
+ fe->address += PAGE_SIZE;
+ if (fe->address >= fe->vma->vm_end)
+ goto out;
+ fe->pte++;
}

- vmf.virtual_address = (void __user *) start_addr;
- vmf.pte = pte;
- vmf.pgoff = pgoff;
- vmf.max_pgoff = max_pgoff;
- vmf.flags = flags;
- vmf.gfp_mask = __get_fault_gfp_mask(vma);
- vma->vm_ops->map_pages(vma, &vmf);
+ fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+out:
+ /* restore fault_env */
+ fe->pte = pte;
+ fe->address = address;
}

-static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
- spinlock_t *ptl;
- pte_t *pte;
int ret = 0;

/*
@@ -2954,64 +2939,64 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- do_fault_around(vma, address, pte, pgoff, flags);
- if (!pte_same(*pte, orig_pte))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ do_fault_around(fe, pgoff);
+ if (!pte_same(*fe->pte, orig_pte))
goto unlock_out;
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
}

- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
page_cache_release(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, false, false);
+ do_set_pte(fe, fault_page);
unlock_page(fault_page);
unlock_out:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return ret;
}

-static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
struct mem_cgroup *memcg;
- spinlock_t *ptl;
- pte_t *pte;
int ret;

if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;

- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, fe->address);
if (!new_page)
return VM_FAULT_OOM;

- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
+ if (mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
page_cache_release(new_page);
return VM_FAULT_OOM;
}

- ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+ ret = __do_fault(fe, pgoff, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

if (fault_page)
- copy_user_highpage(new_page, fault_page, address, vma);
+ copy_user_highpage(new_page, fault_page, fe->address, vma);
__SetPageUptodate(new_page);

- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
page_cache_release(fault_page);
@@ -3024,10 +3009,10 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
goto uncharge_out;
}
- do_set_pte(vma, address, new_page, pte, true, true);
+ do_set_pte(fe, new_page);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
page_cache_release(fault_page);
@@ -3045,18 +3030,15 @@ uncharge_out:
return ret;
}

-static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
struct address_space *mapping;
- spinlock_t *ptl;
- pte_t *pte;
int dirtied = 0;
int ret, tmp;

- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

@@ -3066,7 +3048,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (vma->vm_ops->page_mkwrite) {
unlock_page(fault_page);
- tmp = do_page_mkwrite(vma, fault_page, address);
+ tmp = do_page_mkwrite(vma, fault_page, fe->address);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
page_cache_release(fault_page);
@@ -3074,15 +3056,16 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
}

- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
page_cache_release(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, true, false);
- pte_unmap_unlock(pte, ptl);
+ do_set_pte(fe, fault_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);

if (set_page_dirty(fault_page))
dirtied = 1;
@@ -3114,23 +3097,20 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+static int do_fault(struct fault_env *fe, pte_t orig_pte)
{
- pgoff_t pgoff = linear_page_index(vma, address);
+ struct vm_area_struct *vma = fe->vma;
+ pgoff_t pgoff = linear_page_index(vma, fe->address);

- pte_unmap(page_table);
+ pte_unmap(fe->pte);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
- if (!(flags & FAULT_FLAG_WRITE))
- return do_read_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
+ if (!(fe->flags & FAULT_FLAG_WRITE))
+ return do_read_fault(fe, pgoff, orig_pte);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
- return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+ return do_cow_fault(fe, pgoff, orig_pte);
+ return do_shared_fault(fe, pgoff, orig_pte);
}

static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3148,11 +3128,10 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
return mpol_misplaced(page, vma, addr);
}

-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static int do_numa_page(struct fault_env *fe, pte_t pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *page = NULL;
- spinlock_t *ptl;
int page_nid = -1;
int last_cpupid;
int target_nid;
@@ -3172,10 +3151,10 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* page table entry is not accessible, so there would be no
* concurrent hardware modifications to the PTE.
*/
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*ptep, pte))) {
- pte_unmap_unlock(ptep, ptl);
+ fe->ptl = pte_lockptr(vma->vm_mm, fe->pmd);
+ spin_lock(fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
goto out;
}

@@ -3184,18 +3163,18 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
- set_pte_at(mm, addr, ptep, pte);
- update_mmu_cache(vma, addr, ptep);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
+ update_mmu_cache(vma, fe->address, fe->pte);

- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_page(vma, fe->address, pte);
if (!page) {
- pte_unmap_unlock(ptep, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}

/* TODO: handle PTE-mapped THP */
if (PageCompound(page)) {
- pte_unmap_unlock(ptep, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}

@@ -3219,8 +3198,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
- pte_unmap_unlock(ptep, ptl);
+ target_nid = numa_migrate_prep(page, vma, fe->address, page_nid,
+ &flags);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (target_nid == -1) {
put_page(page);
goto out;
@@ -3240,24 +3220,24 @@ out:
return 0;
}

-static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+static int create_huge_pmd(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
if (vma_is_anonymous(vma))
- return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
+ return do_huge_pmd_anonymous_page(fe);
if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ return vma->vm_ops->pmd_fault(vma, fe->address, fe->pmd,
+ fe->flags);
return VM_FAULT_FALLBACK;
}

-static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, pmd_t orig_pmd,
- unsigned int flags)
+static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
{
- if (vma_is_anonymous(vma))
- return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
- if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ if (vma_is_anonymous(fe->vma))
+ return do_huge_pmd_wp_page(fe, orig_pmd);
+ if (fe->vma->vm_ops->pmd_fault)
+ return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
+ fe->flags);
return VM_FAULT_FALLBACK;
}

@@ -3277,12 +3257,9 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int handle_pte_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pmd_t *pmd, unsigned int flags)
+static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;
- spinlock_t *ptl;

/*
* some architectures can have larger ptes than wordsize,
@@ -3292,37 +3269,34 @@ static int handle_pte_fault(struct mm_struct *mm,
* we later double check anyway with the ptl lock held. So here
* a barrier will do.
*/
- entry = *pte;
+ entry = *fe->pte;
barrier();
if (!pte_present(entry)) {
if (pte_none(entry)) {
- if (vma_is_anonymous(vma))
- return do_anonymous_page(mm, vma, address,
- pte, pmd, flags);
+ if (vma_is_anonymous(fe->vma))
+ return do_anonymous_page(fe);
else
- return do_fault(mm, vma, address, pte, pmd,
- flags, entry);
+ return do_fault(fe, entry);
}
- return do_swap_page(mm, vma, address,
- pte, pmd, flags, entry);
+ return do_swap_page(fe, entry);
}

if (pte_protnone(entry))
- return do_numa_page(mm, vma, address, entry, pte, pmd);
+ return do_numa_page(fe, entry);

- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*pte, entry)))
+ fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
+ spin_lock(fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, entry)))
goto unlock;
- if (flags & FAULT_FLAG_WRITE) {
+ if (fe->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
- return do_wp_page(mm, vma, address,
- pte, pmd, ptl, entry);
+ return do_wp_page(fe, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
- if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
- update_mmu_cache(vma, address, pte);
+ if (ptep_set_access_flags(fe->vma, fe->address, fe->pte, entry,
+ fe->flags & FAULT_FLAG_WRITE)) {
+ update_mmu_cache(fe->vma, fe->address, fe->pte);
} else {
/*
* This is needed only for protection faults but the arch code
@@ -3330,11 +3304,11 @@ static int handle_pte_fault(struct mm_struct *mm,
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
- if (flags & FAULT_FLAG_WRITE)
- flush_tlb_fix_spurious_fault(vma, address);
+ if (fe->flags & FAULT_FLAG_WRITE)
+ flush_tlb_fix_spurious_fault(fe->vma, fe->address);
}
unlock:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}

@@ -3347,46 +3321,42 @@ unlock:
static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
{
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = flags,
+ };
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- if (unlikely(is_vm_hugetlb_page(vma)))
- return hugetlb_fault(mm, vma, address, flags);

pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
return VM_FAULT_OOM;
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
+ fe.pmd = pmd_alloc(mm, pud, address);
+ if (!fe.pmd)
return VM_FAULT_OOM;
- if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
- int ret = create_huge_pmd(mm, vma, address, pmd, flags);
+ if (pmd_none(*fe.pmd) && transparent_hugepage_enabled(vma)) {
+ int ret = create_huge_pmd(&fe);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- pmd_t orig_pmd = *pmd;
+ pmd_t orig_pmd = *fe.pmd;
int ret;

barrier();
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
- unsigned int dirty = flags & FAULT_FLAG_WRITE;
-
if (pmd_protnone(orig_pmd))
- return do_huge_pmd_numa_page(mm, vma, address,
- orig_pmd, pmd);
+ return do_huge_pmd_numa_page(&fe, orig_pmd);

- if (dirty && !pmd_write(orig_pmd)) {
- ret = wp_huge_pmd(mm, vma, address, pmd,
- orig_pmd, flags);
+ if ((fe.flags & FAULT_FLAG_WRITE) &&
+ !pmd_write(orig_pmd)) {
+ ret = wp_huge_pmd(&fe, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- huge_pmd_set_accessed(mm, vma, address, pmd,
- orig_pmd, dirty);
+ huge_pmd_set_accessed(&fe, orig_pmd);
return 0;
}
}
@@ -3397,7 +3367,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* run pte_offset_map on the pmd, if an huge pmd could
* materialize from under us from a different thread.
*/
- if (unlikely(pte_alloc(mm, pmd, address)))
+ if (unlikely(pte_alloc(fe.vma->vm_mm, fe.pmd, fe.address)))
return VM_FAULT_OOM;
/*
* If a huge pmd materialized under us just retry later. Use
@@ -3410,7 +3380,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* through an atomic read in C, which is what pmd_trans_unstable()
* provides.
*/
- if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
+ if (unlikely(pmd_trans_unstable(fe.pmd) || pmd_devmap(*fe.pmd)))
return 0;
/*
* A regular pmd is established and it can't morph into a huge pmd
@@ -3418,9 +3388,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* read mode and khugepaged takes it in write mode. So now it's
* safe to run pte_offset_map().
*/
- pte = pte_offset_map(pmd, address);
+ fe.pte = pte_offset_map(fe.pmd, fe.address);

- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+ return handle_pte_fault(&fe);
}

/*
@@ -3449,7 +3419,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();

- ret = __handle_mm_fault(vma, address, flags);
+ if (unlikely(is_vm_hugetlb_page(vma)))
+ ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+ else
+ ret = __handle_mm_fault(vma, address, flags);

if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
diff --git a/mm/nommu.c b/mm/nommu.c
index 6402f2715d48..06c374d73797 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1813,7 +1813,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
{
BUG();
}
--
2.7.0

2016-03-03 17:07:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 07/29] mm: postpone page table allocation until we have page to map

The idea (and most of code) is borrowed again: from Hugh's patchset on
huge tmpfs[1].

Instead of allocation pte page table upfront, we postpone this until we
have page to map in hands. This approach opens possibility to map the
page as huge if filesystem supports this.

Comparing to Hugh's patch I've pushed page table allocation a bit
further: into do_set_pte(). This way we can postpone allocation even in
faultaround case without moving do_fault_around() after __do_fault().

do_set_pte() got renamed to alloc_set_pte() as it can allocate page
table if required.

[1] http://lkml.kernel.org/r/[email protected]

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 10 +-
mm/filemap.c | 17 +--
mm/memory.c | 297 +++++++++++++++++++++++++++++++----------------------
3 files changed, 197 insertions(+), 127 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 58c43a94d5b0..9abe8bf3016a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -300,6 +300,13 @@ struct fault_env {
* Protects pte page table if 'pte'
* is not NULL, otherwise pmd.
*/
+ pgtable_t prealloc_pte; /* Pre-allocated pte page table.
+ * vm_ops->map_pages() calls
+ * alloc_set_pte() from atomic context.
+ * do_fault_around() pre-allocates
+ * page table to avoid allocation from
+ * atomic context.
+ */
};

/*
@@ -580,7 +587,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-void do_set_pte(struct fault_env *fe, struct page *page);
+int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+ struct page *page);
#endif

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 3202ce17a515..de867f83dc0a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2147,11 +2147,6 @@ void filemap_map_pages(struct fault_env *fe,
start_pgoff) {
if (iter.index > end_pgoff)
break;
- fe->pte += iter.index - last_pgoff;
- fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
- last_pgoff = iter.index;
- if (!pte_none(*fe->pte))
- goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2188,8 +2183,18 @@ repeat:

if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
- do_set_pte(fe, page);
+
+ fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ if (fe->pte)
+ fe->pte += iter.index - last_pgoff;
+ last_pgoff = iter.index;
+ alloc_set_pte(fe, NULL, page);
unlock_page(page);
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*fe->pmd))
+ break;
+ /* Failed to setup page table? */
+ VM_BUG_ON(!fe->pte);
goto next;
unlock:
unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 463597eff6ed..c90d9c47c240 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2660,8 +2660,6 @@ static int do_anonymous_page(struct fault_env *fe)
struct page *page;
pte_t entry;

- pte_unmap(fe->pte);
-
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
@@ -2670,6 +2668,23 @@ static int do_anonymous_page(struct fault_env *fe)
if (check_stack_guard_page(vma, fe->address) < 0)
return VM_FAULT_SIGSEGV;

+ /*
+ * Use pte_alloc() instead of pte_alloc_map(). We can't run
+ * pte_offset_map() on pmds where a huge pmd might be created
+ * from a different thread.
+ *
+ * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+ * parallel threads are excluded by other means.
+ *
+ * Here we only have down_read(mmap_sem).
+ */
+ if (pte_alloc(vma->vm_mm, fe->pmd, fe->address))
+ return VM_FAULT_OOM;
+
+ /* See the comment in pte_alloc_one_map() */
+ if (unlikely(pmd_trans_unstable(fe->pmd)))
+ return 0;
+
/* Use the zero-page for reads */
if (!(fe->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
@@ -2785,23 +2800,76 @@ static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
return ret;
}

+static int pte_alloc_one_map(struct fault_env *fe)
+{
+ struct vm_area_struct *vma = fe->vma;
+
+ if (!pmd_none(*fe->pmd))
+ goto map_pte;
+ if (fe->prealloc_pte) {
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd))) {
+ spin_unlock(fe->ptl);
+ goto map_pte;
+ }
+
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
+ spin_unlock(fe->ptl);
+ fe->prealloc_pte = 0;
+ } else if (unlikely(pte_alloc(vma->vm_mm, fe->pmd, fe->address))) {
+ return VM_FAULT_OOM;
+ }
+map_pte:
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+ * didn't become pmd_trans_huge under us and then back to pmd_none, as
+ * a result of MADV_DONTNEED running immediately after a huge pmd fault
+ * in a different thread of this mm, in turn leading to a misleading
+ * pmd_trans_huge() retval. All we have to ensure is that it is a
+ * regular pmd that we can walk with pte_offset_map() and we can do that
+ * through an atomic read in C, which is what pmd_trans_unstable()
+ * provides.
+ */
+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+ return VM_FAULT_NOPAGE;
+
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ return 0;
+}
+
/**
- * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
+ * alloc_set_pte - setup new PTE entry for given page and add reverse page
+ * mapping. If needed, the fucntion allocates page table or use pre-allocated.
*
* @fe: fault environment
+ * @memcg: memcg to charge page (only for private mappings)
* @page: page to map
*
- * Caller must hold page table lock relevant for @fe->pte.
+ * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
*
* Target users are page handler itself and implementations of
* vm_ops->map_pages.
*/
-void do_set_pte(struct fault_env *fe, struct page *page)
+int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+ struct page *page)
{
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;

+ if (!fe->pte) {
+ int ret = pte_alloc_one_map(fe);
+ if (ret)
+ return ret;
+ }
+
+ /* Re-check under ptl */
+ if (unlikely(!pte_none(*fe->pte)))
+ return VM_FAULT_NOPAGE;
+
flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (write)
@@ -2810,6 +2878,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, fe->address, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
+ lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page);
@@ -2818,6 +2888,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)

/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, fe->address, fe->pte);
+
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -2884,19 +2956,17 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
+static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
{
- unsigned long address = fe->address, start_addr, nr_pages, mask;
- pte_t *pte = fe->pte;
+ unsigned long address = fe->address, nr_pages, mask;
pgoff_t end_pgoff;
- int off;
+ int off, ret = 0;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;

- start_addr = max(fe->address & mask, fe->vma->vm_start);
- off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- fe->pte -= off;
+ fe->address = max(address & mask, fe->vma->vm_start);
+ off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
start_pgoff -= off;

/*
@@ -2904,30 +2974,45 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
* or fault_around_pages() from start_pgoff, depending what is nearest.
*/
end_pgoff = start_pgoff -
- ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ ((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
start_pgoff + nr_pages - 1);

- /* Check if it makes any sense to call ->map_pages */
- fe->address = start_addr;
- while (!pte_none(*fe->pte)) {
- if (++start_pgoff > end_pgoff)
- goto out;
- fe->address += PAGE_SIZE;
- if (fe->address >= fe->vma->vm_end)
- goto out;
- fe->pte++;
+ if (pmd_none(*fe->pmd)) {
+ fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
+ smp_wmb(); /* See comment in __pte_alloc() */
}

fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+
+ /* preallocated pagetable is unused: free it */
+ if (fe->prealloc_pte) {
+ pte_free(fe->vma->vm_mm, fe->prealloc_pte);
+ fe->prealloc_pte = 0;
+ }
+ /* Huge page is mapped? Page fault is solved */
+ if (pmd_trans_huge(*fe->pmd)) {
+ ret = VM_FAULT_NOPAGE;
+ goto out;
+ }
+
+ /* ->map_pages() haven't done anything useful. Cold page cache? */
+ if (!fe->pte)
+ goto out;
+
+ /* check if the page fault is solved */
+ fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
+ if (!pte_none(*fe->pte))
+ ret = VM_FAULT_NOPAGE;
+ pte_unmap_unlock(fe->pte, fe->ptl);
out:
- /* restore fault_env */
- fe->pte = pte;
fe->address = address;
+ fe->pte = NULL;
+ return ret;
}

-static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
@@ -2939,33 +3024,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- do_fault_around(fe, pgoff);
- if (!pte_same(*fe->pte, orig_pte))
- goto unlock_out;
- pte_unmap_unlock(fe->pte, fe->ptl);
+ ret = do_fault_around(fe, pgoff);
+ if (ret)
+ return ret;
}

ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, NULL, fault_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
- unlock_page(fault_page);
- page_cache_release(fault_page);
- return ret;
- }
- do_set_pte(fe, fault_page);
unlock_page(fault_page);
-unlock_out:
- pte_unmap_unlock(fe->pte, fe->ptl);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ page_cache_release(fault_page);
return ret;
}

-static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
@@ -2993,26 +3070,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
copy_user_highpage(new_page, fault_page, fe->address, vma);
__SetPageUptodate(new_page);

- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, memcg, new_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
- if (fault_page) {
- unlock_page(fault_page);
- page_cache_release(fault_page);
- } else {
- /*
- * The fault handler has no page to lock, so it holds
- * i_mmap_lock for read to protect against truncate.
- */
- i_mmap_unlock_read(vma->vm_file->f_mapping);
- }
- goto uncharge_out;
- }
- do_set_pte(fe, new_page);
- mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
- pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
page_cache_release(fault_page);
@@ -3023,6 +3083,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
*/
i_mmap_unlock_read(vma->vm_file->f_mapping);
}
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ goto uncharge_out;
return ret;
uncharge_out:
mem_cgroup_cancel_charge(new_page, memcg, false);
@@ -3030,7 +3092,7 @@ uncharge_out:
return ret;
}

-static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
@@ -3056,16 +3118,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
}
}

- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, NULL, fault_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+ VM_FAULT_RETRY))) {
unlock_page(fault_page);
page_cache_release(fault_page);
return ret;
}
- do_set_pte(fe, fault_page);
- pte_unmap_unlock(fe->pte, fe->ptl);

if (set_page_dirty(fault_page))
dirtied = 1;
@@ -3097,20 +3158,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int do_fault(struct fault_env *fe, pte_t orig_pte)
+static int do_fault(struct fault_env *fe)
{
struct vm_area_struct *vma = fe->vma;
pgoff_t pgoff = linear_page_index(vma, fe->address);

- pte_unmap(fe->pte);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
if (!(fe->flags & FAULT_FLAG_WRITE))
- return do_read_fault(fe, pgoff, orig_pte);
+ return do_read_fault(fe, pgoff);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(fe, pgoff, orig_pte);
- return do_shared_fault(fe, pgoff, orig_pte);
+ return do_cow_fault(fe, pgoff);
+ return do_shared_fault(fe, pgoff);
}

static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3250,37 +3310,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with pte unmapped and unlocked.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
+ * concurrent faults).
*
- * The mmap_sem may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * The mmap_sem may have been released depending on flags and our return value.
+ * See filemap_fault() and __lock_page_or_retry().
*/
static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;

- /*
- * some architectures can have larger ptes than wordsize,
- * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
- * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
- * The code below just needs a consistent view for the ifs and
- * we later double check anyway with the ptl lock held. So here
- * a barrier will do.
- */
- entry = *fe->pte;
- barrier();
- if (!pte_present(entry)) {
+ if (unlikely(pmd_none(*fe->pmd))) {
+ /*
+ * Leave __pte_alloc() until later: because vm_ops->fault may
+ * want to allocate huge page, and if we expose page table
+ * for an instant, it will be difficult to retract from
+ * concurrent faults and from rmap lookups.
+ */
+ } else {
+ /* See comment in pte_alloc_one_map() */
+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+ return 0;
+ /*
+ * A regular pmd is established and it can't morph into a huge
+ * pmd from under us anymore at this point because we hold the
+ * mmap_sem read mode and khugepaged takes it in write mode.
+ * So now it's safe to run pte_offset_map().
+ */
+ fe->pte = pte_offset_map(fe->pmd, fe->address);
+
+ entry = *fe->pte;
+
+ /*
+ * some architectures can have larger ptes than wordsize,
+ * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
+ * CONFIG_32BIT=y, so READ_ONCE or ACCESS_ONCE cannot guarantee
+ * atomic accesses. The code below just needs a consistent
+ * view for the ifs and we later double check anyway with the
+ * ptl lock held. So here a barrier will do.
+ */
+ barrier();
if (pte_none(entry)) {
- if (vma_is_anonymous(fe->vma))
- return do_anonymous_page(fe);
- else
- return do_fault(fe, entry);
+ pte_unmap(fe->pte);
+ fe->pte = NULL;
}
- return do_swap_page(fe, entry);
}

+ if (!fe->pte) {
+ if (vma_is_anonymous(fe->vma))
+ return do_anonymous_page(fe);
+ else
+ return do_fault(fe);
+ }
+
+ if (!pte_present(entry))
+ return do_swap_page(fe, entry);
+
if (pte_protnone(entry))
return do_numa_page(fe, entry);

@@ -3362,34 +3447,6 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
}

- /*
- * Use pte_alloc() instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pte_alloc(fe.vma->vm_mm, fe.pmd, fe.address)))
- return VM_FAULT_OOM;
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
- * didn't become pmd_trans_huge under us and then back to pmd_none, as
- * a result of MADV_DONTNEED running immediately after a huge pmd fault
- * in a different thread of this mm, in turn leading to a misleading
- * pmd_trans_huge() retval. All we have to ensure is that it is a
- * regular pmd that we can walk with pte_offset_map() and we can do that
- * through an atomic read in C, which is what pmd_trans_unstable()
- * provides.
- */
- if (unlikely(pmd_trans_unstable(fe.pmd) || pmd_devmap(*fe.pmd)))
- return 0;
- /*
- * A regular pmd is established and it can't morph into a huge pmd
- * from under us anymore at this point because we hold the mmap_sem
- * read mode and khugepaged takes it in write mode. So now it's
- * safe to run pte_offset_map().
- */
- fe.pte = pte_offset_map(fe.pmd, fe.address);
-
return handle_pte_fault(&fe);
}

--
2.7.0

2016-03-03 16:52:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 03/29] mm: make remove_migration_ptes() beyond mm/migration.c

The patch makes remove_migration_ptes() available to be used in
split_huge_page().

New parameter 'locked' added: as with try_to_umap() we need a way to
indicate that caller holds rmap lock.

We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
pages are never mlocked.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/rmap.h | 2 ++
mm/migrate.c | 15 +++++++++------
2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 3d975e2252d4..49eb4f8ebac9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -243,6 +243,8 @@ int page_mkclean(struct page *);
*/
int try_to_munlock(struct page *);

+void remove_migration_ptes(struct page *old, struct page *new, bool locked);
+
/*
* Called by memory-failure.c to kill processes.
*/
diff --git a/mm/migrate.c b/mm/migrate.c
index 577c94b8e959..6c822a7b27e0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -172,7 +172,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
else
page_add_file_rmap(new);

- if (vma->vm_flags & VM_LOCKED)
+ if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);

/* No need to invalidate - it was non-present before */
@@ -187,14 +187,17 @@ out:
* Get rid of all migration entries and replace them by
* references to the indicated page.
*/
-static void remove_migration_ptes(struct page *old, struct page *new)
+void remove_migration_ptes(struct page *old, struct page *new, bool locked)
{
struct rmap_walk_control rwc = {
.rmap_one = remove_migration_pte,
.arg = old,
};

- rmap_walk(new, &rwc);
+ if (locked)
+ rmap_walk_locked(new, &rwc);
+ else
+ rmap_walk(new, &rwc);
}

/*
@@ -702,7 +705,7 @@ static int writeout(struct address_space *mapping, struct page *page)
* At this point we know that the migration attempt cannot
* be successful.
*/
- remove_migration_ptes(page, page);
+ remove_migration_ptes(page, page, false);

rc = mapping->a_ops->writepage(page, &wbc);

@@ -900,7 +903,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,

if (page_was_mapped)
remove_migration_ptes(page,
- rc == MIGRATEPAGE_SUCCESS ? newpage : page);
+ rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);

out_unlock_both:
unlock_page(newpage);
@@ -1070,7 +1073,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,

if (page_was_mapped)
remove_migration_ptes(hpage,
- rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage);
+ rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false);

unlock_page(new_hpage);

--
2.7.0

2016-03-03 17:08:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 02/29] rmap: extend try_to_unmap() to be usable by split_huge_page()

The patch add support for two ttu_flags:

- TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
unmap page;

- TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;

Apart these flags, patch changes rwc->done to !page_mapcount()
instead of !page_mapped(). try_to_unmap() works on pte level, so we
really interested if this small pages is mapped, not compound page
it's part of.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 7 +++++++
include/linux/rmap.h | 3 +++
mm/huge_memory.c | 5 +----
mm/rmap.c | 24 ++++++++++++++++--------
4 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 459fd25b378e..c47067151ffd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -111,6 +111,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
__split_huge_pmd(__vma, __pmd, __address); \
} while (0)

+
+void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address);
+
#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -178,6 +181,10 @@ static inline int split_huge_page(struct page *page)
static inline void deferred_split_huge_page(struct page *page) {}
#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
+
+static inline void split_huge_pmd_address(struct vm_area_struct *vma,
+ unsigned long address) {}
+
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a5875e9b4a27..3d975e2252d4 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,6 +86,7 @@ enum ttu_flags {
TTU_MIGRATION = 2, /* migration mode */
TTU_MUNLOCK = 4, /* munlock mode */
TTU_LZFREE = 8, /* lazy free mode */
+ TTU_SPLIT_HUGE_PMD = 16, /* split huge PMD if any */

TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
@@ -93,6 +94,8 @@ enum ttu_flags {
TTU_BATCH_FLUSH = (1 << 11), /* Batch TLB flushes where possible
* and caller guarantees they will
* do a final flush if necessary */
+ TTU_RMAP_LOCKED = (1 << 12) /* do not grab rmap lock:
+ * caller holds it */
};

#ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 09de368c742b..354464b484a7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3049,15 +3049,12 @@ out:
}
}

-static void split_huge_pmd_address(struct vm_area_struct *vma,
- unsigned long address)
+void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;

- VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
-
pgd = pgd_offset(vma->vm_mm, address);
if (!pgd_present(*pgd))
return;
diff --git a/mm/rmap.c b/mm/rmap.c
index 30b739ce0ffa..945933a01010 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1431,6 +1431,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
goto out;

+ if (flags & TTU_SPLIT_HUGE_PMD)
+ split_huge_pmd_address(vma, address);
pte = page_check_address(page, mm, address, &ptl, 0);
if (!pte)
goto out;
@@ -1576,10 +1578,10 @@ static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg)
return is_vma_temporary_stack(vma);
}

-static int page_not_mapped(struct page *page)
+static int page_mapcount_is_zero(struct page *page)
{
- return !page_mapped(page);
-};
+ return !page_mapcount(page);
+}

/**
* try_to_unmap - try to remove all page table mappings to a page
@@ -1606,12 +1608,10 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
struct rmap_walk_control rwc = {
.rmap_one = try_to_unmap_one,
.arg = &rp,
- .done = page_not_mapped,
+ .done = page_mapcount_is_zero,
.anon_lock = page_lock_anon_vma_read,
};

- VM_BUG_ON_PAGE(!PageHuge(page) && PageTransHuge(page), page);
-
/*
* During exec, a temporary VMA is setup and later moved.
* The VMA is moved under the anon_vma lock but not the
@@ -1623,9 +1623,12 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
rwc.invalid_vma = invalid_migration_vma;

- ret = rmap_walk(page, &rwc);
+ if (flags & TTU_RMAP_LOCKED)
+ ret = rmap_walk_locked(page, &rwc);
+ else
+ ret = rmap_walk(page, &rwc);

- if (ret != SWAP_MLOCK && !page_mapped(page)) {
+ if (ret != SWAP_MLOCK && !page_mapcount(page)) {
ret = SWAP_SUCCESS;
if (rp.lazyfreed && !PageDirty(page))
ret = SWAP_LZFREE;
@@ -1633,6 +1636,11 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
return ret;
}

+static int page_not_mapped(struct page *page)
+{
+ return !page_mapped(page);
+};
+
/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
--
2.7.0

2016-03-03 17:08:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 01/29] rmap: introduce rmap_walk_locked()

rmap_walk_locked() is the same as rmap_walk(), but caller takes care
about relevant rmap lock.

It's preparation to switch THP splitting from custom rmap walk in
freeze_page()/unfreeze_page() to generic one.

Not support for KSM pages for now: not clear which lock is implied.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/rmap.h | 1 +
mm/rmap.c | 41 ++++++++++++++++++++++++++++++++---------
2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a07f42bedda3..a5875e9b4a27 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -266,6 +266,7 @@ struct rmap_walk_control {
};

int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
+int rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc);

#else /* !CONFIG_MMU */

diff --git a/mm/rmap.c b/mm/rmap.c
index 02f0bfc3c80a..30b739ce0ffa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1715,14 +1715,21 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* LOCKED.
*/
-static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
+static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
+ bool locked)
{
struct anon_vma *anon_vma;
pgoff_t pgoff;
struct anon_vma_chain *avc;
int ret = SWAP_AGAIN;

- anon_vma = rmap_walk_anon_lock(page, rwc);
+ if (locked) {
+ anon_vma = page_anon_vma(page);
+ /* anon_vma disappear under us? */
+ VM_BUG_ON_PAGE(!anon_vma, page);
+ } else {
+ anon_vma = rmap_walk_anon_lock(page, rwc);
+ }
if (!anon_vma)
return ret;

@@ -1742,7 +1749,9 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
if (rwc->done && rwc->done(page))
break;
}
- anon_vma_unlock_read(anon_vma);
+
+ if (!locked)
+ anon_vma_unlock_read(anon_vma);
return ret;
}

@@ -1759,9 +1768,10 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* LOCKED.
*/
-static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
+static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
+ bool locked)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page_mapping(page);
pgoff_t pgoff;
struct vm_area_struct *vma;
int ret = SWAP_AGAIN;
@@ -1778,7 +1788,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
return ret;

pgoff = page_to_pgoff(page);
- i_mmap_lock_read(mapping);
+ if (!locked)
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);

@@ -1795,7 +1806,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
}

done:
- i_mmap_unlock_read(mapping);
+ if (!locked)
+ i_mmap_unlock_read(mapping);
return ret;
}

@@ -1804,9 +1816,20 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc)
if (unlikely(PageKsm(page)))
return rmap_walk_ksm(page, rwc);
else if (PageAnon(page))
- return rmap_walk_anon(page, rwc);
+ return rmap_walk_anon(page, rwc, false);
+ else
+ return rmap_walk_file(page, rwc, false);
+}
+
+/* Like rmap_walk, but caller holds relevant rmap lock */
+int rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc)
+{
+ /* no ksm support for now */
+ VM_BUG_ON_PAGE(PageKsm(page), page);
+ if (PageAnon(page))
+ return rmap_walk_anon(page, rwc, true);
else
- return rmap_walk_file(page, rwc);
+ return rmap_walk_file(page, rwc, true);
}

#ifdef CONFIG_HUGETLB_PAGE
--
2.7.0

2016-03-04 04:20:23

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCHv3 00/29] huge tmpfs implementation using compound pages

On 03/03/2016 11:51 AM, Kirill A. Shutemov wrote:
> I consider it feature complete for initial step into upstream. I'll focus
> on validation now. I work with Sasha on that.

Hey Kirill,

I see the following two (separate) issues. I haven't hit them ever before, so
I suspect that while they seem unrelated, they are somehow caused by this series.

First:

[ 1386.011801] ==================================================================

[ 1386.011901] BUG: KASAN: use-after-free in __fget+0x4fa/0x540 at addr ffff8801afe43b34

[ 1386.011922] Read of size 4 by task syz-executor/22976

[ 1386.011939] =============================================================================

[ 1386.011959] BUG filp (Not tainted): kasan: bad access detected

[ 1386.011969] -----------------------------------------------------------------------------

[ 1386.011969]

[ 1386.011976] Disabling lock debugging due to kernel taint

[ 1386.012005] INFO: Slab 0xffffea0006bf9000 objects=19 used=16 fp=0xffff8801afe40040 flags=0x2fffff80004080

[ 1386.012027] INFO: Object 0xffff8801afe43a80 @offset=14976 fp=0xbbbbbbbbbbbbbbbb

[ 1386.012027]

[ 1386.012061] Redzone ffff8801afe43a40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012087] Redzone ffff8801afe43a50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012112] Redzone ffff8801afe43a60: 02 00 00 00 68 30 00 00 3b 55 0e 00 01 00 00 00 ....h0..;U......

[ 1386.012133] Redzone ffff8801afe43a70: 00 00 00 00 00 00 00 00 40 aa 90 ac ff ff ff ff ........@.......

[ 1386.012156] Object ffff8801afe43a80: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................

[ 1386.012181] Object ffff8801afe43a90: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................

[ 1386.012206] Object ffff8801afe43aa0: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................

[ 1386.012230] Object ffff8801afe43ab0: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................

[ 1386.012251] Object ffff8801afe43ac0: 00 00 00 00 00 00 00 00 70 03 8c a1 ff ff ff ff ........p.......

[ 1386.012278] Object ffff8801afe43ad0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012298] Object ffff8801afe43ae0: 00 00 00 00 00 00 00 00 c0 65 94 ac ff ff ff ff .........e......

[ 1386.012317] Object ffff8801afe43af0: 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........

[ 1386.012333] Object ffff8801afe43b00: ff ff ff ff ff ff ff ff 00 e0 57 bc ff ff ff ff ..........W.....

[ 1386.012351] Object ffff8801afe43b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012367] Object ffff8801afe43b20: e0 b6 93 ac ff ff ff ff 00 00 00 00 00 00 00 00 ................

[ 1386.012382] Object ffff8801afe43b30: 01 80 00 00 1e 00 04 00 01 00 00 00 00 00 00 00 ................

[ 1386.012394] Object ffff8801afe43b40: 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........

[ 1386.012405] Object ffff8801afe43b50: ff ff ff ff ff ff ff ff 40 35 6a bb ff ff ff ff ........@5j.....

[ 1386.012416] Object ffff8801afe43b60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012427] Object ffff8801afe43b70: c0 c4 8c ac ff ff ff ff 78 3b e4 af 01 88 ff ff ........x;......

[ 1386.012438] Object ffff8801afe43b80: 78 3b e4 af 01 88 ff ff 00 00 00 00 00 00 00 00 x;..............

[ 1386.012450] Object ffff8801afe43b90: 38 3b e4 af 01 88 ff ff c0 df 57 bc ff ff ff ff 8;........W.....

[ 1386.012461] Object ffff8801afe43ba0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012472] Object ffff8801afe43bb0: 20 b7 93 ac ff ff ff ff 00 00 00 00 00 00 00 00 ...............

[ 1386.012483] Object ffff8801afe43bc0: 00 00 00 00 00 00 00 00 ed 1e af de ff ff ff ff ................

[ 1386.012494] Object ffff8801afe43bd0: ff ff ff ff ff ff ff ff 40 e0 57 bc ff ff ff ff [email protected].....

[ 1386.012506] Object ffff8801afe43be0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012517] Object ffff8801afe43bf0: a0 b6 93 ac ff ff ff ff 00 00 00 00 00 00 00 00 ................

[ 1386.012528] Object ffff8801afe43c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012539] Object ffff8801afe43c10: 40 b6 d5 b2 00 88 ff ff 00 00 00 00 00 00 00 00 @...............

[ 1386.012550] Object ffff8801afe43c20: 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 ........ .......

[ 1386.012561] Object ffff8801afe43c30: ff ff ff ff ff ff ff ff ........

[ 1386.012572] Redzone ffff8801afe43c38: 00 00 00 00 00 00 00 00 ........

[ 1386.012583] Padding ffff8801afe43d70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

[ 1386.012607] CPU: 1 PID: 22976 Comm: syz-executor Tainted: G B 4.5.0-rc6-next-20160301-sasha-00054-g4c13c38-dirty #2987

[ 1386.012636] 0000000000000000 ffff8800b2097d08 ffffffffa33db57d ffffffff00000001

[ 1386.012651] fffffbfff5e6cc08 0000000041b58ab3 ffffffffaecc1ee9 ffffffffa33db3e5

[ 1386.012666] 000000002e90934f ffff8801b1744000 ffffffffaecdeceb ffff8801afe43a80

[ 1386.012669] Call Trace:

[ 1386.012713] dump_stack (lib/dump_stack.c:53)
[ 1386.012731] ? arch_local_irq_restore (init/do_mounts.h:17)
[ 1386.012749] ? print_section (./arch/x86/include/asm/current.h:14 include/linux/kasan.h:35 mm/slub.c:481 mm/slub.c:512)
[ 1386.012763] print_trailer (mm/slub.c:670)
[ 1386.012778] object_err (mm/slub.c:677)
[ 1386.012794] kasan_report_error (include/linux/kasan.h:28 mm/kasan/report.c:170 mm/kasan/report.c:237)
[ 1386.012920] __asan_report_load4_noabort (mm/kasan/report.c:279)
[ 1386.012946] __fget (fs/file.c:707)
[ 1386.012996] __fget_light (fs/file.c:757)
[ 1386.013009] __fdget (fs/file.c:765)
[ 1386.013030] SyS_ioctl (include/linux/file.h:55 fs/ioctl.c:683 fs/ioctl.c:680)
[ 1386.013051] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:200)
[ 1386.013056] Memory state around the buggy address:

[ 1386.013069] ffff8801afe43a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

[ 1386.013080] ffff8801afe43a80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb

[ 1386.013090] >ffff8801afe43b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

[ 1386.013094] ^

[ 1386.013105] ffff8801afe43b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

[ 1386.013115] ffff8801afe43c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc

[ 1386.013119] ==================================================================

And second:

[ 2328.415149] ------------[ cut here ]------------

[ 2328.417960] WARNING: CPU: 2 PID: 13358 at arch/x86/mm/pat.c:986 untrack_pfn+0x24e/0x2d0

[ 2328.418852] Modules linked in:

[ 2328.419257] CPU: 2 PID: 13358 Comm: syz-executor Not tainted 4.5.0-rc6-next-20160301-sasha-00054-g4c13c38-dirty #2987

[ 2328.420445] 0000000000000000 ffff88000cad77c0 ffffffffa43db57d ffffffff00000002

[ 2328.421392] fffffbfff606cc08 0000000041b58ab3 ffffffffafcc1ee9 ffffffffa43db3e5

[ 2328.422295] ffffffffa2598b50 0000000020000000 0000000041b58ab3 ffffffffafcddf50

[ 2328.423234] Call Trace:

[ 2328.423583] dump_stack (lib/dump_stack.c:53)
[ 2328.424184] ? arch_local_irq_restore (init/do_mounts.h:17)
[ 2328.424917] ? is_module_text_address (kernel/module.c:4033)
[ 2328.425668] ? vm_insert_mixed (mm/memory.c:3737)
[ 2328.426385] ? untrack_pfn (arch/x86/mm/pat.c:986 (discriminator 3))
[ 2328.427036] __warn (kernel/panic.c:492)
[ 2328.439719] warn_slowpath_null (kernel/panic.c:528)
[ 2328.440792] untrack_pfn (arch/x86/mm/pat.c:986 (discriminator 3))
[ 2328.441606] ? track_pfn_insert (arch/x86/mm/pat.c:975)
[ 2328.442418] ? do_wp_page (mm/memory.c:1235)
[ 2328.443099] unmap_single_vma (mm/memory.c:1270)
[ 2328.443975] unmap_vmas (mm/memory.c:1320 (discriminator 3))
[ 2328.452015] exit_mmap (mm/mmap.c:2769)
[ 2328.452868] ? SyS_munmap (mm/mmap.c:2739)
[ 2328.453798] ? do_raw_spin_unlock (kernel/locking/spinlock_debug.c:160)
[ 2328.454498] ? __might_sleep (kernel/sched/core.c:7736 (discriminator 14))
[ 2328.455113] mmput (kernel/fork.c:715)
[ 2328.455666] do_exit (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:92 kernel/exit.c:437 kernel/exit.c:735)
[ 2328.456251] ? mm_update_next_owner (kernel/exit.c:653)
[ 2328.456952] ? __dequeue_signal (kernel/signal.c:546)
[ 2328.457643] ? do_sigaltstack (kernel/signal.c:546)
[ 2328.458351] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:801 include/linux/spinlock_api_smp.h:170 kernel/locking/spinlock.c:199)
[ 2328.459063] do_group_exit (include/linux/sched.h:815 kernel/exit.c:861)
[ 2328.459698] get_signal (kernel/signal.c:2327)
[ 2328.460363] do_signal (arch/x86/kernel/signal.c:784)
[ 2328.460956] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2540 kernel/locking/lockdep.c:2587)
[ 2328.461750] ? trace_hardirqs_on (kernel/locking/lockdep.c:2595)
[ 2328.462469] ? setup_sigcontext (arch/x86/kernel/signal.c:781)
[ 2328.463134] ? finish_task_switch (./arch/x86/include/asm/current.h:14 kernel/sched/core.c:2746)
[ 2328.469533] ? finish_task_switch (kernel/sched/sched.h:1101 kernel/sched/core.c:2743)
[ 2328.470303] ? rcu_read_unlock (kernel/sched/core.c:2706)
[ 2328.470966] ? SyS_futex (kernel/futex.c:3182)
[ 2328.471615] ? exit_to_usermode_loop (./arch/x86/include/asm/paravirt.h:801 arch/x86/entry/common.c:238)
[ 2328.473334] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2540 kernel/locking/lockdep.c:2587)
[ 2328.474135] exit_to_usermode_loop (arch/x86/entry/common.c:248)
[ 2328.476051] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2540 kernel/locking/lockdep.c:2587)
[ 2328.476827] syscall_return_slowpath (arch/x86/entry/common.c:283 arch/x86/entry/common.c:348)
[ 2328.478821] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:232)
[ 2328.486429] ---[ end trace be1dc5a23ab2ebe4 ]---


Thanks,
Sasha

2016-03-04 11:26:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> Truncate and punch hole that only cover part of THP range is implemented
> by zero out this part of THP.
>
> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> inconsistent results depending what pages happened to be allocated.
> Not sure if it should be considered ABI break or not.

Looks like this shouldn't be a problem. man 2 fallocate:

Within the specified range, partial filesystem blocks are zeroed,
and whole filesystem blocks are removed from the file. After a
successful call, subsequent reads from this range will return
zeroes.

It means we effectively have 2M filesystem block size.

And I don't see any guarantee about subsequent lseek(SEEK_HOLE) beheviour.

--
Kirill A. Shutemov

2016-03-04 17:40:25

by Dave Hansen

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
>> Truncate and punch hole that only cover part of THP range is implemented
>> by zero out this part of THP.
>>
>> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
>> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
>> inconsistent results depending what pages happened to be allocated.
>> Not sure if it should be considered ABI break or not.
>
> Looks like this shouldn't be a problem. man 2 fallocate:
>
> Within the specified range, partial filesystem blocks are zeroed,
> and whole filesystem blocks are removed from the file. After a
> successful call, subsequent reads from this range will return
> zeroes.
>
> It means we effectively have 2M filesystem block size.

The question is still whether this will case problems for apps.

Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
filesystem act like they have a 2M blocksize and others like they have
4k? Would that confuse apps?

2016-03-04 19:38:59

by Hugh Dickins

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Fri, 4 Mar 2016, Dave Hansen wrote:
> On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> >> Truncate and punch hole that only cover part of THP range is implemented
> >> by zero out this part of THP.
> >>
> >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> >> inconsistent results depending what pages happened to be allocated.
> >> Not sure if it should be considered ABI break or not.
> >
> > Looks like this shouldn't be a problem. man 2 fallocate:
> >
> > Within the specified range, partial filesystem blocks are zeroed,
> > and whole filesystem blocks are removed from the file. After a
> > successful call, subsequent reads from this range will return
> > zeroes.
> >
> > It means we effectively have 2M filesystem block size.
>
> The question is still whether this will case problems for apps.
>
> Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
> filesystem act like they have a 2M blocksize and others like they have
> 4k? Would that confuse apps?

At risk of addressing the tip of an iceberg, before diving down to
scope out the rest of the iceberg...

So far as the behaviour of lseek(,,SEEK_HOLE) goes, I agree with Kirill:
I don't think it matters to anyone if it skips some zeroed small pages
within a hugepage. It may cause some artificial tests of holepunch and
SEEK_HOLE to fail, and it ought to be documented as a limitation from
choosing to enable THP (Kirill's way) on a filesystem, but I don't think
it's an ABI break to worry about: anyone who cares just shouldn't enable.

(Though in the case of my huge tmpfs, it's the reverse: the small hole
punch splits the hugepage; but it's natural that Kirill's way would try
to hold on to its compound pages for longer than I do, and that's fine
so long as it's all consistent.)

But I may disagree with "we effectively have 2M filesystem block size",
beyond the SEEK_HOLE case. If we're emulating hugetlbfs in tmpfs, sure,
we would have 2M filesystem block size. But if we're enabling THP
(emphasis on T for Transparent) in tmpfs (or another filesystem), then
when it matters it must act as if the block size is the 4k (or whatever)
it usually is. When it matters? Approaching memcg limit or ENOSPC
spring to mind.

Ah, but suppose someone holepunches out most of each 2M page: they would
expect the memcg not to be charged for those holes (just as when they
munmap most of an anonymous THP) - that does suggest splitting is needed.

Hugh

2016-03-04 22:48:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> On Fri, 4 Mar 2016, Dave Hansen wrote:
> > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > >> Truncate and punch hole that only cover part of THP range is implemented
> > >> by zero out this part of THP.
> > >>
> > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > >> inconsistent results depending what pages happened to be allocated.
> > >> Not sure if it should be considered ABI break or not.
> > >
> > > Looks like this shouldn't be a problem. man 2 fallocate:
> > >
> > > Within the specified range, partial filesystem blocks are zeroed,
> > > and whole filesystem blocks are removed from the file. After a
> > > successful call, subsequent reads from this range will return
> > > zeroes.
> > >
> > > It means we effectively have 2M filesystem block size.
> >
> > The question is still whether this will case problems for apps.
> >
> > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
> > filesystem act like they have a 2M blocksize and others like they have
> > 4k? Would that confuse apps?
>
> At risk of addressing the tip of an iceberg, before diving down to
> scope out the rest of the iceberg...
>
> So far as the behaviour of lseek(,,SEEK_HOLE) goes, I agree with Kirill:
> I don't think it matters to anyone if it skips some zeroed small pages
> within a hugepage. It may cause some artificial tests of holepunch and
> SEEK_HOLE to fail, and it ought to be documented as a limitation from
> choosing to enable THP (Kirill's way) on a filesystem, but I don't think
> it's an ABI break to worry about: anyone who cares just shouldn't enable.
>
> (Though in the case of my huge tmpfs, it's the reverse: the small hole
> punch splits the hugepage; but it's natural that Kirill's way would try
> to hold on to its compound pages for longer than I do, and that's fine
> so long as it's all consistent.)
>
> But I may disagree with "we effectively have 2M filesystem block size",
> beyond the SEEK_HOLE case. If we're emulating hugetlbfs in tmpfs, sure,
> we would have 2M filesystem block size. But if we're enabling THP
> (emphasis on T for Transparent) in tmpfs (or another filesystem), then
> when it matters it must act as if the block size is the 4k (or whatever)
> it usually is. When it matters? Approaching memcg limit or ENOSPC
> spring to mind.
>
> Ah, but suppose someone holepunches out most of each 2M page: they would
> expect the memcg not to be charged for those holes (just as when they
> munmap most of an anonymous THP) - that does suggest splitting is needed.

Hmm.. As split_huge_pages() can fail, we wound need to propagate this
error to userspace. This potentially triggers some other user-visible
effect. EBUSY is not on list of fallocate(2) errror codes.

I think we can invent a way to track if a THP has punch-holed subpages and
prevent the compound page from being mapped as PMD or mapping these
subpages.

But I'm reluctant doing it upfront until real users emerge.

I would propose to see what user demands will be. May be we overthink the
situation.

--
Kirill A. Shutemov

2016-03-04 22:53:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 00/29] huge tmpfs implementation using compound pages

On Thu, Mar 03, 2016 at 11:20:00PM -0500, Sasha Levin wrote:
> On 03/03/2016 11:51 AM, Kirill A. Shutemov wrote:
> > I consider it feature complete for initial step into upstream. I'll focus
> > on validation now. I work with Sasha on that.
>
> Hey Kirill,
>
> I see the following two (separate) issues. I haven't hit them ever before, so
> I suspect that while they seem unrelated, they are somehow caused by this series.

As I said in private, I failed to derive any useful information from these
crashes. I don't see anything related to the patchset.

Of course, it can be related to the patchset, but there is not enough
information to connect the dots.

--
Kirill A. Shutemov

2016-03-04 23:05:54

by Dave Chinner

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> On Fri, 4 Mar 2016, Dave Hansen wrote:
> > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > >> Truncate and punch hole that only cover part of THP range is implemented
> > >> by zero out this part of THP.
> > >>
> > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > >> inconsistent results depending what pages happened to be allocated.
> > >> Not sure if it should be considered ABI break or not.
> > >
> > > Looks like this shouldn't be a problem. man 2 fallocate:
> > >
> > > Within the specified range, partial filesystem blocks are zeroed,
> > > and whole filesystem blocks are removed from the file. After a
> > > successful call, subsequent reads from this range will return
> > > zeroes.
> > >
> > > It means we effectively have 2M filesystem block size.
> >
> > The question is still whether this will case problems for apps.
> >
> > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
> > filesystem act like they have a 2M blocksize and others like they have
> > 4k? Would that confuse apps?
>
> At risk of addressing the tip of an iceberg, before diving down to
> scope out the rest of the iceberg...
....

> (Though in the case of my huge tmpfs, it's the reverse: the small hole
> punch splits the hugepage; but it's natural that Kirill's way would try
> to hold on to its compound pages for longer than I do, and that's fine
> so long as it's all consistent.)
....
> Ah, but suppose someone holepunches out most of each 2M page: they would
> expect the memcg not to be charged for those holes (just as when they
> munmap most of an anonymous THP) - that does suggest splitting is needed.

I think filesystems will expect splitting to happen. They call
truncate_pagecache_range() on the region that the hole is being
punched out of, and they expect page cache pages over this range to
be unmapped, invalidated and then removed from the mapping tree as a
result. Also, most filesystems think the page cache only contains
PAGE_CACHE_SIZE mappings, so they are completely unaware of the
limitations THP might have when it comes to invalidation.

IOWs, if this range is not aligned to huge page boundaries, then it
implies the huge page is either split into PAGE_SIZE mappings and
then the range is invalidated as expected, or it is completely
invalidated and then refaulted on future accesses which determine if
THP or normal pages are used for the page being faulted....

Just to complicate things, keep in mind that some filesystems may
have a PAGE_SIZE block size, but can be convinced to only
allocate/punch/truncate/etc extents on larger alignments on a
per-inode basis. IOWs, THP vs hole punch behaviour is not actually
a filesystem type specific behaviour - it's per-inode specific...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-04 23:24:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > >> Truncate and punch hole that only cover part of THP range is implemented
> > > >> by zero out this part of THP.
> > > >>
> > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > > >> inconsistent results depending what pages happened to be allocated.
> > > >> Not sure if it should be considered ABI break or not.
> > > >
> > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > >
> > > > Within the specified range, partial filesystem blocks are zeroed,
> > > > and whole filesystem blocks are removed from the file. After a
> > > > successful call, subsequent reads from this range will return
> > > > zeroes.
> > > >
> > > > It means we effectively have 2M filesystem block size.
> > >
> > > The question is still whether this will case problems for apps.
> > >
> > > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
> > > filesystem act like they have a 2M blocksize and others like they have
> > > 4k? Would that confuse apps?
> >
> > At risk of addressing the tip of an iceberg, before diving down to
> > scope out the rest of the iceberg...
> ....
>
> > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > punch splits the hugepage; but it's natural that Kirill's way would try
> > to hold on to its compound pages for longer than I do, and that's fine
> > so long as it's all consistent.)
> ....
> > Ah, but suppose someone holepunches out most of each 2M page: they would
> > expect the memcg not to be charged for those holes (just as when they
> > munmap most of an anonymous THP) - that does suggest splitting is needed.
>
> I think filesystems will expect splitting to happen. They call
> truncate_pagecache_range() on the region that the hole is being
> punched out of, and they expect page cache pages over this range to
> be unmapped, invalidated and then removed from the mapping tree as a
> result. Also, most filesystems think the page cache only contains
> PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> limitations THP might have when it comes to invalidation.
>
> IOWs, if this range is not aligned to huge page boundaries, then it
> implies the huge page is either split into PAGE_SIZE mappings and
> then the range is invalidated as expected, or it is completely
> invalidated and then refaulted on future accesses which determine if
> THP or normal pages are used for the page being faulted....

The filesystem in question is tmpfs and complete invalidation is not
always an option. For other filesystems it also can be unavailable
immediately if the page is dirty (the dirty flag is tracked on per-THP
basis at the moment).

Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
-EBUSY (or other errno on your choice), if we cannot split the page
right away?

> Just to complicate things, keep in mind that some filesystems may
> have a PAGE_SIZE block size, but can be convinced to only
> allocate/punch/truncate/etc extents on larger alignments on a
> per-inode basis. IOWs, THP vs hole punch behaviour is not actually
> a filesystem type specific behaviour - it's per-inode specific...

There is also similar question about THP vs. i_size vs. SIGBUS.

For small pages an application will not get SIGBUS on mmap()ed file, until
it wouldn't try to access beyond round_up(i_size, PAGE_CACHE_SIZE) - 1.

For THP it would be round_up(i_size, HPAGE_PMD_SIZE) - 1.

Is it a problem?

--
Kirill A. Shutemov

2016-03-05 22:38:27

by Dave Chinner

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > > >> Truncate and punch hole that only cover part of THP range is implemented
> > > > >> by zero out this part of THP.
> > > > >>
> > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > > > >> inconsistent results depending what pages happened to be allocated.
> > > > >> Not sure if it should be considered ABI break or not.
> > > > >
> > > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > >
> > > > > Within the specified range, partial filesystem blocks are zeroed,
> > > > > and whole filesystem blocks are removed from the file. After a
> > > > > successful call, subsequent reads from this range will return
> > > > > zeroes.
> > > > >
> > > > > It means we effectively have 2M filesystem block size.
> > > >
> > > > The question is still whether this will case problems for apps.
> > > >
> > > > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
> > > > filesystem act like they have a 2M blocksize and others like they have
> > > > 4k? Would that confuse apps?
> > >
> > > At risk of addressing the tip of an iceberg, before diving down to
> > > scope out the rest of the iceberg...
> > ....
> >
> > > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > > punch splits the hugepage; but it's natural that Kirill's way would try
> > > to hold on to its compound pages for longer than I do, and that's fine
> > > so long as it's all consistent.)
> > ....
> > > Ah, but suppose someone holepunches out most of each 2M page: they would
> > > expect the memcg not to be charged for those holes (just as when they
> > > munmap most of an anonymous THP) - that does suggest splitting is needed.
> >
> > I think filesystems will expect splitting to happen. They call
> > truncate_pagecache_range() on the region that the hole is being
> > punched out of, and they expect page cache pages over this range to
> > be unmapped, invalidated and then removed from the mapping tree as a
> > result. Also, most filesystems think the page cache only contains
> > PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> > limitations THP might have when it comes to invalidation.
> >
> > IOWs, if this range is not aligned to huge page boundaries, then it
> > implies the huge page is either split into PAGE_SIZE mappings and
> > then the range is invalidated as expected, or it is completely
> > invalidated and then refaulted on future accesses which determine if
> > THP or normal pages are used for the page being faulted....
>
> The filesystem in question is tmpfs and complete invalidation is not
> always an option.

Then your two options are: splitting the page and rerunning the hole
punch, or simply zeroing the sections of the THP rather than trying
to punch out the backing store.

> For other filesystems it also can be unavailable
> immediately if the page is dirty (the dirty flag is tracked on per-THP
> basis at the moment).

Filesystems with persistent storage flush the range being punched
first to ensure that partial blocks are correctly written before we
start freeing the backing store. This is needed on XFS to ensure
hole punch plays nicely with delayed allocation and other extent
based operations. Hence we know that we have clean pages over the
hole we are about to punch and so there is no reason the
invalidation should *ever* fail.

tmpfs is a special snowflake when it comes to these fallocate based
filesystem layout manipulation functions - it does not have
persistent storage, so you have to do things very differently to
ensure that data is not lost.

> Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> -EBUSY (or other errno on your choice), if we cannot split the page
> right away?

Which means THP are not transparent any more. What does an
application do when it gets an EBUSY, anyway? It needs to punch a
hole, and failure to do so could result in data corruption or stale
data exposure if the hole isn't punched and the data purged from the
range.

And it's not just hole punching that has this problem. Direct IO is
going to have the same issue with invalidation of the mapped ranges
over the IO being done. XFS already WARNs when page cache
invalidation fails with EBUSY in direct IO, because that is
indicative of an application with a potential data corruption vector
and there's nothing we can do in the kernel code to prevent it.

I think the same issues also exist with DAX using huge (and giant)
pages. Hence it seems like we need to think about these interactions
carefully, because they will no longer are isolated to tmpfs and
THP...

> > Just to complicate things, keep in mind that some filesystems may
> > have a PAGE_SIZE block size, but can be convinced to only
> > allocate/punch/truncate/etc extents on larger alignments on a
> > per-inode basis. IOWs, THP vs hole punch behaviour is not actually
> > a filesystem type specific behaviour - it's per-inode specific...
>
> There is also similar question about THP vs. i_size vs. SIGBUS.
>
> For small pages an application will not get SIGBUS on mmap()ed file, until
> it wouldn't try to access beyond round_up(i_size, PAGE_CACHE_SIZE) - 1.
>
> For THP it would be round_up(i_size, HPAGE_PMD_SIZE) - 1.
>
> Is it a problem?

No idea. I'm guessing that there may be significant stale data
exposure issues here as filesystems do not guarantee that blocks
completely beyond EOF contain zeros.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-06 00:30:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> > On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > > > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > > > >> Truncate and punch hole that only cover part of THP range is implemented
> > > > > >> by zero out this part of THP.
> > > > > >>
> > > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > > > > >> inconsistent results depending what pages happened to be allocated.
> > > > > >> Not sure if it should be considered ABI break or not.
> > > > > >
> > > > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > > >
> > > > > > Within the specified range, partial filesystem blocks are zeroed,
> > > > > > and whole filesystem blocks are removed from the file. After a
> > > > > > successful call, subsequent reads from this range will return
> > > > > > zeroes.
> > > > > >
> > > > > > It means we effectively have 2M filesystem block size.
> > > > >
> > > > > The question is still whether this will case problems for apps.
> > > > >
> > > > > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
> > > > > filesystem act like they have a 2M blocksize and others like they have
> > > > > 4k? Would that confuse apps?
> > > >
> > > > At risk of addressing the tip of an iceberg, before diving down to
> > > > scope out the rest of the iceberg...
> > > ....
> > >
> > > > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > > > punch splits the hugepage; but it's natural that Kirill's way would try
> > > > to hold on to its compound pages for longer than I do, and that's fine
> > > > so long as it's all consistent.)
> > > ....
> > > > Ah, but suppose someone holepunches out most of each 2M page: they would
> > > > expect the memcg not to be charged for those holes (just as when they
> > > > munmap most of an anonymous THP) - that does suggest splitting is needed.
> > >
> > > I think filesystems will expect splitting to happen. They call
> > > truncate_pagecache_range() on the region that the hole is being
> > > punched out of, and they expect page cache pages over this range to
> > > be unmapped, invalidated and then removed from the mapping tree as a
> > > result. Also, most filesystems think the page cache only contains
> > > PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> > > limitations THP might have when it comes to invalidation.
> > >
> > > IOWs, if this range is not aligned to huge page boundaries, then it
> > > implies the huge page is either split into PAGE_SIZE mappings and
> > > then the range is invalidated as expected, or it is completely
> > > invalidated and then refaulted on future accesses which determine if
> > > THP or normal pages are used for the page being faulted....
> >
> > The filesystem in question is tmpfs and complete invalidation is not
> > always an option.
>
> Then your two options are: splitting the page and rerunning the hole
> punch, or simply zeroing the sections of the THP rather than trying
> to punch out the backing store.

The second option is implemented at the moment as splitting can fail.

> > For other filesystems it also can be unavailable
> > immediately if the page is dirty (the dirty flag is tracked on per-THP
> > basis at the moment).
>
> Filesystems with persistent storage flush the range being punched
> first to ensure that partial blocks are correctly written before we
> start freeing the backing store. This is needed on XFS to ensure
> hole punch plays nicely with delayed allocation and other extent
> based operations. Hence we know that we have clean pages over the
> hole we are about to punch and so there is no reason the
> invalidation should *ever* fail.

Okay. It means we have other option to consider on THP-enabling for a
filesystem with persistent storage.

> tmpfs is a special snowflake when it comes to these fallocate based
> filesystem layout manipulation functions - it does not have
> persistent storage, so you have to do things very differently to
> ensure that data is not lost.
>
> > Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> > -EBUSY (or other errno on your choice), if we cannot split the page
> > right away?
>
> Which means THP are not transparent any more. What does an
> application do when it gets an EBUSY, anyway?

I guess it's reasonable to expect from an application to handle EOPNOTSUPP
as FALLOC_FL_PUNCH_HOLE is not supported by some filesystems.
Although, non-consistent result from the same fd can be confusing.

For now, I would stick with clearing part of THP on partial truncate or
punch hole.

We can consider some other options, like deferred split, later.

> And it's not just hole punching that has this problem. Direct IO is
> going to have the same issue with invalidation of the mapped ranges
> over the IO being done. XFS already WARNs when page cache
> invalidation fails with EBUSY in direct IO, because that is
> indicative of an application with a potential data corruption vector
> and there's nothing we can do in the kernel code to prevent it.

My current understanding is that for filesystems with persistent storage,
in order to make THP any useful, we would need to implement writeback
without splitting the huge page.
At the moment, I have no idea how hard it would be..

But it means we shouldn't be required to split a page in order to
invalidate page cache. Therefore no risk of split failure.

> I think the same issues also exist with DAX using huge (and giant)
> pages. Hence it seems like we need to think about these interactions
> carefully, because they will no longer are isolated to tmpfs and
> THP...
>
> > > Just to complicate things, keep in mind that some filesystems may
> > > have a PAGE_SIZE block size, but can be convinced to only
> > > allocate/punch/truncate/etc extents on larger alignments on a
> > > per-inode basis. IOWs, THP vs hole punch behaviour is not actually
> > > a filesystem type specific behaviour - it's per-inode specific...
> >
> > There is also similar question about THP vs. i_size vs. SIGBUS.
> >
> > For small pages an application will not get SIGBUS on mmap()ed file, until
> > it wouldn't try to access beyond round_up(i_size, PAGE_CACHE_SIZE) - 1.
> >
> > For THP it would be round_up(i_size, HPAGE_PMD_SIZE) - 1.
> >
> > Is it a problem?
>
> No idea. I'm guessing that there may be significant stale data
> exposure issues here as filesystems do not guarantee that blocks
> completely beyond EOF contain zeros.

I don't know about blocks, but I think we can provide this guarantee on
page cache level, as we do for small pages.

--
Kirill A. Shutemov

2016-03-06 23:03:57

by Dave Chinner

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Sun, Mar 06, 2016 at 03:30:34AM +0300, Kirill A. Shutemov wrote:
> On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> > On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> > > Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> > > -EBUSY (or other errno on your choice), if we cannot split the page
> > > right away?
> >
> > Which means THP are not transparent any more. What does an
> > application do when it gets an EBUSY, anyway?
>
> I guess it's reasonable to expect from an application to handle EOPNOTSUPP
> as FALLOC_FL_PUNCH_HOLE is not supported by some filesystems.

Yes, but this is usually done as a check at the program
initialisation to determine whether to issue hole punches at all.
It's not suppose to be a dynamic error.

> Although, non-consistent result from the same fd can be confusing.

Exactly.

> > And it's not just hole punching that has this problem. Direct IO is
> > going to have the same issue with invalidation of the mapped ranges
> > over the IO being done. XFS already WARNs when page cache
> > invalidation fails with EBUSY in direct IO, because that is
> > indicative of an application with a potential data corruption vector
> > and there's nothing we can do in the kernel code to prevent it.
>
> My current understanding is that for filesystems with persistent storage,
> in order to make THP any useful, we would need to implement writeback
> without splitting the huge page.

Algorithmically it is no different to filesytem block size < page
size writeback.

> At the moment, I have no idea how hard it would be..

THP support would effectively require us to remove PAGE_CACHE_SIZE
assumptions from all of the filesystem and buffer code. That's a
large chunk of work e.g. fs/buffer.c and any filesystem that uses
bufferheads for tracking filesystem block state through the page
cache.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-06 23:33:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

On Mon, Mar 07, 2016 at 10:03:36AM +1100, Dave Chinner wrote:
> On Sun, Mar 06, 2016 at 03:30:34AM +0300, Kirill A. Shutemov wrote:
> > On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> > > And it's not just hole punching that has this problem. Direct IO is
> > > going to have the same issue with invalidation of the mapped ranges
> > > over the IO being done. XFS already WARNs when page cache
> > > invalidation fails with EBUSY in direct IO, because that is
> > > indicative of an application with a potential data corruption vector
> > > and there's nothing we can do in the kernel code to prevent it.
> >
> > My current understanding is that for filesystems with persistent storage,
> > in order to make THP any useful, we would need to implement writeback
> > without splitting the huge page.
>
> Algorithmically it is no different to filesytem block size < page
> size writeback.
>
> > At the moment, I have no idea how hard it would be..
>
> THP support would effectively require us to remove PAGE_CACHE_SIZE
> assumptions from all of the filesystem and buffer code. That's a
> large chunk of work e.g. fs/buffer.c and any filesystem that uses
> bufferheads for tracking filesystem block state through the page
> cache.

I'll try to learn more about the code before the summit.
I guess it's something worth descussion in person.

--
Kirill A. Shutemov