2016-04-16 23:27:12

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH mmotm 1/5] huge tmpfs: try to allocate huge pages split into a team fix

Please replace the
huge-tmpfs-try-to-allocate-huge-pages-split-into-a-team-fix.patch
you added to your tree by this one: nothing wrong with Stephen's,
but in this case I think the source is better off if we simply
remove that BUILD_BUG() instead of adding an IS_ENABLED():
fixes build problem seen on arm when putting together linux-next.

Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 1 -
1 file changed, 1 deletion(-)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1744,7 +1744,6 @@ static inline struct page *shmem_hugetea

static inline void shmem_disband_hugeteam(struct page *page)
{
- BUILD_BUG();
}

static inline void shmem_added_to_hugeteam(struct page *page,


2016-04-16 23:29:54

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH mmotm 2/5] huge tmpfs: fix mlocked meminfo track huge unhuge mlocks fix

Please add this fix after
huge-tmpfs-fix-mlocked-meminfo-track-huge-unhuge-mlocks.patch
for later merging into it. I expect this to fix a build problem found
by robot on an x86_64 randconfig. I was not able to reproduce the error,
but I'm growing to realize that different optimizers behave differently.

Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/rmap.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1445,8 +1445,12 @@ static int try_to_unmap_one(struct page
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
+ int nr_pages = 1;
+
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !pte)
+ nr_pages = HPAGE_PMD_NR;
/* Holding pte lock, we do *not* need mmap_sem here */
- mlock_vma_pages(page, pte ? 1 : HPAGE_PMD_NR);
+ mlock_vma_pages(page, nr_pages);
ret = SWAP_MLOCK;
goto out_unmap;
}

2016-04-16 23:33:12

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH mmotm 3/5] huge tmpfs recovery: tweak shmem_getpage_gfp to fill team fix

Please add this fix after my 27/31, your
huge-tmpfs-recovery-tweak-shmem_getpage_gfp-to-fill-team.patch
for later merging into it. Great catch by Mika Penttila, a bug which
prevented some unusual cases from being recovered into huge pages as
intended: an initially sparse head would be set PageTeam only after
this check. But the check is guarding against a racing disband, which
cannot happen before the head is published as PageTeam, plus we have
an additional reference on the head which keeps it safe throughout:
so very easily fixed.

Reported-by: Mika Penttila <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2938,7 +2938,7 @@ repeat:
page = *pagep;
lock_page(page);
head = page - (index & (HPAGE_PMD_NR-1));
- if (!PageTeam(head)) {
+ if (!PageTeam(head) && page != head) {
error = -ENOENT;
goto decused;
}

2016-04-16 23:38:20

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH mmotm 4/5] huge tmpfs: avoid premature exposure of new pagetable revert

This patch reverts all of my 09/31, your
huge-tmpfs-avoid-premature-exposure-of-new-pagetable.patch
and also the mm/memory.c changes from the patch after it,
huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes.patch

I've diffed this against the top of the tree, but it may be better to
throw this and huge-tmpfs-avoid-premature-exposure-of-new-pagetable.patch
away, and just delete the mm/memory.c part of the patch after it.

This is in preparation for 5/5, which replaces what was done here.
Why? Numerous reasons. Kirill was concerned that my movement of
map_pages from before to after fault would show performance regression.
Robot reported vm-scalability.throughput -5.5% regression, bisected to
the avoid premature exposure patch. Andrew was concerned about bloat
in mm/memory.o. Google had seen (on an earlier kernel) an OOM deadlock
from pagetable allocations being done while holding pagecache pagelock.

I thought I could deal with those later on, but the clincher came from
Xiong Zhou's report that it had broken binary execution from DAX mount.
Silly little oversight, but not as easily fixed as first appears, because
DAX now uses the i_mmap_rwsem to guard an extent from truncation: which
would be open to deadlock if pagetable allocation goes down to reclaim
(both are using only the read lock, but in danger of an rwr sandwich).

I've considered various alternative approaches, and what can be done
to get both DAX and huge tmpfs working again quickly. Eventually
arrived at the obvious: shmem should use the new pmd_fault().

Reported-by: kernel test robot <[email protected]>
Reported-by: Xiong Zhou <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/filemap.c | 10 --
mm/memory.c | 225 +++++++++++++++++++++----------------------------
2 files changed, 101 insertions(+), 134 deletions(-)

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2151,10 +2151,6 @@ void filemap_map_pages(struct vm_area_st
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
if (iter.index > vmf->max_pgoff)
break;
-
- pte = vmf->pte + iter.index - vmf->pgoff;
- if (!pte_none(*pte))
- goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2176,8 +2172,6 @@ repeat:
goto repeat;
}

- VM_BUG_ON_PAGE(page->index != iter.index, page);
-
if (!PageUptodate(page) ||
PageReadahead(page) ||
PageHWPoison(page))
@@ -2192,6 +2186,10 @@ repeat:
if (page->index >= size >> PAGE_SHIFT)
goto unlock;

+ pte = vmf->pte + page->index - vmf->pgoff;
+ if (!pte_none(*pte))
+ goto unlock;
+
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -45,7 +45,6 @@
#include <linux/swap.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
-#include <linux/pageteam.h>
#include <linux/ksm.h>
#include <linux/rmap.h>
#include <linux/export.h>
@@ -2718,17 +2717,20 @@ static inline int check_stack_guard_page

/*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults). We return with mmap_sem still held.
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ unsigned int flags)
{
struct mem_cgroup *memcg;
- pte_t *page_table;
struct page *page;
spinlock_t *ptl;
pte_t entry;

+ pte_unmap(page_table);
+
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
@@ -2737,27 +2739,6 @@ static int do_anonymous_page(struct mm_s
if (check_stack_guard_page(vma, address) < 0)
return VM_FAULT_SIGSEGV;

- /*
- * Use pte_alloc instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pte_alloc(mm, pmd, address)))
- return VM_FAULT_OOM;
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
- * didn't become pmd_trans_huge under us and then back to pmd_none, as
- * a result of MADV_DONTNEED running immediately after a huge pmd fault
- * in a different thread of this mm, in turn leading to a misleading
- * pmd_trans_huge() retval. All we have to ensure is that it is a
- * regular pmd that we can walk with pte_offset_map() and we can do that
- * through an atomic read in C, which is what pmd_trans_unstable()
- * provides.
- */
- if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
- return 0;
-
/* Use the zero-page for reads */
if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
@@ -2836,8 +2817,8 @@ oom:
* See filemap_fault() and __lock_page_retry().
*/
static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, pgoff_t pgoff, unsigned int flags,
- struct page *cow_page, struct page **page)
+ pgoff_t pgoff, unsigned int flags,
+ struct page *cow_page, struct page **page)
{
struct vm_fault vmf;
int ret;
@@ -2849,20 +2830,17 @@ static int __do_fault(struct vm_area_str
vmf.gfp_mask = __get_fault_gfp_mask(vma);
vmf.cow_page = cow_page;

- /*
- * Give huge pmd a chance before allocating pte or trying fault around.
- */
- if (unlikely(pmd_none(*pmd)))
- vmf.flags |= FAULT_FLAG_MAY_HUGE;
-
ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
if (!vmf.page)
goto out;
- if (unlikely(ret & VM_FAULT_HUGE)) {
- ret |= map_team_by_pmd(vma, address, pmd, vmf.page);
- return ret;
+
+ if (unlikely(PageHWPoison(vmf.page))) {
+ if (ret & VM_FAULT_LOCKED)
+ unlock_page(vmf.page);
+ put_page(vmf.page);
+ return VM_FAULT_HWPOISON;
}

if (unlikely(!(ret & VM_FAULT_LOCKED)))
@@ -2870,35 +2848,9 @@ static int __do_fault(struct vm_area_str
else
VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);

- if (unlikely(PageHWPoison(vmf.page))) {
- ret = VM_FAULT_HWPOISON;
- goto err;
- }
-
- /*
- * Use pte_alloc instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pte_alloc(vma->vm_mm, pmd, address))) {
- ret = VM_FAULT_OOM;
- goto err;
- }
- /*
- * If a huge pmd materialized under us just retry later. Allow for
- * a racing transition of huge pmd to none to huge pmd or pagetable.
- */
- if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd))) {
- ret = VM_FAULT_NOPAGE;
- goto err;
- }
out:
*page = vmf.page;
return ret;
-err:
- unlock_page(vmf.page);
- put_page(vmf.page);
- return ret;
}

/**
@@ -3048,19 +3000,32 @@ static void do_fault_around(struct vm_ar

static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags)
+ pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
struct page *fault_page;
spinlock_t *ptl;
pte_t *pte;
- int ret;
+ int ret = 0;

- ret = __do_fault(vma, address, pmd, pgoff, flags, NULL, &fault_page);
+ /*
+ * Let's call ->map_pages() first and use ->fault() as fallback
+ * if page by the offset is not ready to be mapped (cold cache or
+ * something).
+ */
+ if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ do_fault_around(vma, address, pte, pgoff, flags);
+ if (!pte_same(*pte, orig_pte))
+ goto unlock_out;
+ pte_unmap_unlock(pte, ptl);
+ }
+
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_none(*pte))) {
+ if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
put_page(fault_page);
@@ -3068,20 +3033,14 @@ static int do_read_fault(struct mm_struc
}
do_set_pte(vma, address, fault_page, pte, false, false);
unlock_page(fault_page);
-
- /*
- * Finally call ->map_pages() to fault around the pte we just set.
- */
- if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1)
- do_fault_around(vma, address, pte, pgoff, flags);
-
+unlock_out:
pte_unmap_unlock(pte, ptl);
return ret;
}

static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags)
+ pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
struct page *fault_page, *new_page;
struct mem_cgroup *memcg;
@@ -3101,7 +3060,7 @@ static int do_cow_fault(struct mm_struct
return VM_FAULT_OOM;
}

- ret = __do_fault(vma, address, pmd, pgoff, flags, new_page, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

@@ -3110,7 +3069,7 @@ static int do_cow_fault(struct mm_struct
__SetPageUptodate(new_page);

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_none(*pte))) {
+ if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
if (fault_page) {
unlock_page(fault_page);
@@ -3147,7 +3106,7 @@ uncharge_out:

static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags)
+ pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
struct page *fault_page;
struct address_space *mapping;
@@ -3156,7 +3115,7 @@ static int do_shared_fault(struct mm_str
int dirtied = 0;
int ret, tmp;

- ret = __do_fault(vma, address, pmd, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

@@ -3175,7 +3134,7 @@ static int do_shared_fault(struct mm_str
}

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_none(*pte))) {
+ if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
put_page(fault_page);
@@ -3215,18 +3174,22 @@ static int do_shared_fault(struct mm_str
* return value. See filemap_fault() and __lock_page_or_retry().
*/
static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ unsigned int flags, pte_t orig_pte)
{
pgoff_t pgoff = linear_page_index(vma, address);

+ pte_unmap(page_table);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
if (!(flags & FAULT_FLAG_WRITE))
- return do_read_fault(mm, vma, address, pmd, pgoff, flags);
+ return do_read_fault(mm, vma, address, pmd, pgoff, flags,
+ orig_pte);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(mm, vma, address, pmd, pgoff, flags);
- return do_shared_fault(mm, vma, address, pmd, pgoff, flags);
+ return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
+ orig_pte);
+ return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3354,7 +3317,6 @@ static int wp_huge_pmd(struct mm_struct
return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
if (vma->vm_ops->pmd_fault)
return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
- remap_team_by_ptes(vma, address, pmd);
return VM_FAULT_FALLBACK;
}

@@ -3367,49 +3329,20 @@ static int wp_huge_pmd(struct mm_struct
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * We enter with non-exclusive mmap_sem
- * (to exclude vma changes, but allow concurrent faults).
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with pte unmapped and unlocked.
+ *
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+static int handle_pte_fault(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address,
+ pte_t *pte, pmd_t *pmd, unsigned int flags)
{
- pmd_t pmdval;
- pte_t *pte;
pte_t entry;
spinlock_t *ptl;

- /* If a huge pmd materialized under us just retry later */
- pmdval = *pmd;
- barrier();
- if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
- return 0;
-
- if (unlikely(pmd_none(pmdval))) {
- /*
- * Leave pte_alloc() until later: because huge tmpfs may
- * want to map_team_by_pmd(), and if we expose page table
- * for an instant, it will be difficult to retract from
- * concurrent faults and from rmap lookups.
- */
- pte = NULL;
- } else {
- /*
- * A regular pmd is established and it can't morph into a huge
- * pmd from under us anymore at this point because we hold the
- * mmap_sem read mode and khugepaged takes it in write mode.
- * So now it's safe to run pte_offset_map().
- */
- pte = pte_offset_map(pmd, address);
- entry = *pte;
- barrier();
- if (pte_none(entry)) {
- pte_unmap(pte);
- pte = NULL;
- }
- }
-
/*
* some architectures can have larger ptes than wordsize,
* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
@@ -3418,14 +3351,21 @@ static int handle_pte_fault(struct mm_st
* we later double check anyway with the ptl lock held. So here
* a barrier will do.
*/
-
- if (!pte) {
- if (!vma_is_anonymous(vma))
- return do_fault(mm, vma, address, pmd, flags);
- return do_anonymous_page(mm, vma, address, pmd, flags);
+ entry = *pte;
+ barrier();
+ if (!pte_present(entry)) {
+ if (pte_none(entry)) {
+ if (vma_is_anonymous(vma))
+ return do_anonymous_page(mm, vma, address,
+ pte, pmd, flags);
+ else
+ return do_fault(mm, vma, address, pte, pmd,
+ flags, entry);
+ }
+ return do_swap_page(mm, vma, address,
+ pte, pmd, flags, entry);
}
- if (!pte_present(entry))
- return do_swap_page(mm, vma, address, pte, pmd, flags, entry);
+
if (pte_protnone(entry))
return do_numa_page(mm, vma, address, entry, pte, pmd);

@@ -3469,6 +3409,7 @@ static int __handle_mm_fault(struct mm_s
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
+ pte_t *pte;

if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
@@ -3514,7 +3455,35 @@ static int __handle_mm_fault(struct mm_s
}
}

- return handle_pte_fault(mm, vma, address, pmd, flags);
+ /*
+ * Use pte_alloc() instead of pte_alloc_map, because we can't
+ * run pte_offset_map on the pmd, if an huge pmd could
+ * materialize from under us from a different thread.
+ */
+ if (unlikely(pte_alloc(mm, pmd, address)))
+ return VM_FAULT_OOM;
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+ * didn't become pmd_trans_huge under us and then back to pmd_none, as
+ * a result of MADV_DONTNEED running immediately after a huge pmd fault
+ * in a different thread of this mm, in turn leading to a misleading
+ * pmd_trans_huge() retval. All we have to ensure is that it is a
+ * regular pmd that we can walk with pte_offset_map() and we can do that
+ * through an atomic read in C, which is what pmd_trans_unstable()
+ * provides.
+ */
+ if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
+ return 0;
+ /*
+ * A regular pmd is established and it can't morph into a huge pmd
+ * from under us anymore at this point because we hold the mmap_sem
+ * read mode and khugepaged takes it in write mode. So now it's
+ * safe to run pte_offset_map().
+ */
+ pte = pte_offset_map(pmd, address);
+
+ return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

/*

2016-04-16 23:41:38

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH mmotm 5/5] huge tmpfs: add shmem_pmd_fault()

The pmd_fault() method gives the filesystem an opportunity to place
a trans huge pmd entry at *pmd, before any pagetable is exposed (and
an opportunity to split it on COW fault): now use it for huge tmpfs.

This patch is a little raw: with more time before LSF/MM, I would
probably want to dress it up better - the shmem_mapping() calls look
a bit ugly; it's odd to want FAULT_FLAG_MAY_HUGE and VM_FAULT_HUGE just
for a private conversation between shmem_fault() and shmem_pmd_fault();
and there might be a better distribution of work between those two, but
prising apart that series of huge tests is not to be done in a hurry.

Good for now, presents the new way, but might be improved later.

This patch still leaves the huge tmpfs map_team_by_pmd() allocating a
pagetable while holding page lock, but other filesystems are no longer
doing so; and we've not yet settled whether huge tmpfs should (like anon
THP) or should not (like DAX) participate in deposit/withdraw protocol.

Signed-off-by: Hugh Dickins <[email protected]>
---
I've been testing with this applied on top of mmotm plus 1-4/5,
but I suppose the right place for it is immediately after
huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes.patch
with a view to perhaps merging it into that in the future.

mm/huge_memory.c | 4 ++--
mm/memory.c | 13 +++++++++----
mm/shmem.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 44 insertions(+), 6 deletions(-)

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3084,7 +3084,7 @@ void __split_huge_pmd(struct vm_area_str
struct mm_struct *mm = vma->vm_mm;
unsigned long haddr = address & HPAGE_PMD_MASK;

- if (!vma_is_anonymous(vma) && !vma->vm_ops->pmd_fault) {
+ if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) {
remap_team_by_ptes(vma, address, pmd);
return;
}
@@ -3622,7 +3622,7 @@ int map_team_by_pmd(struct vm_area_struc
pgtable_t pgtable;
spinlock_t *pml;
pmd_t pmdval;
- int ret = VM_FAULT_NOPAGE;
+ int ret = 0;

/*
* Another task may have mapped it in just ahead of us; but we
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3410,6 +3410,7 @@ static int __handle_mm_fault(struct mm_s
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ int ret = 0;

if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
@@ -3426,13 +3427,16 @@ static int __handle_mm_fault(struct mm_s
pmd = pmd_alloc(mm, pud, address);
if (!pmd)
return VM_FAULT_OOM;
- if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
- int ret = create_huge_pmd(mm, vma, address, pmd, flags);
+
+ if (pmd_none(*pmd) &&
+ (transparent_hugepage_enabled(vma) ||
+ (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)))) {
+ ret = create_huge_pmd(mm, vma, address, pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
+ ret &= VM_FAULT_MAJOR;
} else {
pmd_t orig_pmd = *pmd;
- int ret;

barrier();
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
@@ -3447,6 +3451,7 @@ static int __handle_mm_fault(struct mm_s
orig_pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
+ ret = 0;
} else {
huge_pmd_set_accessed(mm, vma, address, pmd,
orig_pmd, dirty);
@@ -3483,7 +3488,7 @@ static int __handle_mm_fault(struct mm_s
*/
pte = pte_offset_map(pmd, address);

- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+ return ret | handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

/*
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3223,6 +3223,36 @@ single:
return ret | VM_FAULT_LOCKED | VM_FAULT_HUGE;
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int shmem_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned int flags)
+{
+ struct vm_fault vmf;
+ int ret;
+
+ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+ /* Copy On Write: don't insert huge pmd; or split if already */
+ if (pmd_trans_huge(*pmd))
+ remap_team_by_ptes(vma, address, pmd);
+ return VM_FAULT_FALLBACK;
+ }
+
+ vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ vmf.pgoff = linear_page_index(vma, address);
+ vmf.flags = flags | FAULT_FLAG_MAY_HUGE;
+
+ ret = shmem_fault(vma, &vmf);
+ if (ret & VM_FAULT_HUGE)
+ return ret | map_team_by_pmd(vma, address, pmd, vmf.page);
+ if (ret & VM_FAULT_ERROR)
+ return ret;
+
+ unlock_page(vmf.page);
+ put_page(vmf.page);
+ return ret | VM_FAULT_FALLBACK;
+}
+#endif
+
unsigned long shmem_get_unmapped_area(struct file *file,
unsigned long uaddr, unsigned long len,
unsigned long pgoff, unsigned long flags)
@@ -5129,6 +5159,9 @@ static const struct super_operations shm

static const struct vm_operations_struct shmem_vm_ops = {
.fault = shmem_fault,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ .pmd_fault = shmem_pmd_fault,
+#endif
.map_pages = filemap_map_pages,
#ifdef CONFIG_NUMA
.set_policy = shmem_set_policy,

2016-04-17 00:46:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH mmotm 5/5] huge tmpfs: add shmem_pmd_fault()

On Sat, Apr 16, 2016 at 04:41:33PM -0700, Hugh Dickins wrote:
> The pmd_fault() method gives the filesystem an opportunity to place
> a trans huge pmd entry at *pmd, before any pagetable is exposed (and
> an opportunity to split it on COW fault): now use it for huge tmpfs.
>
> This patch is a little raw: with more time before LSF/MM, I would
> probably want to dress it up better - the shmem_mapping() calls look
> a bit ugly; it's odd to want FAULT_FLAG_MAY_HUGE and VM_FAULT_HUGE just
> for a private conversation between shmem_fault() and shmem_pmd_fault();
> and there might be a better distribution of work between those two, but
> prising apart that series of huge tests is not to be done in a hurry.
>
> Good for now, presents the new way, but might be improved later.
>
> This patch still leaves the huge tmpfs map_team_by_pmd() allocating a
> pagetable while holding page lock, but other filesystems are no longer
> doing so; and we've not yet settled whether huge tmpfs should (like anon
> THP) or should not (like DAX) participate in deposit/withdraw protocol.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Just for record: I don't like ->pmd_fault() approach because it results in
two requests to file system (two shmem_fault() in this case) if we don't
have a huge page to map: one for huge page (failed) and then one for small.
I think this case should be rather common: all mounts without huge pages
enabled. I expect performance regression from this too.

--
Kirill A. Shutemov

2016-04-17 01:21:48

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH mmotm 5/5] huge tmpfs: add shmem_pmd_fault()

On Sun, 17 Apr 2016, Kirill A. Shutemov wrote:
> On Sat, Apr 16, 2016 at 04:41:33PM -0700, Hugh Dickins wrote:
> > The pmd_fault() method gives the filesystem an opportunity to place
> > a trans huge pmd entry at *pmd, before any pagetable is exposed (and
> > an opportunity to split it on COW fault): now use it for huge tmpfs.
> >
> > This patch is a little raw: with more time before LSF/MM, I would
> > probably want to dress it up better - the shmem_mapping() calls look
> > a bit ugly; it's odd to want FAULT_FLAG_MAY_HUGE and VM_FAULT_HUGE just
> > for a private conversation between shmem_fault() and shmem_pmd_fault();
> > and there might be a better distribution of work between those two, but
> > prising apart that series of huge tests is not to be done in a hurry.
> >
> > Good for now, presents the new way, but might be improved later.
> >
> > This patch still leaves the huge tmpfs map_team_by_pmd() allocating a
> > pagetable while holding page lock, but other filesystems are no longer
> > doing so; and we've not yet settled whether huge tmpfs should (like anon
> > THP) or should not (like DAX) participate in deposit/withdraw protocol.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> Just for record: I don't like ->pmd_fault() approach because it results in
> two requests to file system (two shmem_fault() in this case) if we don't
> have a huge page to map: one for huge page (failed) and then one for small.
> I think this case should be rather common: all mounts without huge pages
> enabled. I expect performance regression from this too.

Yes, I did consider that when making the switchover. But it's only
when pmd_none(*pmd), not the other 511 times; and the caches have been
primed for the pte fallback. So I didn't expect it to matter, and to be
outweighed by having map_pages() back in its old position. Ah, you'll
point out that map_pages() makes it a smaller ratio than 511:1.

But if someone speeds up pmd_fault(), or replaces it by a better strategy,
so much the better - I found it a little odd, doing two very different
things, one of which (splitting) must be done in a non-fault context too.

Anyway, I await judgement from the robot.

And note your point about regressing mounts without huge pages enabled:
maybe I should add an early VM_FAULT_FALLBACK for that case, or perhaps
it will end up in the vma flags instead of my shmem_mapping() check.

Hugh

2016-04-20 23:45:07

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH mmotm 1/5] huge tmpfs: try to allocate huge pages split into a team fix

Hi Hugh,

On Sat, 16 Apr 2016 16:27:02 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
>
> Please replace the
> huge-tmpfs-try-to-allocate-huge-pages-split-into-a-team-fix.patch
> you added to your tree by this one: nothing wrong with Stephen's,
> but in this case I think the source is better off if we simply
> remove that BUILD_BUG() instead of adding an IS_ENABLED():
> fixes build problem seen on arm when putting together linux-next.
>
> Reported-by: Stephen Rothwell <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/shmem.c | 1 -
> 1 file changed, 1 deletion(-)
>
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1744,7 +1744,6 @@ static inline struct page *shmem_hugetea
>
> static inline void shmem_disband_hugeteam(struct page *page)
> {
> - BUILD_BUG();
> }
>
> static inline void shmem_added_to_hugeteam(struct page *page,

I have replaced my fix with the above in today's linux-next.

--
Cheers,
Stephen Rothwell

2016-04-20 23:48:14

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH mmotm 2/5] huge tmpfs: fix mlocked meminfo track huge unhuge mlocks fix

Hi Hugh,

On Sat, 16 Apr 2016 16:29:44 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
>
> Please add this fix after
> huge-tmpfs-fix-mlocked-meminfo-track-huge-unhuge-mlocks.patch
> for later merging into it. I expect this to fix a build problem found
> by robot on an x86_64 randconfig. I was not able to reproduce the error,
> but I'm growing to realize that different optimizers behave differently.
>
> Reported-by: kbuild test robot <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/rmap.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1445,8 +1445,12 @@ static int try_to_unmap_one(struct page
> */
> if (!(flags & TTU_IGNORE_MLOCK)) {
> if (vma->vm_flags & VM_LOCKED) {
> + int nr_pages = 1;
> +
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !pte)
> + nr_pages = HPAGE_PMD_NR;
> /* Holding pte lock, we do *not* need mmap_sem here */
> - mlock_vma_pages(page, pte ? 1 : HPAGE_PMD_NR);
> + mlock_vma_pages(page, nr_pages);
> ret = SWAP_MLOCK;
> goto out_unmap;
> }

Added to linux-next today.

--
Cheers,
Stephen Rothwell

2016-04-20 23:50:24

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH mmotm 3/5] huge tmpfs recovery: tweak shmem_getpage_gfp to fill team fix

Hi Hugh,

On Sat, 16 Apr 2016 16:33:07 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
>
> Please add this fix after my 27/31, your
> huge-tmpfs-recovery-tweak-shmem_getpage_gfp-to-fill-team.patch
> for later merging into it. Great catch by Mika Penttila, a bug which
> prevented some unusual cases from being recovered into huge pages as
> intended: an initially sparse head would be set PageTeam only after
> this check. But the check is guarding against a racing disband, which
> cannot happen before the head is published as PageTeam, plus we have
> an additional reference on the head which keeps it safe throughout:
> so very easily fixed.
>
> Reported-by: Mika Penttila <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/shmem.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2938,7 +2938,7 @@ repeat:
> page = *pagep;
> lock_page(page);
> head = page - (index & (HPAGE_PMD_NR-1));
> - if (!PageTeam(head)) {
> + if (!PageTeam(head) && page != head) {
> error = -ENOENT;
> goto decused;
> }

Added to linux-next today.

--
Cheers,
Stephen Rothwell

2016-04-20 23:55:59

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH mmotm 4/5] huge tmpfs: avoid premature exposure of new pagetable revert

Hi Hugh,

On Sat, 16 Apr 2016 16:38:15 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
>
> This patch reverts all of my 09/31, your
> huge-tmpfs-avoid-premature-exposure-of-new-pagetable.patch
> and also the mm/memory.c changes from the patch after it,
> huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes.patch
>
> I've diffed this against the top of the tree, but it may be better to
> throw this and huge-tmpfs-avoid-premature-exposure-of-new-pagetable.patch
> away, and just delete the mm/memory.c part of the patch after it.
>
> This is in preparation for 5/5, which replaces what was done here.
> Why? Numerous reasons. Kirill was concerned that my movement of
> map_pages from before to after fault would show performance regression.
> Robot reported vm-scalability.throughput -5.5% regression, bisected to
> the avoid premature exposure patch. Andrew was concerned about bloat
> in mm/memory.o. Google had seen (on an earlier kernel) an OOM deadlock
> from pagetable allocations being done while holding pagecache pagelock.
>
> I thought I could deal with those later on, but the clincher came from
> Xiong Zhou's report that it had broken binary execution from DAX mount.
> Silly little oversight, but not as easily fixed as first appears, because
> DAX now uses the i_mmap_rwsem to guard an extent from truncation: which
> would be open to deadlock if pagetable allocation goes down to reclaim
> (both are using only the read lock, but in danger of an rwr sandwich).
>
> I've considered various alternative approaches, and what can be done
> to get both DAX and huge tmpfs working again quickly. Eventually
> arrived at the obvious: shmem should use the new pmd_fault().
>
> Reported-by: kernel test robot <[email protected]>
> Reported-by: Xiong Zhou <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/filemap.c | 10 --
> mm/memory.c | 225 +++++++++++++++++++++----------------------------
> 2 files changed, 101 insertions(+), 134 deletions(-)

I added this at the end of mmotm in linux-next today. I will leave
Andrew to sort it out later.

--
Cheers,
Stephen Rothwell

2016-04-20 23:56:36

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH mmotm 5/5] huge tmpfs: add shmem_pmd_fault()

Hi Hugh,

On Sat, 16 Apr 2016 16:41:33 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
>
> The pmd_fault() method gives the filesystem an opportunity to place
> a trans huge pmd entry at *pmd, before any pagetable is exposed (and
> an opportunity to split it on COW fault): now use it for huge tmpfs.
>
> This patch is a little raw: with more time before LSF/MM, I would
> probably want to dress it up better - the shmem_mapping() calls look
> a bit ugly; it's odd to want FAULT_FLAG_MAY_HUGE and VM_FAULT_HUGE just
> for a private conversation between shmem_fault() and shmem_pmd_fault();
> and there might be a better distribution of work between those two, but
> prising apart that series of huge tests is not to be done in a hurry.
>
> Good for now, presents the new way, but might be improved later.
>
> This patch still leaves the huge tmpfs map_team_by_pmd() allocating a
> pagetable while holding page lock, but other filesystems are no longer
> doing so; and we've not yet settled whether huge tmpfs should (like anon
> THP) or should not (like DAX) participate in deposit/withdraw protocol.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> I've been testing with this applied on top of mmotm plus 1-4/5,
> but I suppose the right place for it is immediately after
> huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes.patch
> with a view to perhaps merging it into that in the future.
>
> mm/huge_memory.c | 4 ++--
> mm/memory.c | 13 +++++++++----
> mm/shmem.c | 33 +++++++++++++++++++++++++++++++++
> 3 files changed, 44 insertions(+), 6 deletions(-)

I added this to the end of mmotm in linux-next today.

--
Cheers,
Stephen Rothwell