2016-11-07 08:35:32

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH 1/2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

Architectures like ppc64 want to use page table deposit/withraw
even with huge pmd dax entries. Allow arch to override the
vma_is_anonymous check by moving that to pmd_move_must_withdraw
function

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/asm-generic/pgtable.h | 12 ------------
mm/huge_memory.c | 17 +++++++++++++++--
2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index c4f8fd2fd384..324990273ad2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -653,18 +653,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
}
#endif

-#ifndef pmd_move_must_withdraw
-static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
- spinlock_t *old_pmd_ptl)
-{
- /*
- * With split pmd lock we also need to move preallocated
- * PTE page table if new_pmd is on different PMD page table.
- */
- return new_pmd_ptl != old_pmd_ptl;
-}
-#endif
-
/*
* This function is meant to be used by sites walking pagetables with
* the mmap_sem hold in read mode to protect against MADV_DONTNEED and
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25cb30fe..1ac1b0ca63c4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1424,6 +1424,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return 1;
}

+#ifndef pmd_move_must_withdraw
+static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
+ spinlock_t *old_pmd_ptl)
+{
+ /*
+ * With split pmd lock we also need to move preallocated
+ * PTE page table if new_pmd is on different PMD page table.
+ *
+ * We also don't deposit and withdraw tables for file pages.
+ */
+ return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+}
+#endif
+
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd)
@@ -1458,8 +1472,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
VM_BUG_ON(!pmd_none(*new_pmd));

- if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
- vma_is_anonymous(vma)) {
+ if (pmd_move_must_withdraw(new_ptl, old_ptl)) {
pgtable_t pgtable;
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
--
2.10.2


2016-11-07 08:35:37

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH 2/2] mm: THP page cache support for ppc64

Add arch specific callback in the generic THP page cache code that will
deposit and withdarw preallocated page table. Archs like ppc64 use
this preallocated table to store the hash pte slot information.

Testing:
kernel build of the patch series on tmpfs mounted with option huge=always

The related thp stat:
thp_fault_alloc 72939
thp_fault_fallback 60547
thp_collapse_alloc 603
thp_collapse_alloc_failed 0
thp_file_alloc 253763
thp_file_mapped 4251
thp_split_page 51518
thp_split_page_failed 1
thp_deferred_split_page 73566
thp_split_pmd 665
thp_zero_page_alloc 3
thp_zero_page_alloc_failed 0

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
NOTE: I decided not to hide the arch_needs_pgtable_deposit() check
in zap_deposited_table() because IMHO that hides the new code path
which makes following this special case confusing.

arch/powerpc/include/asm/book3s/64/pgtable.h | 10 +++++
include/asm-generic/pgtable.h | 3 ++
mm/Kconfig | 6 +--
mm/huge_memory.c | 17 +++++++++
mm/khugepaged.c | 21 ++++++++++-
mm/memory.c | 56 +++++++++++++++++++++++-----
6 files changed, 96 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 9fd77f8794a0..9eed9d66d24f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1020,6 +1020,16 @@ static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
*/
return true;
}
+
+
+#define arch_needs_pgtable_deposit arch_needs_pgtable_deposit
+static inline bool arch_needs_pgtable_deposit(void)
+{
+ if (radix_enabled())
+ return false;
+ return true;
+}
+
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 324990273ad2..e00e3b7cf6a8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -653,6 +653,9 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
}
#endif

+#ifndef arch_needs_pgtable_deposit
+#define arch_needs_pgtable_deposit() (false)
+#endif
/*
* This function is meant to be used by sites walking pagetables with
* the mmap_sem hold in read mode to protect against MADV_DONTNEED and
diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11fa0d9..0a279d399722 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -447,13 +447,9 @@ choice
benefit.
endchoice

-#
-# We don't deposit page tables on file THP mapping,
-# but Power makes use of them to address MMU quirk.
-#
config TRANSPARENT_HUGE_PAGECACHE
def_bool y
- depends on TRANSPARENT_HUGEPAGE && !PPC
+ depends on TRANSPARENT_HUGEPAGE

#
# UP and nommu archs use km based percpu allocator
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1ac1b0ca63c4..00d2d8b8a00d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,6 +1377,15 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return ret;
}

+static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+ pgtable_t pgtable;
+
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pte_free(mm, pgtable);
+ atomic_long_dec(&mm->nr_ptes);
+}
+
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
@@ -1416,6 +1425,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
atomic_long_dec(&tlb->mm->nr_ptes);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
} else {
+ if (arch_needs_pgtable_deposit())
+ zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
}
spin_unlock(ptl);
@@ -1594,6 +1605,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

if (!vma_is_anonymous(vma)) {
_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ /*
+ * We are going to unmap this huge page. So
+ * just go ahead and zap it
+ */
+ if (arch_needs_pgtable_deposit())
+ zap_deposited_table(mm, pmd);
if (vma_is_dax(vma))
return;
page = pmd_page(_pmd);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 728d7790dc2d..9fb7b275cb63 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1240,6 +1240,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
struct vm_area_struct *vma;
unsigned long addr;
pmd_t *pmd, _pmd;
+ bool deposited = false;

i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1264,10 +1265,26 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);
/* assume page table is clear */
_pmd = pmdp_collapse_flush(vma, addr, pmd);
+ /*
+ * now deposit the pgtable for arch that need it
+ * otherwise free it.
+ */
+ if (arch_needs_pgtable_deposit()) {
+ /*
+ * The deposit should be visibile only after
+ * collapse is seen by others.
+ */
+ smp_wmb();
+ pgtable_trans_huge_deposit(vma->vm_mm, pmd,
+ pmd_pgtable(_pmd));
+ deposited = true;
+ }
spin_unlock(ptl);
up_write(&vma->vm_mm->mmap_sem);
- atomic_long_dec(&vma->vm_mm->nr_ptes);
- pte_free(vma->vm_mm, pmd_pgtable(_pmd));
+ if (!deposited) {
+ atomic_long_dec(&vma->vm_mm->nr_ptes);
+ pte_free(vma->vm_mm, pmd_pgtable(_pmd));
+ }
}
}
i_mmap_unlock_write(mapping);
diff --git a/mm/memory.c b/mm/memory.c
index e18c57bdc75c..6659e8f439c9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2939,6 +2939,19 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
return true;
}

+static void deposit_prealloc_pte(struct fault_env *fe)
+{
+ struct vm_area_struct *vma = fe->vma;
+
+ pgtable_trans_huge_deposit(vma->vm_mm, fe->pmd, fe->prealloc_pte);
+ /*
+ * We are going to consume the prealloc table,
+ * count that as nr_ptes.
+ */
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ fe->prealloc_pte = 0;
+}
+
static int do_set_pmd(struct fault_env *fe, struct page *page)
{
struct vm_area_struct *vma = fe->vma;
@@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)
ret = VM_FAULT_FALLBACK;
page = compound_head(page);

+ /*
+ * Archs like ppc64 need additonal space to store information
+ * related to pte entry. Use the preallocated table for that.
+ */
+ if (arch_needs_pgtable_deposit() && !fe->prealloc_pte)
+ fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
+
fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
if (unlikely(!pmd_none(*fe->pmd)))
goto out;
@@ -2966,6 +2986,11 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)

add_mm_counter(vma->vm_mm, MM_FILEPAGES, HPAGE_PMD_NR);
page_add_file_rmap(page, true);
+ /*
+ * deposit and withdraw with pmd lock held
+ */
+ if (arch_needs_pgtable_deposit())
+ deposit_prealloc_pte(fe);

set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);

@@ -2975,6 +3000,13 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)
ret = 0;
count_vm_event(THP_FILE_MAPPED);
out:
+ /*
+ * If we are going to fallback to pte mapping, do a
+ * withdraw with pmd lock held.
+ */
+ if (arch_needs_pgtable_deposit() && (ret == VM_FAULT_FALLBACK))
+ fe->prealloc_pte = pgtable_trans_huge_withdraw(vma->vm_mm,
+ fe->pmd);
spin_unlock(fe->ptl);
return ret;
}
@@ -3014,18 +3046,20 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,

ret = do_set_pmd(fe, page);
if (ret != VM_FAULT_FALLBACK)
- return ret;
+ goto fault_handled;
}

if (!fe->pte) {
ret = pte_alloc_one_map(fe);
if (ret)
- return ret;
+ goto fault_handled;
}

/* Re-check under ptl */
- if (unlikely(!pte_none(*fe->pte)))
- return VM_FAULT_NOPAGE;
+ if (unlikely(!pte_none(*fe->pte))) {
+ ret = VM_FAULT_NOPAGE;
+ goto fault_handled;
+ }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3045,8 +3079,15 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,

/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, fe->address, fe->pte);
+ ret = 0;

- return 0;
+fault_handled:
+ /* preallocated pagetable is unused: free it */
+ if (fe->prealloc_pte) {
+ pte_free(fe->vma->vm_mm, fe->prealloc_pte);
+ fe->prealloc_pte = 0;
+ }
+ return ret;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3145,11 +3186,6 @@ static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)

fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);

- /* preallocated pagetable is unused: free it */
- if (fe->prealloc_pte) {
- pte_free(fe->vma->vm_mm, fe->prealloc_pte);
- fe->prealloc_pte = 0;
- }
/* Huge page is mapped? Page fault is solved */
if (pmd_trans_huge(*fe->pmd)) {
ret = VM_FAULT_NOPAGE;
--
2.10.2

2016-11-07 09:14:22

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

On Monday, November 07, 2016 4:35 PM Aneesh Kumar K.V
>
> Architectures like ppc64 want to use page table deposit/withraw
> even with huge pmd dax entries. Allow arch to override the
> vma_is_anonymous check by moving that to pmd_move_must_withdraw
> function
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> include/asm-generic/pgtable.h | 12 ------------
> mm/huge_memory.c | 17 +++++++++++++++--
> 2 files changed, 15 insertions(+), 14 deletions(-)
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index c4f8fd2fd384..324990273ad2 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -653,18 +653,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> }
> #endif
>
> -#ifndef pmd_move_must_withdraw
> -static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
> - spinlock_t *old_pmd_ptl)
> -{
> - /*
> - * With split pmd lock we also need to move preallocated
> - * PTE page table if new_pmd is on different PMD page table.
> - */
> - return new_pmd_ptl != old_pmd_ptl;
> -}
> -#endif
> -
> /*
> * This function is meant to be used by sites walking pagetables with
> * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cdcd25cb30fe..1ac1b0ca63c4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1424,6 +1424,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> return 1;
> }
>
> +#ifndef pmd_move_must_withdraw
> +static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
> + spinlock_t *old_pmd_ptl)
> +{
> + /*
> + * With split pmd lock we also need to move preallocated
> + * PTE page table if new_pmd is on different PMD page table.
> + *
> + * We also don't deposit and withdraw tables for file pages.
> + */
> + return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);

Stray git merge?

> +}
> +#endif
> +
> bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> unsigned long new_addr, unsigned long old_end,
> pmd_t *old_pmd, pmd_t *new_pmd)
> @@ -1458,8 +1472,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
> VM_BUG_ON(!pmd_none(*new_pmd));
>
> - if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
> - vma_is_anonymous(vma)) {
> + if (pmd_move_must_withdraw(new_ptl, old_ptl)) {
> pgtable_t pgtable;
> pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> --
> 2.10.2
>

2016-11-07 09:29:25

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

Hi Aneesh,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9-rc4 next-20161028]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Aneesh-Kumar-K-V/mm-move-vma_is_anonymous-check-within-pmd_move_must_withdraw/20161107-164033
base: git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x006-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All error/warnings (new ones prefixed by >>):

mm/huge_memory.c: In function 'pmd_move_must_withdraw':
>> mm/huge_memory.c:1441:58: error: 'vma' undeclared (first use in this function)
return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
^~~
mm/huge_memory.c:1441:58: note: each undeclared identifier is reported only once for each function it appears in
>> mm/huge_memory.c:1442:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^

vim +/vma +1441 mm/huge_memory.c

1435 /*
1436 * With split pmd lock we also need to move preallocated
1437 * PTE page table if new_pmd is on different PMD page table.
1438 *
1439 * We also don't deposit and withdraw tables for file pages.
1440 */
> 1441 return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
> 1442 }
1443 #endif
1444
1445 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.71 kB)
.config.gz (25.84 kB)
Download all attachments

2016-11-07 11:40:54

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

kbuild test robot <[email protected]> writes:

> Hi Aneesh,
>
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.9-rc4 next-20161028]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Aneesh-Kumar-K-V/mm-move-vma_is_anonymous-check-within-pmd_move_must_withdraw/20161107-164033
> base: git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-randconfig-x006-201645 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All error/warnings (new ones prefixed by >>):
>
> mm/huge_memory.c: In function 'pmd_move_must_withdraw':
>>> mm/huge_memory.c:1441:58: error: 'vma' undeclared (first use in this function)
> return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
> ^~~
> mm/huge_memory.c:1441:58: note: each undeclared identifier is reported only once for each function it appears in
>>> mm/huge_memory.c:1442:1: warning: control reaches end of non-void function [-Wreturn-type]
> }
> ^
>
> vim +/vma +1441 mm/huge_memory.c
>
> 1435 /*
> 1436 * With split pmd lock we also need to move preallocated
> 1437 * PTE page table if new_pmd is on different PMD page table.
> 1438 *
> 1439 * We also don't deposit and withdraw tables for file pages.
> 1440 */
>> 1441 return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
>> 1442 }
> 1443 #endif
> 1444
> 1445 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>

My bad, I didn't test with hugepage enabled for x86. Will fixup in the
next update.

-aneesh

2016-11-07 14:47:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH V2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

Architectures like ppc64 want to use page table deposit/withraw
even with huge pmd dax entries. Allow arch to override the
vma_is_anonymous check by moving that to pmd_move_must_withdraw
function

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
* Changes from V1:
Fix build error with x86 config

arch/powerpc/include/asm/book3s/64/pgtable.h | 3 ++-
include/asm-generic/pgtable.h | 12 ------------
mm/huge_memory.c | 18 ++++++++++++++++--
3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 9fd77f8794a0..700301bc5190 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1009,7 +1009,8 @@ static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
#define pmd_move_must_withdraw pmd_move_must_withdraw
struct spinlock;
static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
- struct spinlock *old_pmd_ptl)
+ struct spinlock *old_pmd_ptl,
+ struct vm_area_struct *vma)
{
if (radix_enabled())
return false;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index c4f8fd2fd384..324990273ad2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -653,18 +653,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
}
#endif

-#ifndef pmd_move_must_withdraw
-static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
- spinlock_t *old_pmd_ptl)
-{
- /*
- * With split pmd lock we also need to move preallocated
- * PTE page table if new_pmd is on different PMD page table.
- */
- return new_pmd_ptl != old_pmd_ptl;
-}
-#endif
-
/*
* This function is meant to be used by sites walking pagetables with
* the mmap_sem hold in read mode to protect against MADV_DONTNEED and
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25cb30fe..54f265ec902e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1424,6 +1424,21 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return 1;
}

+#ifndef pmd_move_must_withdraw
+static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
+ spinlock_t *old_pmd_ptl,
+ struct vm_area_struct *vma)
+{
+ /*
+ * With split pmd lock we also need to move preallocated
+ * PTE page table if new_pmd is on different PMD page table.
+ *
+ * We also don't deposit and withdraw tables for file pages.
+ */
+ return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+}
+#endif
+
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd)
@@ -1458,8 +1473,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
VM_BUG_ON(!pmd_none(*new_pmd));

- if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
- vma_is_anonymous(vma)) {
+ if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
pgtable_t pgtable;
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
--
2.10.2

2016-11-11 10:06:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH V2] mm: move vma_is_anonymous check within pmd_move_must_withdraw

On Mon, Nov 07, 2016 at 08:16:39PM +0530, Aneesh Kumar K.V wrote:
> Architectures like ppc64 want to use page table deposit/withraw
> even with huge pmd dax entries. Allow arch to override the
> vma_is_anonymous check by moving that to pmd_move_must_withdraw
> function
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2016-11-11 10:15:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: THP page cache support for ppc64

On Mon, Nov 07, 2016 at 02:04:41PM +0530, Aneesh Kumar K.V wrote:
> @@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)
> ret = VM_FAULT_FALLBACK;
> page = compound_head(page);
>
> + /*
> + * Archs like ppc64 need additonal space to store information
> + * related to pte entry. Use the preallocated table for that.
> + */
> + if (arch_needs_pgtable_deposit() && !fe->prealloc_pte)
> + fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
> +

-ENOMEM handling?

I think we should do this way before this point. Maybe in do_fault() or
something.

--
Kirill A. Shutemov

2016-11-11 12:12:29

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: THP page cache support for ppc64

"Kirill A. Shutemov" <[email protected]> writes:

> On Mon, Nov 07, 2016 at 02:04:41PM +0530, Aneesh Kumar K.V wrote:
>> @@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)
>> ret = VM_FAULT_FALLBACK;
>> page = compound_head(page);
>>
>> + /*
>> + * Archs like ppc64 need additonal space to store information
>> + * related to pte entry. Use the preallocated table for that.
>> + */
>> + if (arch_needs_pgtable_deposit() && !fe->prealloc_pte)
>> + fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
>> +
>
> -ENOMEM handling?

How about

if (arch_needs_pgtable_deposit() && !fe->prealloc_pte) {
fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
if (!fe->prealloc_pte)
return VM_FAULT_OOM;
}



>
> I think we should do this way before this point. Maybe in do_fault() or
> something.

doing this in do_set_pmd keeps this closer to where we set the pmd. Any
reason you thing we should move it higher up the stack. We already do
pte_alloc() at the same level for a non transhuge case in
alloc_set_pte().

-aneesh

2016-11-11 16:29:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: THP page cache support for ppc64

On Fri, Nov 11, 2016 at 05:42:11PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > On Mon, Nov 07, 2016 at 02:04:41PM +0530, Aneesh Kumar K.V wrote:
> >> @@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)
> >> ret = VM_FAULT_FALLBACK;
> >> page = compound_head(page);
> >>
> >> + /*
> >> + * Archs like ppc64 need additonal space to store information
> >> + * related to pte entry. Use the preallocated table for that.
> >> + */
> >> + if (arch_needs_pgtable_deposit() && !fe->prealloc_pte)
> >> + fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
> >> +
> >
> > -ENOMEM handling?
>
> How about
>
> if (arch_needs_pgtable_deposit() && !fe->prealloc_pte) {
> fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
> if (!fe->prealloc_pte)
> return VM_FAULT_OOM;
> }
>
>
>
> >
> > I think we should do this way before this point. Maybe in do_fault() or
> > something.
>
> doing this in do_set_pmd keeps this closer to where we set the pmd. Any
> reason you thing we should move it higher up the stack. We already do
> pte_alloc() at the same level for a non transhuge case in
> alloc_set_pte().

I vaguely remember Hugh mentioned deadlock of allocation under page-lock vs.
OOM-killer (or something else?).

If the deadlock is still there it would be matter of making preallocation
unconditional to fix the issue.

But what you propose about doesn't make situation any worse. I'm fine with
that.

--
Kirill A. Shutemov

2016-11-12 01:37:13

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: THP page cache support for ppc64

On Fri, 11 Nov 2016, Kirill A. Shutemov wrote:
> On Fri, Nov 11, 2016 at 05:42:11PM +0530, Aneesh Kumar K.V wrote:
> >
> > doing this in do_set_pmd keeps this closer to where we set the pmd. Any
> > reason you thing we should move it higher up the stack. We already do
> > pte_alloc() at the same level for a non transhuge case in
> > alloc_set_pte().
>
> I vaguely remember Hugh mentioned deadlock of allocation under page-lock vs.
> OOM-killer (or something else?).

You remember well. It was indeed the OOM killer, but in particular due
to the way it used to wait for a current victim to exit, and that exit
could be delayed forever by the way munlock_vma_pages_all() goes to lock
each page in a VM_LOCKED area - a pity if one of them is the page we
hold locked while servicing a fault and need to allocate a pagetable.

>
> If the deadlock is still there it would be matter of making preallocation
> unconditional to fix the issue.

I think enough has changed at the OOM killer end that the deadlock is
no longer there. I haven't kept up with all the changes made recently,
but I think we no longer wait for a unique victim to exit before trying
another (reaped mms set MMF_OOM_SKIP); and the OOM reaper skips over
VM_LOCKED areas to avoid just such a deadlock.

It's still silly that munlock_vma_pages_all() should require page lock
on each of those pages; but neither Michal nor I have had time to
revisit our attempts to relieve that requirement - mlock.c is not easy.

>
> But what you propose about doesn't make situation any worse. I'm fine with
> that.

Yes, I think that's right: if there is a problem, then it would already
be problem since alloc_set_pte() was created; but we've seen no reports.

Hugh