2022-04-05 01:56:46

by Yang Shi

[permalink] [raw]
Subject: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent


Changelog
v3: * Register mm to khugepaged in common mmap path instead of touching
filesystem code (patch 8/8) suggested by Ted.
* New patch 7/8 cleaned up and renamed khugepaged_enter_vma_merge()
to khugepaged_enter_vma().
* Collected acked-by from Song Liu for patch 1 ~ 6.
* Rebased on top of 5.18-rc1.
* Excluded linux-xfs and linux-ext4 list since the series doesn't
touch fs code anymore, but keep linux-fsdevel posted.
v2: * Collected reviewed-by tags from Miaohe Lin.
* Fixed build error for patch 4/8.

The readonly FS THP relies on khugepaged to collapse THP for suitable
vmas. But it is kind of "random luck" for khugepaged to see the
readonly FS vmas (see report: https://lore.kernel.org/linux-mm/[email protected]/) since currently the vmas are registered to khugepaged when:
- Anon huge pmd page fault
- VMA merge
- MADV_HUGEPAGE
- Shmem mmap

If the above conditions are not met, even though khugepaged is enabled
it won't see readonly FS vmas at all. MADV_HUGEPAGE could be specified
explicitly to tell khugepaged to collapse this area, but when khugepaged
mode is "always" it should scan suitable vmas as long as VM_NOHUGEPAGE
is not set.

So make sure readonly FS vmas are registered to khugepaged to make the
behavior more consistent.

Registering suitable vmas in common mmap path, that could cover both
readonly FS vmas and shmem vmas, so removed the khugepaged calls in
shmem.c.

The patch 1 ~ 7 are minor bug fixes, clean up and preparation patches.
The patch 8 is the real meat.


Tested with khugepaged test in selftests and the testcase provided by
Vlastimil Babka in https://lore.kernel.org/lkml/[email protected]/
by commenting out MADV_HUGEPAGE call.


Yang Shi (8):
sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE
mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGED
mm: khugepaged: skip DAX vma
mm: thp: only regular file could be THP eligible
mm: khugepaged: make khugepaged_enter() void function
mm: khugepaged: move some khugepaged_* functions to khugepaged.c
mm: khugepaged: introduce khugepaged_enter_vma() helper
mm: mmap: register suitable readonly file vmas for khugepaged

include/linux/huge_mm.h | 14 ++++++++++++
include/linux/khugepaged.h | 59 ++++++++++++---------------------------------------
include/linux/sched/coredump.h | 3 ++-
kernel/fork.c | 4 +---
mm/huge_memory.c | 15 ++++---------
mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
mm/mmap.c | 14 ++++++++----
mm/shmem.c | 12 -----------
8 files changed, 88 insertions(+), 109 deletions(-)



2022-04-05 02:19:21

by Yang Shi

[permalink] [raw]
Subject: [v3 PATCH 1/8] sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE

MMF_VM_HUGEPAGE is set as long as the mm is available for khugepaged by
khugepaged_enter(), not only when VM_HUGEPAGE is set on vma. Correct
the comment to avoid confusion.

Reviewed-by: Miaohe Lin <[email protected]>
Acked-by: Song Liu <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/sched/coredump.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 4d9e3a656875..4d0a5be28b70 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -57,7 +57,8 @@ static inline int get_dumpable(struct mm_struct *mm)
#endif
/* leave room for more dump flags */
#define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
-#define MMF_VM_HUGEPAGE 17 /* set when VM_HUGEPAGE is set on vma */
+#define MMF_VM_HUGEPAGE 17 /* set when mm is available for
+ khugepaged */
/*
* This one-shot flag is dropped due to necessity of changing exe once again
* on NFS restore
--
2.26.3

2022-04-05 02:48:52

by Yang Shi

[permalink] [raw]
Subject: [v3 PATCH 6/8] mm: khugepaged: move some khugepaged_* functions to khugepaged.c

To reuse hugepage_vma_check() for khugepaged_enter() so that we could
remove some duplicate code. But moving hugepage_vma_check() to
khugepaged.h needs to include huge_mm.h in it, it seems not optimal to
bloat khugepaged.h.

And the khugepaged_* functions actually are wrappers for some non-inline
functions, so it seems the benefits are not too much to keep them inline.

So move the khugepaged_* functions to khugepaged.c, any callers just
need to include khugepaged.h which is quite small. For example, the
following patches will call khugepaged_enter() in filemap page fault path
for regular filesystems to make readonly FS THP collapse more consistent.
The filemap.c just needs to include khugepaged.h.

Acked-by: Song Liu <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/khugepaged.h | 37 ++++++-------------------------------
mm/khugepaged.c | 20 ++++++++++++++++++++
2 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 0423d3619f26..6acf9701151e 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -2,10 +2,6 @@
#ifndef _LINUX_KHUGEPAGED_H
#define _LINUX_KHUGEPAGED_H

-#include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
-#include <linux/shmem_fs.h>
-
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern struct attribute_group khugepaged_attr_group;

@@ -16,6 +12,12 @@ extern void __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
extern void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
unsigned long vm_flags);
+extern void khugepaged_fork(struct mm_struct *mm,
+ struct mm_struct *oldmm);
+extern void khugepaged_exit(struct mm_struct *mm);
+extern void khugepaged_enter(struct vm_area_struct *vma,
+ unsigned long vm_flags);
+
extern void khugepaged_min_free_kbytes_update(void);
#ifdef CONFIG_SHMEM
extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
@@ -33,36 +35,9 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
#define khugepaged_always() \
(transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_FLAG))
-#define khugepaged_req_madv() \
- (transparent_hugepage_flags & \
- (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
#define khugepaged_defrag() \
(transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
-
-static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
-{
- if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
- __khugepaged_enter(mm);
-}
-
-static inline void khugepaged_exit(struct mm_struct *mm)
-{
- if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
- __khugepaged_exit(mm);
-}
-
-static inline void khugepaged_enter(struct vm_area_struct *vma,
- unsigned long vm_flags)
-{
- if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
- if ((khugepaged_always() ||
- (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) ||
- (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
- !(vm_flags & VM_NOHUGEPAGE) &&
- !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
- __khugepaged_enter(vma->vm_mm);
-}
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
{
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b69eda934d70..ec5b0a691d87 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -556,6 +556,26 @@ void __khugepaged_exit(struct mm_struct *mm)
}
}

+void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+ if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
+ __khugepaged_enter(mm);
+}
+
+void khugepaged_exit(struct mm_struct *mm)
+{
+ if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
+ __khugepaged_exit(mm);
+}
+
+void khugepaged_enter(struct vm_area_struct *vma, unsigned long vm_flags)
+{
+ if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
+ khugepaged_enabled())
+ if (hugepage_vma_check(vma, vm_flags))
+ __khugepaged_enter(vma->vm_mm);
+}
+
static void release_pte_page(struct page *page)
{
mod_node_page_state(page_pgdat(page),
--
2.26.3

2022-04-05 03:09:15

by Yang Shi

[permalink] [raw]
Subject: [v3 PATCH 5/8] mm: khugepaged: make khugepaged_enter() void function

The most callers of khugepaged_enter() don't care about the return
value. Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle
the error by returning -ENOMEM. Actually it is not harmful for them to
ignore the error case either. It also sounds overkilling to fail fork()
and page fault early due to khugepaged_enter() error, and MADV_HUGEPAGE
does set VM_HUGEPAGE flag regardless of the error.

Acked-by: Song Liu <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/khugepaged.h | 30 ++++++++++++------------------
kernel/fork.c | 4 +---
mm/huge_memory.c | 4 ++--
mm/khugepaged.c | 18 +++++++-----------
4 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 2fcc01891b47..0423d3619f26 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -12,10 +12,10 @@ extern struct attribute_group khugepaged_attr_group;
extern int khugepaged_init(void);
extern void khugepaged_destroy(void);
extern int start_stop_khugepaged(void);
-extern int __khugepaged_enter(struct mm_struct *mm);
+extern void __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
-extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
- unsigned long vm_flags);
+extern void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
+ unsigned long vm_flags);
extern void khugepaged_min_free_kbytes_update(void);
#ifdef CONFIG_SHMEM
extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
@@ -40,11 +40,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
(transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))

-static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
{
if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
- return __khugepaged_enter(mm);
- return 0;
+ __khugepaged_enter(mm);
}

static inline void khugepaged_exit(struct mm_struct *mm)
@@ -53,7 +52,7 @@ static inline void khugepaged_exit(struct mm_struct *mm)
__khugepaged_exit(mm);
}

-static inline int khugepaged_enter(struct vm_area_struct *vma,
+static inline void khugepaged_enter(struct vm_area_struct *vma,
unsigned long vm_flags)
{
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
@@ -62,27 +61,22 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
(khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
!(vm_flags & VM_NOHUGEPAGE) &&
!test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
- if (__khugepaged_enter(vma->vm_mm))
- return -ENOMEM;
- return 0;
+ __khugepaged_enter(vma->vm_mm);
}
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
{
- return 0;
}
static inline void khugepaged_exit(struct mm_struct *mm)
{
}
-static inline int khugepaged_enter(struct vm_area_struct *vma,
- unsigned long vm_flags)
+static inline void khugepaged_enter(struct vm_area_struct *vma,
+ unsigned long vm_flags)
{
- return 0;
}
-static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
- unsigned long vm_flags)
+static inline void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
+ unsigned long vm_flags)
{
- return 0;
}
static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
unsigned long addr)
diff --git a/kernel/fork.c b/kernel/fork.c
index 9796897560ab..0d13baf86650 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -612,9 +612,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
retval = ksm_fork(mm, oldmm);
if (retval)
goto out;
- retval = khugepaged_fork(mm, oldmm);
- if (retval)
- goto out;
+ khugepaged_fork(mm, oldmm);

prev = NULL;
for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 183b793fd28e..4fd5a6a79d44 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,8 +725,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
return VM_FAULT_FALLBACK;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
- if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
- return VM_FAULT_OOM;
+ khugepaged_enter(vma, vma->vm_flags);
+
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm) &&
transparent_hugepage_use_zero_page()) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 609c1bc0a027..b69eda934d70 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -365,8 +365,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
* register it here without waiting a page fault that
* may not happen any time soon.
*/
- if (khugepaged_enter_vma_merge(vma, *vm_flags))
- return -ENOMEM;
+ khugepaged_enter_vma_merge(vma, *vm_flags);
break;
case MADV_NOHUGEPAGE:
*vm_flags &= ~VM_HUGEPAGE;
@@ -475,20 +474,20 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
return true;
}

-int __khugepaged_enter(struct mm_struct *mm)
+void __khugepaged_enter(struct mm_struct *mm)
{
struct mm_slot *mm_slot;
int wakeup;

mm_slot = alloc_mm_slot();
if (!mm_slot)
- return -ENOMEM;
+ return;

/* __khugepaged_exit() must not run from under us */
VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
free_mm_slot(mm_slot);
- return 0;
+ return;
}

spin_lock(&khugepaged_mm_lock);
@@ -504,11 +503,9 @@ int __khugepaged_enter(struct mm_struct *mm)
mmgrab(mm);
if (wakeup)
wake_up_interruptible(&khugepaged_wait);
-
- return 0;
}

-int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
+void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
unsigned long vm_flags)
{
unsigned long hstart, hend;
@@ -519,13 +516,12 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
* file-private shmem THP is not supported.
*/
if (!hugepage_vma_check(vma, vm_flags))
- return 0;
+ return;

hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
hend = vma->vm_end & HPAGE_PMD_MASK;
if (hstart < hend)
- return khugepaged_enter(vma, vm_flags);
- return 0;
+ khugepaged_enter(vma, vm_flags);
}

void __khugepaged_exit(struct mm_struct *mm)
--
2.26.3

2022-04-05 03:17:22

by Yang Shi

[permalink] [raw]
Subject: [v3 PATCH 4/8] mm: thp: only regular file could be THP eligible

Since commit a4aeaa06d45e ("mm: khugepaged: skip huge page collapse for
special files"), khugepaged just collapses THP for regular file which is
the intended usecase for readonly fs THP. Only show regular file as THP
eligible accordingly.

And make file_thp_enabled() available for khugepaged too in order to remove
duplicate code.

Acked-by: Song Liu <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/huge_mm.h | 14 ++++++++++++++
mm/huge_memory.c | 11 ++---------
mm/khugepaged.c | 9 ++-------
3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2999190adc22..62a6f667850d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -172,6 +172,20 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
return false;
}

+static inline bool file_thp_enabled(struct vm_area_struct *vma)
+{
+ struct inode *inode;
+
+ if (!vma->vm_file)
+ return false;
+
+ inode = vma->vm_file->f_inode;
+
+ return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) &&
+ (vma->vm_flags & VM_EXEC) &&
+ !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
+}
+
bool transparent_hugepage_active(struct vm_area_struct *vma);

#define transparent_hugepage_use_zero_page() \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2fe38212e07c..183b793fd28e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -68,13 +68,6 @@ static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;
unsigned long huge_zero_pfn __read_mostly = ~0UL;

-static inline bool file_thp_enabled(struct vm_area_struct *vma)
-{
- return transhuge_vma_enabled(vma, vma->vm_flags) && vma->vm_file &&
- !inode_is_open_for_write(vma->vm_file->f_inode) &&
- (vma->vm_flags & VM_EXEC);
-}
-
bool transparent_hugepage_active(struct vm_area_struct *vma)
{
/* The addr is used to check if the vma size fits */
@@ -86,8 +79,8 @@ bool transparent_hugepage_active(struct vm_area_struct *vma)
return __transparent_hugepage_enabled(vma);
if (vma_is_shmem(vma))
return shmem_huge_enabled(vma);
- if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
- return file_thp_enabled(vma);
+ if (transhuge_vma_enabled(vma, vma->vm_flags) && file_thp_enabled(vma))
+ return true;

return false;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 964a4d2c942a..609c1bc0a027 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -464,13 +464,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
return false;

/* Only regular file is valid */
- if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
- (vm_flags & VM_EXEC)) {
- struct inode *inode = vma->vm_file->f_inode;
-
- return !inode_is_open_for_write(inode) &&
- S_ISREG(inode->i_mode);
- }
+ if (file_thp_enabled(vma))
+ return true;

if (!vma->anon_vma || vma->vm_ops)
return false;
--
2.26.3

2022-04-05 03:40:41

by Yang Shi

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On Mon, Apr 4, 2022 at 5:16 PM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Apr 04, 2022 at 01:02:42PM -0700, Yang Shi wrote:
> > The readonly FS THP relies on khugepaged to collapse THP for suitable
> > vmas. But it is kind of "random luck" for khugepaged to see the
> > readonly FS vmas (see report: https://lore.kernel.org/linux-mm/[email protected]/) since currently the vmas are registered to khugepaged when:
>
> I still don't see the point. The effort should be put into
> supporting large folios, not in making this hack work better.

The series makes sense even though the hack is replaced by large
folios IMHO. The problem is the file VMAs may be not registered by
khugepaged consistently for some THP modes, for example, always,
regardless of whether it's readonly or the hack is gone or not. IIUC
even though the hack is replaced by the large folios, we still have
khugepaged to collapse pmd-mappable huge pages for both anonymous vmas
and file vmas, right? Or are you thinking about killing khugepaged
soon with supporting large folios?

Anyway it may make things clearer if the cover letter is rephrased to:

When khugepaged collapses file THPs, its behavior is not consistent.
It is kind of "random luck" for khugepaged to see the file vmas (see
report: https://lore.kernel.org/linux-mm/[email protected]/)
since currently the vmas are registered to khugepaged when:
- Anon huge pmd page fault
- VMA merge
- MADV_HUGEPAGE
- Shmem mmap

If the above conditions are not met, even though khugepaged is enabled
it won't see any file vma at all. MADV_HUGEPAGE could be specified
explicitly to tell khugepaged to collapse this area, but when
khugepaged mode is "always" it should scan suitable vmas as long as
VM_NOHUGEPAGE is not set.

So make sure file vmas are registered to khugepaged to make the
behavior more consistent.

2022-04-05 03:41:54

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On Mon, Apr 04, 2022 at 01:02:42PM -0700, Yang Shi wrote:
> The readonly FS THP relies on khugepaged to collapse THP for suitable
> vmas. But it is kind of "random luck" for khugepaged to see the
> readonly FS vmas (see report: https://lore.kernel.org/linux-mm/[email protected]/) since currently the vmas are registered to khugepaged when:

I still don't see the point. The effort should be put into
supporting large folios, not in making this hack work better.

2022-04-27 22:52:57

by Yang Shi

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On Wed, Apr 27, 2022 at 1:59 PM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote:
> > When khugepaged collapses file THPs, its behavior is not consistent.
> > It is kind of "random luck" for khugepaged to see the file vmas (see
> > report: https://lore.kernel.org/linux-mm/[email protected]/)
> > since currently the vmas are registered to khugepaged when:
> > - Anon huge pmd page fault
> > - VMA merge
> > - MADV_HUGEPAGE
> > - Shmem mmap
> >
> > If the above conditions are not met, even though khugepaged is enabled
> > it won't see any file vma at all. MADV_HUGEPAGE could be specified
> > explicitly to tell khugepaged to collapse this area, but when
> > khugepaged mode is "always" it should scan suitable vmas as long as
> > VM_NOHUGEPAGE is not set.
>
> I don't see that as being true at all. The point of this hack was that
> applications which really knew what they were doing could enable it.
> It makes no sense to me that setting "always" by the sysadmin for shmem
> also force-enables ROTHP, even for applications which aren't aware of it.
>
> Most telling, I think, is that Song Liu hasn't weighed in on this at
> all. It's clearly not important to the original author.

I tend to agree that MADV_MADVISE should be preferred when this
feature (or hack) was designed in the original author's mind in the
first place. And "madvise" is definitely the recommended way to use
THP, but I don't think it means we should not care "always" and assume
nobody uses it otherwise the issue would have not been reported.

2022-04-27 23:03:25

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote:
> When khugepaged collapses file THPs, its behavior is not consistent.
> It is kind of "random luck" for khugepaged to see the file vmas (see
> report: https://lore.kernel.org/linux-mm/[email protected]/)
> since currently the vmas are registered to khugepaged when:
> - Anon huge pmd page fault
> - VMA merge
> - MADV_HUGEPAGE
> - Shmem mmap
>
> If the above conditions are not met, even though khugepaged is enabled
> it won't see any file vma at all. MADV_HUGEPAGE could be specified
> explicitly to tell khugepaged to collapse this area, but when
> khugepaged mode is "always" it should scan suitable vmas as long as
> VM_NOHUGEPAGE is not set.

I don't see that as being true at all. The point of this hack was that
applications which really knew what they were doing could enable it.
It makes no sense to me that setting "always" by the sysadmin for shmem
also force-enables ROTHP, even for applications which aren't aware of it.

Most telling, I think, is that Song Liu hasn't weighed in on this at
all. It's clearly not important to the original author.

2022-04-29 01:04:08

by Yang Shi

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On Wed, Apr 27, 2022 at 1:59 PM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Apr 04, 2022 at 05:48:49PM -0700, Yang Shi wrote:
> > When khugepaged collapses file THPs, its behavior is not consistent.
> > It is kind of "random luck" for khugepaged to see the file vmas (see
> > report: https://lore.kernel.org/linux-mm/[email protected]/)
> > since currently the vmas are registered to khugepaged when:
> > - Anon huge pmd page fault
> > - VMA merge
> > - MADV_HUGEPAGE
> > - Shmem mmap
> >
> > If the above conditions are not met, even though khugepaged is enabled
> > it won't see any file vma at all. MADV_HUGEPAGE could be specified
> > explicitly to tell khugepaged to collapse this area, but when
> > khugepaged mode is "always" it should scan suitable vmas as long as
> > VM_NOHUGEPAGE is not set.
>
> I don't see that as being true at all. The point of this hack was that
> applications which really knew what they were doing could enable it.
> It makes no sense to me that setting "always" by the sysadmin for shmem
> also force-enables ROTHP, even for applications which aren't aware of it.
>
> Most telling, I think, is that Song Liu hasn't weighed in on this at
> all. It's clearly not important to the original author.

Song Liu already acked the series, please see
https://lore.kernel.org/linux-mm/[email protected]/

2022-05-09 12:43:10

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [v3 PATCH 1/8] sched: coredump.h: clarify the use of MMF_VM_HUGEPAGE

On 4/4/22 22:02, Yang Shi wrote:
> MMF_VM_HUGEPAGE is set as long as the mm is available for khugepaged by
> khugepaged_enter(), not only when VM_HUGEPAGE is set on vma. Correct
> the comment to avoid confusion.
>
> Reviewed-by: Miaohe Lin <[email protected]>
> Acked-by: Song Liu <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Vlastmil Babka <[email protected]>

> ---
> include/linux/sched/coredump.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 4d9e3a656875..4d0a5be28b70 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -57,7 +57,8 @@ static inline int get_dumpable(struct mm_struct *mm)
> #endif
> /* leave room for more dump flags */
> #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
> -#define MMF_VM_HUGEPAGE 17 /* set when VM_HUGEPAGE is set on vma */
> +#define MMF_VM_HUGEPAGE 17 /* set when mm is available for
> + khugepaged */
> /*
> * This one-shot flag is dropped due to necessity of changing exe once again
> * on NFS restore


2022-05-09 13:55:04

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [v3 PATCH 4/8] mm: thp: only regular file could be THP eligible

On 4/4/22 22:02, Yang Shi wrote:
> Since commit a4aeaa06d45e ("mm: khugepaged: skip huge page collapse for
> special files"), khugepaged just collapses THP for regular file which is
> the intended usecase for readonly fs THP. Only show regular file as THP
> eligible accordingly.
>
> And make file_thp_enabled() available for khugepaged too in order to remove
> duplicate code.
>
> Acked-by: Song Liu <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> include/linux/huge_mm.h | 14 ++++++++++++++
> mm/huge_memory.c | 11 ++---------
> mm/khugepaged.c | 9 ++-------
> 3 files changed, 18 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2999190adc22..62a6f667850d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -172,6 +172,20 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
> return false;
> }
>
> +static inline bool file_thp_enabled(struct vm_area_struct *vma)
> +{
> + struct inode *inode;
> +
> + if (!vma->vm_file)
> + return false;
> +
> + inode = vma->vm_file->f_inode;
> +
> + return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) &&
> + (vma->vm_flags & VM_EXEC) &&
> + !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> +}
> +
> bool transparent_hugepage_active(struct vm_area_struct *vma);
>
> #define transparent_hugepage_use_zero_page() \
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2fe38212e07c..183b793fd28e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -68,13 +68,6 @@ static atomic_t huge_zero_refcount;
> struct page *huge_zero_page __read_mostly;
> unsigned long huge_zero_pfn __read_mostly = ~0UL;
>
> -static inline bool file_thp_enabled(struct vm_area_struct *vma)
> -{
> - return transhuge_vma_enabled(vma, vma->vm_flags) && vma->vm_file &&
> - !inode_is_open_for_write(vma->vm_file->f_inode) &&
> - (vma->vm_flags & VM_EXEC);
> -}
> -
> bool transparent_hugepage_active(struct vm_area_struct *vma)
> {
> /* The addr is used to check if the vma size fits */
> @@ -86,8 +79,8 @@ bool transparent_hugepage_active(struct vm_area_struct *vma)
> return __transparent_hugepage_enabled(vma);
> if (vma_is_shmem(vma))
> return shmem_huge_enabled(vma);
> - if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
> - return file_thp_enabled(vma);
> + if (transhuge_vma_enabled(vma, vma->vm_flags) && file_thp_enabled(vma))
> + return true;
>
> return false;
> }
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 964a4d2c942a..609c1bc0a027 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -464,13 +464,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
> return false;
>
> /* Only regular file is valid */
> - if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
> - (vm_flags & VM_EXEC)) {
> - struct inode *inode = vma->vm_file->f_inode;
> -
> - return !inode_is_open_for_write(inode) &&
> - S_ISREG(inode->i_mode);
> - }
> + if (file_thp_enabled(vma))
> + return true;
>
> if (!vma->anon_vma || vma->vm_ops)
> return false;


2022-05-09 13:59:33

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [v3 PATCH 5/8] mm: khugepaged: make khugepaged_enter() void function

On 4/4/22 22:02, Yang Shi wrote:
> The most callers of khugepaged_enter() don't care about the return
> value. Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle
> the error by returning -ENOMEM. Actually it is not harmful for them to
> ignore the error case either. It also sounds overkilling to fail fork()
> and page fault early due to khugepaged_enter() error, and MADV_HUGEPAGE
> does set VM_HUGEPAGE flag regardless of the error.
>
> Acked-by: Song Liu <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2022-05-09 15:42:03

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [v3 PATCH 6/8] mm: khugepaged: move some khugepaged_* functions to khugepaged.c

On 4/4/22 22:02, Yang Shi wrote:
> To reuse hugepage_vma_check() for khugepaged_enter() so that we could
> remove some duplicate code. But moving hugepage_vma_check() to
> khugepaged.h needs to include huge_mm.h in it, it seems not optimal to
> bloat khugepaged.h.
>
> And the khugepaged_* functions actually are wrappers for some non-inline
> functions, so it seems the benefits are not too much to keep them inline.
>
> So move the khugepaged_* functions to khugepaged.c, any callers just
> need to include khugepaged.h which is quite small. For example, the
> following patches will call khugepaged_enter() in filemap page fault path
> for regular filesystems to make readonly FS THP collapse more consistent.
> The filemap.c just needs to include khugepaged.h.

This last part is inaccurate in v3?

> Acked-by: Song Liu <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

I think moving the tiny wrappers is unnecessary.

How about just making hugepage_vma_check() not static and declare it in
khugepaged.h, then it can be used from khugepaged_enter() in the same file
and AFAICS no need to include huge_mm.h there?

> ---
> include/linux/khugepaged.h | 37 ++++++-------------------------------
> mm/khugepaged.c | 20 ++++++++++++++++++++
> 2 files changed, 26 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 0423d3619f26..6acf9701151e 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -2,10 +2,6 @@
> #ifndef _LINUX_KHUGEPAGED_H
> #define _LINUX_KHUGEPAGED_H
>
> -#include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
> -#include <linux/shmem_fs.h>
> -
> -
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> extern struct attribute_group khugepaged_attr_group;
>
> @@ -16,6 +12,12 @@ extern void __khugepaged_enter(struct mm_struct *mm);
> extern void __khugepaged_exit(struct mm_struct *mm);
> extern void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
> unsigned long vm_flags);
> +extern void khugepaged_fork(struct mm_struct *mm,
> + struct mm_struct *oldmm);
> +extern void khugepaged_exit(struct mm_struct *mm);
> +extern void khugepaged_enter(struct vm_area_struct *vma,
> + unsigned long vm_flags);
> +
> extern void khugepaged_min_free_kbytes_update(void);
> #ifdef CONFIG_SHMEM
> extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> @@ -33,36 +35,9 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> #define khugepaged_always() \
> (transparent_hugepage_flags & \
> (1<<TRANSPARENT_HUGEPAGE_FLAG))
> -#define khugepaged_req_madv() \
> - (transparent_hugepage_flags & \
> - (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
> #define khugepaged_defrag() \
> (transparent_hugepage_flags & \
> (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
> -
> -static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> -{
> - if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> - __khugepaged_enter(mm);
> -}
> -
> -static inline void khugepaged_exit(struct mm_struct *mm)
> -{
> - if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
> - __khugepaged_exit(mm);
> -}
> -
> -static inline void khugepaged_enter(struct vm_area_struct *vma,
> - unsigned long vm_flags)
> -{
> - if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
> - if ((khugepaged_always() ||
> - (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) ||
> - (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
> - !(vm_flags & VM_NOHUGEPAGE) &&
> - !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> - __khugepaged_enter(vma->vm_mm);
> -}
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> {
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b69eda934d70..ec5b0a691d87 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -556,6 +556,26 @@ void __khugepaged_exit(struct mm_struct *mm)
> }
> }
>
> +void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> +{
> + if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> + __khugepaged_enter(mm);
> +}
> +
> +void khugepaged_exit(struct mm_struct *mm)
> +{
> + if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
> + __khugepaged_exit(mm);
> +}
> +
> +void khugepaged_enter(struct vm_area_struct *vma, unsigned long vm_flags)
> +{
> + if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
> + khugepaged_enabled())
> + if (hugepage_vma_check(vma, vm_flags))
> + __khugepaged_enter(vma->vm_mm);
> +}
> +
> static void release_pte_page(struct page *page)
> {
> mod_node_page_state(page_pgdat(page),


2022-05-09 16:10:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On 4/4/22 22:02, Yang Shi wrote:
> include/linux/huge_mm.h | 14 ++++++++++++
> include/linux/khugepaged.h | 59 ++++++++++++---------------------------------------
> include/linux/sched/coredump.h | 3 ++-
> kernel/fork.c | 4 +---
> mm/huge_memory.c | 15 ++++---------
> mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
> mm/mmap.c | 14 ++++++++----
> mm/shmem.c | 12 -----------
> 8 files changed, 88 insertions(+), 109 deletions(-)

Resending my general feedback from mm-commits thread to include the
public ML's:

There's modestly less lines in the end, some duplicate code removed,
special casing in shmem.c removed, that's all good as it is. Also patch 8/8
become quite boring in v3, no need to change individual filesystems and also
no hook in fault path, just the common mmap path. So I would just handle
patch 6 differently as I just replied to it, and acked the rest.

That said it's still unfortunately rather a mess of functions that have
similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma),
transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)?
So maybe still some space for further cleanups. But the series is fine as it
is so we don't have to wait for it now.

We could also consider that the tracking of which mm is to be scanned is
modelled after ksm which has its own madvise flag, but also no "always"
mode. What if for THP we only tracked actual THP madvised mm's, and in
"always" mode just scanned all vm's, would that allow ripping out some code
perhaps, while not adding too many unnecessary scans? If some processes are
being scanned without any effect, maybe track success separately, and scan
them less frequently etc. That could be ultimately more efficinet than
painfully tracking just *eligibility* for scanning in "always" mode?

Even more radical thing to consider (maybe that's a LSF/MM level topic, too
bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon
in MGLRU, and I probably forgot something else. Maybe time to think about
unifying those scanners?



2022-05-09 20:44:07

by Yang Shi

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <[email protected]> wrote:
>
> On 4/4/22 22:02, Yang Shi wrote:
> > include/linux/huge_mm.h | 14 ++++++++++++
> > include/linux/khugepaged.h | 59 ++++++++++++---------------------------------------
> > include/linux/sched/coredump.h | 3 ++-
> > kernel/fork.c | 4 +---
> > mm/huge_memory.c | 15 ++++---------
> > mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
> > mm/mmap.c | 14 ++++++++----
> > mm/shmem.c | 12 -----------
> > 8 files changed, 88 insertions(+), 109 deletions(-)
>
> Resending my general feedback from mm-commits thread to include the
> public ML's:
>
> There's modestly less lines in the end, some duplicate code removed,
> special casing in shmem.c removed, that's all good as it is. Also patch 8/8
> become quite boring in v3, no need to change individual filesystems and also
> no hook in fault path, just the common mmap path. So I would just handle
> patch 6 differently as I just replied to it, and acked the rest.
>
> That said it's still unfortunately rather a mess of functions that have
> similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma),
> transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)?
> So maybe still some space for further cleanups. But the series is fine as it
> is so we don't have to wait for it now.

Yeah, I agree that we do have a lot thp checks. Will find some time to
look into it deeper later.

>
> We could also consider that the tracking of which mm is to be scanned is
> modelled after ksm which has its own madvise flag, but also no "always"
> mode. What if for THP we only tracked actual THP madvised mm's, and in
> "always" mode just scanned all vm's, would that allow ripping out some code
> perhaps, while not adding too many unnecessary scans? If some processes are

Do you mean add all mm(s) to the scan list unconditionally? I don't
think it will scale.

> being scanned without any effect, maybe track success separately, and scan
> them less frequently etc. That could be ultimately more efficinet than
> painfully tracking just *eligibility* for scanning in "always" mode?

Sounds like we need a couple of different lists, for example, inactive
and active? And promote or demote mm(s) between the two lists? TBH I
don't see too many benefits at the moment. Or I misunderstood you?

>
> Even more radical thing to consider (maybe that's a LSF/MM level topic, too
> bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon
> in MGLRU, and I probably forgot something else. Maybe time to think about
> unifying those scanners?

We do have pagewalk (walk_page_range()) which is used by a couple of
mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure
whether it is feasible for khugepaged, ksm, etc, or not since I didn't
look that hard. But I agree it should be worth looking at.

>
>

2022-05-10 03:28:14

by Yang Shi

[permalink] [raw]
Subject: Re: [v3 PATCH 6/8] mm: khugepaged: move some khugepaged_* functions to khugepaged.c

On Mon, May 9, 2022 at 8:31 AM Vlastimil Babka <[email protected]> wrote:
>
> On 4/4/22 22:02, Yang Shi wrote:
> > To reuse hugepage_vma_check() for khugepaged_enter() so that we could
> > remove some duplicate code. But moving hugepage_vma_check() to
> > khugepaged.h needs to include huge_mm.h in it, it seems not optimal to
> > bloat khugepaged.h.
> >
> > And the khugepaged_* functions actually are wrappers for some non-inline
> > functions, so it seems the benefits are not too much to keep them inline.
> >
> > So move the khugepaged_* functions to khugepaged.c, any callers just
> > need to include khugepaged.h which is quite small. For example, the
> > following patches will call khugepaged_enter() in filemap page fault path
> > for regular filesystems to make readonly FS THP collapse more consistent.
> > The filemap.c just needs to include khugepaged.h.
>
> This last part is inaccurate in v3?

Yeah, thanks for catching this. Since the patch will be reworked, so
the commit log will be reworked as well.

>
> > Acked-by: Song Liu <[email protected]>
> > Signed-off-by: Yang Shi <[email protected]>
>
> I think moving the tiny wrappers is unnecessary.
>
> How about just making hugepage_vma_check() not static and declare it in
> khugepaged.h, then it can be used from khugepaged_enter() in the same file
> and AFAICS no need to include huge_mm.h there?

Sounds good to me, will fix it in v4. Thanks for reviewing and acking
the series.

>
> > ---
> > include/linux/khugepaged.h | 37 ++++++-------------------------------
> > mm/khugepaged.c | 20 ++++++++++++++++++++
> > 2 files changed, 26 insertions(+), 31 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index 0423d3619f26..6acf9701151e 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -2,10 +2,6 @@
> > #ifndef _LINUX_KHUGEPAGED_H
> > #define _LINUX_KHUGEPAGED_H
> >
> > -#include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
> > -#include <linux/shmem_fs.h>
> > -
> > -
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > extern struct attribute_group khugepaged_attr_group;
> >
> > @@ -16,6 +12,12 @@ extern void __khugepaged_enter(struct mm_struct *mm);
> > extern void __khugepaged_exit(struct mm_struct *mm);
> > extern void khugepaged_enter_vma_merge(struct vm_area_struct *vma,
> > unsigned long vm_flags);
> > +extern void khugepaged_fork(struct mm_struct *mm,
> > + struct mm_struct *oldmm);
> > +extern void khugepaged_exit(struct mm_struct *mm);
> > +extern void khugepaged_enter(struct vm_area_struct *vma,
> > + unsigned long vm_flags);
> > +
> > extern void khugepaged_min_free_kbytes_update(void);
> > #ifdef CONFIG_SHMEM
> > extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> > @@ -33,36 +35,9 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> > #define khugepaged_always() \
> > (transparent_hugepage_flags & \
> > (1<<TRANSPARENT_HUGEPAGE_FLAG))
> > -#define khugepaged_req_madv() \
> > - (transparent_hugepage_flags & \
> > - (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
> > #define khugepaged_defrag() \
> > (transparent_hugepage_flags & \
> > (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
> > -
> > -static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > -{
> > - if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> > - __khugepaged_enter(mm);
> > -}
> > -
> > -static inline void khugepaged_exit(struct mm_struct *mm)
> > -{
> > - if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
> > - __khugepaged_exit(mm);
> > -}
> > -
> > -static inline void khugepaged_enter(struct vm_area_struct *vma,
> > - unsigned long vm_flags)
> > -{
> > - if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
> > - if ((khugepaged_always() ||
> > - (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) ||
> > - (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
> > - !(vm_flags & VM_NOHUGEPAGE) &&
> > - !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> > - __khugepaged_enter(vma->vm_mm);
> > -}
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > {
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b69eda934d70..ec5b0a691d87 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -556,6 +556,26 @@ void __khugepaged_exit(struct mm_struct *mm)
> > }
> > }
> >
> > +void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > +{
> > + if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> > + __khugepaged_enter(mm);
> > +}
> > +
> > +void khugepaged_exit(struct mm_struct *mm)
> > +{
> > + if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
> > + __khugepaged_exit(mm);
> > +}
> > +
> > +void khugepaged_enter(struct vm_area_struct *vma, unsigned long vm_flags)
> > +{
> > + if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
> > + khugepaged_enabled())
> > + if (hugepage_vma_check(vma, vm_flags))
> > + __khugepaged_enter(vma->vm_mm);
> > +}
> > +
> > static void release_pte_page(struct page *page)
> > {
> > mod_node_page_state(page_pgdat(page),
>

2022-05-10 10:48:59

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On 5/9/22 22:34, Yang Shi wrote:
> On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <[email protected]> wrote:
>>
>> On 4/4/22 22:02, Yang Shi wrote:
>> > include/linux/huge_mm.h | 14 ++++++++++++
>> > include/linux/khugepaged.h | 59 ++++++++++++---------------------------------------
>> > include/linux/sched/coredump.h | 3 ++-
>> > kernel/fork.c | 4 +---
>> > mm/huge_memory.c | 15 ++++---------
>> > mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
>> > mm/mmap.c | 14 ++++++++----
>> > mm/shmem.c | 12 -----------
>> > 8 files changed, 88 insertions(+), 109 deletions(-)
>>
>> Resending my general feedback from mm-commits thread to include the
>> public ML's:
>>
>> There's modestly less lines in the end, some duplicate code removed,
>> special casing in shmem.c removed, that's all good as it is. Also patch 8/8
>> become quite boring in v3, no need to change individual filesystems and also
>> no hook in fault path, just the common mmap path. So I would just handle
>> patch 6 differently as I just replied to it, and acked the rest.
>>
>> That said it's still unfortunately rather a mess of functions that have
>> similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma),
>> transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)?
>> So maybe still some space for further cleanups. But the series is fine as it
>> is so we don't have to wait for it now.
>
> Yeah, I agree that we do have a lot thp checks. Will find some time to
> look into it deeper later.

Thanks.

>>
>> We could also consider that the tracking of which mm is to be scanned is
>> modelled after ksm which has its own madvise flag, but also no "always"
>> mode. What if for THP we only tracked actual THP madvised mm's, and in
>> "always" mode just scanned all vm's, would that allow ripping out some code
>> perhaps, while not adding too many unnecessary scans? If some processes are
>
> Do you mean add all mm(s) to the scan list unconditionally? I don't
> think it will scale.

It might be interesting to find out how many mm's (percentage of all mm's)
are typically in the list with "always" enabled. I wouldn't be surprised if
it was nearly all of them. Having at least one large enough anonymous area
sounds like something all processes would have these days?

>> being scanned without any effect, maybe track success separately, and scan
>> them less frequently etc. That could be ultimately more efficinet than
>> painfully tracking just *eligibility* for scanning in "always" mode?
>
> Sounds like we need a couple of different lists, for example, inactive
> and active? And promote or demote mm(s) between the two lists? TBH I
> don't see too many benefits at the moment. Or I misunderstood you?

Yeah, something like that. It would of course require finding out whether
khugepaged is consuming too much cpu uselessly these days while not
processing fast enough mm's where it succeeds more.

>>
>> Even more radical thing to consider (maybe that's a LSF/MM level topic, too
>> bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon
>> in MGLRU, and I probably forgot something else. Maybe time to think about
>> unifying those scanners?
>
> We do have pagewalk (walk_page_range()) which is used by a couple of
> mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure
> whether it is feasible for khugepaged, ksm, etc, or not since I didn't
> look that hard. But I agree it should be worth looking at.

pagewalk is a framework to simplify writing code that processes page tables
for a given one-off task, yeah. But this would be something a bit different,
e.g. a kernel thread that does the sum of what khugepaged/ksm/etc do. Numa
balancing uses task_work instead of kthread so that would require
consideration on which mechanism the unified daemon would use.

>>
>>


2022-05-10 22:53:21

by Yang Shi

[permalink] [raw]
Subject: Re: [v3 PATCH 0/8] Make khugepaged collapse readonly FS THP more consistent

On Tue, May 10, 2022 at 12:35 AM Vlastimil Babka <[email protected]> wrote:
>
> On 5/9/22 22:34, Yang Shi wrote:
> > On Mon, May 9, 2022 at 9:05 AM Vlastimil Babka <[email protected]> wrote:
> >>
> >> On 4/4/22 22:02, Yang Shi wrote:
> >> > include/linux/huge_mm.h | 14 ++++++++++++
> >> > include/linux/khugepaged.h | 59 ++++++++++++---------------------------------------
> >> > include/linux/sched/coredump.h | 3 ++-
> >> > kernel/fork.c | 4 +---
> >> > mm/huge_memory.c | 15 ++++---------
> >> > mm/khugepaged.c | 76 +++++++++++++++++++++++++++++++++++++-----------------------------
> >> > mm/mmap.c | 14 ++++++++----
> >> > mm/shmem.c | 12 -----------
> >> > 8 files changed, 88 insertions(+), 109 deletions(-)
> >>
> >> Resending my general feedback from mm-commits thread to include the
> >> public ML's:
> >>
> >> There's modestly less lines in the end, some duplicate code removed,
> >> special casing in shmem.c removed, that's all good as it is. Also patch 8/8
> >> become quite boring in v3, no need to change individual filesystems and also
> >> no hook in fault path, just the common mmap path. So I would just handle
> >> patch 6 differently as I just replied to it, and acked the rest.
> >>
> >> That said it's still unfortunately rather a mess of functions that have
> >> similar names. transhuge_vma_enabled(vma). hugepage_vma_check(vma),
> >> transparent_hugepage_active(vma), transhuge_vma_suitable(vma, addr)?
> >> So maybe still some space for further cleanups. But the series is fine as it
> >> is so we don't have to wait for it now.
> >
> > Yeah, I agree that we do have a lot thp checks. Will find some time to
> > look into it deeper later.
>
> Thanks.
>
> >>
> >> We could also consider that the tracking of which mm is to be scanned is
> >> modelled after ksm which has its own madvise flag, but also no "always"
> >> mode. What if for THP we only tracked actual THP madvised mm's, and in
> >> "always" mode just scanned all vm's, would that allow ripping out some code
> >> perhaps, while not adding too many unnecessary scans? If some processes are
> >
> > Do you mean add all mm(s) to the scan list unconditionally? I don't
> > think it will scale.
>
> It might be interesting to find out how many mm's (percentage of all mm's)
> are typically in the list with "always" enabled. I wouldn't be surprised if
> it was nearly all of them. Having at least one large enough anonymous area
> sounds like something all processes would have these days?

Just did a simple test on my Fedora VM with "always", which is almost idle.

The number of user processes (excluding kernel thread) is:
[yangs@localhost ~]$ ps -aux --no-headers | awk '{if ($5 > 0) print $5}' | wc -l
113

The number of mm on khugepaged scan list counted by a simple drgn
script is 56. The below is the drgn script:
>>> i = 0
>>> mm_list = prog['khugepaged_scan'].mm_head.address_of_()
>>> for mm in list_for_each(mm_list):
... i = i + 1
...
>>> print(i)

So 50% processes on the list. For busy machines, the percentage may be
higher. And the most big enough processes (with large anon mapping)
should be on the list.

>
> >> being scanned without any effect, maybe track success separately, and scan
> >> them less frequently etc. That could be ultimately more efficinet than
> >> painfully tracking just *eligibility* for scanning in "always" mode?
> >
> > Sounds like we need a couple of different lists, for example, inactive
> > and active? And promote or demote mm(s) between the two lists? TBH I
> > don't see too many benefits at the moment. Or I misunderstood you?
>
> Yeah, something like that. It would of course require finding out whether
> khugepaged is consuming too much cpu uselessly these days while not
> processing fast enough mm's where it succeeds more.

Yeah, currently we don't have such statistics at mm or vma
granularity. But we may be able to get some stats by some
post-processing with trace events.

>
> >>
> >> Even more radical thing to consider (maybe that's a LSF/MM level topic, too
> >> bad :) is that we scan pagetables in ksm, khugepaged, numa balancing, soon
> >> in MGLRU, and I probably forgot something else. Maybe time to think about
> >> unifying those scanners?
> >
> > We do have pagewalk (walk_page_range()) which is used by a couple of
> > mm stuff, for example, mlock, mempolicy, mprotect, etc. I'm not sure
> > whether it is feasible for khugepaged, ksm, etc, or not since I didn't
> > look that hard. But I agree it should be worth looking at.
>
> pagewalk is a framework to simplify writing code that processes page tables
> for a given one-off task, yeah. But this would be something a bit different,
> e.g. a kernel thread that does the sum of what khugepaged/ksm/etc do. Numa
> balancing uses task_work instead of kthread so that would require
> consideration on which mechanism the unified daemon would use.

Aha, OK, you mean consolidate khugepaged, ksmd, etc into one kthread.
TBH I don't know whether that will work out or not. Maybe the first
step is to use the same page table walk framework for all of them?

>
> >>
> >>
>