2023-07-12 06:07:21

by Yin, Fengwei

[permalink] [raw]
Subject: [RFC PATCH v2 0/3] support large folio for mlock

Yu mentioned at [1] about the mlock() can't be applied to large folio.

I leant the related code and here is my understanding:
- For RLIMIT_MEMLOCK related, there is no problem. Becuase the
RLIMIT_MEMLOCK statistics is not related underneath page. That means
underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
statistics collection which is always correct.

- For keeping the page in RAM, there is no problem either. At least,
during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit
set in vm_flags, the folio will be kept whatever the folio is
mlocked or not.

So the function of mlock for large folio works. But it's not optimized
because the page reclaim needs scan these large folio and may split
them.

This series identified the large folio for mlock to two types:
- The large folio is in VM_LOCKED VMA range
- The large folio cross VM_LOCKED VMA boundary

For the first type, we mlock large folio so page relcaim will skip it.
For the second type, we don't mlock large folio. It's allowed to be
picked by page reclaim and be split. So the pages not in VM_LOCKED VMA
range are allowed to be reclaimed/released.

patch1 introduce API to check whether large folio is in VMA range.
patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
large folio mlock/munlock.
patch3 make mlock/munlock syscall support large folio.

testing done:
- kernel selftest. No extra failure introduced

Matthew commented on v1 that the large folio should be split if it
crosses the VMA boundaries. But there is no obvious correct method
to handle split failure and it's a common issue for mprotect,
mlock, mremap, munmap....

So I keep v1 behaivor (not split folio if it crosses VMA boundaries)
in v2.

[1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/

Changes from v1:
patch1:
- Add new function folio_within_vma() based on folio_in_range()
per Yu's suggestion

patch2:
- Update folio_referenced_one() to skip the entries which are
out of VM_LOCKED VMA range if the large folio cross VMA
boundaries per Yu's suggestion

patch3:
- Simplify the changes in mlock_pte_range() by introduing two
helper functions should_mlock_folio() and get_folio_mlock_step()
per Yu's suggestion


Yin Fengwei (3):
mm: add functions folio_in_range() and folio_within_vma()
mm: handle large folio when large folio in VM_LOCKED VMA range
mm: mlock: update mlock_pte_range to handle large folio

mm/internal.h | 43 +++++++++++++++++++--
mm/mlock.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++---
mm/rmap.c | 34 +++++++++++++----
3 files changed, 166 insertions(+), 15 deletions(-)

--
2.39.2



2023-07-12 06:07:30

by Yin, Fengwei

[permalink] [raw]
Subject: [RFC PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma()

It will be used to check whether the folio is mapped to specific
VMA and whether the mapping address of folio is in the range.

Also a helper function folio_within_vma() to check whether folio
is in the range of vma based on folio_in_range().

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/internal.h | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/mm/internal.h b/mm/internal.h
index 483add0bfb289..c7dd15d8de3ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -585,6 +585,38 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
bool write, int *locked);
extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
unsigned long bytes);
+
+static inline bool
+folio_in_range(struct folio *folio, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ pgoff_t pgoff, addr;
+ unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+ VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
+ if (start < vma->vm_start)
+ start = vma->vm_start;
+
+ if (end > vma->vm_end)
+ end = vma->vm_end;
+
+ pgoff = folio_pgoff(folio);
+
+ /* if folio start address is not in vma range */
+ if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
+ return false;
+
+ addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+
+ return ((addr >= start) && (addr + folio_size(folio) <= end));
+}
+
+static inline bool
+folio_within_vma(struct folio *folio, struct vm_area_struct *vma)
+{
+ return folio_in_range(folio, vma, vma->vm_start, vma->vm_end);
+}
+
/*
* mlock_vma_folio() and munlock_vma_folio():
* should be called with vma's mmap_lock held for read or write,
--
2.39.2


2023-07-12 06:07:34

by Yin, Fengwei

[permalink] [raw]
Subject: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

Current kernel only lock base size folio during mlock syscall.
Add large folio support with following rules:
- Only mlock large folio when it's in VM_LOCKED VMA range

- If there is cow folio, mlock the cow folio as cow folio
is also in VM_LOCKED VMA range.

- munlock will apply to the large folio which is in VMA range
or cross the VMA boundary.

The last rule is used to handle the case that the large folio is
mlocked, later the VMA is split in the middle of large folio
and this large folio become cross VMA boundary.

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 99 insertions(+), 5 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 0a0c996c5c214..f49e079066870 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio)
local_unlock(&mlock_fbatch.lock);
}

+static inline bool should_mlock_folio(struct folio *folio,
+ struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_LOCKED)
+ return (!folio_test_large(folio) ||
+ folio_within_vma(folio, vma));
+
+ /*
+ * For unlock, allow munlock large folio which is partially
+ * mapped to VMA. As it's possible that large folio is
+ * mlocked and VMA is split later.
+ *
+ * During memory pressure, such kind of large folio can
+ * be split. And the pages are not in VM_LOCKed VMA
+ * can be reclaimed.
+ */
+
+ return true;
+}
+
+static inline unsigned int get_folio_mlock_step(struct folio *folio,
+ pte_t pte, unsigned long addr, unsigned long end)
+{
+ unsigned int nr;
+
+ nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
+ return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
+}
+
+void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
+ pte_t *pte, unsigned long addr, unsigned int nr)
+{
+ struct folio *cow_folio;
+ unsigned int step = 1;
+
+ mlock_folio(folio);
+ if (nr == 1)
+ return;
+
+ for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
+ pte_t ptent;
+
+ step = 1;
+ ptent = ptep_get(pte);
+
+ if (!pte_present(ptent))
+ continue;
+
+ cow_folio = vm_normal_folio(vma, addr, ptent);
+ if (!cow_folio || cow_folio == folio) {
+ continue;
+ }
+
+ mlock_folio(cow_folio);
+ step = get_folio_mlock_step(folio, ptent,
+ addr, addr + (nr << PAGE_SHIFT));
+ }
+}
+
+void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
+ pte_t *pte, unsigned long addr, unsigned int nr)
+{
+ struct folio *cow_folio;
+ unsigned int step = 1;
+
+ munlock_folio(folio);
+ if (nr == 1)
+ return;
+
+ for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
+ pte_t ptent;
+
+ step = 1;
+ ptent = ptep_get(pte);
+
+ if (!pte_present(ptent))
+ continue;
+
+ cow_folio = vm_normal_folio(vma, addr, ptent);
+ if (!cow_folio || cow_folio == folio) {
+ continue;
+ }
+
+ munlock_folio(cow_folio);
+ step = get_folio_mlock_step(folio, ptent,
+ addr, addr + (nr << PAGE_SHIFT));
+ }
+}
+
static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)

@@ -314,6 +403,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
pte_t *start_pte, *pte;
pte_t ptent;
struct folio *folio;
+ unsigned int step = 1;

ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
@@ -329,24 +419,28 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
goto out;
}

- start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ pte = start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
if (!start_pte) {
walk->action = ACTION_AGAIN;
return 0;
}
- for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
+
+ for (; addr != end; pte += step, addr += (step << PAGE_SHIFT)) {
+ step = 1;
ptent = ptep_get(pte);
if (!pte_present(ptent))
continue;
folio = vm_normal_folio(vma, addr, ptent);
if (!folio || folio_is_zone_device(folio))
continue;
- if (folio_test_large(folio))
+ if (!should_mlock_folio(folio, vma))
continue;
+
+ step = get_folio_mlock_step(folio, ptent, addr, end);
if (vma->vm_flags & VM_LOCKED)
- mlock_folio(folio);
+ mlock_folio_range(folio, vma, pte, addr, step);
else
- munlock_folio(folio);
+ munlock_folio_range(folio, vma, pte, addr, step);
}
pte_unmap(start_pte);
out:
--
2.39.2


2023-07-12 06:22:34

by Yin, Fengwei

[permalink] [raw]
Subject: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range

If large folio is in the range of VM_LOCKED VMA, it should be
mlocked to avoid being picked by page reclaim. Which may split
the large folio and then mlock each pages again.

Mlock this kind of large folio to prevent them being picked by
page reclaim.

For the large folio which cross the boundary of VM_LOCKED VMA,
we'd better not to mlock it. So if the system is under memory
pressure, this kind of large folio will be split and the pages
ouf of VM_LOCKED VMA can be reclaimed.

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/internal.h | 11 ++++++++---
mm/rmap.c | 34 +++++++++++++++++++++++++++-------
2 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c7dd15d8de3ef..776141de2797a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
* still be set while VM_SPECIAL bits are added: so ignore it then.
*/
if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
- (compound || !folio_test_large(folio)))
+ (compound || !folio_test_large(folio) ||
+ folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
mlock_folio(folio);
}

@@ -651,8 +652,12 @@ void munlock_folio(struct folio *folio);
static inline void munlock_vma_folio(struct folio *folio,
struct vm_area_struct *vma, bool compound)
{
- if (unlikely(vma->vm_flags & VM_LOCKED) &&
- (compound || !folio_test_large(folio)))
+ /*
+ * To handle the case that a mlocked large folio is unmapped from VMA
+ * piece by piece, allow munlock the large folio which is partially
+ * mapped to VMA.
+ */
+ if (unlikely(vma->vm_flags & VM_LOCKED))
munlock_folio(folio);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index 2668f5ea35342..455f415d8d9ca 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -803,6 +803,14 @@ struct folio_referenced_arg {
unsigned long vm_flags;
struct mem_cgroup *memcg;
};
+
+static inline bool should_restore_mlock(struct folio *folio,
+ struct vm_area_struct *vma, bool pmd_mapped)
+{
+ return !folio_test_large(folio) ||
+ pmd_mapped || folio_within_vma(folio, vma);
+}
+
/*
* arg: folio_referenced_arg will be passed
*/
@@ -816,13 +824,25 @@ static bool folio_referenced_one(struct folio *folio,
while (page_vma_mapped_walk(&pvmw)) {
address = pvmw.address;

- if ((vma->vm_flags & VM_LOCKED) &&
- (!folio_test_large(folio) || !pvmw.pte)) {
- /* Restore the mlock which got missed */
- mlock_vma_folio(folio, vma, !pvmw.pte);
- page_vma_mapped_walk_done(&pvmw);
- pra->vm_flags |= VM_LOCKED;
- return false; /* To break the loop */
+ if (vma->vm_flags & VM_LOCKED) {
+ if (should_restore_mlock(folio, vma, !pvmw.pte)) {
+ /* Restore the mlock which got missed */
+ mlock_vma_folio(folio, vma, !pvmw.pte);
+ page_vma_mapped_walk_done(&pvmw);
+ pra->vm_flags |= VM_LOCKED;
+ return false; /* To break the loop */
+ } else {
+ /*
+ * For large folio cross VMA boundaries, it's
+ * expected to be picked by page reclaim. But
+ * should skip reference of pages which are in
+ * the range of VM_LOCKED vma. As page reclaim
+ * should just count the reference of pages out
+ * the range of VM_LOCKED vma.
+ */
+ pra->mapcount--;
+ continue;
+ }
}

if (pvmw.pte) {
--
2.39.2


2023-07-12 06:30:57

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/3] mm: add functions folio_in_range() and folio_within_vma()

On Wed, Jul 12, 2023 at 12:01 AM Yin Fengwei <[email protected]> wrote:
>
> It will be used to check whether the folio is mapped to specific
> VMA and whether the mapping address of folio is in the range.
>
> Also a helper function folio_within_vma() to check whether folio
> is in the range of vma based on folio_in_range().
>
> Signed-off-by: Yin Fengwei <[email protected]>

Reviewed-by: Yu Zhao <[email protected]>

2023-07-12 06:32:35

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range

On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>
> If large folio is in the range of VM_LOCKED VMA, it should be
> mlocked to avoid being picked by page reclaim. Which may split
> the large folio and then mlock each pages again.
>
> Mlock this kind of large folio to prevent them being picked by
> page reclaim.
>
> For the large folio which cross the boundary of VM_LOCKED VMA,
> we'd better not to mlock it. So if the system is under memory
> pressure, this kind of large folio will be split and the pages
> ouf of VM_LOCKED VMA can be reclaimed.
>
> Signed-off-by: Yin Fengwei <[email protected]>
> ---
> mm/internal.h | 11 ++++++++---
> mm/rmap.c | 34 +++++++++++++++++++++++++++-------
> 2 files changed, 35 insertions(+), 10 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c7dd15d8de3ef..776141de2797a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
> * still be set while VM_SPECIAL bits are added: so ignore it then.
> */
> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
> - (compound || !folio_test_large(folio)))
> + (compound || !folio_test_large(folio) ||
> + folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
> mlock_folio(folio);
> }

This can be simplified:
1. remove the compound parameter
2. make the if
if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
folio_within_vma())
mlock_folio(folio);

> @@ -651,8 +652,12 @@ void munlock_folio(struct folio *folio);
> static inline void munlock_vma_folio(struct folio *folio,
> struct vm_area_struct *vma, bool compound)

Remove the compound parameter here too.

> {
> - if (unlikely(vma->vm_flags & VM_LOCKED) &&
> - (compound || !folio_test_large(folio)))
> + /*
> + * To handle the case that a mlocked large folio is unmapped from VMA
> + * piece by piece, allow munlock the large folio which is partially
> + * mapped to VMA.
> + */
> + if (unlikely(vma->vm_flags & VM_LOCKED))
> munlock_folio(folio);
> }
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2668f5ea35342..455f415d8d9ca 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -803,6 +803,14 @@ struct folio_referenced_arg {
> unsigned long vm_flags;
> struct mem_cgroup *memcg;
> };
> +
> +static inline bool should_restore_mlock(struct folio *folio,
> + struct vm_area_struct *vma, bool pmd_mapped)
> +{
> + return !folio_test_large(folio) ||
> + pmd_mapped || folio_within_vma(folio, vma);
> +}

This is just folio_within_vma() :)

> /*
> * arg: folio_referenced_arg will be passed
> */
> @@ -816,13 +824,25 @@ static bool folio_referenced_one(struct folio *folio,
> while (page_vma_mapped_walk(&pvmw)) {
> address = pvmw.address;
>
> - if ((vma->vm_flags & VM_LOCKED) &&
> - (!folio_test_large(folio) || !pvmw.pte)) {
> - /* Restore the mlock which got missed */
> - mlock_vma_folio(folio, vma, !pvmw.pte);
> - page_vma_mapped_walk_done(&pvmw);
> - pra->vm_flags |= VM_LOCKED;
> - return false; /* To break the loop */
> + if (vma->vm_flags & VM_LOCKED) {
> + if (should_restore_mlock(folio, vma, !pvmw.pte)) {
> + /* Restore the mlock which got missed */
> + mlock_vma_folio(folio, vma, !pvmw.pte);
> + page_vma_mapped_walk_done(&pvmw);
> + pra->vm_flags |= VM_LOCKED;
> + return false; /* To break the loop */
> + } else {

There is no need for "else", or just

if (!folio_within_vma())
goto dec_pra_mapcount;

> + /*
> + * For large folio cross VMA boundaries, it's
> + * expected to be picked by page reclaim. But
> + * should skip reference of pages which are in
> + * the range of VM_LOCKED vma. As page reclaim
> + * should just count the reference of pages out
> + * the range of VM_LOCKED vma.
> + */
> + pra->mapcount--;
> + continue;
> + }
> }

2023-07-12 06:51:25

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>
> Current kernel only lock base size folio during mlock syscall.
> Add large folio support with following rules:
> - Only mlock large folio when it's in VM_LOCKED VMA range
>
> - If there is cow folio, mlock the cow folio as cow folio
> is also in VM_LOCKED VMA range.
>
> - munlock will apply to the large folio which is in VMA range
> or cross the VMA boundary.
>
> The last rule is used to handle the case that the large folio is
> mlocked, later the VMA is split in the middle of large folio
> and this large folio become cross VMA boundary.
>
> Signed-off-by: Yin Fengwei <[email protected]>
> ---
> mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 99 insertions(+), 5 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0a0c996c5c214..f49e079066870 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio)
> local_unlock(&mlock_fbatch.lock);
> }
>
> +static inline bool should_mlock_folio(struct folio *folio,
> + struct vm_area_struct *vma)
> +{
> + if (vma->vm_flags & VM_LOCKED)
> + return (!folio_test_large(folio) ||
> + folio_within_vma(folio, vma));
> +
> + /*
> + * For unlock, allow munlock large folio which is partially
> + * mapped to VMA. As it's possible that large folio is
> + * mlocked and VMA is split later.
> + *
> + * During memory pressure, such kind of large folio can
> + * be split. And the pages are not in VM_LOCKed VMA
> + * can be reclaimed.
> + */
> +
> + return true;

Looks good, or just

should_mlock_folio() // or whatever name you see fit, can_mlock_folio()?
{
return !(vma->vm_flags & VM_LOCKED) || folio_within_vma();
}

> +}
> +
> +static inline unsigned int get_folio_mlock_step(struct folio *folio,
> + pte_t pte, unsigned long addr, unsigned long end)
> +{
> + unsigned int nr;
> +
> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
> +}
> +
> +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
> + pte_t *pte, unsigned long addr, unsigned int nr)
> +{
> + struct folio *cow_folio;
> + unsigned int step = 1;
> +
> + mlock_folio(folio);
> + if (nr == 1)
> + return;
> +
> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
> + pte_t ptent;
> +
> + step = 1;
> + ptent = ptep_get(pte);
> +
> + if (!pte_present(ptent))
> + continue;
> +
> + cow_folio = vm_normal_folio(vma, addr, ptent);
> + if (!cow_folio || cow_folio == folio) {
> + continue;
> + }
> +
> + mlock_folio(cow_folio);
> + step = get_folio_mlock_step(folio, ptent,
> + addr, addr + (nr << PAGE_SHIFT));
> + }
> +}
> +
> +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
> + pte_t *pte, unsigned long addr, unsigned int nr)
> +{
> + struct folio *cow_folio;
> + unsigned int step = 1;
> +
> + munlock_folio(folio);
> + if (nr == 1)
> + return;
> +
> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
> + pte_t ptent;
> +
> + step = 1;
> + ptent = ptep_get(pte);
> +
> + if (!pte_present(ptent))
> + continue;
> +
> + cow_folio = vm_normal_folio(vma, addr, ptent);
> + if (!cow_folio || cow_folio == folio) {
> + continue;
> + }
> +
> + munlock_folio(cow_folio);
> + step = get_folio_mlock_step(folio, ptent,
> + addr, addr + (nr << PAGE_SHIFT));
> + }
> +}

I'll finish the above later.

> static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long end, struct mm_walk *walk)
>
> @@ -314,6 +403,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
> pte_t *start_pte, *pte;
> pte_t ptent;
> struct folio *folio;
> + unsigned int step = 1;
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> @@ -329,24 +419,28 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
> goto out;
> }
>
> - start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> + pte = start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> if (!start_pte) {
> walk->action = ACTION_AGAIN;
> return 0;
> }
> - for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> +
> + for (; addr != end; pte += step, addr += (step << PAGE_SHIFT)) {
> + step = 1;
> ptent = ptep_get(pte);
> if (!pte_present(ptent))
> continue;
> folio = vm_normal_folio(vma, addr, ptent);
> if (!folio || folio_is_zone_device(folio))
> continue;
> - if (folio_test_large(folio))
> + if (!should_mlock_folio(folio, vma))
> continue;
> +
> + step = get_folio_mlock_step(folio, ptent, addr, end);
> if (vma->vm_flags & VM_LOCKED)
> - mlock_folio(folio);
> + mlock_folio_range(folio, vma, pte, addr, step);
> else
> - munlock_folio(folio);
> + munlock_folio_range(folio, vma, pte, addr, step);
> }
> pte_unmap(start_pte);
> out:

Looks good.

2023-07-12 07:24:32

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range



On 7/12/23 14:23, Yu Zhao wrote:
> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>>
>> If large folio is in the range of VM_LOCKED VMA, it should be
>> mlocked to avoid being picked by page reclaim. Which may split
>> the large folio and then mlock each pages again.
>>
>> Mlock this kind of large folio to prevent them being picked by
>> page reclaim.
>>
>> For the large folio which cross the boundary of VM_LOCKED VMA,
>> we'd better not to mlock it. So if the system is under memory
>> pressure, this kind of large folio will be split and the pages
>> ouf of VM_LOCKED VMA can be reclaimed.
>>
>> Signed-off-by: Yin Fengwei <[email protected]>
>> ---
>> mm/internal.h | 11 ++++++++---
>> mm/rmap.c | 34 +++++++++++++++++++++++++++-------
>> 2 files changed, 35 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index c7dd15d8de3ef..776141de2797a 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
>> * still be set while VM_SPECIAL bits are added: so ignore it then.
>> */
>> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
>> - (compound || !folio_test_large(folio)))
>> + (compound || !folio_test_large(folio) ||
>> + folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
>> mlock_folio(folio);
>> }
>
> This can be simplified:
> 1. remove the compound parameter
Yes. There is not difference here for pmd mapping of THPs and pte mappings of THPs
if the only condition need check is whether the folio is within VMA range or not.

But let me add Huge for confirmation.


> 2. make the if
> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
> folio_within_vma())
> mlock_folio(folio);
!folio_test_large(folio) was kept here by purpose. For normal 4K page, don't need
to call folio_within_vma() which is heavy for normal 4K page.


>
>> @@ -651,8 +652,12 @@ void munlock_folio(struct folio *folio);
>> static inline void munlock_vma_folio(struct folio *folio,
>> struct vm_area_struct *vma, bool compound)
>
> Remove the compound parameter here too.
>
>> {
>> - if (unlikely(vma->vm_flags & VM_LOCKED) &&
>> - (compound || !folio_test_large(folio)))
>> + /*
>> + * To handle the case that a mlocked large folio is unmapped from VMA
>> + * piece by piece, allow munlock the large folio which is partially
>> + * mapped to VMA.
>> + */
>> + if (unlikely(vma->vm_flags & VM_LOCKED))
>> munlock_folio(folio);
>> }
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2668f5ea35342..455f415d8d9ca 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -803,6 +803,14 @@ struct folio_referenced_arg {
>> unsigned long vm_flags;
>> struct mem_cgroup *memcg;
>> };
>> +
>> +static inline bool should_restore_mlock(struct folio *folio,
>> + struct vm_area_struct *vma, bool pmd_mapped)
>> +{
>> + return !folio_test_large(folio) ||
>> + pmd_mapped || folio_within_vma(folio, vma);
>> +}
>
> This is just folio_within_vma() :)
>
>> /*
>> * arg: folio_referenced_arg will be passed
>> */
>> @@ -816,13 +824,25 @@ static bool folio_referenced_one(struct folio *folio,
>> while (page_vma_mapped_walk(&pvmw)) {
>> address = pvmw.address;
>>
>> - if ((vma->vm_flags & VM_LOCKED) &&
>> - (!folio_test_large(folio) || !pvmw.pte)) {
>> - /* Restore the mlock which got missed */
>> - mlock_vma_folio(folio, vma, !pvmw.pte);
>> - page_vma_mapped_walk_done(&pvmw);
>> - pra->vm_flags |= VM_LOCKED;
>> - return false; /* To break the loop */
>> + if (vma->vm_flags & VM_LOCKED) {
>> + if (should_restore_mlock(folio, vma, !pvmw.pte)) {
>> + /* Restore the mlock which got missed */
>> + mlock_vma_folio(folio, vma, !pvmw.pte);
>> + page_vma_mapped_walk_done(&pvmw);
>> + pra->vm_flags |= VM_LOCKED;
>> + return false; /* To break the loop */
>> + } else {
>
> There is no need for "else", or just
>
> if (!folio_within_vma())
> goto dec_pra_mapcount;
I tried not to use goto as much as possible. I suppose you mean:

if (!should_restore_lock())
goto dec_pra_mapcount; (I may use continue here. :)).

mlock_vma_folio();
page_vma_mapped_walk_done()
...

Right?


Regards
Yin, Fengwei

>
>> + /*
>> + * For large folio cross VMA boundaries, it's
>> + * expected to be picked by page reclaim. But
>> + * should skip reference of pages which are in
>> + * the range of VM_LOCKED vma. As page reclaim
>> + * should just count the reference of pages out
>> + * the range of VM_LOCKED vma.
>> + */
>> + pra->mapcount--;
>> + continue;
>> + }
>> }

2023-07-12 17:45:34

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range

On Wed, Jul 12, 2023 at 12:44 AM Yin Fengwei <[email protected]> wrote:
>
>
> On 7/12/23 14:23, Yu Zhao wrote:
> > On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
> >>
> >> If large folio is in the range of VM_LOCKED VMA, it should be
> >> mlocked to avoid being picked by page reclaim. Which may split
> >> the large folio and then mlock each pages again.
> >>
> >> Mlock this kind of large folio to prevent them being picked by
> >> page reclaim.
> >>
> >> For the large folio which cross the boundary of VM_LOCKED VMA,
> >> we'd better not to mlock it. So if the system is under memory
> >> pressure, this kind of large folio will be split and the pages
> >> ouf of VM_LOCKED VMA can be reclaimed.
> >>
> >> Signed-off-by: Yin Fengwei <[email protected]>
> >> ---
> >> mm/internal.h | 11 ++++++++---
> >> mm/rmap.c | 34 +++++++++++++++++++++++++++-------
> >> 2 files changed, 35 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index c7dd15d8de3ef..776141de2797a 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
> >> * still be set while VM_SPECIAL bits are added: so ignore it then.
> >> */
> >> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
> >> - (compound || !folio_test_large(folio)))
> >> + (compound || !folio_test_large(folio) ||
> >> + folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
> >> mlock_folio(folio);
> >> }
> >
> > This can be simplified:
> > 1. remove the compound parameter
> Yes. There is not difference here for pmd mapping of THPs and pte mappings of THPs
> if the only condition need check is whether the folio is within VMA range or not.
>
> But let me add Huge for confirmation.
>
>
> > 2. make the if
> > if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
> > folio_within_vma())
> > mlock_folio(folio);
> !folio_test_large(folio) was kept here by purpose. For normal 4K page, don't need
> to call folio_within_vma() which is heavy for normal 4K page.

I suspected you would think so -- I don't think it would make any
measurable (for systems with mostly large folios, it would actually be
an extra work). Since we have many places like this once, probably we
could wrap folio_test_large() into folio_within_vma() and call it
large_folio_within_vma(), if you feel it's necessary.

> >> @@ -651,8 +652,12 @@ void munlock_folio(struct folio *folio);
> >> static inline void munlock_vma_folio(struct folio *folio,
> >> struct vm_area_struct *vma, bool compound)
> >
> > Remove the compound parameter here too.
> >
> >> {
> >> - if (unlikely(vma->vm_flags & VM_LOCKED) &&
> >> - (compound || !folio_test_large(folio)))
> >> + /*
> >> + * To handle the case that a mlocked large folio is unmapped from VMA
> >> + * piece by piece, allow munlock the large folio which is partially
> >> + * mapped to VMA.
> >> + */
> >> + if (unlikely(vma->vm_flags & VM_LOCKED))
> >> munlock_folio(folio);
> >> }
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 2668f5ea35342..455f415d8d9ca 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -803,6 +803,14 @@ struct folio_referenced_arg {
> >> unsigned long vm_flags;
> >> struct mem_cgroup *memcg;
> >> };
> >> +
> >> +static inline bool should_restore_mlock(struct folio *folio,
> >> + struct vm_area_struct *vma, bool pmd_mapped)
> >> +{
> >> + return !folio_test_large(folio) ||
> >> + pmd_mapped || folio_within_vma(folio, vma);
> >> +}
> >
> > This is just folio_within_vma() :)
> >
> >> /*
> >> * arg: folio_referenced_arg will be passed
> >> */
> >> @@ -816,13 +824,25 @@ static bool folio_referenced_one(struct folio *folio,
> >> while (page_vma_mapped_walk(&pvmw)) {
> >> address = pvmw.address;
> >>
> >> - if ((vma->vm_flags & VM_LOCKED) &&
> >> - (!folio_test_large(folio) || !pvmw.pte)) {
> >> - /* Restore the mlock which got missed */
> >> - mlock_vma_folio(folio, vma, !pvmw.pte);
> >> - page_vma_mapped_walk_done(&pvmw);
> >> - pra->vm_flags |= VM_LOCKED;
> >> - return false; /* To break the loop */
> >> + if (vma->vm_flags & VM_LOCKED) {
> >> + if (should_restore_mlock(folio, vma, !pvmw.pte)) {
> >> + /* Restore the mlock which got missed */
> >> + mlock_vma_folio(folio, vma, !pvmw.pte);
> >> + page_vma_mapped_walk_done(&pvmw);
> >> + pra->vm_flags |= VM_LOCKED;
> >> + return false; /* To break the loop */
> >> + } else {
> >
> > There is no need for "else", or just
> >
> > if (!folio_within_vma())
> > goto dec_pra_mapcount;
> I tried not to use goto as much as possible. I suppose you mean:
>
> if (!should_restore_lock())
> goto dec_pra_mapcount; (I may use continue here. :)).

should_restore_lock() is just folio_within_vma() -- see the comment
above. "continue" looks good to me too (prefer not to add more indents
to the functions below).

> mlock_vma_folio();
> page_vma_mapped_walk_done()
> ...
>
> Right?

Right.

2023-07-13 02:15:52

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range



On 7/13/23 01:03, Yu Zhao wrote:
> On Wed, Jul 12, 2023 at 12:44 AM Yin Fengwei <[email protected]> wrote:
>>
>>
>> On 7/12/23 14:23, Yu Zhao wrote:
>>> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>>>>
>>>> If large folio is in the range of VM_LOCKED VMA, it should be
>>>> mlocked to avoid being picked by page reclaim. Which may split
>>>> the large folio and then mlock each pages again.
>>>>
>>>> Mlock this kind of large folio to prevent them being picked by
>>>> page reclaim.
>>>>
>>>> For the large folio which cross the boundary of VM_LOCKED VMA,
>>>> we'd better not to mlock it. So if the system is under memory
>>>> pressure, this kind of large folio will be split and the pages
>>>> ouf of VM_LOCKED VMA can be reclaimed.
>>>>
>>>> Signed-off-by: Yin Fengwei <[email protected]>
>>>> ---
>>>> mm/internal.h | 11 ++++++++---
>>>> mm/rmap.c | 34 +++++++++++++++++++++++++++-------
>>>> 2 files changed, 35 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index c7dd15d8de3ef..776141de2797a 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
>>>> * still be set while VM_SPECIAL bits are added: so ignore it then.
>>>> */
>>>> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
>>>> - (compound || !folio_test_large(folio)))
>>>> + (compound || !folio_test_large(folio) ||
>>>> + folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
>>>> mlock_folio(folio);
>>>> }
>>>
>>> This can be simplified:
>>> 1. remove the compound parameter
>> Yes. There is not difference here for pmd mapping of THPs and pte mappings of THPs
>> if the only condition need check is whether the folio is within VMA range or not.
>>
>> But let me add Huge for confirmation.
>>
>>
>>> 2. make the if
>>> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
>>> folio_within_vma())
>>> mlock_folio(folio);
>> !folio_test_large(folio) was kept here by purpose. For normal 4K page, don't need
>> to call folio_within_vma() which is heavy for normal 4K page.
>
> I suspected you would think so -- I don't think it would make any
> measurable (for systems with mostly large folios, it would actually be
> an extra work). Since we have many places like this once, probably we
> could wrap folio_test_large() into folio_within_vma() and call it
> large_folio_within_vma(), if you feel it's necessary.
I thought about moving folio_test_large() to folio_in_range(). But gave
it up because of checking folio addr in vma range.

But with new folio_within_vma(), we could do that. Will move folio_test_large()
to folio_within_vma() (I will keep current naming) and make it like:

return !folio_test_large() || folio_in_range();

>
>>>> @@ -651,8 +652,12 @@ void munlock_folio(struct folio *folio);
>>>> static inline void munlock_vma_folio(struct folio *folio,
>>>> struct vm_area_struct *vma, bool compound)
>>>
>>> Remove the compound parameter here too.
>>>
>>>> {
>>>> - if (unlikely(vma->vm_flags & VM_LOCKED) &&
>>>> - (compound || !folio_test_large(folio)))
>>>> + /*
>>>> + * To handle the case that a mlocked large folio is unmapped from VMA
>>>> + * piece by piece, allow munlock the large folio which is partially
>>>> + * mapped to VMA.
>>>> + */
>>>> + if (unlikely(vma->vm_flags & VM_LOCKED))
>>>> munlock_folio(folio);
>>>> }
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 2668f5ea35342..455f415d8d9ca 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -803,6 +803,14 @@ struct folio_referenced_arg {
>>>> unsigned long vm_flags;
>>>> struct mem_cgroup *memcg;
>>>> };
>>>> +
>>>> +static inline bool should_restore_mlock(struct folio *folio,
>>>> + struct vm_area_struct *vma, bool pmd_mapped)
>>>> +{
>>>> + return !folio_test_large(folio) ||
>>>> + pmd_mapped || folio_within_vma(folio, vma);
>>>> +}
>>>
>>> This is just folio_within_vma() :)
>>>
>>>> /*
>>>> * arg: folio_referenced_arg will be passed
>>>> */
>>>> @@ -816,13 +824,25 @@ static bool folio_referenced_one(struct folio *folio,
>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>> address = pvmw.address;
>>>>
>>>> - if ((vma->vm_flags & VM_LOCKED) &&
>>>> - (!folio_test_large(folio) || !pvmw.pte)) {
>>>> - /* Restore the mlock which got missed */
>>>> - mlock_vma_folio(folio, vma, !pvmw.pte);
>>>> - page_vma_mapped_walk_done(&pvmw);
>>>> - pra->vm_flags |= VM_LOCKED;
>>>> - return false; /* To break the loop */
>>>> + if (vma->vm_flags & VM_LOCKED) {
>>>> + if (should_restore_mlock(folio, vma, !pvmw.pte)) {
>>>> + /* Restore the mlock which got missed */
>>>> + mlock_vma_folio(folio, vma, !pvmw.pte);
>>>> + page_vma_mapped_walk_done(&pvmw);
>>>> + pra->vm_flags |= VM_LOCKED;
>>>> + return false; /* To break the loop */
>>>> + } else {
>>>
>>> There is no need for "else", or just
>>>
>>> if (!folio_within_vma())
>>> goto dec_pra_mapcount;
>> I tried not to use goto as much as possible. I suppose you mean:
>>
>> if (!should_restore_lock())
>> goto dec_pra_mapcount; (I may use continue here. :)).
>
> should_restore_lock() is just folio_within_vma() -- see the comment
> above. "continue" looks good to me too (prefer not to add more indents
> to the functions below).
Yes.

>
>> mlock_vma_folio();
>> page_vma_mapped_walk_done()
>> ...
>>
>> Right?
>
> Right.
This is very good suggestion. Will update v3 accordingly after wait
for a while in case other comments. Thanks.


Regards
Yin, Fengwei


2023-07-14 02:33:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range

On Wed, 12 Jul 2023, Yin Fengwei wrote:
> On 7/12/23 14:23, Yu Zhao wrote:
> > On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
> >> * still be set while VM_SPECIAL bits are added: so ignore it then.
> >> */
> >> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
> >> - (compound || !folio_test_large(folio)))
> >> + (compound || !folio_test_large(folio) ||
> >> + folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
> >> mlock_folio(folio);
> >> }
> >
> > This can be simplified:
> > 1. remove the compound parameter
> Yes. There is not difference here for pmd mapping of THPs and pte mappings of THPs
> if the only condition need check is whether the folio is within VMA range or not.
>
> But let me add Huge for confirmation.

I'm not sure what it is that you need me to confirm: if the folio fits
within the vma, then the folio fits within the vma, pmd-mapped or not.

(And I agree with Yu that it's better to drop the folio_test_large()
check too.)

This idea, of counting the folio as mlocked according to whether the
whole folio fits within the vma, does seem a good idea to me: worth
pursuing. But whether the implementation adds up and works out, I
have not checked. It was always difficult to arrive at a satisfactory
compromise in mlocking compound pages: I hope this way does work out.

Hugh

2023-07-14 03:11:14

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range



On 7/14/2023 10:21 AM, Hugh Dickins wrote:
> On Wed, 12 Jul 2023, Yin Fengwei wrote:
>> On 7/12/23 14:23, Yu Zhao wrote:
>>> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>>>> --- a/
>>>> +++ b/mm/internal.h
>>>> @@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
>>>> * still be set while VM_SPECIAL bits are added: so ignore it then.
>>>> */
>>>> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
>>>> - (compound || !folio_test_large(folio)))
>>>> + (compound || !folio_test_large(folio) ||
>>>> + folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
>>>> mlock_folio(folio);
>>>> }
>>>
>>> This can be simplified:
>>> 1. remove the compound parameter
>> Yes. There is not difference here for pmd mapping of THPs and pte mappings of THPs
>> if the only condition need check is whether the folio is within VMA range or not.
>>
>> But let me add Huge for confirmation.
>
> I'm not sure what it is that you need me to confirm: if the folio fits
> within the vma, then the folio fits within the vma, pmd-mapped or not.
Sorry. My bad. I should speak it out for what I want your confirmation:
Whether we can remove the compound and use whether folio is within
VMA instead.

I suppose you answer is Yes.

>
> (And I agree with Yu that it's better to drop the folio_test_large()
> check too.)
My argument was folio_test_large() can filter the normal 4K page out so
it doesn't need to call folio_in_range() which looks to me a little bit
heavy for normal 4K page. And the deal was move folio_test_large()
to folio_in_range() like function so simplify the code in caller side.

>
> This idea, of counting the folio as mlocked according to whether the
> whole folio fits within the vma, does seem a good idea to me: worth
> pursuing. But whether the implementation adds up and works out, I
> have not checked. It was always difficult to arrive at a satisfactory
> compromise in mlocking compound pages: I hope this way does work out.
This is the purpose of this patch. :). Thanks.


Regards
Yin, Fengwei

>
> Hugh

2023-07-14 03:53:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range

On Fri, 14 Jul 2023, Yin, Fengwei wrote:
> On 7/14/2023 10:21 AM, Hugh Dickins wrote:
> > On Wed, 12 Jul 2023, Yin Fengwei wrote:
> >> On 7/12/23 14:23, Yu Zhao wrote:
> >>> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
> >>>> --- a/
> >>>> +++ b/mm/internal.h
> >>>> @@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
> >>>> * still be set while VM_SPECIAL bits are added: so ignore it then.
> >>>> */
> >>>> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
> >>>> - (compound || !folio_test_large(folio)))
> >>>> + (compound || !folio_test_large(folio) ||
> >>>> + folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
> >>>> mlock_folio(folio);
> >>>> }
> >>>
> >>> This can be simplified:
> >>> 1. remove the compound parameter
> >> Yes. There is not difference here for pmd mapping of THPs and pte mappings of THPs
> >> if the only condition need check is whether the folio is within VMA range or not.
> >>
> >> But let me add Huge for confirmation.
> >
> > I'm not sure what it is that you need me to confirm: if the folio fits
> > within the vma, then the folio fits within the vma, pmd-mapped or not.
> Sorry. My bad. I should speak it out for what I want your confirmation:
> Whether we can remove the compound and use whether folio is within
> VMA instead.
>
> I suppose you answer is Yes.

Yes (if it all works out going that way).

>
> >
> > (And I agree with Yu that it's better to drop the folio_test_large()
> > check too.)
> My argument was folio_test_large() can filter the normal 4K page out so
> it doesn't need to call folio_in_range() which looks to me a little bit
> heavy for normal 4K page. And the deal was move folio_test_large()
> to folio_in_range() like function so simplify the code in caller side.

I realized that, but agree with Yu.

It looked a little over-engineered to me, but I didn't spend long enough
looking to understand why there's folio_within_vma() distinct from
folio_in_range(), when everyone(?) calls folio_in_range() with the same
arguments vma->vm_start, vma->vm_end.

>
> >
> > This idea, of counting the folio as mlocked according to whether the
> > whole folio fits within the vma, does seem a good idea to me: worth
> > pursuing. But whether the implementation adds up and works out, I
> > have not checked. It was always difficult to arrive at a satisfactory
> > compromise in mlocking compound pages: I hope this way does work out.
> This is the purpose of this patch. :). Thanks.
>
>
> Regards
> Yin, Fengwei
>
> >
> > Hugh

2023-07-14 05:59:01

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range



On 7/14/2023 11:41 AM, Hugh Dickins wrote:
> On Fri, 14 Jul 2023, Yin, Fengwei wrote:
>> On 7/14/2023 10:21 AM, Hugh Dickins wrote:
>>> On Wed, 12 Jul 2023, Yin Fengwei wrote:
>>>> On 7/12/23 14:23, Yu Zhao wrote:
>>>>> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>>>>>> --- a/
>>>>>> +++ b/mm/internal.h
>>>>>> @@ -643,7 +643,8 @@ static inline void mlock_vma_folio(struct folio *folio,
>>>>>> * still be set while VM_SPECIAL bits are added: so ignore it then.
>>>>>> */
>>>>>> if (unlikely((vma->vm_flags & (VM_LOCKED|VM_SPECIAL)) == VM_LOCKED) &&
>>>>>> - (compound || !folio_test_large(folio)))
>>>>>> + (compound || !folio_test_large(folio) ||
>>>>>> + folio_in_range(folio, vma, vma->vm_start, vma->vm_end)))
>>>>>> mlock_folio(folio);
>>>>>> }
>>>>>
>>>>> This can be simplified:
>>>>> 1. remove the compound parameter
>>>> Yes. There is not difference here for pmd mapping of THPs and pte mappings of THPs
>>>> if the only condition need check is whether the folio is within VMA range or not.
>>>>
>>>> But let me add Huge for confirmation.
>>>
>>> I'm not sure what it is that you need me to confirm: if the folio fits
>>> within the vma, then the folio fits within the vma, pmd-mapped or not.
>> Sorry. My bad. I should speak it out for what I want your confirmation:
>> Whether we can remove the compound and use whether folio is within
>> VMA instead.
>>
>> I suppose you answer is Yes.
>
> Yes (if it all works out going that way).
>
>>
>>>
>>> (And I agree with Yu that it's better to drop the folio_test_large()
>>> check too.)
>> My argument was folio_test_large() can filter the normal 4K page out so
>> it doesn't need to call folio_in_range() which looks to me a little bit
>> heavy for normal 4K page. And the deal was move folio_test_large()
>> to folio_in_range() like function so simplify the code in caller side.
>
> I realized that, but agree with Yu.
OK. I will rethink this as both Yu and you suggested same thing.

>
> It looked a little over-engineered to me, but I didn't spend long enough
> looking to understand why there's folio_within_vma() distinct from
> folio_in_range(), when everyone(?) calls folio_in_range() with the same
> arguments vma->vm_start, vma->vm_end.
madvise could call folio_in_range() with start/end from user space instead
of using VMA range.


Regards
Yin, Fengwei

>
>>
>>>
>>> This idea, of counting the folio as mlocked according to whether the
>>> whole folio fits within the vma, does seem a good idea to me: worth
>>> pursuing. But whether the implementation adds up and works out, I
>>> have not checked. It was always difficult to arrive at a satisfactory
>>> compromise in mlocking compound pages: I hope this way does work out.
>> This is the purpose of this patch. :). Thanks.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> Hugh

2023-07-15 07:57:12

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Wed, Jul 12, 2023 at 12:31 AM Yu Zhao <[email protected]> wrote:
>
> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
> >
> > Current kernel only lock base size folio during mlock syscall.
> > Add large folio support with following rules:
> > - Only mlock large folio when it's in VM_LOCKED VMA range
> >
> > - If there is cow folio, mlock the cow folio as cow folio
> > is also in VM_LOCKED VMA range.
> >
> > - munlock will apply to the large folio which is in VMA range
> > or cross the VMA boundary.
> >
> > The last rule is used to handle the case that the large folio is
> > mlocked, later the VMA is split in the middle of large folio
> > and this large folio become cross VMA boundary.
> >
> > Signed-off-by: Yin Fengwei <[email protected]>
> > ---
> > mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 99 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/mlock.c b/mm/mlock.c
> > index 0a0c996c5c214..f49e079066870 100644
> > --- a/mm/mlock.c
> > +++ b/mm/mlock.c
> > @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio)
> > local_unlock(&mlock_fbatch.lock);
> > }
> >
> > +static inline bool should_mlock_folio(struct folio *folio,
> > + struct vm_area_struct *vma)
> > +{
> > + if (vma->vm_flags & VM_LOCKED)
> > + return (!folio_test_large(folio) ||
> > + folio_within_vma(folio, vma));
> > +
> > + /*
> > + * For unlock, allow munlock large folio which is partially
> > + * mapped to VMA. As it's possible that large folio is
> > + * mlocked and VMA is split later.
> > + *
> > + * During memory pressure, such kind of large folio can
> > + * be split. And the pages are not in VM_LOCKed VMA
> > + * can be reclaimed.
> > + */
> > +
> > + return true;
>
> Looks good, or just
>
> should_mlock_folio() // or whatever name you see fit, can_mlock_folio()?
> {
> return !(vma->vm_flags & VM_LOCKED) || folio_within_vma();
> }
>
> > +}
> > +
> > +static inline unsigned int get_folio_mlock_step(struct folio *folio,
> > + pte_t pte, unsigned long addr, unsigned long end)
> > +{
> > + unsigned int nr;
> > +
> > + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
> > + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
> > +}
> > +
> > +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
> > + pte_t *pte, unsigned long addr, unsigned int nr)
> > +{
> > + struct folio *cow_folio;
> > + unsigned int step = 1;
> > +
> > + mlock_folio(folio);
> > + if (nr == 1)
> > + return;
> > +
> > + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
> > + pte_t ptent;
> > +
> > + step = 1;
> > + ptent = ptep_get(pte);
> > +
> > + if (!pte_present(ptent))
> > + continue;
> > +
> > + cow_folio = vm_normal_folio(vma, addr, ptent);
> > + if (!cow_folio || cow_folio == folio) {
> > + continue;
> > + }
> > +
> > + mlock_folio(cow_folio);
> > + step = get_folio_mlock_step(folio, ptent,
> > + addr, addr + (nr << PAGE_SHIFT));
> > + }
> > +}
> > +
> > +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
> > + pte_t *pte, unsigned long addr, unsigned int nr)
> > +{
> > + struct folio *cow_folio;
> > + unsigned int step = 1;
> > +
> > + munlock_folio(folio);
> > + if (nr == 1)
> > + return;
> > +
> > + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
> > + pte_t ptent;
> > +
> > + step = 1;
> > + ptent = ptep_get(pte);
> > +
> > + if (!pte_present(ptent))
> > + continue;
> > +
> > + cow_folio = vm_normal_folio(vma, addr, ptent);
> > + if (!cow_folio || cow_folio == folio) {
> > + continue;
> > + }
> > +
> > + munlock_folio(cow_folio);
> > + step = get_folio_mlock_step(folio, ptent,
> > + addr, addr + (nr << PAGE_SHIFT));
> > + }
> > +}
>
> I'll finish the above later.

There is a problem here that I didn't have the time to elaborate: we
can't mlock() a folio that is within the range but not fully mapped
because this folio can be on the deferred split queue. When the split
happens, those unmapped folios (not mapped by this vma but are mapped
into other vmas) will be stranded on the unevictable lru.

For that matter, we can't mlock any large folios that are being
shared, unless you want to overengineer it by checking whether all
sharing vmas are also mlocked -- mlock is cleared during fork. So the
condition for mlocking large anon folios is 1) within range 2) fully
mapped 3) not shared (mapcount is 1). The final patch should look like
something like this:

- if (folio_test_large(folio))
+ if (folio_pfn(folio) != pte_pfn(ptent))
+ continue;
+ if (!the_aforementioned_condition())

There is another corner case I forgot to mention: for example, what if
a folio spans two (the only two) adjacent mlocked vmas? No need to
worry about this since it's not worth optimizing.

2023-07-17 00:21:20

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/15/2023 2:06 PM, Yu Zhao wrote:
> There is a problem here that I didn't have the time to elaborate: we
> can't mlock() a folio that is within the range but not fully mapped
> because this folio can be on the deferred split queue. When the split
> happens, those unmapped folios (not mapped by this vma but are mapped
> into other vmas) will be stranded on the unevictable lru.
This should be fine unless I missed something. During large folio split,
the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
munlocked in unmap_folio(). So the head/tail pages will be evictable always.

>
> For that matter, we can't mlock any large folios that are being
> shared, unless you want to overengineer it by checking whether all
> sharing vmas are also mlocked -- mlock is cleared during fork. So the
> condition for mlocking large anon folios is 1) within range 2) fully
> mapped 3) not shared (mapcount is 1). The final patch should look like
> something like this:
>
> - if (folio_test_large(folio))
> + if (folio_pfn(folio) != pte_pfn(ptent))
> + continue;
> + if (!the_aforementioned_condition())
>
> There is another corner case I forgot to mention: for example, what if
> a folio spans two (the only two) adjacent mlocked vmas? No need to
> worry about this since it's not worth optimizing.
Yes. The behavior will be related with whether the folio is mlocked or not.
But the worst case is the folio is split and each page is mlocked during
next scan again.


Regards
Yin, Fengwei

2023-07-17 01:07:42

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
>
> On 7/15/2023 2:06 PM, Yu Zhao wrote:
> > There is a problem here that I didn't have the time to elaborate: we
> > can't mlock() a folio that is within the range but not fully mapped
> > because this folio can be on the deferred split queue. When the split
> > happens, those unmapped folios (not mapped by this vma but are mapped
> > into other vmas) will be stranded on the unevictable lru.
>
> This should be fine unless I missed something. During large folio split,
> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
> munlocked in unmap_folio(). So the head/tail pages will be evictable always.

It's close but not entirely accurate: munlock can fail on isolated folios.

2023-07-17 02:06:17

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/17/23 08:35, Yu Zhao wrote:
> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
>>
>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
>>> There is a problem here that I didn't have the time to elaborate: we
>>> can't mlock() a folio that is within the range but not fully mapped
>>> because this folio can be on the deferred split queue. When the split
>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>> into other vmas) will be stranded on the unevictable lru.
>>
>> This should be fine unless I missed something. During large folio split,
>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
>
> It's close but not entirely accurate: munlock can fail on isolated folios.
Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.

Could this also happen against normal 4K page? I mean when user try to munlock
a normal 4K page and this 4K page is isolated. So it become unevictable page?


Regards
Yin, Fengwei

2023-07-17 08:42:23

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/17/23 08:35, Yu Zhao wrote:
> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
>>
>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
>>> There is a problem here that I didn't have the time to elaborate: we
>>> can't mlock() a folio that is within the range but not fully mapped
>>> because this folio can be on the deferred split queue. When the split
>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>> into other vmas) will be stranded on the unevictable lru.
>>
>> This should be fine unless I missed something. During large folio split,
>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
>
> It's close but not entirely accurate: munlock can fail on isolated folios.

I suppose normal 4K page can hit this problem also and following patch could
fix it:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568bb..839b8398aa613 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2498,7 +2498,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,

VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
list_del(&folio->lru);
- if (unlikely(!folio_evictable(folio))) {
+ if (unlikely(!folio_evictable(folio) || folio_test_unevictable(folio))) {
spin_unlock_irq(&lruvec->lru_lock);
folio_putback_lru(folio);
spin_lock_irq(&lruvec->lru_lock);
@@ -2723,7 +2723,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
folio = lru_to_folio(&l_hold);
list_del(&folio->lru);

- if (unlikely(!folio_evictable(folio))) {
+ if (unlikely(!folio_evictable(folio) || folio_test_unevictable(folio))) {
folio_putback_lru(folio);
continue;
}
@@ -5182,7 +5182,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
sc->nr_reclaimed += reclaimed;

list_for_each_entry_safe_reverse(folio, next, &list, lru) {
- if (!folio_evictable(folio)) {
+ if (!folio_evictable(folio) || folio_test_unevictable(folio)) {
list_del(&folio->lru);
folio_putback_lru(folio);
continue;



Regards
Yin, Fengwei


2023-07-18 02:46:18

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/17/23 16:12, Yin Fengwei wrote:
>
>
> On 7/17/23 08:35, Yu Zhao wrote:
>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
>>>
>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
>>>> There is a problem here that I didn't have the time to elaborate: we
>>>> can't mlock() a folio that is within the range but not fully mapped
>>>> because this folio can be on the deferred split queue. When the split
>>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>>> into other vmas) will be stranded on the unevictable lru.
>>>
>>> This should be fine unless I missed something. During large folio split,
>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
>>
>> It's close but not entirely accurate: munlock can fail on isolated folios.
>
> I suppose normal 4K page can hit this problem also and following patch could
> fix it:
No. This patch is not necessary as unevictable folio will not be picked up by
page reclaim. It's not possible to munlock the isolated folio from lru list.

The possible cases I am ware are: page_migrate, madvise and damon_pa_pageout and
lru_gen_look_around. The first three already handle this case correctly by call
folio_putback_lru().

If folio is isolated, the split_folio() will just fail. So looks we are fine
for this corner case. Let me know if I miss something here.


Regards
Yin, Fengwei

>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1080209a568bb..839b8398aa613 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2498,7 +2498,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
>
> VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> list_del(&folio->lru);
> - if (unlikely(!folio_evictable(folio))) {
> + if (unlikely(!folio_evictable(folio) || folio_test_unevictable(folio))) {
> spin_unlock_irq(&lruvec->lru_lock);
> folio_putback_lru(folio);
> spin_lock_irq(&lruvec->lru_lock);
> @@ -2723,7 +2723,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
> folio = lru_to_folio(&l_hold);
> list_del(&folio->lru);
>
> - if (unlikely(!folio_evictable(folio))) {
> + if (unlikely(!folio_evictable(folio) || folio_test_unevictable(folio))) {
> folio_putback_lru(folio);
> continue;
> }
> @@ -5182,7 +5182,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
> sc->nr_reclaimed += reclaimed;
>
> list_for_each_entry_safe_reverse(folio, next, &list, lru) {
> - if (!folio_evictable(folio)) {
> + if (!folio_evictable(folio) || folio_test_unevictable(folio)) {
> list_del(&folio->lru);
> folio_putback_lru(folio);
> continue;
>
>
>
> Regards
> Yin, Fengwei
>

2023-07-18 04:51:35

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Mon, Jul 17, 2023 at 8:07 PM Yin Fengwei <[email protected]> wrote:
>
> On 7/17/23 16:12, Yin Fengwei wrote:
> >
> > On 7/17/23 08:35, Yu Zhao wrote:
> >> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
> >>>
> >>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
> >>>> There is a problem here that I didn't have the time to elaborate: we
> >>>> can't mlock() a folio that is within the range but not fully mapped
> >>>> because this folio can be on the deferred split queue. When the split
> >>>> happens, those unmapped folios (not mapped by this vma but are mapped
> >>>> into other vmas) will be stranded on the unevictable lru.
> >>>
> >>> This should be fine unless I missed something. During large folio split,
> >>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
> >>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
> >>
> >> It's close but not entirely accurate: munlock can fail on isolated folios.
> >
> > I suppose normal 4K page can hit this problem also and following patch could
> > fix it:
> No. This patch is not necessary as unevictable folio will not be picked up by
> page reclaim. It's not possible to munlock the isolated folio from lru list.
>
> The possible cases I am ware are: page_migrate, madvise and damon_pa_pageout and
> lru_gen_look_around. The first three already handle this case correctly by call
> folio_putback_lru().
>
> If folio is isolated, the split_folio() will just fail. So looks we are fine
> for this corner case. Let me know if I miss something here.

The race is between isolation and munlock -- split_folio() only fails
if a folio is still isolated when it tries to freeze its refcnt, e.g.,
cpu 1 cpu 2
split_folio()
isolation unmap_folio()
putback
freeze refcnt

2023-07-18 23:46:12

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
>
>
>
> On 7/17/23 08:35, Yu Zhao wrote:
> > On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
> >>
> >> On 7/15/2023 2:06 PM, Yu Zhao wrote:
> >>> There is a problem here that I didn't have the time to elaborate: we
> >>> can't mlock() a folio that is within the range but not fully mapped
> >>> because this folio can be on the deferred split queue. When the split
> >>> happens, those unmapped folios (not mapped by this vma but are mapped
> >>> into other vmas) will be stranded on the unevictable lru.
> >>
> >> This should be fine unless I missed something. During large folio split,
> >> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
> >> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
> >
> > It's close but not entirely accurate: munlock can fail on isolated folios.
> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
>
> Could this also happen against normal 4K page? I mean when user try to munlock
> a normal 4K page and this 4K page is isolated. So it become unevictable page?

Looks like it can be possible. If cpu 1 is in __munlock_folio() and
cpu 2 is isolating the folio for any purpose:

cpu1 cpu2
isolate folio
folio_test_clear_lru() // 0
putback folio // add
to unevictable list
folio_test_clear_mlocked()


The page would be stranded on the unevictable list in this case, no?
Maybe we should only try to isolate the page (clear PG_lru) after we
possibly clear PG_mlocked? In this case if we fail to isolate we know
for sure that whoever has the page isolated will observe that
PG_mlocked is clear and correctly make the page evictable.

This probably would be complicated with the current implementation, as
we first need to decrement mlock_count to determine if we want to
clear PG_mlocked, and to do so we need to isolate the page as
mlock_count overlays page->lru. With the proposal in [1] to rework
mlock_count, it might be much simpler as far as I can tell. I intend
to refresh this proposal soon-ish.

[1]https://lore.kernel.org/lkml/[email protected]/

>
>
> Regards
> Yin, Fengwei
>

2023-07-19 00:05:06

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/19/23 06:48, Yosry Ahmed wrote:
> On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 7/17/23 08:35, Yu Zhao wrote:
>>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
>>>>
>>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
>>>>> There is a problem here that I didn't have the time to elaborate: we
>>>>> can't mlock() a folio that is within the range but not fully mapped
>>>>> because this folio can be on the deferred split queue. When the split
>>>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>>>> into other vmas) will be stranded on the unevictable lru.
>>>>
>>>> This should be fine unless I missed something. During large folio split,
>>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
>>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
>>>
>>> It's close but not entirely accurate: munlock can fail on isolated folios.
>> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
>>
>> Could this also happen against normal 4K page? I mean when user try to munlock
>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>
> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> cpu 2 is isolating the folio for any purpose:
>
> cpu1 cpu2
> isolate folio
> folio_test_clear_lru() // 0
> putback folio // add
> to unevictable list
> folio_test_clear_mlocked()
Yes. Yu showed this sequence to me in another email. I thought the putback_lru()
could correct the none-mlocked but unevictable folio. But it doesn't because
of this race.

>
>
> The page would be stranded on the unevictable list in this case, no?
> Maybe we should only try to isolate the page (clear PG_lru) after we
> possibly clear PG_mlocked? In this case if we fail to isolate we know
> for sure that whoever has the page isolated will observe that
> PG_mlocked is clear and correctly make the page evictable.
>
> This probably would be complicated with the current implementation, as
> we first need to decrement mlock_count to determine if we want to
> clear PG_mlocked, and to do so we need to isolate the page as
> mlock_count overlays page->lru. With the proposal in [1] to rework
> mlock_count, it might be much simpler as far as I can tell. I intend
> to refresh this proposal soon-ish.
>
> [1]https://lore.kernel.org/lkml/[email protected]/
>
>>
>>
>> Regards
>> Yin, Fengwei
>>

2023-07-19 01:48:18

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Tue, Jul 18, 2023 at 4:47 PM Yin Fengwei <[email protected]> wrote:
>
>
>
> On 7/19/23 06:48, Yosry Ahmed wrote:
> > On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
> >>
> >>
> >>
> >> On 7/17/23 08:35, Yu Zhao wrote:
> >>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
> >>>>
> >>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
> >>>>> There is a problem here that I didn't have the time to elaborate: we
> >>>>> can't mlock() a folio that is within the range but not fully mapped
> >>>>> because this folio can be on the deferred split queue. When the split
> >>>>> happens, those unmapped folios (not mapped by this vma but are mapped
> >>>>> into other vmas) will be stranded on the unevictable lru.
> >>>>
> >>>> This should be fine unless I missed something. During large folio split,
> >>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
> >>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
> >>>
> >>> It's close but not entirely accurate: munlock can fail on isolated folios.
> >> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
> >>
> >> Could this also happen against normal 4K page? I mean when user try to munlock
> >> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> >
> > Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> > cpu 2 is isolating the folio for any purpose:
> >
> > cpu1 cpu2
> > isolate folio
> > folio_test_clear_lru() // 0
> > putback folio // add
> > to unevictable list
> > folio_test_clear_mlocked()
> Yes. Yu showed this sequence to me in another email. I thought the putback_lru()
> could correct the none-mlocked but unevictable folio. But it doesn't because
> of this race.

(+Hugh Dickins for vis)

Yu, I am not familiar with the split_folio() case, so I am not sure it
is the same exact race I stated above.

Can you confirm whether or not doing folio_test_clear_mlocked() before
folio_test_clear_lru() would fix the race you are referring to? IIUC,
in this case, we make sure we clear PG_mlocked before we try to to
clear PG_lru. If we fail to clear it, then someone else have the folio
isolated after we clear PG_mlocked, so we can be sure that when they
put the folio back it will be correctly made evictable.

Is my understanding correct?

If yes, I can add this fix to my next version of the RFC series to
rework mlock_count. It would be a lot more complicated with the
current implementation (as I stated in a previous email).

>
> >
> >
> > The page would be stranded on the unevictable list in this case, no?
> > Maybe we should only try to isolate the page (clear PG_lru) after we
> > possibly clear PG_mlocked? In this case if we fail to isolate we know
> > for sure that whoever has the page isolated will observe that
> > PG_mlocked is clear and correctly make the page evictable.
> >
> > This probably would be complicated with the current implementation, as
> > we first need to decrement mlock_count to determine if we want to
> > clear PG_mlocked, and to do so we need to isolate the page as
> > mlock_count overlays page->lru. With the proposal in [1] to rework
> > mlock_count, it might be much simpler as far as I can tell. I intend
> > to refresh this proposal soon-ish.
> >
> > [1]https://lore.kernel.org/lkml/[email protected]/
> >
> >>
> >>
> >> Regards
> >> Yin, Fengwei
> >>

2023-07-19 02:06:32

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Tue, Jul 18, 2023 at 6:32 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Jul 18, 2023 at 4:47 PM Yin Fengwei <[email protected]> wrote:
> >
> >
> >
> > On 7/19/23 06:48, Yosry Ahmed wrote:
> > > On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
> > >>
> > >>
> > >>
> > >> On 7/17/23 08:35, Yu Zhao wrote:
> > >>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
> > >>>>
> > >>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
> > >>>>> There is a problem here that I didn't have the time to elaborate: we
> > >>>>> can't mlock() a folio that is within the range but not fully mapped
> > >>>>> because this folio can be on the deferred split queue. When the split
> > >>>>> happens, those unmapped folios (not mapped by this vma but are mapped
> > >>>>> into other vmas) will be stranded on the unevictable lru.
> > >>>>
> > >>>> This should be fine unless I missed something. During large folio split,
> > >>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
> > >>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
> > >>>
> > >>> It's close but not entirely accurate: munlock can fail on isolated folios.
> > >> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
> > >>
> > >> Could this also happen against normal 4K page? I mean when user try to munlock
> > >> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> > >
> > > Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> > > cpu 2 is isolating the folio for any purpose:
> > >
> > > cpu1 cpu2
> > > isolate folio
> > > folio_test_clear_lru() // 0
> > > putback folio // add
> > > to unevictable list
> > > folio_test_clear_mlocked()
> > Yes. Yu showed this sequence to me in another email. I thought the putback_lru()
> > could correct the none-mlocked but unevictable folio. But it doesn't because
> > of this race.
>
> (+Hugh Dickins for vis)
>
> Yu, I am not familiar with the split_folio() case, so I am not sure it
> is the same exact race I stated above.
>
> Can you confirm whether or not doing folio_test_clear_mlocked() before
> folio_test_clear_lru() would fix the race you are referring to? IIUC,
> in this case, we make sure we clear PG_mlocked before we try to to
> clear PG_lru. If we fail to clear it, then someone else have the folio
> isolated after we clear PG_mlocked, so we can be sure that when they
> put the folio back it will be correctly made evictable.
>
> Is my understanding correct?

Hmm, actually this might not be enough. In folio_add_lru() we will
call folio_batch_add_and_move(), which calls lru_add_fn() and *then*
sets PG_lru. Since we check folio_evictable() in lru_add_fn(), the
race can still happen:


cpu1 cpu2
folio_evictable() //false
folio_test_clear_mlocked()
folio_test_clear_lru() //false
folio_set_lru()

Relying on PG_lru for synchronization might not be enough with the
current code. We might need to revert 2262ace60713 ("mm/munlock:
delete smp_mb() from __pagevec_lru_add_fn()").

Sorry for going back and forth here, I am thinking out loud.

>
> If yes, I can add this fix to my next version of the RFC series to
> rework mlock_count. It would be a lot more complicated with the
> current implementation (as I stated in a previous email).
>
> >
> > >
> > >
> > > The page would be stranded on the unevictable list in this case, no?
> > > Maybe we should only try to isolate the page (clear PG_lru) after we
> > > possibly clear PG_mlocked? In this case if we fail to isolate we know
> > > for sure that whoever has the page isolated will observe that
> > > PG_mlocked is clear and correctly make the page evictable.
> > >
> > > This probably would be complicated with the current implementation, as
> > > we first need to decrement mlock_count to determine if we want to
> > > clear PG_mlocked, and to do so we need to isolate the page as
> > > mlock_count overlays page->lru. With the proposal in [1] to rework
> > > mlock_count, it might be much simpler as far as I can tell. I intend
> > > to refresh this proposal soon-ish.
> > >
> > > [1]https://lore.kernel.org/lkml/[email protected]/
> > >
> > >>
> > >>
> > >> Regards
> > >> Yin, Fengwei
> > >>

2023-07-19 02:12:18

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio


On 7/19/23 09:52, Yosry Ahmed wrote:
> On Tue, Jul 18, 2023 at 6:32 PM Yosry Ahmed <[email protected]> wrote:
>> On Tue, Jul 18, 2023 at 4:47 PM Yin Fengwei <[email protected]> wrote:
>>>
>>>
>>> On 7/19/23 06:48, Yosry Ahmed wrote:
>>>> On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
>>>>>
>>>>>
>>>>> On 7/17/23 08:35, Yu Zhao wrote:
>>>>>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
>>>>>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
>>>>>>>> There is a problem here that I didn't have the time to elaborate: we
>>>>>>>> can't mlock() a folio that is within the range but not fully mapped
>>>>>>>> because this folio can be on the deferred split queue. When the split
>>>>>>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>>>>>>> into other vmas) will be stranded on the unevictable lru.
>>>>>>> This should be fine unless I missed something. During large folio split,
>>>>>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
>>>>>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
>>>>>> It's close but not entirely accurate: munlock can fail on isolated folios.
>>>>> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
>>>>>
>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>> cpu 2 is isolating the folio for any purpose:
>>>>
>>>> cpu1 cpu2
>>>> isolate folio
>>>> folio_test_clear_lru() // 0
>>>> putback folio // add
>>>> to unevictable list
>>>> folio_test_clear_mlocked()
>>> Yes. Yu showed this sequence to me in another email. I thought the putback_lru()
>>> could correct the none-mlocked but unevictable folio. But it doesn't because
>>> of this race.
>> (+Hugh Dickins for vis)
>>
>> Yu, I am not familiar with the split_folio() case, so I am not sure it
>> is the same exact race I stated above.
>>
>> Can you confirm whether or not doing folio_test_clear_mlocked() before
>> folio_test_clear_lru() would fix the race you are referring to? IIUC,
>> in this case, we make sure we clear PG_mlocked before we try to to
>> clear PG_lru. If we fail to clear it, then someone else have the folio
>> isolated after we clear PG_mlocked, so we can be sure that when they
>> put the folio back it will be correctly made evictable.
>>
>> Is my understanding correct?
> Hmm, actually this might not be enough. In folio_add_lru() we will
> call folio_batch_add_and_move(), which calls lru_add_fn() and *then*
> sets PG_lru. Since we check folio_evictable() in lru_add_fn(), the
> race can still happen:
>
>
> cpu1 cpu2
> folio_evictable() //false
> folio_test_clear_mlocked()
> folio_test_clear_lru() //false
> folio_set_lru()
>
> Relying on PG_lru for synchronization might not be enough with the
> current code. We might need to revert 2262ace60713 ("mm/munlock:
> delete smp_mb() from __pagevec_lru_add_fn()").
>
> Sorry for going back and forth here, I am thinking out loud.

Yes. Currently, the order in lru_add_fn() is not correct.

I think we should move folio_test_clear_lru(folio) into

lru locked range. As the lru lock here was expected to

use for sync here. Check the comment in lru_add_fn().


Regards

Yin, Fengwei


>
>> If yes, I can add this fix to my next version of the RFC series to
>> rework mlock_count. It would be a lot more complicated with the
>> current implementation (as I stated in a previous email).
>>
>>>>
>>>> The page would be stranded on the unevictable list in this case, no?
>>>> Maybe we should only try to isolate the page (clear PG_lru) after we
>>>> possibly clear PG_mlocked? In this case if we fail to isolate we know
>>>> for sure that whoever has the page isolated will observe that
>>>> PG_mlocked is clear and correctly make the page evictable.
>>>>
>>>> This probably would be complicated with the current implementation, as
>>>> we first need to decrement mlock_count to determine if we want to
>>>> clear PG_mlocked, and to do so we need to isolate the page as
>>>> mlock_count overlays page->lru. With the proposal in [1] to rework
>>>> mlock_count, it might be much simpler as far as I can tell. I intend
>>>> to refresh this proposal soon-ish.
>>>>
>>>> [1]https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>>>
>>>>> Regards
>>>>> Yin, Fengwei
>>>>>

2023-07-19 02:32:59

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Tue, Jul 18, 2023 at 6:57 PM Yin Fengwei <[email protected]> wrote:
>
>
> On 7/19/23 09:52, Yosry Ahmed wrote:
> > On Tue, Jul 18, 2023 at 6:32 PM Yosry Ahmed <[email protected]> wrote:
> >> On Tue, Jul 18, 2023 at 4:47 PM Yin Fengwei <[email protected]> wrote:
> >>>
> >>>
> >>> On 7/19/23 06:48, Yosry Ahmed wrote:
> >>>> On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>> On 7/17/23 08:35, Yu Zhao wrote:
> >>>>>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
> >>>>>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
> >>>>>>>> There is a problem here that I didn't have the time to elaborate: we
> >>>>>>>> can't mlock() a folio that is within the range but not fully mapped
> >>>>>>>> because this folio can be on the deferred split queue. When the split
> >>>>>>>> happens, those unmapped folios (not mapped by this vma but are mapped
> >>>>>>>> into other vmas) will be stranded on the unevictable lru.
> >>>>>>> This should be fine unless I missed something. During large folio split,
> >>>>>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
> >>>>>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
> >>>>>> It's close but not entirely accurate: munlock can fail on isolated folios.
> >>>>> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
> >>>>>
> >>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> >>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> >>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> >>>> cpu 2 is isolating the folio for any purpose:
> >>>>
> >>>> cpu1 cpu2
> >>>> isolate folio
> >>>> folio_test_clear_lru() // 0
> >>>> putback folio // add
> >>>> to unevictable list
> >>>> folio_test_clear_mlocked()
> >>> Yes. Yu showed this sequence to me in another email. I thought the putback_lru()
> >>> could correct the none-mlocked but unevictable folio. But it doesn't because
> >>> of this race.
> >> (+Hugh Dickins for vis)
> >>
> >> Yu, I am not familiar with the split_folio() case, so I am not sure it
> >> is the same exact race I stated above.
> >>
> >> Can you confirm whether or not doing folio_test_clear_mlocked() before
> >> folio_test_clear_lru() would fix the race you are referring to? IIUC,
> >> in this case, we make sure we clear PG_mlocked before we try to to
> >> clear PG_lru. If we fail to clear it, then someone else have the folio
> >> isolated after we clear PG_mlocked, so we can be sure that when they
> >> put the folio back it will be correctly made evictable.
> >>
> >> Is my understanding correct?
> > Hmm, actually this might not be enough. In folio_add_lru() we will
> > call folio_batch_add_and_move(), which calls lru_add_fn() and *then*
> > sets PG_lru. Since we check folio_evictable() in lru_add_fn(), the
> > race can still happen:
> >
> >
> > cpu1 cpu2
> > folio_evictable() //false
> > folio_test_clear_mlocked()
> > folio_test_clear_lru() //false
> > folio_set_lru()
> >
> > Relying on PG_lru for synchronization might not be enough with the
> > current code. We might need to revert 2262ace60713 ("mm/munlock:
> > delete smp_mb() from __pagevec_lru_add_fn()").
> >
> > Sorry for going back and forth here, I am thinking out loud.
>
> Yes. Currently, the order in lru_add_fn() is not correct.
>
> I think we should move folio_test_clear_lru(folio) into
>
> lru locked range. As the lru lock here was expected to
>
> use for sync here. Check the comment in lru_add_fn().

Right, I am wondering if it would be better to just revert
2262ace60713 and rely on the memory barrier and operations ordering
instead of the lru lock. The lru lock is heavily contended as-is, so
avoiding using it where possible is preferable I assume.

>
>
> Regards
>
> Yin, Fengwei
>
>
> >
> >> If yes, I can add this fix to my next version of the RFC series to
> >> rework mlock_count. It would be a lot more complicated with the
> >> current implementation (as I stated in a previous email).
> >>
> >>>>
> >>>> The page would be stranded on the unevictable list in this case, no?
> >>>> Maybe we should only try to isolate the page (clear PG_lru) after we
> >>>> possibly clear PG_mlocked? In this case if we fail to isolate we know
> >>>> for sure that whoever has the page isolated will observe that
> >>>> PG_mlocked is clear and correctly make the page evictable.
> >>>>
> >>>> This probably would be complicated with the current implementation, as
> >>>> we first need to decrement mlock_count to determine if we want to
> >>>> clear PG_mlocked, and to do so we need to isolate the page as
> >>>> mlock_count overlays page->lru. With the proposal in [1] to rework
> >>>> mlock_count, it might be much simpler as far as I can tell. I intend
> >>>> to refresh this proposal soon-ish.
> >>>>
> >>>> [1]https://lore.kernel.org/lkml/[email protected]/
> >>>>
> >>>>>
> >>>>> Regards
> >>>>> Yin, Fengwei
> >>>>>

2023-07-19 02:44:03

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/19/23 10:00, Yosry Ahmed wrote:
> On Tue, Jul 18, 2023 at 6:57 PM Yin Fengwei <[email protected]> wrote:
>>
>>
>> On 7/19/23 09:52, Yosry Ahmed wrote:
>>> On Tue, Jul 18, 2023 at 6:32 PM Yosry Ahmed <[email protected]> wrote:
>>>> On Tue, Jul 18, 2023 at 4:47 PM Yin Fengwei <[email protected]> wrote:
>>>>>
>>>>>
>>>>> On 7/19/23 06:48, Yosry Ahmed wrote:
>>>>>> On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 7/17/23 08:35, Yu Zhao wrote:
>>>>>>>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
>>>>>>>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
>>>>>>>>>> There is a problem here that I didn't have the time to elaborate: we
>>>>>>>>>> can't mlock() a folio that is within the range but not fully mapped
>>>>>>>>>> because this folio can be on the deferred split queue. When the split
>>>>>>>>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>>>>>>>>> into other vmas) will be stranded on the unevictable lru.
>>>>>>>>> This should be fine unless I missed something. During large folio split,
>>>>>>>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
>>>>>>>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
>>>>>>>> It's close but not entirely accurate: munlock can fail on isolated folios.
>>>>>>> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
>>>>>>>
>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>>>> cpu 2 is isolating the folio for any purpose:
>>>>>>
>>>>>> cpu1 cpu2
>>>>>> isolate folio
>>>>>> folio_test_clear_lru() // 0
>>>>>> putback folio // add
>>>>>> to unevictable list
>>>>>> folio_test_clear_mlocked()
>>>>> Yes. Yu showed this sequence to me in another email. I thought the putback_lru()
>>>>> could correct the none-mlocked but unevictable folio. But it doesn't because
>>>>> of this race.
>>>> (+Hugh Dickins for vis)
>>>>
>>>> Yu, I am not familiar with the split_folio() case, so I am not sure it
>>>> is the same exact race I stated above.
>>>>
>>>> Can you confirm whether or not doing folio_test_clear_mlocked() before
>>>> folio_test_clear_lru() would fix the race you are referring to? IIUC,
>>>> in this case, we make sure we clear PG_mlocked before we try to to
>>>> clear PG_lru. If we fail to clear it, then someone else have the folio
>>>> isolated after we clear PG_mlocked, so we can be sure that when they
>>>> put the folio back it will be correctly made evictable.
>>>>
>>>> Is my understanding correct?
>>> Hmm, actually this might not be enough. In folio_add_lru() we will
>>> call folio_batch_add_and_move(), which calls lru_add_fn() and *then*
>>> sets PG_lru. Since we check folio_evictable() in lru_add_fn(), the
>>> race can still happen:
>>>
>>>
>>> cpu1 cpu2
>>> folio_evictable() //false
>>> folio_test_clear_mlocked()
>>> folio_test_clear_lru() //false
>>> folio_set_lru()
>>>
>>> Relying on PG_lru for synchronization might not be enough with the
>>> current code. We might need to revert 2262ace60713 ("mm/munlock:
>>> delete smp_mb() from __pagevec_lru_add_fn()").
>>>
>>> Sorry for going back and forth here, I am thinking out loud.
>>
>> Yes. Currently, the order in lru_add_fn() is not correct.
>>
>> I think we should move folio_test_clear_lru(folio) into
>>
>> lru locked range. As the lru lock here was expected to
>>
>> use for sync here. Check the comment in lru_add_fn().
>
> Right, I am wondering if it would be better to just revert
> 2262ace60713 and rely on the memory barrier and operations ordering
> instead of the lru lock. The lru lock is heavily contended as-is, so
> avoiding using it where possible is preferable I assume.
My understanding is set_lru after add folio to lru list is correct.
Once folio_set_lru(), others can do isolation of this folio. But if
this folio is not in lru list yet, what could happen? It's not required
to hold lru lock to do isolation.

>
>>
>>
>> Regards
>>
>> Yin, Fengwei
>>
>>
>>>
>>>> If yes, I can add this fix to my next version of the RFC series to
>>>> rework mlock_count. It would be a lot more complicated with the
>>>> current implementation (as I stated in a previous email).
>>>>
>>>>>>
>>>>>> The page would be stranded on the unevictable list in this case, no?
>>>>>> Maybe we should only try to isolate the page (clear PG_lru) after we
>>>>>> possibly clear PG_mlocked? In this case if we fail to isolate we know
>>>>>> for sure that whoever has the page isolated will observe that
>>>>>> PG_mlocked is clear and correctly make the page evictable.
>>>>>>
>>>>>> This probably would be complicated with the current implementation, as
>>>>>> we first need to decrement mlock_count to determine if we want to
>>>>>> clear PG_mlocked, and to do so we need to isolate the page as
>>>>>> mlock_count overlays page->lru. With the proposal in [1] to rework
>>>>>> mlock_count, it might be much simpler as far as I can tell. I intend
>>>>>> to refresh this proposal soon-ish.
>>>>>>
>>>>>> [1]https://lore.kernel.org/lkml/[email protected]/
>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>> Yin, Fengwei
>>>>>>>

2023-07-19 02:50:51

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/19/23 10:22, Yosry Ahmed wrote:
> On Tue, Jul 18, 2023 at 7:10 PM Yin Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 7/19/23 10:00, Yosry Ahmed wrote:
>>> On Tue, Jul 18, 2023 at 6:57 PM Yin Fengwei <[email protected]> wrote:
>>>>
>>>>
>>>> On 7/19/23 09:52, Yosry Ahmed wrote:
>>>>> On Tue, Jul 18, 2023 at 6:32 PM Yosry Ahmed <[email protected]> wrote:
>>>>>> On Tue, Jul 18, 2023 at 4:47 PM Yin Fengwei <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 7/19/23 06:48, Yosry Ahmed wrote:
>>>>>>>> On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 7/17/23 08:35, Yu Zhao wrote:
>>>>>>>>>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
>>>>>>>>>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
>>>>>>>>>>>> There is a problem here that I didn't have the time to elaborate: we
>>>>>>>>>>>> can't mlock() a folio that is within the range but not fully mapped
>>>>>>>>>>>> because this folio can be on the deferred split queue. When the split
>>>>>>>>>>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>>>>>>>>>>> into other vmas) will be stranded on the unevictable lru.
>>>>>>>>>>> This should be fine unless I missed something. During large folio split,
>>>>>>>>>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
>>>>>>>>>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
>>>>>>>>>> It's close but not entirely accurate: munlock can fail on isolated folios.
>>>>>>>>> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
>>>>>>>>>
>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>>>>>> cpu 2 is isolating the folio for any purpose:
>>>>>>>>
>>>>>>>> cpu1 cpu2
>>>>>>>> isolate folio
>>>>>>>> folio_test_clear_lru() // 0
>>>>>>>> putback folio // add
>>>>>>>> to unevictable list
>>>>>>>> folio_test_clear_mlocked()
>>>>>>> Yes. Yu showed this sequence to me in another email. I thought the putback_lru()
>>>>>>> could correct the none-mlocked but unevictable folio. But it doesn't because
>>>>>>> of this race.
>>>>>> (+Hugh Dickins for vis)
>>>>>>
>>>>>> Yu, I am not familiar with the split_folio() case, so I am not sure it
>>>>>> is the same exact race I stated above.
>>>>>>
>>>>>> Can you confirm whether or not doing folio_test_clear_mlocked() before
>>>>>> folio_test_clear_lru() would fix the race you are referring to? IIUC,
>>>>>> in this case, we make sure we clear PG_mlocked before we try to to
>>>>>> clear PG_lru. If we fail to clear it, then someone else have the folio
>>>>>> isolated after we clear PG_mlocked, so we can be sure that when they
>>>>>> put the folio back it will be correctly made evictable.
>>>>>>
>>>>>> Is my understanding correct?
>>>>> Hmm, actually this might not be enough. In folio_add_lru() we will
>>>>> call folio_batch_add_and_move(), which calls lru_add_fn() and *then*
>>>>> sets PG_lru. Since we check folio_evictable() in lru_add_fn(), the
>>>>> race can still happen:
>>>>>
>>>>>
>>>>> cpu1 cpu2
>>>>> folio_evictable() //false
>>>>> folio_test_clear_mlocked()
>>>>> folio_test_clear_lru() //false
>>>>> folio_set_lru()
>>>>>
>>>>> Relying on PG_lru for synchronization might not be enough with the
>>>>> current code. We might need to revert 2262ace60713 ("mm/munlock:
>>>>> delete smp_mb() from __pagevec_lru_add_fn()").
>>>>>
>>>>> Sorry for going back and forth here, I am thinking out loud.
>>>>
>>>> Yes. Currently, the order in lru_add_fn() is not correct.
>>>>
>>>> I think we should move folio_test_clear_lru(folio) into
>>>>
>>>> lru locked range. As the lru lock here was expected to
>>>>
>>>> use for sync here. Check the comment in lru_add_fn().
>>>
>>> Right, I am wondering if it would be better to just revert
>>> 2262ace60713 and rely on the memory barrier and operations ordering
>>> instead of the lru lock. The lru lock is heavily contended as-is, so
>>> avoiding using it where possible is preferable I assume.
>> My understanding is set_lru after add folio to lru list is correct.
>> Once folio_set_lru(), others can do isolation of this folio. But if
>> this folio is not in lru list yet, what could happen? It's not required
>> to hold lru lock to do isolation.
>
> IIUC, clearing PG_lru is an atomic lockless operation to make sure no
> one else is isolating the folio, but then you need to hold the lruvec
> lock and actually delete the folio from the lru to complete its
> isolation. This is my read of folio_isolate_lru().
>
> Anyway, whether we rely on the lruvec lock or memory barrier +
> operation ordering doesn't make a huge difference (I think?). The code
> seemed to work with the latter before mlock_count was introduced.
>
> If we decide to go with the latter, I can integrate the fix into the
> refresh of my mlock_count rework RFC series (as it would be dependent
> on that series). If we decide to go with the lruvec, then it can be
> done as part of this series, or separately.
Let's wait the response from Huge and Yu. :).

>
> Thanks.
>
>>
>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> Yin, Fengwei
>>>>
>>>>
>>>>>
>>>>>> If yes, I can add this fix to my next version of the RFC series to
>>>>>> rework mlock_count. It would be a lot more complicated with the
>>>>>> current implementation (as I stated in a previous email).
>>>>>>
>>>>>>>>
>>>>>>>> The page would be stranded on the unevictable list in this case, no?
>>>>>>>> Maybe we should only try to isolate the page (clear PG_lru) after we
>>>>>>>> possibly clear PG_mlocked? In this case if we fail to isolate we know
>>>>>>>> for sure that whoever has the page isolated will observe that
>>>>>>>> PG_mlocked is clear and correctly make the page evictable.
>>>>>>>>
>>>>>>>> This probably would be complicated with the current implementation, as
>>>>>>>> we first need to decrement mlock_count to determine if we want to
>>>>>>>> clear PG_mlocked, and to do so we need to isolate the page as
>>>>>>>> mlock_count overlays page->lru. With the proposal in [1] to rework
>>>>>>>> mlock_count, it might be much simpler as far as I can tell. I intend
>>>>>>>> to refresh this proposal soon-ish.
>>>>>>>>
>>>>>>>> [1]https://lore.kernel.org/lkml/[email protected]/
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Yin, Fengwei
>>>>>>>>>

2023-07-19 02:53:59

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Tue, Jul 18, 2023 at 7:10 PM Yin Fengwei <[email protected]> wrote:
>
>
>
> On 7/19/23 10:00, Yosry Ahmed wrote:
> > On Tue, Jul 18, 2023 at 6:57 PM Yin Fengwei <[email protected]> wrote:
> >>
> >>
> >> On 7/19/23 09:52, Yosry Ahmed wrote:
> >>> On Tue, Jul 18, 2023 at 6:32 PM Yosry Ahmed <[email protected]> wrote:
> >>>> On Tue, Jul 18, 2023 at 4:47 PM Yin Fengwei <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>> On 7/19/23 06:48, Yosry Ahmed wrote:
> >>>>>> On Sun, Jul 16, 2023 at 6:58 PM Yin Fengwei <[email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 7/17/23 08:35, Yu Zhao wrote:
> >>>>>>>> On Sun, Jul 16, 2023 at 6:00 PM Yin, Fengwei <[email protected]> wrote:
> >>>>>>>>> On 7/15/2023 2:06 PM, Yu Zhao wrote:
> >>>>>>>>>> There is a problem here that I didn't have the time to elaborate: we
> >>>>>>>>>> can't mlock() a folio that is within the range but not fully mapped
> >>>>>>>>>> because this folio can be on the deferred split queue. When the split
> >>>>>>>>>> happens, those unmapped folios (not mapped by this vma but are mapped
> >>>>>>>>>> into other vmas) will be stranded on the unevictable lru.
> >>>>>>>>> This should be fine unless I missed something. During large folio split,
> >>>>>>>>> the unmap_folio() will be migrate(anon)/unmap(file) folio. Folio will be
> >>>>>>>>> munlocked in unmap_folio(). So the head/tail pages will be evictable always.
> >>>>>>>> It's close but not entirely accurate: munlock can fail on isolated folios.
> >>>>>>> Yes. The munlock just clear PG_mlocked bit but with PG_unevictable left.
> >>>>>>>
> >>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> >>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> >>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> >>>>>> cpu 2 is isolating the folio for any purpose:
> >>>>>>
> >>>>>> cpu1 cpu2
> >>>>>> isolate folio
> >>>>>> folio_test_clear_lru() // 0
> >>>>>> putback folio // add
> >>>>>> to unevictable list
> >>>>>> folio_test_clear_mlocked()
> >>>>> Yes. Yu showed this sequence to me in another email. I thought the putback_lru()
> >>>>> could correct the none-mlocked but unevictable folio. But it doesn't because
> >>>>> of this race.
> >>>> (+Hugh Dickins for vis)
> >>>>
> >>>> Yu, I am not familiar with the split_folio() case, so I am not sure it
> >>>> is the same exact race I stated above.
> >>>>
> >>>> Can you confirm whether or not doing folio_test_clear_mlocked() before
> >>>> folio_test_clear_lru() would fix the race you are referring to? IIUC,
> >>>> in this case, we make sure we clear PG_mlocked before we try to to
> >>>> clear PG_lru. If we fail to clear it, then someone else have the folio
> >>>> isolated after we clear PG_mlocked, so we can be sure that when they
> >>>> put the folio back it will be correctly made evictable.
> >>>>
> >>>> Is my understanding correct?
> >>> Hmm, actually this might not be enough. In folio_add_lru() we will
> >>> call folio_batch_add_and_move(), which calls lru_add_fn() and *then*
> >>> sets PG_lru. Since we check folio_evictable() in lru_add_fn(), the
> >>> race can still happen:
> >>>
> >>>
> >>> cpu1 cpu2
> >>> folio_evictable() //false
> >>> folio_test_clear_mlocked()
> >>> folio_test_clear_lru() //false
> >>> folio_set_lru()
> >>>
> >>> Relying on PG_lru for synchronization might not be enough with the
> >>> current code. We might need to revert 2262ace60713 ("mm/munlock:
> >>> delete smp_mb() from __pagevec_lru_add_fn()").
> >>>
> >>> Sorry for going back and forth here, I am thinking out loud.
> >>
> >> Yes. Currently, the order in lru_add_fn() is not correct.
> >>
> >> I think we should move folio_test_clear_lru(folio) into
> >>
> >> lru locked range. As the lru lock here was expected to
> >>
> >> use for sync here. Check the comment in lru_add_fn().
> >
> > Right, I am wondering if it would be better to just revert
> > 2262ace60713 and rely on the memory barrier and operations ordering
> > instead of the lru lock. The lru lock is heavily contended as-is, so
> > avoiding using it where possible is preferable I assume.
> My understanding is set_lru after add folio to lru list is correct.
> Once folio_set_lru(), others can do isolation of this folio. But if
> this folio is not in lru list yet, what could happen? It's not required
> to hold lru lock to do isolation.

IIUC, clearing PG_lru is an atomic lockless operation to make sure no
one else is isolating the folio, but then you need to hold the lruvec
lock and actually delete the folio from the lru to complete its
isolation. This is my read of folio_isolate_lru().

Anyway, whether we rely on the lruvec lock or memory barrier +
operation ordering doesn't make a huge difference (I think?). The code
seemed to work with the latter before mlock_count was introduced.

If we decide to go with the latter, I can integrate the fix into the
refresh of my mlock_count rework RFC series (as it would be dependent
on that series). If we decide to go with the lruvec, then it can be
done as part of this series, or separately.

Thanks.

>
> >
> >>
> >>
> >> Regards
> >>
> >> Yin, Fengwei
> >>
> >>
> >>>
> >>>> If yes, I can add this fix to my next version of the RFC series to
> >>>> rework mlock_count. It would be a lot more complicated with the
> >>>> current implementation (as I stated in a previous email).
> >>>>
> >>>>>>
> >>>>>> The page would be stranded on the unevictable list in this case, no?
> >>>>>> Maybe we should only try to isolate the page (clear PG_lru) after we
> >>>>>> possibly clear PG_mlocked? In this case if we fail to isolate we know
> >>>>>> for sure that whoever has the page isolated will observe that
> >>>>>> PG_mlocked is clear and correctly make the page evictable.
> >>>>>>
> >>>>>> This probably would be complicated with the current implementation, as
> >>>>>> we first need to decrement mlock_count to determine if we want to
> >>>>>> clear PG_mlocked, and to do so we need to isolate the page as
> >>>>>> mlock_count overlays page->lru. With the proposal in [1] to rework
> >>>>>> mlock_count, it might be much simpler as far as I can tell. I intend
> >>>>>> to refresh this proposal soon-ish.
> >>>>>>
> >>>>>> [1]https://lore.kernel.org/lkml/[email protected]/
> >>>>>>
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> Yin, Fengwei
> >>>>>>>

2023-07-19 14:33:56

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Wed, 19 Jul 2023, Yin Fengwei wrote:
> >>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> >>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> >>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> >>>>>>>> cpu 2 is isolating the folio for any purpose:
> >>>>>>>>
> >>>>>>>> cpu1 cpu2
> >>>>>>>> isolate folio
> >>>>>>>> folio_test_clear_lru() // 0
> >>>>>>>> putback folio // add to unevictable list
> >>>>>>>> folio_test_clear_mlocked()
> >>>>> folio_set_lru()
> Let's wait the response from Huge and Yu. :).

I haven't been able to give it enough thought, but I suspect you are right:
that the current __munlock_folio() is deficient when folio_test_clear_lru()
fails.

(Though it has not been reported as a problem in practice: perhaps because
so few places try to isolate from the unevictable "list".)

I forget what my order of development was, but it's likely that I first
wrote the version for our own internal kernel - which used our original
lruvec locking, which did not depend on getting PG_lru first (having got
lru_lock, it checked memcg, then tried again if that had changed).

I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
but it turned out to work okay - elsewhere; but it looks as if I missed
its implication when adapting __munlock_page() for upstream.

If I were trying to fix this __munlock_folio() race myself (sorry, I'm
not), I would first look at that aspect: instead of folio_test_clear_lru()
behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
spin waiting for PG_lru here?

Hugh

2023-07-19 16:28:41

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <[email protected]> wrote:
>
> On Wed, 19 Jul 2023, Yin Fengwei wrote:
> > >>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> > >>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> > >>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> > >>>>>>>> cpu 2 is isolating the folio for any purpose:
> > >>>>>>>>
> > >>>>>>>> cpu1 cpu2
> > >>>>>>>> isolate folio
> > >>>>>>>> folio_test_clear_lru() // 0
> > >>>>>>>> putback folio // add to unevictable list
> > >>>>>>>> folio_test_clear_mlocked()
> > >>>>> folio_set_lru()
> > Let's wait the response from Huge and Yu. :).
>
> I haven't been able to give it enough thought, but I suspect you are right:
> that the current __munlock_folio() is deficient when folio_test_clear_lru()
> fails.
>
> (Though it has not been reported as a problem in practice: perhaps because
> so few places try to isolate from the unevictable "list".)
>
> I forget what my order of development was, but it's likely that I first
> wrote the version for our own internal kernel - which used our original
> lruvec locking, which did not depend on getting PG_lru first (having got
> lru_lock, it checked memcg, then tried again if that had changed).

Right. Just holding the lruvec lock without clearing PG_lru would not
protect against memcg movement in this case.

>
> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
> but it turned out to work okay - elsewhere; but it looks as if I missed
> its implication when adapting __munlock_page() for upstream.
>
> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
> not), I would first look at that aspect: instead of folio_test_clear_lru()
> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
> spin waiting for PG_lru here?

+Matthew Wilcox

It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
folio_set_lru() before checking folio_evictable(). While this is
probably extraneous since folio_batch_move_lru() will set it again
afterwards, it's probably harmless given that the lruvec lock is held
throughout (so no one can complete the folio isolation anyway), and
given that there were no problems introduced by this extra
folio_set_lru() as far as I can tell.

If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
("mm/munlock:
delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
ordering between manipulating PG_lru and PG_mlocked, I suppose we can
get away without having to spin. Again, that would only be possible if
reworking mlock_count [1] is acceptable. Otherwise, we can't clear
PG_mlocked before PG_lru in __munlock_folio().

I am not saying this is necessarily better than spinning, just a note
(and perhaps selfishly making [1] more appealing ;)).

[1]https://lore.kernel.org/lkml/[email protected]/

>
> Hugh

2023-07-20 02:30:39

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/19/2023 10:26 PM, Hugh Dickins wrote:
> On Wed, 19 Jul 2023, Yin Fengwei wrote:
>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
>>>>>>>>>>
>>>>>>>>>> cpu1 cpu2
>>>>>>>>>> isolate folio
>>>>>>>>>> folio_test_clear_lru() // 0
>>>>>>>>>> putback folio // add to unevictable list
>>>>>>>>>> folio_test_clear_mlocked()
>>>>>>> folio_set_lru()
>> Let's wait the response from Huge and Yu. :).
>
> I haven't been able to give it enough thought, but I suspect you are right:
> that the current __munlock_folio() is deficient when folio_test_clear_lru()
> fails.
>
> (Though it has not been reported as a problem in practice: perhaps because
> so few places try to isolate from the unevictable "list".)
>
> I forget what my order of development was, but it's likely that I first
> wrote the version for our own internal kernel - which used our original
> lruvec locking, which did not depend on getting PG_lru first (having got
> lru_lock, it checked memcg, then tried again if that had changed).
>
> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
> but it turned out to work okay - elsewhere; but it looks as if I missed
> its implication when adapting __munlock_page() for upstream.
>
> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
> not), I would first look at that aspect: instead of folio_test_clear_lru()
> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
> spin waiting for PG_lru here?
Considering following sequence:
CPU1 (migration) CPU2 (mlock)

isolation page (clear lru) mlock_pte_range
try_to_migrate -> take_pte_lock
try_to_migrate_one munlock_folio
pvmw -> take pte lock __munlock_folio if batch full
folio_wait_clear_lru

deadlock may happen.


Regards
Yin, Fengwei

>
> Hugh

2023-07-20 12:46:18

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <[email protected]> wrote:
>>
>> On Wed, 19 Jul 2023, Yin Fengwei wrote:
>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
>>>>>>>>>>>
>>>>>>>>>>> cpu1 cpu2
>>>>>>>>>>> isolate folio
>>>>>>>>>>> folio_test_clear_lru() // 0
>>>>>>>>>>> putback folio // add to unevictable list
>>>>>>>>>>> folio_test_clear_mlocked()
>>>>>>>> folio_set_lru()
>>> Let's wait the response from Huge and Yu. :).
>>
>> I haven't been able to give it enough thought, but I suspect you are right:
>> that the current __munlock_folio() is deficient when folio_test_clear_lru()
>> fails.
>>
>> (Though it has not been reported as a problem in practice: perhaps because
>> so few places try to isolate from the unevictable "list".)
>>
>> I forget what my order of development was, but it's likely that I first
>> wrote the version for our own internal kernel - which used our original
>> lruvec locking, which did not depend on getting PG_lru first (having got
>> lru_lock, it checked memcg, then tried again if that had changed).
>
> Right. Just holding the lruvec lock without clearing PG_lru would not
> protect against memcg movement in this case.
>
>>
>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
>> but it turned out to work okay - elsewhere; but it looks as if I missed
>> its implication when adapting __munlock_page() for upstream.
>>
>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
>> not), I would first look at that aspect: instead of folio_test_clear_lru()
>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
>> spin waiting for PG_lru here?
>
> +Matthew Wilcox
>
> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
> folio_set_lru() before checking folio_evictable(). While this is
> probably extraneous since folio_batch_move_lru() will set it again
> afterwards, it's probably harmless given that the lruvec lock is held
> throughout (so no one can complete the folio isolation anyway), and
> given that there were no problems introduced by this extra
> folio_set_lru() as far as I can tell.
After checking related code, Yes. Looks fine if we move folio_set_lru()
before if (folio_evictable(folio)) in lru_add_fn() because of holding
lru lock.

>
> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
> ("mm/munlock:
> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
> ordering between manipulating PG_lru and PG_mlocked, I suppose we can
> get away without having to spin. Again, that would only be possible if
> reworking mlock_count [1] is acceptable. Otherwise, we can't clear
> PG_mlocked before PG_lru in __munlock_folio().
What about following change to move mlocked operation before check lru
in __munlock_folio()?

diff --git a/mm/mlock.c b/mm/mlock.c
index 0a0c996c5c21..514f0d5bfbfd 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
{
int nr_pages = folio_nr_pages(folio);
- bool isolated = false;
+ bool isolated = false, mlocked = true;
+
+ mlocked = folio_test_clear_mlocked(folio);

if (!folio_test_clear_lru(folio))
goto munlock;
@@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
/* Then mlock_count is maintained, but might undercount */
if (folio->mlock_count)
folio->mlock_count--;
- if (folio->mlock_count)
+ if (folio->mlock_count) {
+ if (mlocked)
+ folio_set_mlocked(folio);
goto out;
+ }
}
/* else assume that was the last mlock: reclaim will fix it if not */

munlock:
- if (folio_test_clear_mlocked(folio)) {
+ if (mlocked) {
__zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
if (isolated || !folio_test_unevictable(folio))
__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);


>
> I am not saying this is necessarily better than spinning, just a note
> (and perhaps selfishly making [1] more appealing ;)).
>
> [1]https://lore.kernel.org/lkml/[email protected]/
>
>>
>> Hugh

2023-07-20 21:52:23

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Thu, Jul 20, 2023 at 5:03 AM Yin, Fengwei <[email protected]> wrote:
>
>
>
> On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
> > On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <[email protected]> wrote:
> >>
> >> On Wed, 19 Jul 2023, Yin Fengwei wrote:
> >>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> >>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> >>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> >>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
> >>>>>>>>>>>
> >>>>>>>>>>> cpu1 cpu2
> >>>>>>>>>>> isolate folio
> >>>>>>>>>>> folio_test_clear_lru() // 0
> >>>>>>>>>>> putback folio // add to unevictable list
> >>>>>>>>>>> folio_test_clear_mlocked()
> >>>>>>>> folio_set_lru()
> >>> Let's wait the response from Huge and Yu. :).
> >>
> >> I haven't been able to give it enough thought, but I suspect you are right:
> >> that the current __munlock_folio() is deficient when folio_test_clear_lru()
> >> fails.
> >>
> >> (Though it has not been reported as a problem in practice: perhaps because
> >> so few places try to isolate from the unevictable "list".)
> >>
> >> I forget what my order of development was, but it's likely that I first
> >> wrote the version for our own internal kernel - which used our original
> >> lruvec locking, which did not depend on getting PG_lru first (having got
> >> lru_lock, it checked memcg, then tried again if that had changed).
> >
> > Right. Just holding the lruvec lock without clearing PG_lru would not
> > protect against memcg movement in this case.
> >
> >>
> >> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
> >> but it turned out to work okay - elsewhere; but it looks as if I missed
> >> its implication when adapting __munlock_page() for upstream.
> >>
> >> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
> >> not), I would first look at that aspect: instead of folio_test_clear_lru()
> >> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
> >> spin waiting for PG_lru here?
> >
> > +Matthew Wilcox
> >
> > It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
> > a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
> > folio_set_lru() before checking folio_evictable(). While this is
> > probably extraneous since folio_batch_move_lru() will set it again
> > afterwards, it's probably harmless given that the lruvec lock is held
> > throughout (so no one can complete the folio isolation anyway), and
> > given that there were no problems introduced by this extra
> > folio_set_lru() as far as I can tell.
> After checking related code, Yes. Looks fine if we move folio_set_lru()
> before if (folio_evictable(folio)) in lru_add_fn() because of holding
> lru lock.
>
> >
> > If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
> > ("mm/munlock:
> > delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
> > ordering between manipulating PG_lru and PG_mlocked, I suppose we can
> > get away without having to spin. Again, that would only be possible if
> > reworking mlock_count [1] is acceptable. Otherwise, we can't clear
> > PG_mlocked before PG_lru in __munlock_folio().
> What about following change to move mlocked operation before check lru
> in __munlock_folio()?

It seems correct to me on a high level, but I think there is a subtle problem:

We clear PG_mlocked before trying to isolate to make sure that if
someone already has the folio isolated they will put it back on an
evictable list, then if we are able to isolate the folio ourselves and
find that the mlock_count is > 0, we set PG_mlocked again.

There is a small window where PG_mlocked might be temporarily cleared
but the folio is not actually munlocked (i.e we don't update the
NR_MLOCK stat). In that window, a racing reclaimer on a different cpu
may find VM_LOCKED from in a different vma, and call mlock_folio(). In
mlock_folio(), we will call folio_test_set_mlocked(folio) and see that
PG_mlocked is clear, so we will increment the MLOCK stats, even though
the folio was already mlocked. This can cause MLOCK stats to be
unbalanced (increments more than decrements), no?

>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0a0c996c5c21..514f0d5bfbfd 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
> static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
> {
> int nr_pages = folio_nr_pages(folio);
> - bool isolated = false;
> + bool isolated = false, mlocked = true;
> +
> + mlocked = folio_test_clear_mlocked(folio);
>
> if (!folio_test_clear_lru(folio))
> goto munlock;
> @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
> /* Then mlock_count is maintained, but might undercount */
> if (folio->mlock_count)
> folio->mlock_count--;
> - if (folio->mlock_count)
> + if (folio->mlock_count) {
> + if (mlocked)
> + folio_set_mlocked(folio);
> goto out;
> + }
> }
> /* else assume that was the last mlock: reclaim will fix it if not */
>
> munlock:
> - if (folio_test_clear_mlocked(folio)) {
> + if (mlocked) {
> __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
> if (isolated || !folio_test_unevictable(folio))
> __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
>
>
> >
> > I am not saying this is necessarily better than spinning, just a note
> > (and perhaps selfishly making [1] more appealing ;)).
> >
> > [1]https://lore.kernel.org/lkml/[email protected]/
> >
> >>
> >> Hugh

2023-07-21 01:52:56

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/21/2023 4:51 AM, Yosry Ahmed wrote:
> On Thu, Jul 20, 2023 at 5:03 AM Yin, Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
>>> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <[email protected]> wrote:
>>>>
>>>> On Wed, 19 Jul 2023, Yin Fengwei wrote:
>>>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
>>>>>>>>>>>>>
>>>>>>>>>>>>> cpu1 cpu2
>>>>>>>>>>>>> isolate folio
>>>>>>>>>>>>> folio_test_clear_lru() // 0
>>>>>>>>>>>>> putback folio // add to unevictable list
>>>>>>>>>>>>> folio_test_clear_mlocked()
>>>>>>>>>> folio_set_lru()
>>>>> Let's wait the response from Huge and Yu. :).
>>>>
>>>> I haven't been able to give it enough thought, but I suspect you are right:
>>>> that the current __munlock_folio() is deficient when folio_test_clear_lru()
>>>> fails.
>>>>
>>>> (Though it has not been reported as a problem in practice: perhaps because
>>>> so few places try to isolate from the unevictable "list".)
>>>>
>>>> I forget what my order of development was, but it's likely that I first
>>>> wrote the version for our own internal kernel - which used our original
>>>> lruvec locking, which did not depend on getting PG_lru first (having got
>>>> lru_lock, it checked memcg, then tried again if that had changed).
>>>
>>> Right. Just holding the lruvec lock without clearing PG_lru would not
>>> protect against memcg movement in this case.
>>>
>>>>
>>>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
>>>> but it turned out to work okay - elsewhere; but it looks as if I missed
>>>> its implication when adapting __munlock_page() for upstream.
>>>>
>>>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
>>>> not), I would first look at that aspect: instead of folio_test_clear_lru()
>>>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
>>>> spin waiting for PG_lru here?
>>>
>>> +Matthew Wilcox
>>>
>>> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
>>> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
>>> folio_set_lru() before checking folio_evictable(). While this is
>>> probably extraneous since folio_batch_move_lru() will set it again
>>> afterwards, it's probably harmless given that the lruvec lock is held
>>> throughout (so no one can complete the folio isolation anyway), and
>>> given that there were no problems introduced by this extra
>>> folio_set_lru() as far as I can tell.
>> After checking related code, Yes. Looks fine if we move folio_set_lru()
>> before if (folio_evictable(folio)) in lru_add_fn() because of holding
>> lru lock.
>>
>>>
>>> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
>>> ("mm/munlock:
>>> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
>>> ordering between manipulating PG_lru and PG_mlocked, I suppose we can
>>> get away without having to spin. Again, that would only be possible if
>>> reworking mlock_count [1] is acceptable. Otherwise, we can't clear
>>> PG_mlocked before PG_lru in __munlock_folio().
>> What about following change to move mlocked operation before check lru
>> in __munlock_folio()?
>
> It seems correct to me on a high level, but I think there is a subtle problem:
>
> We clear PG_mlocked before trying to isolate to make sure that if
> someone already has the folio isolated they will put it back on an
> evictable list, then if we are able to isolate the folio ourselves and
> find that the mlock_count is > 0, we set PG_mlocked again.
>
> There is a small window where PG_mlocked might be temporarily cleared
> but the folio is not actually munlocked (i.e we don't update the
> NR_MLOCK stat). In that window, a racing reclaimer on a different cpu
> may find VM_LOCKED from in a different vma, and call mlock_folio(). In
> mlock_folio(), we will call folio_test_set_mlocked(folio) and see that
> PG_mlocked is clear, so we will increment the MLOCK stats, even though
> the folio was already mlocked. This can cause MLOCK stats to be
> unbalanced (increments more than decrements), no?
Looks like NR_MLOCK is always connected to PG_mlocked bit. Not possible
to be unbalanced.

Let's say:
mlock_folio() NR_MLOCK increase and set mlocked
mlock_folio() NR_MLOCK NO change as folio is already mlocked

__munlock_folio() with isolated folio. NR_MLOCK decrease (0) and
clear mlocked

folio_putback_lru()
reclaimed mlock_folio() NR_MLOCK increase and set mlocked

munlock_folio() NR_MLOCK decrease (0) and clear mlocked
munlock_folio() NR_MLOCK NO change as folio has no mlocked set


Regards
Yin, Fengwei

>
>>
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index 0a0c996c5c21..514f0d5bfbfd 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
>> static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
>> {
>> int nr_pages = folio_nr_pages(folio);
>> - bool isolated = false;
>> + bool isolated = false, mlocked = true;
>> +
>> + mlocked = folio_test_clear_mlocked(folio);
>>
>> if (!folio_test_clear_lru(folio))
>> goto munlock;
>> @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
>> /* Then mlock_count is maintained, but might undercount */
>> if (folio->mlock_count)
>> folio->mlock_count--;
>> - if (folio->mlock_count)
>> + if (folio->mlock_count) {
>> + if (mlocked)
>> + folio_set_mlocked(folio);
>> goto out;
>> + }
>> }
>> /* else assume that was the last mlock: reclaim will fix it if not */
>>
>> munlock:
>> - if (folio_test_clear_mlocked(folio)) {
>> + if (mlocked) {
>> __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
>> if (isolated || !folio_test_unevictable(folio))
>> __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
>>
>>
>>>
>>> I am not saying this is necessarily better than spinning, just a note
>>> (and perhaps selfishly making [1] more appealing ;)).
>>>
>>> [1]https://lore.kernel.org/lkml/[email protected]/
>>>
>>>>
>>>> Hugh

2023-07-21 01:54:33

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Thu, Jul 20, 2023 at 6:12 PM Yin, Fengwei <[email protected]> wrote:
>
>
>
> On 7/21/2023 4:51 AM, Yosry Ahmed wrote:
> > On Thu, Jul 20, 2023 at 5:03 AM Yin, Fengwei <[email protected]> wrote:
> >>
> >>
> >>
> >> On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
> >>> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <[email protected]> wrote:
> >>>>
> >>>> On Wed, 19 Jul 2023, Yin Fengwei wrote:
> >>>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> >>>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> >>>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> >>>>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> cpu1 cpu2
> >>>>>>>>>>>>> isolate folio
> >>>>>>>>>>>>> folio_test_clear_lru() // 0
> >>>>>>>>>>>>> putback folio // add to unevictable list
> >>>>>>>>>>>>> folio_test_clear_mlocked()
> >>>>>>>>>> folio_set_lru()
> >>>>> Let's wait the response from Huge and Yu. :).
> >>>>
> >>>> I haven't been able to give it enough thought, but I suspect you are right:
> >>>> that the current __munlock_folio() is deficient when folio_test_clear_lru()
> >>>> fails.
> >>>>
> >>>> (Though it has not been reported as a problem in practice: perhaps because
> >>>> so few places try to isolate from the unevictable "list".)
> >>>>
> >>>> I forget what my order of development was, but it's likely that I first
> >>>> wrote the version for our own internal kernel - which used our original
> >>>> lruvec locking, which did not depend on getting PG_lru first (having got
> >>>> lru_lock, it checked memcg, then tried again if that had changed).
> >>>
> >>> Right. Just holding the lruvec lock without clearing PG_lru would not
> >>> protect against memcg movement in this case.
> >>>
> >>>>
> >>>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
> >>>> but it turned out to work okay - elsewhere; but it looks as if I missed
> >>>> its implication when adapting __munlock_page() for upstream.
> >>>>
> >>>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
> >>>> not), I would first look at that aspect: instead of folio_test_clear_lru()
> >>>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
> >>>> spin waiting for PG_lru here?
> >>>
> >>> +Matthew Wilcox
> >>>
> >>> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
> >>> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
> >>> folio_set_lru() before checking folio_evictable(). While this is
> >>> probably extraneous since folio_batch_move_lru() will set it again
> >>> afterwards, it's probably harmless given that the lruvec lock is held
> >>> throughout (so no one can complete the folio isolation anyway), and
> >>> given that there were no problems introduced by this extra
> >>> folio_set_lru() as far as I can tell.
> >> After checking related code, Yes. Looks fine if we move folio_set_lru()
> >> before if (folio_evictable(folio)) in lru_add_fn() because of holding
> >> lru lock.
> >>
> >>>
> >>> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
> >>> ("mm/munlock:
> >>> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
> >>> ordering between manipulating PG_lru and PG_mlocked, I suppose we can
> >>> get away without having to spin. Again, that would only be possible if
> >>> reworking mlock_count [1] is acceptable. Otherwise, we can't clear
> >>> PG_mlocked before PG_lru in __munlock_folio().
> >> What about following change to move mlocked operation before check lru
> >> in __munlock_folio()?
> >
> > It seems correct to me on a high level, but I think there is a subtle problem:
> >
> > We clear PG_mlocked before trying to isolate to make sure that if
> > someone already has the folio isolated they will put it back on an
> > evictable list, then if we are able to isolate the folio ourselves and
> > find that the mlock_count is > 0, we set PG_mlocked again.
> >
> > There is a small window where PG_mlocked might be temporarily cleared
> > but the folio is not actually munlocked (i.e we don't update the
> > NR_MLOCK stat). In that window, a racing reclaimer on a different cpu
> > may find VM_LOCKED from in a different vma, and call mlock_folio(). In
> > mlock_folio(), we will call folio_test_set_mlocked(folio) and see that
> > PG_mlocked is clear, so we will increment the MLOCK stats, even though
> > the folio was already mlocked. This can cause MLOCK stats to be
> > unbalanced (increments more than decrements), no?
> Looks like NR_MLOCK is always connected to PG_mlocked bit. Not possible
> to be unbalanced.
>
> Let's say:
> mlock_folio() NR_MLOCK increase and set mlocked
> mlock_folio() NR_MLOCK NO change as folio is already mlocked
>
> __munlock_folio() with isolated folio. NR_MLOCK decrease (0) and
> clear mlocked
>
> folio_putback_lru()
> reclaimed mlock_folio() NR_MLOCK increase and set mlocked
>
> munlock_folio() NR_MLOCK decrease (0) and clear mlocked
> munlock_folio() NR_MLOCK NO change as folio has no mlocked set

Right. The problem with the diff is that we temporarily clear
PG_mlocked *without* updating NR_MLOCK.

Consider a folio that is mlocked by two vmas. NR_MLOCK = folio_nr_pages.

Assume cpu 1 is doing __munlock_folio from one of the vmas, while cpu
2 is doing reclaim.

cpu 1 cpu2
clear PG_mlocked
folio_referenced()
mlock_folio()
set PG_mlocked
add to NR_MLOCK
mlock_count > 0
set PG_mlocked
goto out

Result: NR_MLOCK = folio_nr_pages * 2.

When the folio is munlock()'d later from the second vma, NR_MLOCK will
be reduced to folio_nr_pages, but there are not mlocked folios.

This is the scenario that I have in mind. Please correct me if I am wrong.

>
>
> Regards
> Yin, Fengwei
>
> >
> >>
> >> diff --git a/mm/mlock.c b/mm/mlock.c
> >> index 0a0c996c5c21..514f0d5bfbfd 100644
> >> --- a/mm/mlock.c
> >> +++ b/mm/mlock.c
> >> @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
> >> static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
> >> {
> >> int nr_pages = folio_nr_pages(folio);
> >> - bool isolated = false;
> >> + bool isolated = false, mlocked = true;
> >> +
> >> + mlocked = folio_test_clear_mlocked(folio);
> >>
> >> if (!folio_test_clear_lru(folio))
> >> goto munlock;
> >> @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
> >> /* Then mlock_count is maintained, but might undercount */
> >> if (folio->mlock_count)
> >> folio->mlock_count--;
> >> - if (folio->mlock_count)
> >> + if (folio->mlock_count) {
> >> + if (mlocked)
> >> + folio_set_mlocked(folio);
> >> goto out;
> >> + }
> >> }
> >> /* else assume that was the last mlock: reclaim will fix it if not */
> >>
> >> munlock:
> >> - if (folio_test_clear_mlocked(folio)) {
> >> + if (mlocked) {
> >> __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
> >> if (isolated || !folio_test_unevictable(folio))
> >> __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
> >>
> >>
> >>>
> >>> I am not saying this is necessarily better than spinning, just a note
> >>> (and perhaps selfishly making [1] more appealing ;)).
> >>>
> >>> [1]https://lore.kernel.org/lkml/[email protected]/
> >>>
> >>>>
> >>>> Hugh

2023-07-21 03:46:25

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Thu, Jul 20, 2023 at 8:19 PM Yin, Fengwei <[email protected]> wrote:
>
>
>
> On 7/21/2023 9:35 AM, Yosry Ahmed wrote:
> > On Thu, Jul 20, 2023 at 6:12 PM Yin, Fengwei <[email protected]> wrote:
> >>
> >>
> >>
> >> On 7/21/2023 4:51 AM, Yosry Ahmed wrote:
> >>> On Thu, Jul 20, 2023 at 5:03 AM Yin, Fengwei <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
> >>>>> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <[email protected]> wrote:
> >>>>>>
> >>>>>> On Wed, 19 Jul 2023, Yin Fengwei wrote:
> >>>>>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> >>>>>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> >>>>>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> >>>>>>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> cpu1 cpu2
> >>>>>>>>>>>>>>> isolate folio
> >>>>>>>>>>>>>>> folio_test_clear_lru() // 0
> >>>>>>>>>>>>>>> putback folio // add to unevictable list
> >>>>>>>>>>>>>>> folio_test_clear_mlocked()
> >>>>>>>>>>>> folio_set_lru()
> >>>>>>> Let's wait the response from Huge and Yu. :).
> >>>>>>
> >>>>>> I haven't been able to give it enough thought, but I suspect you are right:
> >>>>>> that the current __munlock_folio() is deficient when folio_test_clear_lru()
> >>>>>> fails.
> >>>>>>
> >>>>>> (Though it has not been reported as a problem in practice: perhaps because
> >>>>>> so few places try to isolate from the unevictable "list".)
> >>>>>>
> >>>>>> I forget what my order of development was, but it's likely that I first
> >>>>>> wrote the version for our own internal kernel - which used our original
> >>>>>> lruvec locking, which did not depend on getting PG_lru first (having got
> >>>>>> lru_lock, it checked memcg, then tried again if that had changed).
> >>>>>
> >>>>> Right. Just holding the lruvec lock without clearing PG_lru would not
> >>>>> protect against memcg movement in this case.
> >>>>>
> >>>>>>
> >>>>>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
> >>>>>> but it turned out to work okay - elsewhere; but it looks as if I missed
> >>>>>> its implication when adapting __munlock_page() for upstream.
> >>>>>>
> >>>>>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
> >>>>>> not), I would first look at that aspect: instead of folio_test_clear_lru()
> >>>>>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
> >>>>>> spin waiting for PG_lru here?
> >>>>>
> >>>>> +Matthew Wilcox
> >>>>>
> >>>>> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
> >>>>> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
> >>>>> folio_set_lru() before checking folio_evictable(). While this is
> >>>>> probably extraneous since folio_batch_move_lru() will set it again
> >>>>> afterwards, it's probably harmless given that the lruvec lock is held
> >>>>> throughout (so no one can complete the folio isolation anyway), and
> >>>>> given that there were no problems introduced by this extra
> >>>>> folio_set_lru() as far as I can tell.
> >>>> After checking related code, Yes. Looks fine if we move folio_set_lru()
> >>>> before if (folio_evictable(folio)) in lru_add_fn() because of holding
> >>>> lru lock.
> >>>>
> >>>>>
> >>>>> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
> >>>>> ("mm/munlock:
> >>>>> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
> >>>>> ordering between manipulating PG_lru and PG_mlocked, I suppose we can
> >>>>> get away without having to spin. Again, that would only be possible if
> >>>>> reworking mlock_count [1] is acceptable. Otherwise, we can't clear
> >>>>> PG_mlocked before PG_lru in __munlock_folio().
> >>>> What about following change to move mlocked operation before check lru
> >>>> in __munlock_folio()?
> >>>
> >>> It seems correct to me on a high level, but I think there is a subtle problem:
> >>>
> >>> We clear PG_mlocked before trying to isolate to make sure that if
> >>> someone already has the folio isolated they will put it back on an
> >>> evictable list, then if we are able to isolate the folio ourselves and
> >>> find that the mlock_count is > 0, we set PG_mlocked again.
> >>>
> >>> There is a small window where PG_mlocked might be temporarily cleared
> >>> but the folio is not actually munlocked (i.e we don't update the
> >>> NR_MLOCK stat). In that window, a racing reclaimer on a different cpu
> >>> may find VM_LOCKED from in a different vma, and call mlock_folio(). In
> >>> mlock_folio(), we will call folio_test_set_mlocked(folio) and see that
> >>> PG_mlocked is clear, so we will increment the MLOCK stats, even though
> >>> the folio was already mlocked. This can cause MLOCK stats to be
> >>> unbalanced (increments more than decrements), no?
> >> Looks like NR_MLOCK is always connected to PG_mlocked bit. Not possible
> >> to be unbalanced.
> >>
> >> Let's say:
> >> mlock_folio() NR_MLOCK increase and set mlocked
> >> mlock_folio() NR_MLOCK NO change as folio is already mlocked
> >>
> >> __munlock_folio() with isolated folio. NR_MLOCK decrease (0) and
> >> clear mlocked
> >>
> >> folio_putback_lru()
> >> reclaimed mlock_folio() NR_MLOCK increase and set mlocked
> >>
> >> munlock_folio() NR_MLOCK decrease (0) and clear mlocked
> >> munlock_folio() NR_MLOCK NO change as folio has no mlocked set
> >
> > Right. The problem with the diff is that we temporarily clear
> > PG_mlocked *without* updating NR_MLOCK.
> >
> > Consider a folio that is mlocked by two vmas. NR_MLOCK = folio_nr_pages.
> >
> > Assume cpu 1 is doing __munlock_folio from one of the vmas, while cpu
> > 2 is doing reclaim.
> >
> > cpu 1 cpu2
> > clear PG_mlocked
> > folio_referenced()
> > mlock_folio()
> > set PG_mlocked
> > add to NR_MLOCK
> > mlock_count > 0
> > set PG_mlocked
> > goto out
> >
> > Result: NR_MLOCK = folio_nr_pages * 2.
> >
> > When the folio is munlock()'d later from the second vma, NR_MLOCK will
> > be reduced to folio_nr_pages, but there are not mlocked folios.
> >
> > This is the scenario that I have in mind. Please correct me if I am wrong.
> Yes. Looks possible even may be difficult to hit.
>
> My first thought was it's not possible because unevictable folio will not
> be picked by reclaimer. But it's possible case if things happen between
> clear_mlock and test_and_clear_lru:
> folio_putback_lru() by other isolation user like migration
> reclaimer pick the folio and call mlock_folio()
> reclaimer call folio
>
> The fixing can be following the rules (combine NR_LOCK with PG_mlocked bit)
> strictly.

Yeah probably. I believe restoring the old ordering of manipulating
PG_lru and PG_mlocked with the memory barrier would be a simpler fix,
but this is only possible if the mlock_count rework gets merged.

>
>
> Regards
> Yin, Fengwei
>
> >
> >>
> >>
> >> Regards
> >> Yin, Fengwei
> >>
> >>>
> >>>>
> >>>> diff --git a/mm/mlock.c b/mm/mlock.c
> >>>> index 0a0c996c5c21..514f0d5bfbfd 100644
> >>>> --- a/mm/mlock.c
> >>>> +++ b/mm/mlock.c
> >>>> @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
> >>>> static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
> >>>> {
> >>>> int nr_pages = folio_nr_pages(folio);
> >>>> - bool isolated = false;
> >>>> + bool isolated = false, mlocked = true;
> >>>> +
> >>>> + mlocked = folio_test_clear_mlocked(folio);
> >>>>
> >>>> if (!folio_test_clear_lru(folio))
> >>>> goto munlock;
> >>>> @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
> >>>> /* Then mlock_count is maintained, but might undercount */
> >>>> if (folio->mlock_count)
> >>>> folio->mlock_count--;
> >>>> - if (folio->mlock_count)
> >>>> + if (folio->mlock_count) {
> >>>> + if (mlocked)
> >>>> + folio_set_mlocked(folio);
> >>>> goto out;
> >>>> + }
> >>>> }
> >>>> /* else assume that was the last mlock: reclaim will fix it if not */
> >>>>
> >>>> munlock:
> >>>> - if (folio_test_clear_mlocked(folio)) {
> >>>> + if (mlocked) {
> >>>> __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
> >>>> if (isolated || !folio_test_unevictable(folio))
> >>>> __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
> >>>>
> >>>>
> >>>>>
> >>>>> I am not saying this is necessarily better than spinning, just a note
> >>>>> (and perhaps selfishly making [1] more appealing ;)).
> >>>>>
> >>>>> [1]https://lore.kernel.org/lkml/[email protected]/
> >>>>>
> >>>>>>
> >>>>>> Hugh

2023-07-21 03:50:57

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/21/2023 9:35 AM, Yosry Ahmed wrote:
> On Thu, Jul 20, 2023 at 6:12 PM Yin, Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 7/21/2023 4:51 AM, Yosry Ahmed wrote:
>>> On Thu, Jul 20, 2023 at 5:03 AM Yin, Fengwei <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 7/19/2023 11:44 PM, Yosry Ahmed wrote:
>>>>> On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <[email protected]> wrote:
>>>>>>
>>>>>> On Wed, 19 Jul 2023, Yin Fengwei wrote:
>>>>>>>>>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
>>>>>>>>>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
>>>>>>>>>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
>>>>>>>>>>>>>>> cpu 2 is isolating the folio for any purpose:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> cpu1 cpu2
>>>>>>>>>>>>>>> isolate folio
>>>>>>>>>>>>>>> folio_test_clear_lru() // 0
>>>>>>>>>>>>>>> putback folio // add to unevictable list
>>>>>>>>>>>>>>> folio_test_clear_mlocked()
>>>>>>>>>>>> folio_set_lru()
>>>>>>> Let's wait the response from Huge and Yu. :).
>>>>>>
>>>>>> I haven't been able to give it enough thought, but I suspect you are right:
>>>>>> that the current __munlock_folio() is deficient when folio_test_clear_lru()
>>>>>> fails.
>>>>>>
>>>>>> (Though it has not been reported as a problem in practice: perhaps because
>>>>>> so few places try to isolate from the unevictable "list".)
>>>>>>
>>>>>> I forget what my order of development was, but it's likely that I first
>>>>>> wrote the version for our own internal kernel - which used our original
>>>>>> lruvec locking, which did not depend on getting PG_lru first (having got
>>>>>> lru_lock, it checked memcg, then tried again if that had changed).
>>>>>
>>>>> Right. Just holding the lruvec lock without clearing PG_lru would not
>>>>> protect against memcg movement in this case.
>>>>>
>>>>>>
>>>>>> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
>>>>>> but it turned out to work okay - elsewhere; but it looks as if I missed
>>>>>> its implication when adapting __munlock_page() for upstream.
>>>>>>
>>>>>> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
>>>>>> not), I would first look at that aspect: instead of folio_test_clear_lru()
>>>>>> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
>>>>>> spin waiting for PG_lru here?
>>>>>
>>>>> +Matthew Wilcox
>>>>>
>>>>> It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
>>>>> a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
>>>>> folio_set_lru() before checking folio_evictable(). While this is
>>>>> probably extraneous since folio_batch_move_lru() will set it again
>>>>> afterwards, it's probably harmless given that the lruvec lock is held
>>>>> throughout (so no one can complete the folio isolation anyway), and
>>>>> given that there were no problems introduced by this extra
>>>>> folio_set_lru() as far as I can tell.
>>>> After checking related code, Yes. Looks fine if we move folio_set_lru()
>>>> before if (folio_evictable(folio)) in lru_add_fn() because of holding
>>>> lru lock.
>>>>
>>>>>
>>>>> If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
>>>>> ("mm/munlock:
>>>>> delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
>>>>> ordering between manipulating PG_lru and PG_mlocked, I suppose we can
>>>>> get away without having to spin. Again, that would only be possible if
>>>>> reworking mlock_count [1] is acceptable. Otherwise, we can't clear
>>>>> PG_mlocked before PG_lru in __munlock_folio().
>>>> What about following change to move mlocked operation before check lru
>>>> in __munlock_folio()?
>>>
>>> It seems correct to me on a high level, but I think there is a subtle problem:
>>>
>>> We clear PG_mlocked before trying to isolate to make sure that if
>>> someone already has the folio isolated they will put it back on an
>>> evictable list, then if we are able to isolate the folio ourselves and
>>> find that the mlock_count is > 0, we set PG_mlocked again.
>>>
>>> There is a small window where PG_mlocked might be temporarily cleared
>>> but the folio is not actually munlocked (i.e we don't update the
>>> NR_MLOCK stat). In that window, a racing reclaimer on a different cpu
>>> may find VM_LOCKED from in a different vma, and call mlock_folio(). In
>>> mlock_folio(), we will call folio_test_set_mlocked(folio) and see that
>>> PG_mlocked is clear, so we will increment the MLOCK stats, even though
>>> the folio was already mlocked. This can cause MLOCK stats to be
>>> unbalanced (increments more than decrements), no?
>> Looks like NR_MLOCK is always connected to PG_mlocked bit. Not possible
>> to be unbalanced.
>>
>> Let's say:
>> mlock_folio() NR_MLOCK increase and set mlocked
>> mlock_folio() NR_MLOCK NO change as folio is already mlocked
>>
>> __munlock_folio() with isolated folio. NR_MLOCK decrease (0) and
>> clear mlocked
>>
>> folio_putback_lru()
>> reclaimed mlock_folio() NR_MLOCK increase and set mlocked
>>
>> munlock_folio() NR_MLOCK decrease (0) and clear mlocked
>> munlock_folio() NR_MLOCK NO change as folio has no mlocked set
>
> Right. The problem with the diff is that we temporarily clear
> PG_mlocked *without* updating NR_MLOCK.
>
> Consider a folio that is mlocked by two vmas. NR_MLOCK = folio_nr_pages.
>
> Assume cpu 1 is doing __munlock_folio from one of the vmas, while cpu
> 2 is doing reclaim.
>
> cpu 1 cpu2
> clear PG_mlocked
> folio_referenced()
> mlock_folio()
> set PG_mlocked
> add to NR_MLOCK
> mlock_count > 0
> set PG_mlocked
> goto out
>
> Result: NR_MLOCK = folio_nr_pages * 2.
>
> When the folio is munlock()'d later from the second vma, NR_MLOCK will
> be reduced to folio_nr_pages, but there are not mlocked folios.
>
> This is the scenario that I have in mind. Please correct me if I am wrong.
Yes. Looks possible even may be difficult to hit.

My first thought was it's not possible because unevictable folio will not
be picked by reclaimer. But it's possible case if things happen between
clear_mlock and test_and_clear_lru:
folio_putback_lru() by other isolation user like migration
reclaimer pick the folio and call mlock_folio()
reclaimer call folio

The fixing can be following the rules (combine NR_LOCK with PG_mlocked bit)
strictly.


Regards
Yin, Fengwei

>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>> index 0a0c996c5c21..514f0d5bfbfd 100644
>>>> --- a/mm/mlock.c
>>>> +++ b/mm/mlock.c
>>>> @@ -122,7 +122,9 @@ static struct lruvec *__mlock_new_folio(struct folio *folio, struct lruvec *lruv
>>>> static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec)
>>>> {
>>>> int nr_pages = folio_nr_pages(folio);
>>>> - bool isolated = false;
>>>> + bool isolated = false, mlocked = true;
>>>> +
>>>> + mlocked = folio_test_clear_mlocked(folio);
>>>>
>>>> if (!folio_test_clear_lru(folio))
>>>> goto munlock;
>>>> @@ -134,13 +136,17 @@ static struct lruvec *__munlock_folio(struct folio *folio, struct lruvec *lruvec
>>>> /* Then mlock_count is maintained, but might undercount */
>>>> if (folio->mlock_count)
>>>> folio->mlock_count--;
>>>> - if (folio->mlock_count)
>>>> + if (folio->mlock_count) {
>>>> + if (mlocked)
>>>> + folio_set_mlocked(folio);
>>>> goto out;
>>>> + }
>>>> }
>>>> /* else assume that was the last mlock: reclaim will fix it if not */
>>>>
>>>> munlock:
>>>> - if (folio_test_clear_mlocked(folio)) {
>>>> + if (mlocked) {
>>>> __zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages);
>>>> if (isolated || !folio_test_unevictable(folio))
>>>> __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
>>>>
>>>>
>>>>>
>>>>> I am not saying this is necessarily better than spinning, just a note
>>>>> (and perhaps selfishly making [1] more appealing ;)).
>>>>>
>>>>> [1]https://lore.kernel.org/lkml/[email protected]/
>>>>>
>>>>>>
>>>>>> Hugh

2023-07-26 13:47:40

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/15/23 14:06, Yu Zhao wrote:
> On Wed, Jul 12, 2023 at 12:31 AM Yu Zhao <[email protected]> wrote:
>>
>> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>>>
>>> Current kernel only lock base size folio during mlock syscall.
>>> Add large folio support with following rules:
>>> - Only mlock large folio when it's in VM_LOCKED VMA range
>>>
>>> - If there is cow folio, mlock the cow folio as cow folio
>>> is also in VM_LOCKED VMA range.
>>>
>>> - munlock will apply to the large folio which is in VMA range
>>> or cross the VMA boundary.
>>>
>>> The last rule is used to handle the case that the large folio is
>>> mlocked, later the VMA is split in the middle of large folio
>>> and this large folio become cross VMA boundary.
>>>
>>> Signed-off-by: Yin Fengwei <[email protected]>
>>> ---
>>> mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 99 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>> index 0a0c996c5c214..f49e079066870 100644
>>> --- a/mm/mlock.c
>>> +++ b/mm/mlock.c
>>> @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio)
>>> local_unlock(&mlock_fbatch.lock);
>>> }
>>>
>>> +static inline bool should_mlock_folio(struct folio *folio,
>>> + struct vm_area_struct *vma)
>>> +{
>>> + if (vma->vm_flags & VM_LOCKED)
>>> + return (!folio_test_large(folio) ||
>>> + folio_within_vma(folio, vma));
>>> +
>>> + /*
>>> + * For unlock, allow munlock large folio which is partially
>>> + * mapped to VMA. As it's possible that large folio is
>>> + * mlocked and VMA is split later.
>>> + *
>>> + * During memory pressure, such kind of large folio can
>>> + * be split. And the pages are not in VM_LOCKed VMA
>>> + * can be reclaimed.
>>> + */
>>> +
>>> + return true;
>>
>> Looks good, or just
>>
>> should_mlock_folio() // or whatever name you see fit, can_mlock_folio()?
>> {
>> return !(vma->vm_flags & VM_LOCKED) || folio_within_vma();
>> }
>>
>>> +}
>>> +
>>> +static inline unsigned int get_folio_mlock_step(struct folio *folio,
>>> + pte_t pte, unsigned long addr, unsigned long end)
>>> +{
>>> + unsigned int nr;
>>> +
>>> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
>>> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
>>> +}
>>> +
>>> +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
>>> + pte_t *pte, unsigned long addr, unsigned int nr)
>>> +{
>>> + struct folio *cow_folio;
>>> + unsigned int step = 1;
>>> +
>>> + mlock_folio(folio);
>>> + if (nr == 1)
>>> + return;
>>> +
>>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
>>> + pte_t ptent;
>>> +
>>> + step = 1;
>>> + ptent = ptep_get(pte);
>>> +
>>> + if (!pte_present(ptent))
>>> + continue;
>>> +
>>> + cow_folio = vm_normal_folio(vma, addr, ptent);
>>> + if (!cow_folio || cow_folio == folio) {
>>> + continue;
>>> + }
>>> +
>>> + mlock_folio(cow_folio);
>>> + step = get_folio_mlock_step(folio, ptent,
>>> + addr, addr + (nr << PAGE_SHIFT));
>>> + }
>>> +}
>>> +
>>> +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
>>> + pte_t *pte, unsigned long addr, unsigned int nr)
>>> +{
>>> + struct folio *cow_folio;
>>> + unsigned int step = 1;
>>> +
>>> + munlock_folio(folio);
>>> + if (nr == 1)
>>> + return;
>>> +
>>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
>>> + pte_t ptent;
>>> +
>>> + step = 1;
>>> + ptent = ptep_get(pte);
>>> +
>>> + if (!pte_present(ptent))
>>> + continue;
>>> +
>>> + cow_folio = vm_normal_folio(vma, addr, ptent);
>>> + if (!cow_folio || cow_folio == folio) {
>>> + continue;
>>> + }
>>> +
>>> + munlock_folio(cow_folio);
>>> + step = get_folio_mlock_step(folio, ptent,
>>> + addr, addr + (nr << PAGE_SHIFT));
>>> + }
>>> +}
>>
>> I'll finish the above later.
>
> There is a problem here that I didn't have the time to elaborate: we
> can't mlock() a folio that is within the range but not fully mapped
> because this folio can be on the deferred split queue. When the split
> happens, those unmapped folios (not mapped by this vma but are mapped
> into other vmas) will be stranded on the unevictable lru.
Checked remap case in past few days, I agree we shouldn't treat a folio
in the range but not fully mapped as in_range folio.

As for remap case, it's possible that the folio is not in deferred split
queue. But part of folio is mapped to VM_LOCKED vma and other part of
folio is mapped to none VM_LOCKED vma. In this case, page can't be split
as it's not in deferred split queue. So page reclaim should be allowed to
pick this folio up, split it and reclaim the pages in none VM_LOCKED vma.
So we can't mlock such kind of folio.

The same thing can happen with madvise_cold_or_pageout_pte_range().
I will update folio_in_vma() to check the PTE also.

Regards
Yin, Fengwei

>
> For that matter, we can't mlock any large folios that are being
> shared, unless you want to overengineer it by checking whether all
> sharing vmas are also mlocked -- mlock is cleared during fork. So the
> condition for mlocking large anon folios is 1) within range 2) fully
> mapped 3) not shared (mapcount is 1). The final patch should look like
> something like this:
>
> - if (folio_test_large(folio))
> + if (folio_pfn(folio) != pte_pfn(ptent))
> + continue;
> + if (!the_aforementioned_condition())
>
> There is another corner case I forgot to mention: for example, what if
> a folio spans two (the only two) adjacent mlocked vmas? No need to
> worry about this since it's not worth optimizing.

2023-07-26 18:23:47

by Yu Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Wed, Jul 26, 2023 at 6:49 AM Yin Fengwei <[email protected]> wrote:
>
>
>
> On 7/15/23 14:06, Yu Zhao wrote:
> > On Wed, Jul 12, 2023 at 12:31 AM Yu Zhao <[email protected]> wrote:
> >>
> >> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
> >>>
> >>> Current kernel only lock base size folio during mlock syscall.
> >>> Add large folio support with following rules:
> >>> - Only mlock large folio when it's in VM_LOCKED VMA range
> >>>
> >>> - If there is cow folio, mlock the cow folio as cow folio
> >>> is also in VM_LOCKED VMA range.
> >>>
> >>> - munlock will apply to the large folio which is in VMA range
> >>> or cross the VMA boundary.
> >>>
> >>> The last rule is used to handle the case that the large folio is
> >>> mlocked, later the VMA is split in the middle of large folio
> >>> and this large folio become cross VMA boundary.
> >>>
> >>> Signed-off-by: Yin Fengwei <[email protected]>
> >>> ---
> >>> mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>> 1 file changed, 99 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/mm/mlock.c b/mm/mlock.c
> >>> index 0a0c996c5c214..f49e079066870 100644
> >>> --- a/mm/mlock.c
> >>> +++ b/mm/mlock.c
> >>> @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio)
> >>> local_unlock(&mlock_fbatch.lock);
> >>> }
> >>>
> >>> +static inline bool should_mlock_folio(struct folio *folio,
> >>> + struct vm_area_struct *vma)
> >>> +{
> >>> + if (vma->vm_flags & VM_LOCKED)
> >>> + return (!folio_test_large(folio) ||
> >>> + folio_within_vma(folio, vma));
> >>> +
> >>> + /*
> >>> + * For unlock, allow munlock large folio which is partially
> >>> + * mapped to VMA. As it's possible that large folio is
> >>> + * mlocked and VMA is split later.
> >>> + *
> >>> + * During memory pressure, such kind of large folio can
> >>> + * be split. And the pages are not in VM_LOCKed VMA
> >>> + * can be reclaimed.
> >>> + */
> >>> +
> >>> + return true;
> >>
> >> Looks good, or just
> >>
> >> should_mlock_folio() // or whatever name you see fit, can_mlock_folio()?
> >> {
> >> return !(vma->vm_flags & VM_LOCKED) || folio_within_vma();
> >> }
> >>
> >>> +}
> >>> +
> >>> +static inline unsigned int get_folio_mlock_step(struct folio *folio,
> >>> + pte_t pte, unsigned long addr, unsigned long end)
> >>> +{
> >>> + unsigned int nr;
> >>> +
> >>> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
> >>> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
> >>> +}
> >>> +
> >>> +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
> >>> + pte_t *pte, unsigned long addr, unsigned int nr)
> >>> +{
> >>> + struct folio *cow_folio;
> >>> + unsigned int step = 1;
> >>> +
> >>> + mlock_folio(folio);
> >>> + if (nr == 1)
> >>> + return;
> >>> +
> >>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
> >>> + pte_t ptent;
> >>> +
> >>> + step = 1;
> >>> + ptent = ptep_get(pte);
> >>> +
> >>> + if (!pte_present(ptent))
> >>> + continue;
> >>> +
> >>> + cow_folio = vm_normal_folio(vma, addr, ptent);
> >>> + if (!cow_folio || cow_folio == folio) {
> >>> + continue;
> >>> + }
> >>> +
> >>> + mlock_folio(cow_folio);
> >>> + step = get_folio_mlock_step(folio, ptent,
> >>> + addr, addr + (nr << PAGE_SHIFT));
> >>> + }
> >>> +}
> >>> +
> >>> +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
> >>> + pte_t *pte, unsigned long addr, unsigned int nr)
> >>> +{
> >>> + struct folio *cow_folio;
> >>> + unsigned int step = 1;
> >>> +
> >>> + munlock_folio(folio);
> >>> + if (nr == 1)
> >>> + return;
> >>> +
> >>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
> >>> + pte_t ptent;
> >>> +
> >>> + step = 1;
> >>> + ptent = ptep_get(pte);
> >>> +
> >>> + if (!pte_present(ptent))
> >>> + continue;
> >>> +
> >>> + cow_folio = vm_normal_folio(vma, addr, ptent);
> >>> + if (!cow_folio || cow_folio == folio) {
> >>> + continue;
> >>> + }
> >>> +
> >>> + munlock_folio(cow_folio);
> >>> + step = get_folio_mlock_step(folio, ptent,
> >>> + addr, addr + (nr << PAGE_SHIFT));
> >>> + }
> >>> +}
> >>
> >> I'll finish the above later.
> >
> > There is a problem here that I didn't have the time to elaborate: we
> > can't mlock() a folio that is within the range but not fully mapped
> > because this folio can be on the deferred split queue. When the split
> > happens, those unmapped folios (not mapped by this vma but are mapped
> > into other vmas) will be stranded on the unevictable lru.
> Checked remap case in past few days, I agree we shouldn't treat a folio
> in the range but not fully mapped as in_range folio.
>
> As for remap case, it's possible that the folio is not in deferred split
> queue. But part of folio is mapped to VM_LOCKED vma and other part of
> folio is mapped to none VM_LOCKED vma. In this case, page can't be split
> as it's not in deferred split queue. So page reclaim should be allowed to
> pick this folio up, split it and reclaim the pages in none VM_LOCKED vma.
> So we can't mlock such kind of folio.
>
> The same thing can happen with madvise_cold_or_pageout_pte_range().
> I will update folio_in_vma() to check the PTE also.

Thanks, and I think we should move forward with this series and fix
the potential mlock race problem separately since it's not caused by
this series.

WDYT?

2023-07-27 00:36:57

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio



On 7/27/23 00:57, Yu Zhao wrote:
> On Wed, Jul 26, 2023 at 6:49 AM Yin Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 7/15/23 14:06, Yu Zhao wrote:
>>> On Wed, Jul 12, 2023 at 12:31 AM Yu Zhao <[email protected]> wrote:
>>>>
>>>> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>>>>>
>>>>> Current kernel only lock base size folio during mlock syscall.
>>>>> Add large folio support with following rules:
>>>>> - Only mlock large folio when it's in VM_LOCKED VMA range
>>>>>
>>>>> - If there is cow folio, mlock the cow folio as cow folio
>>>>> is also in VM_LOCKED VMA range.
>>>>>
>>>>> - munlock will apply to the large folio which is in VMA range
>>>>> or cross the VMA boundary.
>>>>>
>>>>> The last rule is used to handle the case that the large folio is
>>>>> mlocked, later the VMA is split in the middle of large folio
>>>>> and this large folio become cross VMA boundary.
>>>>>
>>>>> Signed-off-by: Yin Fengwei <[email protected]>
>>>>> ---
>>>>> mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>> 1 file changed, 99 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>>> index 0a0c996c5c214..f49e079066870 100644
>>>>> --- a/mm/mlock.c
>>>>> +++ b/mm/mlock.c
>>>>> @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio)
>>>>> local_unlock(&mlock_fbatch.lock);
>>>>> }
>>>>>
>>>>> +static inline bool should_mlock_folio(struct folio *folio,
>>>>> + struct vm_area_struct *vma)
>>>>> +{
>>>>> + if (vma->vm_flags & VM_LOCKED)
>>>>> + return (!folio_test_large(folio) ||
>>>>> + folio_within_vma(folio, vma));
>>>>> +
>>>>> + /*
>>>>> + * For unlock, allow munlock large folio which is partially
>>>>> + * mapped to VMA. As it's possible that large folio is
>>>>> + * mlocked and VMA is split later.
>>>>> + *
>>>>> + * During memory pressure, such kind of large folio can
>>>>> + * be split. And the pages are not in VM_LOCKed VMA
>>>>> + * can be reclaimed.
>>>>> + */
>>>>> +
>>>>> + return true;
>>>>
>>>> Looks good, or just
>>>>
>>>> should_mlock_folio() // or whatever name you see fit, can_mlock_folio()?
>>>> {
>>>> return !(vma->vm_flags & VM_LOCKED) || folio_within_vma();
>>>> }
>>>>
>>>>> +}
>>>>> +
>>>>> +static inline unsigned int get_folio_mlock_step(struct folio *folio,
>>>>> + pte_t pte, unsigned long addr, unsigned long end)
>>>>> +{
>>>>> + unsigned int nr;
>>>>> +
>>>>> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
>>>>> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
>>>>> +}
>>>>> +
>>>>> +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
>>>>> + pte_t *pte, unsigned long addr, unsigned int nr)
>>>>> +{
>>>>> + struct folio *cow_folio;
>>>>> + unsigned int step = 1;
>>>>> +
>>>>> + mlock_folio(folio);
>>>>> + if (nr == 1)
>>>>> + return;
>>>>> +
>>>>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
>>>>> + pte_t ptent;
>>>>> +
>>>>> + step = 1;
>>>>> + ptent = ptep_get(pte);
>>>>> +
>>>>> + if (!pte_present(ptent))
>>>>> + continue;
>>>>> +
>>>>> + cow_folio = vm_normal_folio(vma, addr, ptent);
>>>>> + if (!cow_folio || cow_folio == folio) {
>>>>> + continue;
>>>>> + }
>>>>> +
>>>>> + mlock_folio(cow_folio);
>>>>> + step = get_folio_mlock_step(folio, ptent,
>>>>> + addr, addr + (nr << PAGE_SHIFT));
>>>>> + }
>>>>> +}
>>>>> +
>>>>> +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
>>>>> + pte_t *pte, unsigned long addr, unsigned int nr)
>>>>> +{
>>>>> + struct folio *cow_folio;
>>>>> + unsigned int step = 1;
>>>>> +
>>>>> + munlock_folio(folio);
>>>>> + if (nr == 1)
>>>>> + return;
>>>>> +
>>>>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
>>>>> + pte_t ptent;
>>>>> +
>>>>> + step = 1;
>>>>> + ptent = ptep_get(pte);
>>>>> +
>>>>> + if (!pte_present(ptent))
>>>>> + continue;
>>>>> +
>>>>> + cow_folio = vm_normal_folio(vma, addr, ptent);
>>>>> + if (!cow_folio || cow_folio == folio) {
>>>>> + continue;
>>>>> + }
>>>>> +
>>>>> + munlock_folio(cow_folio);
>>>>> + step = get_folio_mlock_step(folio, ptent,
>>>>> + addr, addr + (nr << PAGE_SHIFT));
>>>>> + }
>>>>> +}
>>>>
>>>> I'll finish the above later.
>>>
>>> There is a problem here that I didn't have the time to elaborate: we
>>> can't mlock() a folio that is within the range but not fully mapped
>>> because this folio can be on the deferred split queue. When the split
>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>> into other vmas) will be stranded on the unevictable lru.
>> Checked remap case in past few days, I agree we shouldn't treat a folio
>> in the range but not fully mapped as in_range folio.
>>
>> As for remap case, it's possible that the folio is not in deferred split
>> queue. But part of folio is mapped to VM_LOCKED vma and other part of
>> folio is mapped to none VM_LOCKED vma. In this case, page can't be split
>> as it's not in deferred split queue. So page reclaim should be allowed to
>> pick this folio up, split it and reclaim the pages in none VM_LOCKED vma.
>> So we can't mlock such kind of folio.
>>
>> The same thing can happen with madvise_cold_or_pageout_pte_range().
>> I will update folio_in_vma() to check the PTE also.
>
> Thanks, and I think we should move forward with this series and fix
> the potential mlock race problem separately since it's not caused by
> this series.
>
> WDYT?

Yes. Agree. Will send v3 with remap case covered.


Regards
Yin, Fengwei