2024-01-24 12:43:35

by zhiguojiang

[permalink] [raw]
Subject: [PATCH] mm:vmscan: shrink skip folios in the exiting task

If the shrinking folio is belong to the exiting task, this folio should
be freed in the task exit flow rather than being reclaimed in the shrink
flow, because the former takes less time.

If the folio which is belong to the exiting task is reclaimed in the
shrink flow, such as the anon folio, the anon folio needs to be first
written to the swap partition by swap-in in shrink flow, and then the
corresponding swap folio needs to be released in the task exiting flow.
As is well known, releasing a swap folio will task more time than
releasing directly an anon folio.

In the scenarios of the low memory system and mutil backed-applications,
the time-consuming problem caused by shrinking the exiting task's folios
will be more severe.

Signed-off-by: Zhiguo Jiang <[email protected]>
---
include/linux/mm.h | 1 +
mm/rmap.c | 12 ++++++++++++
mm/vmscan.c | 4 ++++
3 files changed, 17 insertions(+)
mode change 100644 => 100755 include/linux/mm.h
mode change 100644 => 100755 mm/rmap.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 42652bc8ceca..67e84b7d1928
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -312,6 +312,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */
#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
+#define VM_EXITING 0x100000000 /* The task which this vma belongs to is exiting */

#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
diff --git a/mm/rmap.c b/mm/rmap.c
index 1cf2bffa48ed..deb419fd095b
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -943,6 +943,18 @@ static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg)
if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
return true;

+ /*
+ * Skip counting references of the folios in the exiting task, and
+ * these folios will be freeed in the task exit flow rather than being
+ * reclaimed in shrink flow, which is shorter time consumption.
+ */
+ if (likely(vma->vm_mm) &&
+ unlikely(!atomic_read(&vma->vm_mm->mm_users) ||
+ test_bit(MMF_OOM_SKIP, &vma->vm_mm->flags))) {
+ pra->vm_flags |= VM_EXITING;
+ return true;
+ }
+
return false;
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b3ed3c9c3e3d..40ea4128044c 100755
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -887,6 +887,10 @@ static enum folio_references folio_check_references(struct folio *folio,
return FOLIOREF_KEEP;
}

+ /* The folio belongs to the exiting task: rotate */
+ if (vm_flags & VM_EXITING)
+ return FOLIOREF_KEEP;
+
/* Reclaim if clean, defer dirty folios to writeback */
if (referenced_folio && folio_is_file_lru(folio))
return FOLIOREF_RECLAIM_CLEAN;
--
2.39.0



2024-01-24 14:42:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm:vmscan: shrink skip folios in the exiting task

On Wed, Jan 24, 2024 at 08:43:07PM +0800, Zhiguo Jiang wrote:
> If the shrinking folio is belong to the exiting task, this folio should
> be freed in the task exit flow rather than being reclaimed in the shrink
> flow, because the former takes less time.
>
> If the folio which is belong to the exiting task is reclaimed in the
> shrink flow, such as the anon folio, the anon folio needs to be first
> written to the swap partition by swap-in in shrink flow, and then the
> corresponding swap folio needs to be released in the task exiting flow.
> As is well known, releasing a swap folio will task more time than
> releasing directly an anon folio.
>
> In the scenarios of the low memory system and mutil backed-applications,
> the time-consuming problem caused by shrinking the exiting task's folios
> will be more severe.

What testing have you done of this patch? How often does it happen?
Are there particular workloads that benefit from this? (I'm not sure
what "mutil backed-applications" are?)

And I do mean specifically of this patch, because to my eyes it
shouldn't even compile.

2024-01-25 01:35:10

by zhiguojiang

[permalink] [raw]
Subject: Re: [PATCH] mm:vmscan: shrink skip folios in the exiting task



在 2024/1/24 21:20, Matthew Wilcox 写道:
> On Wed, Jan 24, 2024 at 08:43:07PM +0800, Zhiguo Jiang wrote:
>> If the shrinking folio is belong to the exiting task, this folio should
>> be freed in the task exit flow rather than being reclaimed in the shrink
>> flow, because the former takes less time.
>>
>> If the folio which is belong to the exiting task is reclaimed in the
>> shrink flow, such as the anon folio, the anon folio needs to be first
>> written to the swap partition by swap-in in shrink flow, and then the
>> corresponding swap folio needs to be released in the task exiting flow.
>> As is well known, releasing a swap folio will task more time than
>> releasing directly an anon folio.
>>
>> In the scenarios of the low memory system and mutil backed-applications,
>> the time-consuming problem caused by shrinking the exiting task's folios
>> will be more severe.
> What testing have you done of this patch? How often does it happen?
> Are there particular workloads that benefit from this? (I'm not sure
> what "mutil backed-applications" are?)
1 Yes, this patch has been tested.

2 When the exiting tasks and shrink_inactive_list occur at the same time,
   the folios which shrink_inactive_list reclaims may be the exiting
tasks's folios
   in lruvecs. And when system is low memory, it more likely to occur,
because
   more backend applidatuions will be killed.
   The shrink_inactive_list reclaims the exiting tasks's folios in
lruvecs and
   transforms the exiting tasks's anon folios into swap memory, which leads
   to the increasing load of the current exiting tasks.

3 This patch can alleviate the load of the tasks exiting process. This patch
   can make that the exiting tasks release its anon folios faster
instead of
   releasing its swap memory from its anon folios swap-in in
shrink_inactive_list.

4 "mutil backed-applications" means that there are a large number of
    the backend applications in the system.

Thanks
>
> And I do mean specifically of this patch, because to my eyes it
> shouldn't even compile.
Has been tested.


2024-01-28 06:57:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm:vmscan: shrink skip folios in the exiting task

On Thu, Jan 25, 2024 at 09:34:53AM +0800, zhiguojiang wrote:
> > > In the scenarios of the low memory system and mutil backed-applications,
> > > the time-consuming problem caused by shrinking the exiting task's folios
> > > will be more severe.
> > What testing have you done of this patch? How often does it happen?
> > Are there particular workloads that benefit from this? (I'm not sure
> > what "mutil backed-applications" are?)
> 1 Yes, this patch has been tested.
>
> 2 When the exiting tasks and shrink_inactive_list occur at the same time,
> ?? the folios which shrink_inactive_list reclaims may be the exiting tasks's
> folios
> ?? in lruvecs. And when system is low memory, it more likely to occur,
> because
> ?? more backend applidatuions will be killed.
> ?? The shrink_inactive_list reclaims the exiting tasks's folios in lruvecs
> and
> ?? transforms the exiting tasks's anon folios into swap memory, which leads
> ?? to the increasing load of the current exiting tasks.

Ah, we're talking about an OOM scenario. OK, that makes some sense.
You should have mentioned that in the patch description.

> 3 This patch can alleviate the load of the tasks exiting process. This patch
> ?? can make that the exiting tasks release its anon folios faster instead of
> ?? releasing its swap memory from its anon folios swap-in in
> shrink_inactive_list.
>
> 4 "mutil backed-applications" means that there are a large number of
> ??? the backend applications in the system.
>
> Thanks
> >
> > And I do mean specifically of this patch, because to my eyes it
> > shouldn't even compile.
> Has been tested.

That's odd. I thought GCC used to complain about

long x = 0x100000000;

but it does the right thing. Except on 32-bit where it'll say
"warning: integer constant out of range".

In any case, the number you chose is not going to work on 32-bit systems
or on ARM or x86. It conflicts with protection keys on x86 and MTE on
ARM.

We can do it without any new magic numbers. I'm not sure this is a
great approach, but this should work:

+++ b/mm/rmap.c
@@ -840,6 +840,12 @@ static bool folio_referenced_one(struct folio *folio,
int referenced = 0;
unsigned long start = address, ptes = 0;

+ /* Skip this folio if it's mapped by an exiting task */
+ if (test_bit(MMF_OOM_SKIP, &vma->vm_mm->flags)) {
+ pra->referenced = -1;
+ return false;
+ }
+
while (page_vma_mapped_walk(&pvmw)) {
address = pvmw.address;



2024-01-29 02:21:54

by zhiguojiang

[permalink] [raw]
Subject: Re: [PATCH] mm:vmscan: shrink skip folios in the exiting task



在 2024/1/28 14:35, Matthew Wilcox 写道:
> On Thu, Jan 25, 2024 at 09:34:53AM +0800, zhiguojiang wrote:
>>>> In the scenarios of the low memory system and mutil backed-applications,
>>>> the time-consuming problem caused by shrinking the exiting task's folios
>>>> will be more severe.
>>> What testing have you done of this patch? How often does it happen?
>>> Are there particular workloads that benefit from this? (I'm not sure
>>> what "mutil backed-applications" are?)
>> 1 Yes, this patch has been tested.
>>
>> 2 When the exiting tasks and shrink_inactive_list occur at the same time,
>>    the folios which shrink_inactive_list reclaims may be the exiting tasks's
>> folios
>>    in lruvecs. And when system is low memory, it more likely to occur,
>> because
>>    more backend applidatuions will be killed.
>>    The shrink_inactive_list reclaims the exiting tasks's folios in lruvecs
>> and
>>    transforms the exiting tasks's anon folios into swap memory, which leads
>>    to the increasing load of the current exiting tasks.
> Ah, we're talking about an OOM scenario. OK, that makes some sense.
> You should have mentioned that in the patch description.
Hi,

1 Ok, I will update a more comprehensive description in next version.

2 I think this issue can occur not only in OOM scenario, but also in
  normal task exit scenario. So:

1) if (unlikely(!atomic_read(&vma->vm_mm->mm_users))) represents
  the scenario where the task exits normally.

2) if(test_bit(MMF_OOM_SKIP, &vma->vm_mm->flags)) represents the
   OOM Reaper scenario.

3 MMF_OOM_SKIP can only represent OOM scenario and cannot represent
  normal task exit scenario, as when MMF_OOM_SKIP is set in normal
  task exit scenario, the memory folios of the task have already been
  released. And the shrink_inactive_list should recognize these lru folios
  in exiting task before this exiting task releases its memory folios.

    tlb_gather_mmu_fullmm(&tlb, mm);
    /* update_hiwater_rss(mm) here? but nobody should be looking */
    /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
    unmap_vmas(&tlb, &mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
    mmap_read_unlock(mm);

    /*
     * Set MMF_OOM_SKIP to hide this task from the oom killer/reaper
     * because the memory has been already freed.
     */
    set_bit(MMF_OOM_SKIP, &mm->flags);
>> 3 This patch can alleviate the load of the tasks exiting process. This patch
>>    can make that the exiting tasks release its anon folios faster instead of
>>    releasing its swap memory from its anon folios swap-in in
>> shrink_inactive_list.
>>
>> 4 "mutil backed-applications" means that there are a large number of
>>     the backend applications in the system.
>>
>> Thanks
>>> And I do mean specifically of this patch, because to my eyes it
>>> shouldn't even compile.
>> Has been tested.
> That's odd. I thought GCC used to complain about
>
> long x = 0x100000000;
>
> but it does the right thing. Except on 32-bit where it'll say
> "warning: integer constant out of range".
>
> In any case, the number you chose is not going to work on 32-bit systems
> or on ARM or x86. It conflicts with protection keys on x86 and MTE on
> ARM.
You're right, I overlooked the situation with the 32-bit system.
> We can do it without any new magic numbers. I'm not sure this is a
> great approach, but this should work:
>
> +++ b/mm/rmap.c
> @@ -840,6 +840,12 @@ static bool folio_referenced_one(struct folio *folio,
> int referenced = 0;
> unsigned long start = address, ptes = 0;
>
> + /* Skip this folio if it's mapped by an exiting task */
> + if (test_bit(MMF_OOM_SKIP, &vma->vm_mm->flags)) {
> + pra->referenced = -1;
> + return false;
> + }
> +
> while (page_vma_mapped_walk(&pvmw)) {
> address = pvmw.address;
>
I agree with your point of view. I think this is currently the best
solution,
  but I think it also needs to be added with:
if (unlikely(!atomic_read(&vma->vm_mm->mm_users)))

Please help review it again.

Best Regards.