2023-01-13 17:58:56

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH for 6.1 regression] Revert "mm/compaction: fix set skip in fast_find_migrateblock"

This reverts commit 7efc3b7261030da79001c00d92bc3392fd6c664c.

We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
stalling CPU for long periods of time. Investigation of tracepoint data
shows that compaction is stuck in repeating fast_find_migrateblock()
based migrate page isolation, and then fails to migrate all isolated
pages. Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") was suspected as it was merged in 6.1 and in
theory can indeed remove a termination condition for
fast_find_migrateblock() under certain conditions, as it removes a place
that always marks a scanned pageblock from being re-scanned. There are
other such places, but those can be skipped under certain conditions,
which seems to match the tracepoint data.

Testing of revert also appears to have resolved the issue, thus revert
the commit until a more robust solution for the original problem is
developed.

It's also likely this will fix qemu stalls with 6.1 kernel reported in
Link 2, but that is not yet confirmed.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
Link: https://lore.kernel.org/kvm/[email protected]/
Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Cc: <[email protected]>
---
mm/compaction.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index ca1603524bbe..8238e83385a7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1839,6 +1839,7 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
pfn = cc->zone->zone_start_pfn;
cc->fast_search_fail = 0;
found_block = true;
+ set_pageblock_skip(freepage);
break;
}
}
--
2.39.0


2023-01-14 07:11:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH for 6.1 regression] Revert "mm/compaction: fix set skip in fast_find_migrateblock"

On 1/13/23 18:33, Vlastimil Babka wrote:
> This reverts commit 7efc3b7261030da79001c00d92bc3392fd6c664c.
>
> We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
> stalling CPU for long periods of time. Investigation of tracepoint data
> shows that compaction is stuck in repeating fast_find_migrateblock()
> based migrate page isolation, and then fails to migrate all isolated
> pages. Commit 7efc3b726103 ("mm/compaction: fix set skip in
> fast_find_migrateblock") was suspected as it was merged in 6.1 and in
> theory can indeed remove a termination condition for
> fast_find_migrateblock() under certain conditions, as it removes a place
> that always marks a scanned pageblock from being re-scanned. There are
> other such places, but those can be skipped under certain conditions,
> which seems to match the tracepoint data.
>
> Testing of revert also appears to have resolved the issue, thus revert
> the commit until a more robust solution for the original problem is
> developed.
>
> It's also likely this will fix qemu stalls with 6.1 kernel reported in
> Link 2, but that is not yet confirmed.
>
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
> Link: https://lore.kernel.org/kvm/[email protected]/
> Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
> Cc: <[email protected]>

Oops, forgot:

Signed-off-by: Vlastimil Babka <[email protected]>

> ---
> mm/compaction.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index ca1603524bbe..8238e83385a7 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1839,6 +1839,7 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
> pfn = cc->zone->zone_start_pfn;
> cc->fast_search_fail = 0;
> found_block = true;
> + set_pageblock_skip(freepage);
> break;
> }
> }

2023-01-14 08:49:44

by Pedro Falcato

[permalink] [raw]
Subject: Re: [PATCH for 6.1 regression] Revert "mm/compaction: fix set skip in fast_find_migrateblock"

On Sat, Jan 14, 2023 at 6:51 AM Vlastimil Babka <[email protected]> wrote:
>
> On 1/13/23 18:33, Vlastimil Babka wrote:
> > This reverts commit 7efc3b7261030da79001c00d92bc3392fd6c664c.
> >
> > We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
> > stalling CPU for long periods of time. Investigation of tracepoint data
> > shows that compaction is stuck in repeating fast_find_migrateblock()
> > based migrate page isolation, and then fails to migrate all isolated
> > pages. Commit 7efc3b726103 ("mm/compaction: fix set skip in
> > fast_find_migrateblock") was suspected as it was merged in 6.1 and in
> > theory can indeed remove a termination condition for
> > fast_find_migrateblock() under certain conditions, as it removes a place
> > that always marks a scanned pageblock from being re-scanned. There are
> > other such places, but those can be skipped under certain conditions,
> > which seems to match the tracepoint data.
> >
> > Testing of revert also appears to have resolved the issue, thus revert
> > the commit until a more robust solution for the original problem is
> > developed.
> >
> > It's also likely this will fix qemu stalls with 6.1 kernel reported in
> > Link 2, but that is not yet confirmed.
> >
> > Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
> > Link: https://lore.kernel.org/kvm/[email protected]/
> > Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
> > Cc: <[email protected]>
>
> Oops, forgot:
>
> Signed-off-by: Vlastimil Babka <[email protected]>
>
> > ---
> > mm/compaction.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index ca1603524bbe..8238e83385a7 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1839,6 +1839,7 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
> > pfn = cc->zone->zone_start_pfn;
> > cc->fast_search_fail = 0;
> > found_block = true;
> > + set_pageblock_skip(freepage);
> > break;
> > }
> > }
>

Vlastimil,

Thank you so much for looking into this. I've been daily driving it
for the past half day and it seems to have fixed my QEMU issues.
Of course, I don't have exactly a test suite for this but I've tried
everything and I can't get any of the original problems to show up.

That being said,
Tested-by: Pedro Falcato <[email protected]>

I'll report back if QEMU freezes the system again.

--
Pedro