by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 09/10] mm: page_isolation: prepare for hygienic freelists

On Thu, Mar 21, 2024 at 09:13:57PM +0800, kernel test robot wrote:
> Hi Johannes,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-page_alloc-remove-pcppage-migratetype-caching/20240321-020814
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20240320180429.678181-10-hannes%40cmpxchg.org
> patch subject: [PATCH 09/10] mm: page_isolation: prepare for hygienic freelists
> config: i386-randconfig-003-20240321 (https://download.01.org/0day-ci/archive/20240321/[email protected]/config)
> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240321/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All warnings (new ones prefixed by >>):
>
> mm/page_alloc.c: In function 'move_freepages_block_isolate':
> >> mm/page_alloc.c:688:17: warning: array subscript 11 is above array bounds of 'struct free_area[11]' [-Warray-bounds]
> 688 | zone->free_area[order].nr_free--;
> | ~~~~~~~~~~~~~~~^~~~~~~
> >> mm/page_alloc.c:688:17: warning: array subscript 11 is above array bounds of 'struct free_area[11]' [-Warray-bounds]

I think this is a bug in the old gcc.

We have this in move_freepages_block_isolate():

/* We're the starting block of a larger buddy */
if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
int mt = get_pfnblock_migratetype(page, pfn);
int order = buddy_order(page);

if (!is_migrate_isolate(mt))
__mod_zone_freepage_state(zone, -(1UL << order), mt);
del_page_from_free_list(page, zone, order);

And this config doesn't have hugetlb enabled, so:

/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
#define pageblock_order MAX_PAGE_ORDER

If buddies were indeed >MAX_PAGE_ORDER, this would be an out-of-bounds
access when delete updates the freelist count. Of course, buddies per
definition cannot be larger than MAX_PAGE_ORDER. But the older gcc
doesn't seem to realize this branch in this configuration is dead.

Maybe we can help it out and make the impossible scenario a bit more
explicit? Does this fixlet silence the warning?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index efb2581ac142..4cdc356e73f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1698,6 +1698,10 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
NULL, NULL))
return false;

+ /* No splits needed if buddies can't span multiple blocks */
+ if (pageblock_order == MAX_PAGE_ORDER)
+ goto move;
+
/* We're a tail block in a larger buddy */
pfn = find_large_buddy(start_pfn);
if (pfn != start_pfn) {
@@ -1725,7 +1729,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
split_large_buddy(zone, page, pfn, order);
return true;
}
-
+move:
mt = get_pfnblock_migratetype(page, start_pfn);
nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
if (!is_migrate_isolate(mt))

Zi Yan, does this look sane to you as well?

2024-03-21 15:04:20

On Wed, Mar 27, 2024 at 09:54:01AM +0100, Vlastimil Babka wrote:
> On 3/20/24 7:02 PM, Johannes Weiner wrote:
> > Free page accounting currently happens a bit too high up the call
> > stack, where it has to deal with guard pages, compaction capturing,
> > block stealing and even page isolation. This is subtle and fragile,
> > and makes it difficult to hack on the code.
> >
> > Now that type violations on the freelists have been fixed, push the
> > accounting down to where pages enter and leave the freelist.
>
> Awesome!
>
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Reviewed-by: Vlastimil Babka <[email protected]>
>
> Just some nits:
>
> > @@ -1314,10 +1349,10 @@ static inline void expand(struct zone *zone, struct page *page,
> > * Corresponding page table entries will not be touched,
> > * pages will stay not present in virtual address space
> > */
> > - if (set_page_guard(zone, &page[size], high, migratetype))
> > + if (set_page_guard(zone, &page[size], high))
> > continue;
> >
> > - add_to_free_list(&page[size], zone, high, migratetype);
> > + add_to_free_list(&page[size], zone, high, migratetype, false);
>
> This is account_freepages() in the hot loop, what if we instead used
> __add_to_free_list(), sum up nr_pages and called account_freepages() once
> outside of the loop?

Good idea. I'll send a fixlet for that.

> > set_buddy_order(&page[size], high);
> > }
> > }
>
> <snip>
>
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 042937d5abe4..914a71c580d8 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -252,7 +252,8 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
> > * Isolating this block already succeeded, so this
> > * should not fail on zone boundaries.
> > */
> > - WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
> > + WARN_ON_ONCE(!move_freepages_block_isolate(zone, page,
> > + migratetype));
> > } else {
> > set_pageblock_migratetype(page, migratetype);
> > __putback_isolated_page(page, order, migratetype);
>
> Looks like a drive-by edit of an extra file just to adjust identation.

Argh, yeah, I think an earlier version mucked with the signature and I
didn't undo that cleanly. I'll send a fixlet for that too.

Thanks for the review!

2024-03-27 15:20:53

On Mon, Jun 10, 2024 at 9:28 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Jun 04, 2024 at 10:53:55PM -0600, Yu Zhao wrote:
> > On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <[email protected]> wrote:
> > > On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > > > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <[email protected]> wrote:
> > > > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > > > This series significantly regresses Android and ChromeOS under memory
> > > > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > > > it was mentioned in the early discussions that potential regressions
> > > > > > for such a case are somewhat expected?
> > > > >
> > > > > This is not expected for the 10 patches here. You might be referring
> > > > > to the discussion around the huge page allocator series, which had
> > > > > fallback restrictions and many changes to reclaim and compaction.
> > > >
> > > > Right, now I remember.
> > > >
> > > > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > > > (full and some) on both platforms increased by over 20%. I could post
> > > > > > the details of the benchmarks and the metrics they measure, but I
> > > > > > doubt they would mean much to you. I did ask our test teams to save
> > > > > > extra kernel logs that might be more helpful, and I could forward them
> > > > > > to you.
> > > > >
> > > > > If the issue persists with the latest patches in -mm, a kernel config
> > > > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > > > before/during/after the problematic behavior would be very helpful.
> > > >
> > > > Assuming all the fixes were included, do you want the logs from 6.8?
> > > > We have them available now.
> > >
> > > Yes, that would be helpful.
> > >
> > > If you have them, it would also be quite useful to have the vmstat
> > > before-after-test delta from a good kernel, for baseline comparison.
> >
> > Sorry for taking this long -- I wanted to see if the regression is
> > still reproducible on v6.9.
> >
> > Apparently we got the similar results on v6.9 with the following
> > patches cherry-picked cleanly from v6.10-rc1:
> >
> > 1 mm: page_alloc: remove pcppage migratetype caching
> > 2 mm: page_alloc: optimize free_unref_folios()
> > 3 mm: page_alloc: fix up block types when merging compatible blocks
> > 4 mm: page_alloc: move free pages when converting block during isolation
> > 5 mm: page_alloc: fix move_freepages_block() range error
> > 6 mm: page_alloc: fix freelist movement during block conversion
> > 7 mm: page_alloc: close migratetype race between freeing and stealing
> > 8 mm: page_alloc: set migratetype inside move_freepages()
> > 9 mm: page_isolation: prepare for hygienic freelists
> > 10 mm: page_alloc: consolidate free page accounting
> > 11 mm: page_alloc: change move_freepages() to __move_freepages_block()
> > 12 mm: page_alloc: batch vmstat updates in expand()
> >
> > Unfortunately I just realized that that automated benchmark didn't
> > collect the kernel stats before it starts (since it always starts on a
> > freshly booted device). While this is being fixed, I'm attaching the
> > kernel stats collected after the benchmark finished. I grabbed 10 runs
> > for each (baseline/patched), and if you need more, please let me know.
> > (And we should have the stats before the benchmark soon.)
>
> Thanks for grabbing these, and sorry about the delay, I was traveling
> last week.
>
> You mentioned "THPs are virtually non-existant". But the workload
> doesn't seem to allocate anon THPs at all.

Sorry for not being clear there: you are correct.

I meant that client devices rarely use 2MB THPs or __GFP_COMP. (They
simply can't due to both internal and external fragmentations, but we
are trying!)

> For file THP, the patched
> kernel's median for allocation success is 90% of baseline, but the
> inter-run min/max deviation from the median in baseline is 85%/108%
> and in patched and 85%/112% in patched, so this is quite noisy. Was
> that initial comment regarding a different workload?

No, in both cases (Android and ChromeOS) we tried, we were hoping the
series could help with mTHP (64KB and 32KB). But we hit the
regressions with their baseline (4KB). Again, 2MB THPs, if they are
used, are reserved (allocated and mlocked to hold text/code sections
after a reboot). So they shouldn't matter, and I highly doubt the
regressions are because of them.

> This other data point has me stumped. Comparing medians, there is a
> 1.5% reduction in anon refaults and a 4.8% increase in file
> refaults. And indeed, there is more files and less anon being scanned.
> I think this could explain the PSI delta, since AFAIK you have zram or
> zswap, and anon decompression loads are cheaper than filesystem IO.
>
> The above patches don't do anything that directly influences the
> anon-file reclaim balance. However, if file THPs fall back to 4k file
> pages more, that *might* be able to explain a shift in reclaim
> balance, if some hot subpages in those THPs were protecting colder
> subpages from being reclaimed and refaulting.
>
> In that case, the root cause would still be a simple THP success rate
> regression. To confirm this theory, could you run the baseline and the
> patched sets both with THP disabled entirely?

Will try this. And is bisecting within this series possible?

> Can you elaborate more on what the workload is doing exactly?

These are simple benchmarks that measure the system and foreground
app/tab performance under memory pressure, e.g., [1]. They open a
bunch of apps/tabs (respectively on Android/ChromeOS) and switch
between them. At a given time, one of them is foreground and the rest
are background, obviously. When an app/tab has been in the background
for a while, the userspace may call madvise(PAGEOUT) to reclaim (most
of) its LRU pages, leaving unmovable kernel memory there. This
strategy allows client systems to cache more apps/tabs in the
background and reduce their startup/switch time. But it's also a major
source of fragmentation (I'm sure you get why so I won't go into
details here. And userspace also tries to make a better decision
between reclaim/compact/kill based on fragmentation, but it's not
easy.)

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/refs/heads/main/src/go.chromium.org/tast-tests/cros/local/bundles/cros/platform/memory_pressure.go

> What are
> the parameters of the test machine (CPUs, memory size)? It'd be
> helpful if I could reproduce this locally as well.

The data I shared previously is from an Intel i7-1255U + 4GB Chromebook.

More data attached -- it contains vmstat, zoneinfo and pagetypeinfo
files collected before the benchmark (after fresh reboots) and after
the benchmark.

Attachments:

log.tar.xz (24.56 kB)

2024-06-13 15:39:43

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH V4 00/10] mm: page_alloc: freelist migratetype hygiene

On Wed, Jun 12, 2024 at 12:52:20PM -0600, Yu Zhao wrote:
> On Mon, Jun 10, 2024 at 9:28 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, Jun 04, 2024 at 10:53:55PM -0600, Yu Zhao wrote:
> > > On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <[email protected]> wrote:
> > > > On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > > > > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <[email protected]> wrote:
> > > > > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > > > > This series significantly regresses Android and ChromeOS under memory
> > > > > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > > > > it was mentioned in the early discussions that potential regressions
> > > > > > > for such a case are somewhat expected?
> > > > > >
> > > > > > This is not expected for the 10 patches here. You might be referring
> > > > > > to the discussion around the huge page allocator series, which had
> > > > > > fallback restrictions and many changes to reclaim and compaction.
> > > > >
> > > > > Right, now I remember.
> > > > >
> > > > > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > > > > (full and some) on both platforms increased by over 20%. I could post
> > > > > > > the details of the benchmarks and the metrics they measure, but I
> > > > > > > doubt they would mean much to you. I did ask our test teams to save
> > > > > > > extra kernel logs that might be more helpful, and I could forward them
> > > > > > > to you.
> > > > > >
> > > > > > If the issue persists with the latest patches in -mm, a kernel config
> > > > > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > > > > before/during/after the problematic behavior would be very helpful.
> > > > >
> > > > > Assuming all the fixes were included, do you want the logs from 6.8?
> > > > > We have them available now.
> > > >
> > > > Yes, that would be helpful.
> > > >
> > > > If you have them, it would also be quite useful to have the vmstat
> > > > before-after-test delta from a good kernel, for baseline comparison.
> > >
> > > Sorry for taking this long -- I wanted to see if the regression is
> > > still reproducible on v6.9.
> > >
> > > Apparently we got the similar results on v6.9 with the following
> > > patches cherry-picked cleanly from v6.10-rc1:
> > >
> > > 1 mm: page_alloc: remove pcppage migratetype caching
> > > 2 mm: page_alloc: optimize free_unref_folios()
> > > 3 mm: page_alloc: fix up block types when merging compatible blocks
> > > 4 mm: page_alloc: move free pages when converting block during isolation
> > > 5 mm: page_alloc: fix move_freepages_block() range error
> > > 6 mm: page_alloc: fix freelist movement during block conversion
> > > 7 mm: page_alloc: close migratetype race between freeing and stealing
> > > 8 mm: page_alloc: set migratetype inside move_freepages()
> > > 9 mm: page_isolation: prepare for hygienic freelists
> > > 10 mm: page_alloc: consolidate free page accounting
> > > 11 mm: page_alloc: change move_freepages() to __move_freepages_block()
> > > 12 mm: page_alloc: batch vmstat updates in expand()
> > >
> > > Unfortunately I just realized that that automated benchmark didn't
> > > collect the kernel stats before it starts (since it always starts on a
> > > freshly booted device). While this is being fixed, I'm attaching the
> > > kernel stats collected after the benchmark finished. I grabbed 10 runs
> > > for each (baseline/patched), and if you need more, please let me know.
> > > (And we should have the stats before the benchmark soon.)
> >
> > Thanks for grabbing these, and sorry about the delay, I was traveling
> > last week.
> >
> > You mentioned "THPs are virtually non-existant". But the workload
> > doesn't seem to allocate anon THPs at all.
>
> Sorry for not being clear there: you are correct.
>
> I meant that client devices rarely use 2MB THPs or __GFP_COMP. (They
> simply can't due to both internal and external fragmentations, but we
> are trying!)

Ah, understood. So this is nominally a non-THP workload, and we're
suspecting a simple 4k allocation issue in low memory conditions.

Thanks for clarifying.

However, I don't think 4k alone would explain pressure just yet. PSI
is triggered by reclaim and compaction, but with this series type
fallbacks are still allowed to the full extent before entering any
such remediation. The series merely fixes type safety and eliminates
avoidable/accidental mixing.

So I'm thinking something else must still be going on. Either THP
(however limited the use in this workload); or the userspace feedback
mechanism you mention below...

> > For file THP, the patched
> > kernel's median for allocation success is 90% of baseline, but the
> > inter-run min/max deviation from the median in baseline is 85%/108%
> > and in patched and 85%/112% in patched, so this is quite noisy. Was
> > that initial comment regarding a different workload?
>
> No, in both cases (Android and ChromeOS) we tried, we were hoping the
> series could help with mTHP (64KB and 32KB). But we hit the
> regressions with their baseline (4KB). Again, 2MB THPs, if they are
> used, are reserved (allocated and mlocked to hold text/code sections
> after a reboot). So they shouldn't matter, and I highly doubt the
> regressions are because of them.

Ok.

> > This other data point has me stumped. Comparing medians, there is a
> > 1.5% reduction in anon refaults and a 4.8% increase in file
> > refaults. And indeed, there is more files and less anon being scanned.
> > I think this could explain the PSI delta, since AFAIK you have zram or
> > zswap, and anon decompression loads are cheaper than filesystem IO.
> >
> > The above patches don't do anything that directly influences the
> > anon-file reclaim balance. However, if file THPs fall back to 4k file
> > pages more, that *might* be able to explain a shift in reclaim
> > balance, if some hot subpages in those THPs were protecting colder
> > subpages from being reclaimed and refaulting.
> >
> > In that case, the root cause would still be a simple THP success rate
> > regression. To confirm this theory, could you run the baseline and the
> > patched sets both with THP disabled entirely?
>
> Will try this. And is bisecting within this series possible?

Yes. I built and put each commit incrementally through my test
machinery before sending them out. I can't vouch for all
configurations, of course, but I'd expect it to work.

> > Can you elaborate more on what the workload is doing exactly?
>
> These are simple benchmarks that measure the system and foreground
> app/tab performance under memory pressure, e.g., [1]. They open a
> bunch of apps/tabs (respectively on Android/ChromeOS) and switch
> between them. At a given time, one of them is foreground and the rest
> are background, obviously. When an app/tab has been in the background
> for a while, the userspace may call madvise(PAGEOUT) to reclaim (most
> of) its LRU pages, leaving unmovable kernel memory there. This
> strategy allows client systems to cache more apps/tabs in the
> background and reduce their startup/switch time. But it's also a major
> source of fragmentation (I'm sure you get why so I won't go into
> details here. And userspace also tries to make a better decision
> between reclaim/compact/kill based on fragmentation, but it's not
> easy.)

Thanks for the detailed explanation.

That last bit is interesting: how does it determine "fragmentation"?
The series might well affect this metric.

> [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/refs/heads/main/src/go.chromium.org/tast-tests/cros/local/bundles/cros/platform/memory_pressure.go
>
> > What are
> > the parameters of the test machine (CPUs, memory size)? It'd be
> > helpful if I could reproduce this locally as well.
>
> The data I shared previously is from an Intel i7-1255U + 4GB Chromebook.
>
> More data attached -- it contains vmstat, zoneinfo and pagetypeinfo
> files collected before the benchmark (after fresh reboots) and after
> the benchmark.

Thanks, I'll take a look.