Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33;
Date:   Tue, 12 Sep 2023 10:50:28 -0400
From:   Johannes Weiner <hannes@cmpxchg.org>
To:     Vlastimil Babka <vbabka@suse.cz>
Cc:     Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Miaohe Lin <linmiaohe@huawei.com>,
        Kefeng Wang <wangkefeng.wang@huawei.com>,
        Zi Yan <ziy@nvidia.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching
Message-ID: <20230912145028.GA3228@cmpxchg.org>
References: <20230911195023.247694-1-hannes@cmpxchg.org>
 <20230911195023.247694-2-hannes@cmpxchg.org>
 <a389d846-c19a-42d3-6206-0a1c80e40b37@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <a389d846-c19a-42d3-6206-0a1c80e40b37@suse.cz>
Precedence: bulk

On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
> On 9/11/23 21:41, Johannes Weiner wrote:
> > The idea behind the cache is to save get_pageblock_migratetype()
> > lookups during bulk freeing. A microbenchmark suggests this isn't
> > helping, though. The pcp migratetype can get stale, which means that
> > bulk freeing has an extra branch to check if the pageblock was
> > isolated while on the pcp.
> > 
> > While the variance overlaps, the cache write and the branch seem to
> > make this a net negative. The following test allocates and frees
> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> > 
> > Before:
> >           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
> >                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
> >     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
> >    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
> >     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
> >         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> > 
> >          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> > 
> > After:
> >           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
> >                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
> >     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
> >    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
> >     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
> >         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> > 
> >          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> > 
> > A side effect is that this also ensures that pages whose pageblock
> > gets stolen while on the pcplist end up on the right freelist and we
> > don't perform potentially type-incompatible buddy merges (or skip
> > merges when we shouldn't), whis is likely beneficial to long-term
> > fragmentation management, although the effects would be harder to
> > measure. Settle for simpler and faster code as justification here.
> 
> Makes sense to me, so
> 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > @@ -1577,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> >  			continue;
> >  		del_page_from_free_list(page, zone, current_order);
> >  		expand(zone, page, order, current_order, migratetype);
> > -		set_pcppage_migratetype(page, migratetype);
> 
> Hm interesting, just noticed that __rmqueue_fallback() never did this
> AFAICS, sounds like a bug.

I don't quite follow. Which part?

Keep in mind that at this point __rmqueue_fallback() doesn't return a
page. It just moves pages to the desired freelist, and then
__rmqueue_smallest() gets called again. This changes in 5/6, but until
now at least all of the above would apply to fallback pages.

> > @@ -2145,7 +2123,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >  		 * pages are ordered properly.
> >  		 */
> >  		list_add_tail(&page->pcp_list, list);
> > -		if (is_migrate_cma(get_pcppage_migratetype(page)))
> > +		if (is_migrate_cma(get_pageblock_migratetype(page)))
> >  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> >  					      -(1 << order));
> 
> This is potentially a source of overhead, I assume patch 6/6 might
> change that.

Yes, 6/6 removes it altogether.

But the test results in this patch's changelog are from this patch in
isolation, so it doesn't appear to be a concern even on its own.

> > @@ -2457,7 +2423,7 @@ void free_unref_page_list(struct list_head *list)
> >  		 * Free isolated pages directly to the allocator, see
> >  		 * comment in free_unref_page.
> >  		 */
> > -		migratetype = get_pcppage_migratetype(page);
> > +		migratetype = get_pfnblock_migratetype(page, pfn);
> >  		if (unlikely(is_migrate_isolate(migratetype))) {
> >  			list_del(&page->lru);
> >  			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
> 
> I think after this change we should move the isolated pages handling to
> the second loop below, so that we wouldn't have to call
> get_pfnblock_migratetype() twice per page. Dunno yet if some later patch
> does that. It would need to unlock pcp when necessary.

That sounds like a great idea. Something like the following?

Lightly tested. If you're good with it, I'll beat some more on it and
submit it as a follow-up.

---

From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 12 Sep 2023 10:16:10 -0400
Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()

Move direct freeing of isolated pages to the lock-breaking block in
the second loop. This saves an unnecessary migratetype reassessment.

Minor comment and local variable scoping cleanups.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 49 +++++++++++++++++++++----------------------------
 1 file changed, 21 insertions(+), 28 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3f1c777feed..9cad31de1bf5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2408,48 +2408,41 @@ void free_unref_page_list(struct list_head *list)
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
 	int batch_count = 0;
-	int migratetype;
-
-	/* Prepare pages for freeing */
-	list_for_each_entry_safe(page, next, list, lru) {
-		unsigned long pfn = page_to_pfn(page);
 
-		if (!free_pages_prepare(page, 0, FPI_NONE)) {
+	list_for_each_entry_safe(page, next, list, lru)
+		if (!free_pages_prepare(page, 0, FPI_NONE))
 			list_del(&page->lru);
-			continue;
-		}
-
-		/*
-		 * Free isolated pages directly to the allocator, see
-		 * comment in free_unref_page.
-		 */
-		migratetype = get_pfnblock_migratetype(page, pfn);
-		if (unlikely(is_migrate_isolate(migratetype))) {
-			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
-			continue;
-		}
-	}
 
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
 		struct zone *zone = page_zone(page);
+		int migratetype;
 
 		list_del(&page->lru);
 		migratetype = get_pfnblock_migratetype(page, pfn);
 
 		/*
-		 * Either different zone requiring a different pcp lock or
-		 * excessive lock hold times when freeing a large list of
-		 * pages.
+		 * Zone switch, batch complete, or non-pcp freeing?
+		 * Drop the pcp lock and evaluate.
 		 */
-		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
+		if (unlikely(zone != locked_zone ||
+			     batch_count == SWAP_CLUSTER_MAX ||
+			     is_migrate_isolate(migratetype))) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
 				pcp_trylock_finish(UP_flags);
+				locked_zone = NULL;
 			}
 
-			batch_count = 0;
+			/*
+			 * Free isolated pages directly to the
+			 * allocator, see comment in free_unref_page.
+			 */
+			if (is_migrate_isolate(migratetype)) {
+				free_one_page(zone, page, pfn, 0,
+					      migratetype, FPI_NONE);
+				continue;
+			}
 
 			/*
 			 * trylock is necessary as pages may be getting freed
@@ -2459,12 +2452,12 @@ void free_unref_page_list(struct list_head *list)
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
-				free_one_page(zone, page, pfn,
-					      0, migratetype, FPI_NONE);
-				locked_zone = NULL;
+				free_one_page(zone, page, pfn, 0,
+					      migratetype, FPI_NONE);
 				continue;
 			}
 			locked_zone = zone;
+			batch_count = 0;
 		}
 
 		/*
-- 
2.42.0