Changes since v2 [8]:
- add acks/reviews (thanks David and Oscar)
- small wording and style changes
- rebase to next-20201111
Changes since v1 [7]:
- add acks/reviews (thanks David and Michal)
- drop "mm, page_alloc: make per_cpu_pageset accessible only after init" as
that's orthogonal and needs more consideration
- squash "mm, page_alloc: drain all pcplists during memory offline" into the
last patch, and move new zone_pcp_* functions into mm/page_alloc. As such,
the new 'force all cpus' param of __drain_all_pages() is never exported
outside page_alloc.c so I didn't add a new wrapper function to hide the bool
- keep pcp_batch_high_lock a mutex as offline_pages is synchronized anyway,
as suggested by Michal. Thus we don't need atomic variable and sync around
it, and patch is much smaller. If alloc_contic_range() wants to use the new
functionality and keep parallelism, we can add that on top.
As per the discussions [1] [2] this is an attempt to implement David's
suggestion that page isolation should disable pcplists to avoid races with page
freeing in progress. This is done without extra checks in fast paths, as
explained in Patch 9. The repeated draining done by [2] is then no longer
needed. Previous version (RFC) is at [3].
The RFC tried to hide pcplists disabling/enabling into page isolation, but it
wasn't completely possible, as memory offline does not unisolation. Michal
suggested an explicit API in [4] so that's the current implementation and it
seems indeed nicer.
Once we accept that page isolation users need to do explicit actions around it
depending on the needed guarantees, we can also IMHO accept that the current
pcplist draining can be also done by the callers, which is more effective.
After all, there are only two users of page isolation. So patch 6 does
effectively the same thing as Pavel proposed in [5], and patch 7 implement
stronger guarantees only for memory offline. If CMA decides to opt-in to the
stronger guarantee, it can be added later.
Patches 1-5 are preparatory cleanups for pcplist disabling.
Patchset was briefly tested in QEMU so that memory online/offline works, but
I haven't done a stress test that would prove the race fixed by [2] is
eliminated.
Note that patch 7 could be avoided if we instead adjusted page freeing in shown
in [6], but I believe the current implementation of disabling pcplists is not
too much complex, so I would prefer this instead of adding new checks and
longer irq-disabled section into page freeing hotpaths.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/
[4] https://lore.kernel.org/linux-mm/[email protected]/
[5] https://lore.kernel.org/linux-mm/[email protected]/
[6] https://lore.kernel.org/linux-mm/[email protected]/
[7] https://lore.kernel.org/linux-mm/[email protected]/
[8] https://lore.kernel.org/linux-mm/[email protected]/
Vlastimil Babka (7):
mm, page_alloc: clean up pageset high and batch update
mm, page_alloc: calculate pageset high and batch once per zone
mm, page_alloc: remove setup_pageset()
mm, page_alloc: simplify pageset_update()
mm, page_alloc: cache pageset high and batch in struct zone
mm, page_alloc: move draining pcplists to page isolation users
mm, page_alloc: disable pcplists during memory offline
include/linux/mmzone.h | 6 ++
mm/internal.h | 2 +
mm/memory_hotplug.c | 27 +++---
mm/page_alloc.c | 195 ++++++++++++++++++++++++-----------------
mm/page_isolation.c | 10 +--
5 files changed, 141 insertions(+), 99 deletions(-)
--
2.29.1
All per-cpu pagesets for a zone use the same high and batch values, that are
duplicated there just for performance (locality) reasons. This patch adds the
same variables also to struct zone as a shared copy.
This will be useful later for making possible to disable pcplists temporarily
by setting high value to 0, while remembering the values for restoring them
later. But we can also immediately benefit from not updating pagesets of all
possible cpus in case the newly recalculated values (after sysctl change or
memory online/offline) are actually unchanged from the previous ones.
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
include/linux/mmzone.h | 6 ++++++
mm/page_alloc.c | 16 ++++++++++++++--
2 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7385871768d4..bb9188de2718 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -470,6 +470,12 @@ struct zone {
#endif
struct pglist_data *zone_pgdat;
struct per_cpu_pageset __percpu *pageset;
+ /*
+ * the high and batch values are copied to individual pagesets for
+ * faster access
+ */
+ int pageset_high;
+ int pageset_batch;
#ifndef CONFIG_SPARSEMEM
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b43d6d1a288..25bc9bb77696 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5912,6 +5912,9 @@ static void build_zonelists(pg_data_t *pgdat)
* Other parts of the kernel may not check if the zone is available.
*/
static void pageset_init(struct per_cpu_pageset *p);
+/* These effectively disable the pcplists in the boot pageset completely */
+#define BOOT_PAGESET_HIGH 0
+#define BOOT_PAGESET_BATCH 1
static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
@@ -6301,8 +6304,8 @@ static void pageset_init(struct per_cpu_pageset *p)
* need to be as careful as pageset_update() as nobody can access the
* pageset yet.
*/
- pcp->high = 0;
- pcp->batch = 1;
+ pcp->high = BOOT_PAGESET_HIGH;
+ pcp->batch = BOOT_PAGESET_BATCH;
}
/*
@@ -6326,6 +6329,13 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
new_batch = max(1UL, 1 * new_batch);
}
+ if (zone->pageset_high == new_high &&
+ zone->pageset_batch == new_batch)
+ return;
+
+ zone->pageset_high = new_high;
+ zone->pageset_batch = new_batch;
+
for_each_possible_cpu(cpu) {
p = per_cpu_ptr(zone->pageset, cpu);
pageset_update(&p->pcp, new_high, new_batch);
@@ -6386,6 +6396,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
* offset of a (static) per cpu variable into the per cpu area.
*/
zone->pageset = &boot_pageset;
+ zone->pageset_high = BOOT_PAGESET_HIGH;
+ zone->pageset_batch = BOOT_PAGESET_BATCH;
if (populated_zone(zone))
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n",
--
2.29.1
Memory offlining relies on page isolation to guarantee a forward
progress because pages cannot be reused while they are isolated. But the
page isolation itself doesn't prevent from races while freed pages are
stored on pcp lists and thus can be reused. This can be worked around by
repeated draining of pcplists, as done by commit 968318261221
("mm/memory_hotplug: drain per-cpu pages again during memory offline").
David and Michal would prefer that this race was closed in a way that callers
of page isolation who need stronger guarantees don't need to repeatedly drain.
David suggested disabling pcplists usage completely during page isolation,
instead of repeatedly draining them.
To achieve this without adding special cases in alloc/free fastpath, we can use
the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
will be immediately flushed.
The race can thus be closed by setting pcp->high to 0 and draining pcplists
once, before calling start_isolate_page_range(). The draining will serialize
after processes that already disabled interrupts and read the old value of
pcp->high in free_unref_page_commit(), and processes that have not yet disabled
interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
pcplists. This guarantees no stray pages on pcplists in zones where isolation
happens.
This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
page isolation users can call before start_isolate_page_range() and after
unisolating (or offlining) the isolated pages.
Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
not empty. The check can however race with a free to pcplist that has not yet
increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
racy check and drain on all cpus, and use this option in zone_pcp_disable().
As we have to avoid external updates to high and batch while pcplists are
disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
zone_pcp_enable(). This also synchronizes multiple users of
zone_pcp_disable()/enable().
Currently the only user of this functionality is offline_pages().
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/internal.h | 2 ++
mm/memory_hotplug.c | 28 ++++++++----------
mm/page_alloc.c | 69 +++++++++++++++++++++++++++++++++++----------
mm/page_isolation.c | 6 ++--
4 files changed, 71 insertions(+), 34 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index c43ccdddb0f6..2966496680bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
extern void zone_pcp_update(struct zone *zone);
extern void zone_pcp_reset(struct zone *zone);
+extern void zone_pcp_disable(struct zone *zone);
+extern void zone_pcp_enable(struct zone *zone);
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3c494ab0d075..e0a561c550b3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1491,17 +1491,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
}
node = zone_to_nid(zone);
+ /*
+ * Disable pcplists so that page isolation cannot race with freeing
+ * in a way that pages from isolated pageblock are left on pcplists.
+ */
+ zone_pcp_disable(zone);
+
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
MIGRATE_MOVABLE,
MEMORY_OFFLINE | REPORT_FAILURE);
if (ret) {
reason = "failure to isolate range";
- goto failed_removal;
+ goto failed_removal_pcplists_disabled;
}
- drain_all_pages(zone);
-
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
node_states_check_changes_offline(nr_pages, zone, &arg);
@@ -1551,20 +1555,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
goto failed_removal_isolated;
}
- /*
- * per-cpu pages are drained after start_isolate_page_range, but
- * if there are still pages that are not free, make sure that we
- * drain again, because when we isolated range we might have
- * raced with another thread that was adding pages to pcp list.
- *
- * Forward progress should be still guaranteed because
- * pages on the pcp list can only belong to MOVABLE_ZONE
- * because has_unmovable_pages explicitly checks for
- * PageBuddy on freed pages on other zones.
- */
ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
- if (ret)
- drain_all_pages(zone);
+
} while (ret);
/* Mark all sections offline and remove free pages from the buddy. */
@@ -1580,6 +1572,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
spin_unlock_irqrestore(&zone->lock, flags);
+ zone_pcp_enable(zone);
+
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
zone->present_pages -= nr_pages;
@@ -1612,6 +1606,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
failed_removal_isolated:
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
memory_notify(MEM_CANCEL_OFFLINE, &arg);
+failed_removal_pcplists_disabled:
+ zone_pcp_enable(zone);
failed_removal:
pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
(unsigned long long) start_pfn << PAGE_SHIFT,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ec3e1e27169..ba4943727793 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3030,14 +3030,7 @@ static void drain_local_pages_wq(struct work_struct *work)
preempt_enable();
}
-/*
- * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
- *
- * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
- */
-void drain_all_pages(struct zone *zone)
+void __drain_all_pages(struct zone *zone, bool force_all_cpus)
{
int cpu;
@@ -3076,7 +3069,13 @@ void drain_all_pages(struct zone *zone)
struct zone *z;
bool has_pcps = false;
- if (zone) {
+ if (force_all_cpus) {
+ /*
+ * The pcp.count check is racy, some callers need a
+ * guarantee that no cpu is missed.
+ */
+ has_pcps = true;
+ } else if (zone) {
pcp = per_cpu_ptr(zone->pageset, cpu);
if (pcp->pcp.count)
has_pcps = true;
@@ -3109,6 +3108,18 @@ void drain_all_pages(struct zone *zone)
mutex_unlock(&pcpu_drain_mutex);
}
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
+ *
+ * When zone parameter is non-NULL, spill just the single zone's pages.
+ *
+ * Note that this can be extremely slow as the draining happens in a workqueue.
+ */
+void drain_all_pages(struct zone *zone)
+{
+ __drain_all_pages(zone, false);
+}
+
#ifdef CONFIG_HIBERNATION
/*
@@ -6308,6 +6319,18 @@ static void pageset_init(struct per_cpu_pageset *p)
pcp->batch = BOOT_PAGESET_BATCH;
}
+void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
+ unsigned long batch)
+{
+ struct per_cpu_pageset *p;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ p = per_cpu_ptr(zone->pageset, cpu);
+ pageset_update(&p->pcp, high, batch);
+ }
+}
+
/*
* Calculate and set new high and batch values for all per-cpu pagesets of a
* zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
@@ -6315,8 +6338,6 @@ static void pageset_init(struct per_cpu_pageset *p)
static void zone_set_pageset_high_and_batch(struct zone *zone)
{
unsigned long new_high, new_batch;
- struct per_cpu_pageset *p;
- int cpu;
if (percpu_pagelist_fraction) {
new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6336,10 +6357,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
zone->pageset_high = new_high;
zone->pageset_batch = new_batch;
- for_each_possible_cpu(cpu) {
- p = per_cpu_ptr(zone->pageset, cpu);
- pageset_update(&p->pcp, new_high, new_batch);
- }
+ __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
}
void __meminit setup_zone_pageset(struct zone *zone)
@@ -8757,6 +8775,27 @@ void __meminit zone_pcp_update(struct zone *zone)
mutex_unlock(&pcp_batch_high_lock);
}
+/*
+ * Effectively disable pcplists for the zone by setting the high limit to 0
+ * and draining all cpus. A concurrent page freeing on another CPU that's about
+ * to put the page on pcplist will either finish before the drain and the page
+ * will be drained, or observe the new high limit and skip the pcplist.
+ *
+ * Must be paired with a call to zone_pcp_enable().
+ */
+void zone_pcp_disable(struct zone *zone)
+{
+ mutex_lock(&pcp_batch_high_lock);
+ __zone_set_pageset_high_and_batch(zone, 0, 1);
+ __drain_all_pages(zone, true);
+}
+
+void zone_pcp_enable(struct zone *zone)
+{
+ __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+ mutex_unlock(&pcp_batch_high_lock);
+}
+
void zone_pcp_reset(struct zone *zone)
{
unsigned long flags;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index feab446d1982..a254e1f370a3 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -174,9 +174,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* A call to drain_all_pages() after isolation can flush most of them. However
* in some cases pages might still end up on pcp lists and that would allow
* for their allocation even when they are in fact isolated already. Depending
- * on how strong of a guarantee the caller needs, further drain_all_pages()
- * might be needed (e.g. __offline_pages will need to call it after check for
- * isolated range for a next retry).
+ * on how strong of a guarantee the caller needs, zone_pcp_disable/enable()
+ * might be used to flush and disable pcplist before isolation and enable after
+ * unisolation.
*
* Return: 0 on success and -EBUSY if any part of range cannot be isolated.
*/
--
2.29.1
pageset_update() attempts to update pcplist's high and batch values in a way
that readers don't observe batch > high. It uses smp_wmb() to order the updates
in a way to achieve this. However, without proper pairing read barriers in
readers this guarantee doesn't hold, and there are no such barriers in
e.g. free_unref_page_commit().
Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in free_pcppages_bulk()")
already showed this is problematic, and solved this by ultimately only trusing
pcp->count of the current cpu with interrupts disabled.
The update dance with unpaired write barriers thus makes no sense. Replace
them with plain WRITE_ONCE to prevent store tearing, and document that the
values can change asynchronously and should not be trusted for correctness.
All current readers appear to be OK after 88e8ac11d2ea. Convert them to
READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
making future changes to the code that special care is needed.
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 40 ++++++++++++++++++----------------------
1 file changed, 18 insertions(+), 22 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5a8ec7d94884..8b43d6d1a288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1353,7 +1353,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
{
int migratetype = 0;
int batch_free = 0;
- int prefetch_nr = 0;
+ int prefetch_nr = READ_ONCE(pcp->batch);
bool isolated_pageblocks;
struct page *page, *tmp;
LIST_HEAD(head);
@@ -1404,8 +1404,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
* avoid excessive prefetching due to large count, only
* prefetch buddy for the first pcp->batch nr of pages.
*/
- if (prefetch_nr++ < pcp->batch)
+ if (prefetch_nr) {
prefetch_buddy(page);
+ prefetch_nr--;
+ }
} while (--count && --batch_free && !list_empty(list));
}
@@ -3202,10 +3204,8 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
pcp = &this_cpu_ptr(zone->pageset)->pcp;
list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
- if (pcp->count >= pcp->high) {
- unsigned long batch = READ_ONCE(pcp->batch);
- free_pcppages_bulk(zone, batch, pcp);
- }
+ if (pcp->count >= READ_ONCE(pcp->high))
+ free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
}
/*
@@ -3390,7 +3390,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
do {
if (list_empty(list)) {
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, list,
+ READ_ONCE(pcp->batch), list,
migratetype, alloc_flags);
if (unlikely(list_empty(list)))
return NULL;
@@ -6262,13 +6262,16 @@ static int zone_batchsize(struct zone *zone)
}
/*
- * pcp->high and pcp->batch values are related and dependent on one another:
- * ->batch must never be higher then ->high.
- * The following function updates them in a safe manner without read side
- * locking.
+ * pcp->high and pcp->batch values are related and generally batch is lower
+ * than high. They are also related to pcp->count such that count is lower
+ * than high, and as soon as it reaches high, the pcplist is flushed.
*
- * Any new users of pcp->batch and pcp->high should ensure they can cope with
- * those fields changing asynchronously (acording to the above rule).
+ * However, guaranteeing these relations at all times would require e.g. write
+ * barriers here but also careful usage of read barriers at the read side, and
+ * thus be prone to error and bad for performance. Thus the update only prevents
+ * store tearing. Any new users of pcp->batch and pcp->high should ensure they
+ * can cope with those fields changing asynchronously, and fully trust only the
+ * pcp->count field on the local CPU with interrupts disabled.
*
* mutex_is_locked(&pcp_batch_high_lock) required when calling this function
* outside of boot time (or some other assurance that no concurrent updaters
@@ -6277,15 +6280,8 @@ static int zone_batchsize(struct zone *zone)
static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
unsigned long batch)
{
- /* start with a fail safe value for batch */
- pcp->batch = 1;
- smp_wmb();
-
- /* Update high, then batch, in order */
- pcp->high = high;
- smp_wmb();
-
- pcp->batch = batch;
+ WRITE_ONCE(pcp->batch, batch);
+ WRITE_ONCE(pcp->high, high);
}
static void pageset_init(struct per_cpu_pageset *p)
--
2.29.1
The updates to pcplists' high and batch values are handled by multiple
functions that make the calculations hard to follow. Consolidate everything
to pageset_set_high_and_batch() and remove pageset_set_batch() and
pageset_set_high() wrappers.
The only special case using one of the removed wrappers was:
build_all_zonelists_init()
setup_pageset()
pageset_set_batch()
which was hardcoding batch as 0, so we can just open-code a call to
pageset_update() with constant parameters instead.
No functional change.
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 49 ++++++++++++++++++++-----------------------------
1 file changed, 20 insertions(+), 29 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7560240fcf7d..3f1c57344e73 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5911,7 +5911,7 @@ static void build_zonelists(pg_data_t *pgdat)
* not check if the processor is online before following the pageset pointer.
* Other parts of the kernel may not check if the zone is available.
*/
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
+static void setup_pageset(struct per_cpu_pageset *p);
static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
@@ -5979,7 +5979,7 @@ build_all_zonelists_init(void)
* (a chicken-egg dilemma).
*/
for_each_possible_cpu(cpu)
- setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+ setup_pageset(&per_cpu(boot_pageset, cpu));
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
@@ -6288,12 +6288,6 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
pcp->batch = batch;
}
-/* a companion to pageset_set_high() */
-static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
-{
- pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
-}
-
static void pageset_init(struct per_cpu_pageset *p)
{
struct per_cpu_pages *pcp;
@@ -6306,35 +6300,32 @@ static void pageset_init(struct per_cpu_pageset *p)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
}
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+static void setup_pageset(struct per_cpu_pageset *p)
{
pageset_init(p);
- pageset_set_batch(p, batch);
+ pageset_update(&p->pcp, 0, 1);
}
/*
- * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
- * to the value high for the pageset p.
+ * Calculate and set new high and batch values for given per-cpu pageset of a
+ * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
*/
-static void pageset_set_high(struct per_cpu_pageset *p,
- unsigned long high)
-{
- unsigned long batch = max(1UL, high / 4);
- if ((high / 4) > (PAGE_SHIFT * 8))
- batch = PAGE_SHIFT * 8;
-
- pageset_update(&p->pcp, high, batch);
-}
-
static void pageset_set_high_and_batch(struct zone *zone,
- struct per_cpu_pageset *pcp)
+ struct per_cpu_pageset *p)
{
- if (percpu_pagelist_fraction)
- pageset_set_high(pcp,
- (zone_managed_pages(zone) /
- percpu_pagelist_fraction));
- else
- pageset_set_batch(pcp, zone_batchsize(zone));
+ unsigned long new_high, new_batch;
+
+ if (percpu_pagelist_fraction) {
+ new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
+ new_batch = max(1UL, new_high / 4);
+ if ((new_high / 4) > (PAGE_SHIFT * 8))
+ new_batch = PAGE_SHIFT * 8;
+ } else {
+ new_batch = zone_batchsize(zone);
+ new_high = 6 * new_batch;
+ new_batch = max(1UL, 1 * new_batch);
+ }
+ pageset_update(&p->pcp, new_high, new_batch);
}
static void __meminit zone_pageset_init(struct zone *zone, int cpu)
--
2.29.1
Currently, pcplists are drained during set_migratetype_isolate() which means
once per pageblock processed start_isolate_page_range(). This is somewhat
wasteful. Moreover, the callers might need different guarantees, and the
draining is currently prone to races and does not guarantee that no page
from isolated pageblock will end up on the pcplist after the drain.
Better guarantees are added by later patches and require explicit actions
by page isolation users that need them. Thus it makes sense to move the
current imperfect draining to the callers also as a preparation step.
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Pavel Tatashin <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/memory_hotplug.c | 11 ++++++-----
mm/page_alloc.c | 2 ++
mm/page_isolation.c | 10 +++++-----
3 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 41c62295292b..3c494ab0d075 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1500,6 +1500,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
goto failed_removal;
}
+ drain_all_pages(zone);
+
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
node_states_check_changes_offline(nr_pages, zone, &arg);
@@ -1550,11 +1552,10 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
}
/*
- * per-cpu pages are drained in start_isolate_page_range, but if
- * there are still pages that are not free, make sure that we
- * drain again, because when we isolated range we might
- * have raced with another thread that was adding pages to pcp
- * list.
+ * per-cpu pages are drained after start_isolate_page_range, but
+ * if there are still pages that are not free, make sure that we
+ * drain again, because when we isolated range we might have
+ * raced with another thread that was adding pages to pcp list.
*
* Forward progress should be still guaranteed because
* pages on the pcp list can only belong to MOVABLE_ZONE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 25bc9bb77696..2ec3e1e27169 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8539,6 +8539,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
if (ret)
return ret;
+ drain_all_pages(cc.zone);
+
/*
* In case of -EBUSY, we'd like to know which page causes problem.
* So, just fall through. test_pages_isolated() has a tracepoint
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index abbf42214485..feab446d1982 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -49,7 +49,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
__mod_zone_freepage_state(zone, -nr_pages, mt);
spin_unlock_irqrestore(&zone->lock, flags);
- drain_all_pages(zone);
return 0;
}
@@ -172,11 +171,12 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
*
* Please note that there is no strong synchronization with the page allocator
* either. Pages might be freed while their page blocks are marked ISOLATED.
- * In some cases pages might still end up on pcp lists and that would allow
+ * A call to drain_all_pages() after isolation can flush most of them. However
+ * in some cases pages might still end up on pcp lists and that would allow
* for their allocation even when they are in fact isolated already. Depending
- * on how strong of a guarantee the caller needs drain_all_pages might be needed
- * (e.g. __offline_pages will need to call it after check for isolated range for
- * a next retry).
+ * on how strong of a guarantee the caller needs, further drain_all_pages()
+ * might be needed (e.g. __offline_pages will need to call it after check for
+ * isolated range for a next retry).
*
* Return: 0 on success and -EBUSY if any part of range cannot be isolated.
*/
--
2.29.1
We initialize boot-time pagesets with setup_pageset(), which sets high and
batch values that effectively disable pcplists.
We can remove this wrapper if we just set these values for all pagesets in
pageset_init(). Non-boot pagesets then subsequently update them to the proper
values.
No functional change.
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2fa432762908..5a8ec7d94884 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5911,7 +5911,7 @@ static void build_zonelists(pg_data_t *pgdat)
* not check if the processor is online before following the pageset pointer.
* Other parts of the kernel may not check if the zone is available.
*/
-static void setup_pageset(struct per_cpu_pageset *p);
+static void pageset_init(struct per_cpu_pageset *p);
static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
@@ -5979,7 +5979,7 @@ build_all_zonelists_init(void)
* (a chicken-egg dilemma).
*/
for_each_possible_cpu(cpu)
- setup_pageset(&per_cpu(boot_pageset, cpu));
+ pageset_init(&per_cpu(boot_pageset, cpu));
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
@@ -6298,12 +6298,15 @@ static void pageset_init(struct per_cpu_pageset *p)
pcp = &p->pcp;
for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
-}
-static void setup_pageset(struct per_cpu_pageset *p)
-{
- pageset_init(p);
- pageset_update(&p->pcp, 0, 1);
+ /*
+ * Set batch and high values safe for a boot pageset. A true percpu
+ * pageset's initialization will update them subsequently. Here we don't
+ * need to be as careful as pageset_update() as nobody can access the
+ * pageset yet.
+ */
+ pcp->high = 0;
+ pcp->batch = 1;
}
/*
--
2.29.1
We currently call pageset_set_high_and_batch() for each possible cpu, which
repeats the same calculations of high and batch values.
Instead call the function just once per zone, and make it apply the calculated
values to all per-cpu pagesets of the zone.
This also allows removing the zone_pageset_init() and __zone_pcp_update()
wrappers.
No functional change.
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 42 ++++++++++++++++++------------------------
1 file changed, 18 insertions(+), 24 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f1c57344e73..2fa432762908 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6307,13 +6307,14 @@ static void setup_pageset(struct per_cpu_pageset *p)
}
/*
- * Calculate and set new high and batch values for given per-cpu pageset of a
+ * Calculate and set new high and batch values for all per-cpu pagesets of a
* zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
*/
-static void pageset_set_high_and_batch(struct zone *zone,
- struct per_cpu_pageset *p)
+static void zone_set_pageset_high_and_batch(struct zone *zone)
{
unsigned long new_high, new_batch;
+ struct per_cpu_pageset *p;
+ int cpu;
if (percpu_pagelist_fraction) {
new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6325,23 +6326,25 @@ static void pageset_set_high_and_batch(struct zone *zone,
new_high = 6 * new_batch;
new_batch = max(1UL, 1 * new_batch);
}
- pageset_update(&p->pcp, new_high, new_batch);
-}
-
-static void __meminit zone_pageset_init(struct zone *zone, int cpu)
-{
- struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
- pageset_init(pcp);
- pageset_set_high_and_batch(zone, pcp);
+ for_each_possible_cpu(cpu) {
+ p = per_cpu_ptr(zone->pageset, cpu);
+ pageset_update(&p->pcp, new_high, new_batch);
+ }
}
void __meminit setup_zone_pageset(struct zone *zone)
{
+ struct per_cpu_pageset *p;
int cpu;
+
zone->pageset = alloc_percpu(struct per_cpu_pageset);
- for_each_possible_cpu(cpu)
- zone_pageset_init(zone, cpu);
+ for_each_possible_cpu(cpu) {
+ p = per_cpu_ptr(zone->pageset, cpu);
+ pageset_init(p);
+ }
+
+ zone_set_pageset_high_and_batch(zone);
}
/*
@@ -8080,15 +8083,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
return 0;
}
-static void __zone_pcp_update(struct zone *zone)
-{
- unsigned int cpu;
-
- for_each_possible_cpu(cpu)
- pageset_set_high_and_batch(zone,
- per_cpu_ptr(zone->pageset, cpu));
-}
-
/*
* percpu_pagelist_fraction - changes the pcp->high for each zone on each
* cpu. It is the fraction of total pages in each zone that a hot per cpu
@@ -8121,7 +8115,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
goto out;
for_each_populated_zone(zone)
- __zone_pcp_update(zone);
+ zone_set_pageset_high_and_batch(zone);
out:
mutex_unlock(&pcp_batch_high_lock);
return ret;
@@ -8746,7 +8740,7 @@ EXPORT_SYMBOL(free_contig_range);
void __meminit zone_pcp_update(struct zone *zone)
{
mutex_lock(&pcp_batch_high_lock);
- __zone_pcp_update(zone);
+ zone_set_pageset_high_and_batch(zone);
mutex_unlock(&pcp_batch_high_lock);
}
--
2.29.1
> The updates to pcplists' high and batch values are handled by multiple
> functions that make the calculations hard to follow. Consolidate everything
> to pageset_set_high_and_batch() and remove pageset_set_batch() and
> pageset_set_high() wrappers.
>
> The only special case using one of the removed wrappers was:
> build_all_zonelists_init()
> setup_pageset()
> pageset_set_batch()
> which was hardcoding batch as 0, so we can just open-code a call to
> pageset_update() with constant parameters instead.
>
> No functional change.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 49 ++++++++++++++++++++-----------------------------
> 1 file changed, 20 insertions(+), 29 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7560240fcf7d..3f1c57344e73 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5911,7 +5911,7 @@ static void build_zonelists(pg_data_t *pgdat)
> * not check if the processor is online before following the pageset pointer.
> * Other parts of the kernel may not check if the zone is available.
> */
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static void setup_pageset(struct per_cpu_pageset *p);
> static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
> static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>
> @@ -5979,7 +5979,7 @@ build_all_zonelists_init(void)
> * (a chicken-egg dilemma).
> */
> for_each_possible_cpu(cpu)
> - setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> + setup_pageset(&per_cpu(boot_pageset, cpu));
>
> mminit_verify_zonelist();
> cpuset_init_current_mems_allowed();
> @@ -6288,12 +6288,6 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
> pcp->batch = batch;
> }
>
> -/* a companion to pageset_set_high() */
> -static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> -{
> - pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
> -}
> -
> static void pageset_init(struct per_cpu_pageset *p)
> {
> struct per_cpu_pages *pcp;
> @@ -6306,35 +6300,32 @@ static void pageset_init(struct per_cpu_pageset *p)
> INIT_LIST_HEAD(&pcp->lists[migratetype]);
> }
>
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> +static void setup_pageset(struct per_cpu_pageset *p)
> {
> pageset_init(p);
> - pageset_set_batch(p, batch);
> + pageset_update(&p->pcp, 0, 1);
> }
>
> /*
> - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
> - * to the value high for the pageset p.
> + * Calculate and set new high and batch values for given per-cpu pageset of a
> + * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> */
> -static void pageset_set_high(struct per_cpu_pageset *p,
> - unsigned long high)
> -{
> - unsigned long batch = max(1UL, high / 4);
> - if ((high / 4) > (PAGE_SHIFT * 8))
> - batch = PAGE_SHIFT * 8;
> -
> - pageset_update(&p->pcp, high, batch);
> -}
> -
> static void pageset_set_high_and_batch(struct zone *zone,
> - struct per_cpu_pageset *pcp)
> + struct per_cpu_pageset *p)
> {
> - if (percpu_pagelist_fraction)
> - pageset_set_high(pcp,
> - (zone_managed_pages(zone) /
> - percpu_pagelist_fraction));
> - else
> - pageset_set_batch(pcp, zone_batchsize(zone));
> + unsigned long new_high, new_batch;
> +
> + if (percpu_pagelist_fraction) {
> + new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> + new_batch = max(1UL, new_high / 4);
> + if ((new_high / 4) > (PAGE_SHIFT * 8))
> + new_batch = PAGE_SHIFT * 8;
> + } else {
> + new_batch = zone_batchsize(zone);
> + new_high = 6 * new_batch;
> + new_batch = max(1UL, 1 * new_batch);
> + }
> + pageset_update(&p->pcp, new_high, new_batch);
> }
>
> static void __meminit zone_pageset_init(struct zone *zone, int cpu)
Looks good to me.
Acked-by: Pankaj Gupta <[email protected]>
> We currently call pageset_set_high_and_batch() for each possible cpu, which
> repeats the same calculations of high and batch values.
>
> Instead call the function just once per zone, and make it apply the calculated
> values to all per-cpu pagesets of the zone.
>
> This also allows removing the zone_pageset_init() and __zone_pcp_update()
> wrappers.
>
> No functional change.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 42 ++++++++++++++++++------------------------
> 1 file changed, 18 insertions(+), 24 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3f1c57344e73..2fa432762908 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6307,13 +6307,14 @@ static void setup_pageset(struct per_cpu_pageset *p)
> }
>
> /*
> - * Calculate and set new high and batch values for given per-cpu pageset of a
> + * Calculate and set new high and batch values for all per-cpu pagesets of a
> * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> */
> -static void pageset_set_high_and_batch(struct zone *zone,
> - struct per_cpu_pageset *p)
> +static void zone_set_pageset_high_and_batch(struct zone *zone)
> {
> unsigned long new_high, new_batch;
> + struct per_cpu_pageset *p;
> + int cpu;
>
> if (percpu_pagelist_fraction) {
> new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6325,23 +6326,25 @@ static void pageset_set_high_and_batch(struct zone *zone,
> new_high = 6 * new_batch;
> new_batch = max(1UL, 1 * new_batch);
> }
> - pageset_update(&p->pcp, new_high, new_batch);
> -}
> -
> -static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> -{
> - struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
>
> - pageset_init(pcp);
> - pageset_set_high_and_batch(zone, pcp);
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(zone->pageset, cpu);
> + pageset_update(&p->pcp, new_high, new_batch);
> + }
> }
>
> void __meminit setup_zone_pageset(struct zone *zone)
> {
> + struct per_cpu_pageset *p;
> int cpu;
> +
> zone->pageset = alloc_percpu(struct per_cpu_pageset);
> - for_each_possible_cpu(cpu)
> - zone_pageset_init(zone, cpu);
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(zone->pageset, cpu);
> + pageset_init(p);
> + }
> +
> + zone_set_pageset_high_and_batch(zone);
> }
>
> /*
> @@ -8080,15 +8083,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
> return 0;
> }
>
> -static void __zone_pcp_update(struct zone *zone)
> -{
> - unsigned int cpu;
> -
> - for_each_possible_cpu(cpu)
> - pageset_set_high_and_batch(zone,
> - per_cpu_ptr(zone->pageset, cpu));
> -}
> -
> /*
> * percpu_pagelist_fraction - changes the pcp->high for each zone on each
> * cpu. It is the fraction of total pages in each zone that a hot per cpu
> @@ -8121,7 +8115,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
> goto out;
>
> for_each_populated_zone(zone)
> - __zone_pcp_update(zone);
> + zone_set_pageset_high_and_batch(zone);
> out:
> mutex_unlock(&pcp_batch_high_lock);
> return ret;
> @@ -8746,7 +8740,7 @@ EXPORT_SYMBOL(free_contig_range);
> void __meminit zone_pcp_update(struct zone *zone)
> {
> mutex_lock(&pcp_batch_high_lock);
> - __zone_pcp_update(zone);
> + zone_set_pageset_high_and_batch(zone);
> mutex_unlock(&pcp_batch_high_lock);
> }
Acked-by: Pankaj Gupta <[email protected]>
> We initialize boot-time pagesets with setup_pageset(), which sets high and
> batch values that effectively disable pcplists.
>
> We can remove this wrapper if we just set these values for all pagesets in
> pageset_init(). Non-boot pagesets then subsequently update them to the proper
> values.
>
> No functional change.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2fa432762908..5a8ec7d94884 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5911,7 +5911,7 @@ static void build_zonelists(pg_data_t *pgdat)
> * not check if the processor is online before following the pageset pointer.
> * Other parts of the kernel may not check if the zone is available.
> */
> -static void setup_pageset(struct per_cpu_pageset *p);
> +static void pageset_init(struct per_cpu_pageset *p);
> static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
> static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>
> @@ -5979,7 +5979,7 @@ build_all_zonelists_init(void)
> * (a chicken-egg dilemma).
> */
> for_each_possible_cpu(cpu)
> - setup_pageset(&per_cpu(boot_pageset, cpu));
> + pageset_init(&per_cpu(boot_pageset, cpu));
>
> mminit_verify_zonelist();
> cpuset_init_current_mems_allowed();
> @@ -6298,12 +6298,15 @@ static void pageset_init(struct per_cpu_pageset *p)
> pcp = &p->pcp;
> for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
> INIT_LIST_HEAD(&pcp->lists[migratetype]);
> -}
>
> -static void setup_pageset(struct per_cpu_pageset *p)
> -{
> - pageset_init(p);
> - pageset_update(&p->pcp, 0, 1);
> + /*
> + * Set batch and high values safe for a boot pageset. A true percpu
> + * pageset's initialization will update them subsequently. Here we don't
> + * need to be as careful as pageset_update() as nobody can access the
> + * pageset yet.
> + */
> + pcp->high = 0;
> + pcp->batch = 1;
> }
Acked-by: Pankaj Gupta <[email protected]>
On 11.11.20 10:28, Vlastimil Babka wrote:
> Memory offlining relies on page isolation to guarantee a forward
> progress because pages cannot be reused while they are isolated. But the
> page isolation itself doesn't prevent from races while freed pages are
> stored on pcp lists and thus can be reused. This can be worked around by
> repeated draining of pcplists, as done by commit 968318261221
> ("mm/memory_hotplug: drain per-cpu pages again during memory offline").
>
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
>
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
>
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
>
> This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
> page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages.
>
> Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
> not empty. The check can however race with a free to pcplist that has not yet
> increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
> racy check and drain on all cpus, and use this option in zone_pcp_disable().
^ there it is, can you move that comment into the code? People (e.g., me
:) ) staring at that will be confused otherwise.
--
Thanks,
David / dhildenb
On 11.11.20 10:28, Vlastimil Babka wrote:
> Memory offlining relies on page isolation to guarantee a forward
> progress because pages cannot be reused while they are isolated. But the
> page isolation itself doesn't prevent from races while freed pages are
> stored on pcp lists and thus can be reused. This can be worked around by
> repeated draining of pcplists, as done by commit 968318261221
> ("mm/memory_hotplug: drain per-cpu pages again during memory offline").
>
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
>
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
>
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
>
> This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
> page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages.
>
> Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
> not empty. The check can however race with a free to pcplist that has not yet
> increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
> racy check and drain on all cpus, and use this option in zone_pcp_disable().
>
> As we have to avoid external updates to high and batch while pcplists are
> disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
> zone_pcp_enable(). This also synchronizes multiple users of
> zone_pcp_disable()/enable().
>
> Currently the only user of this functionality is offline_pages().
>
> Suggested-by: David Hildenbrand <[email protected]>
> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> mm/internal.h | 2 ++
> mm/memory_hotplug.c | 28 ++++++++----------
> mm/page_alloc.c | 69 +++++++++++++++++++++++++++++++++++----------
> mm/page_isolation.c | 6 ++--
> 4 files changed, 71 insertions(+), 34 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c43ccdddb0f6..2966496680bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
>
> extern void zone_pcp_update(struct zone *zone);
> extern void zone_pcp_reset(struct zone *zone);
> +extern void zone_pcp_disable(struct zone *zone);
> +extern void zone_pcp_enable(struct zone *zone);
>
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3c494ab0d075..e0a561c550b3 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1491,17 +1491,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> }
> node = zone_to_nid(zone);
>
> + /*
> + * Disable pcplists so that page isolation cannot race with freeing
> + * in a way that pages from isolated pageblock are left on pcplists.
> + */
> + zone_pcp_disable(zone);
> +
> /* set above range as isolated */
> ret = start_isolate_page_range(start_pfn, end_pfn,
> MIGRATE_MOVABLE,
> MEMORY_OFFLINE | REPORT_FAILURE);
> if (ret) {
> reason = "failure to isolate range";
> - goto failed_removal;
> + goto failed_removal_pcplists_disabled;
> }
>
> - drain_all_pages(zone);
> -
> arg.start_pfn = start_pfn;
> arg.nr_pages = nr_pages;
> node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1551,20 +1555,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> goto failed_removal_isolated;
> }
>
> - /*
> - * per-cpu pages are drained after start_isolate_page_range, but
> - * if there are still pages that are not free, make sure that we
> - * drain again, because when we isolated range we might have
> - * raced with another thread that was adding pages to pcp list.
> - *
> - * Forward progress should be still guaranteed because
> - * pages on the pcp list can only belong to MOVABLE_ZONE
> - * because has_unmovable_pages explicitly checks for
> - * PageBuddy on freed pages on other zones.
> - */
> ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> - if (ret)
> - drain_all_pages(zone);
> +
Why two empty lines before the "} while (ret);" ? (unless I'm confused
while looking at this diff)
[...]
> +void __drain_all_pages(struct zone *zone, bool force_all_cpus)
> {
> int cpu;
>
> @@ -3076,7 +3069,13 @@ void drain_all_pages(struct zone *zone)
> struct zone *z;
> bool has_pcps = false;
>
> - if (zone) {
> + if (force_all_cpus) {
> + /*
> + * The pcp.count check is racy, some callers need a
> + * guarantee that no cpu is missed.
Why this comment is helpful, it doesn't tell the whole story. Who
exactly/in which situations?
> + */
> + has_pcps = true;
> + } else if (zone) {
> pcp = per_cpu_ptr(zone->pageset, cpu);
> if (pcp->pcp.count)
> has_pcps = true;
> @@ -3109,6 +3108,18 @@ void drain_all_pages(struct zone *zone)
> mutex_unlock(&pcpu_drain_mutex);
> }
>
> +/*
> + * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> + *
> + * When zone parameter is non-NULL, spill just the single zone's pages.
> + *
> + * Note that this can be extremely slow as the draining happens in a workqueue.
> + */
> +void drain_all_pages(struct zone *zone)
> +{
> + __drain_all_pages(zone, false);
It's still somewhat unclear to me why we don't need "force_all_cpus"
here. Can you clarify that? (e.g., add a comment somewhere?)
[...]
> +void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
> + unsigned long batch)
> +{
> + struct per_cpu_pageset *p;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(zone->pageset, cpu);
> + pageset_update(&p->pcp, high, batch);
> + }
> +}
> +
> /*
> * Calculate and set new high and batch values for all per-cpu pagesets of a
> * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> @@ -6315,8 +6338,6 @@ static void pageset_init(struct per_cpu_pageset *p)
> static void zone_set_pageset_high_and_batch(struct zone *zone)
> {
> unsigned long new_high, new_batch;
> - struct per_cpu_pageset *p;
> - int cpu;
>
> if (percpu_pagelist_fraction) {
> new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6336,10 +6357,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
> zone->pageset_high = new_high;
> zone->pageset_batch = new_batch;
>
> - for_each_possible_cpu(cpu) {
> - p = per_cpu_ptr(zone->pageset, cpu);
> - pageset_update(&p->pcp, new_high, new_batch);
> - }
> + __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
> }
These two hunks look like an unrelated cleanup, or am I missing something?
Thanks for looking into this!
--
Thanks,
David / dhildenb
On 11/11/20 6:58 PM, David Hildenbrand wrote:
> On 11.11.20 10:28, Vlastimil Babka wrote:
>> - /*
>> - * per-cpu pages are drained after start_isolate_page_range, but
>> - * if there are still pages that are not free, make sure that we
>> - * drain again, because when we isolated range we might have
>> - * raced with another thread that was adding pages to pcp list.
>> - *
>> - * Forward progress should be still guaranteed because
>> - * pages on the pcp list can only belong to MOVABLE_ZONE
>> - * because has_unmovable_pages explicitly checks for
>> - * PageBuddy on freed pages on other zones.
>> - */
>> ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>> - if (ret)
>> - drain_all_pages(zone);
>> +
>
> Why two empty lines before the "} while (ret);" ? (unless I'm confused
> while looking at this diff)
>
No there's just a single emply line after "ret = test_pages_isolated..."
Before there was none, which looked ok with the extra identation of the
now-removed "drain_all_pages(zone);"
>> +void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
>> + unsigned long batch)
>> +{
>> + struct per_cpu_pageset *p;
>> + int cpu;
>> +
>> + for_each_possible_cpu(cpu) {
>> + p = per_cpu_ptr(zone->pageset, cpu);
>> + pageset_update(&p->pcp, high, batch);
>> + }
>> +}
>> +
>> /*
>> * Calculate and set new high and batch values for all per-cpu pagesets of a
>> * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>> @@ -6315,8 +6338,6 @@ static void pageset_init(struct per_cpu_pageset *p)
>> static void zone_set_pageset_high_and_batch(struct zone *zone)
>> {
>> unsigned long new_high, new_batch;
>> - struct per_cpu_pageset *p;
>> - int cpu;
>>
>> if (percpu_pagelist_fraction) {
>> new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
>> @@ -6336,10 +6357,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>> zone->pageset_high = new_high;
>> zone->pageset_batch = new_batch;
>>
>> - for_each_possible_cpu(cpu) {
>> - p = per_cpu_ptr(zone->pageset, cpu);
>> - pageset_update(&p->pcp, new_high, new_batch);
>> - }
>> + __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
>> }
>
> These two hunks look like an unrelated cleanup, or am I missing something?
It's extracting part of functionality to __zone_set_pageset_high_and_batch()
that's now called also from zone_pcp_enable() and zone_pcp_disable() - to only
adjust the per-cpu zone->pageset values without the usual calculation.
> Thanks for looking into this!
Thanks for review. Here's updated version that adds more detailed comment about
force_all_cpus parameter to __drain_all_pages() header, hopefully that clarifies
your concerns.
----8<----
From 5077d00c195fa66e613bdb27a195bbcc1e892515 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Fri, 18 Sep 2020 18:35:32 +0200
Subject: [PATCH] mm, page_alloc: disable pcplists during memory offline
Memory offlining relies on page isolation to guarantee a forward
progress because pages cannot be reused while they are isolated. But the
page isolation itself doesn't prevent from races while freed pages are
stored on pcp lists and thus can be reused. This can be worked around by
repeated draining of pcplists, as done by commit 968318261221
("mm/memory_hotplug: drain per-cpu pages again during memory offline").
David and Michal would prefer that this race was closed in a way that callers
of page isolation who need stronger guarantees don't need to repeatedly drain.
David suggested disabling pcplists usage completely during page isolation,
instead of repeatedly draining them.
To achieve this without adding special cases in alloc/free fastpath, we can use
the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
will be immediately flushed.
The race can thus be closed by setting pcp->high to 0 and draining pcplists
once, before calling start_isolate_page_range(). The draining will serialize
after processes that already disabled interrupts and read the old value of
pcp->high in free_unref_page_commit(), and processes that have not yet disabled
interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
pcplists. This guarantees no stray pages on pcplists in zones where isolation
happens.
This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
page isolation users can call before start_isolate_page_range() and after
unisolating (or offlining) the isolated pages.
Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
not empty. The check can however race with a free to pcplist that has not yet
increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
racy check and drain on all cpus, and use this option in zone_pcp_disable().
As we have to avoid external updates to high and batch while pcplists are
disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
zone_pcp_enable(). This also synchronizes multiple users of
zone_pcp_disable()/enable().
Currently the only user of this functionality is offline_pages().
Suggested-by: David Hildenbrand <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/internal.h | 2 ++
mm/memory_hotplug.c | 28 ++++++++---------
mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++++++++--------
mm/page_isolation.c | 6 ++--
4 files changed, 78 insertions(+), 31 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index c43ccdddb0f6..2966496680bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
extern void zone_pcp_update(struct zone *zone);
extern void zone_pcp_reset(struct zone *zone);
+extern void zone_pcp_disable(struct zone *zone);
+extern void zone_pcp_enable(struct zone *zone);
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3c494ab0d075..e0a561c550b3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1491,17 +1491,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
}
node = zone_to_nid(zone);
+ /*
+ * Disable pcplists so that page isolation cannot race with freeing
+ * in a way that pages from isolated pageblock are left on pcplists.
+ */
+ zone_pcp_disable(zone);
+
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
MIGRATE_MOVABLE,
MEMORY_OFFLINE | REPORT_FAILURE);
if (ret) {
reason = "failure to isolate range";
- goto failed_removal;
+ goto failed_removal_pcplists_disabled;
}
- drain_all_pages(zone);
-
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
node_states_check_changes_offline(nr_pages, zone, &arg);
@@ -1551,20 +1555,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
goto failed_removal_isolated;
}
- /*
- * per-cpu pages are drained after start_isolate_page_range, but
- * if there are still pages that are not free, make sure that we
- * drain again, because when we isolated range we might have
- * raced with another thread that was adding pages to pcp list.
- *
- * Forward progress should be still guaranteed because
- * pages on the pcp list can only belong to MOVABLE_ZONE
- * because has_unmovable_pages explicitly checks for
- * PageBuddy on freed pages on other zones.
- */
ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
- if (ret)
- drain_all_pages(zone);
+
} while (ret);
/* Mark all sections offline and remove free pages from the buddy. */
@@ -1580,6 +1572,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
spin_unlock_irqrestore(&zone->lock, flags);
+ zone_pcp_enable(zone);
+
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
zone->present_pages -= nr_pages;
@@ -1612,6 +1606,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
failed_removal_isolated:
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
memory_notify(MEM_CANCEL_OFFLINE, &arg);
+failed_removal_pcplists_disabled:
+ zone_pcp_enable(zone);
failed_removal:
pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
(unsigned long long) start_pfn << PAGE_SHIFT,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ec3e1e27169..51134ce608ba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3031,13 +3031,16 @@ static void drain_local_pages_wq(struct work_struct *work)
}
/*
- * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
- *
- * When zone parameter is non-NULL, spill just the single zone's pages.
+ * The implementation of drain_all_pages(), exposing an extra parameter to
+ * drain on all cpus.
*
- * Note that this can be extremely slow as the draining happens in a workqueue.
+ * drain_all_pages() is optimized to only execute on cpus where pcplists are
+ * not empty. The check for non-emptiness can however race with a free to
+ * pcplist that has not yet increased the pcp->count from 0 to 1. Callers
+ * that need the guarantee that every CPU has drained can disable the
+ * optimizing racy check.
*/
-void drain_all_pages(struct zone *zone)
+void __drain_all_pages(struct zone *zone, bool force_all_cpus)
{
int cpu;
@@ -3076,7 +3079,13 @@ void drain_all_pages(struct zone *zone)
struct zone *z;
bool has_pcps = false;
- if (zone) {
+ if (force_all_cpus) {
+ /*
+ * The pcp.count check is racy, some callers need a
+ * guarantee that no cpu is missed.
+ */
+ has_pcps = true;
+ } else if (zone) {
pcp = per_cpu_ptr(zone->pageset, cpu);
if (pcp->pcp.count)
has_pcps = true;
@@ -3109,6 +3118,18 @@ void drain_all_pages(struct zone *zone)
mutex_unlock(&pcpu_drain_mutex);
}
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
+ *
+ * When zone parameter is non-NULL, spill just the single zone's pages.
+ *
+ * Note that this can be extremely slow as the draining happens in a workqueue.
+ */
+void drain_all_pages(struct zone *zone)
+{
+ __drain_all_pages(zone, false);
+}
+
#ifdef CONFIG_HIBERNATION
/*
@@ -6308,6 +6329,18 @@ static void pageset_init(struct per_cpu_pageset *p)
pcp->batch = BOOT_PAGESET_BATCH;
}
+void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
+ unsigned long batch)
+{
+ struct per_cpu_pageset *p;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ p = per_cpu_ptr(zone->pageset, cpu);
+ pageset_update(&p->pcp, high, batch);
+ }
+}
+
/*
* Calculate and set new high and batch values for all per-cpu pagesets of a
* zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
@@ -6315,8 +6348,6 @@ static void pageset_init(struct per_cpu_pageset *p)
static void zone_set_pageset_high_and_batch(struct zone *zone)
{
unsigned long new_high, new_batch;
- struct per_cpu_pageset *p;
- int cpu;
if (percpu_pagelist_fraction) {
new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6336,10 +6367,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
zone->pageset_high = new_high;
zone->pageset_batch = new_batch;
- for_each_possible_cpu(cpu) {
- p = per_cpu_ptr(zone->pageset, cpu);
- pageset_update(&p->pcp, new_high, new_batch);
- }
+ __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
}
void __meminit setup_zone_pageset(struct zone *zone)
@@ -8757,6 +8785,27 @@ void __meminit zone_pcp_update(struct zone *zone)
mutex_unlock(&pcp_batch_high_lock);
}
+/*
+ * Effectively disable pcplists for the zone by setting the high limit to 0
+ * and draining all cpus. A concurrent page freeing on another CPU that's about
+ * to put the page on pcplist will either finish before the drain and the page
+ * will be drained, or observe the new high limit and skip the pcplist.
+ *
+ * Must be paired with a call to zone_pcp_enable().
+ */
+void zone_pcp_disable(struct zone *zone)
+{
+ mutex_lock(&pcp_batch_high_lock);
+ __zone_set_pageset_high_and_batch(zone, 0, 1);
+ __drain_all_pages(zone, true);
+}
+
+void zone_pcp_enable(struct zone *zone)
+{
+ __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+ mutex_unlock(&pcp_batch_high_lock);
+}
+
void zone_pcp_reset(struct zone *zone)
{
unsigned long flags;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index feab446d1982..a254e1f370a3 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -174,9 +174,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* A call to drain_all_pages() after isolation can flush most of them. However
* in some cases pages might still end up on pcp lists and that would allow
* for their allocation even when they are in fact isolated already. Depending
- * on how strong of a guarantee the caller needs, further drain_all_pages()
- * might be needed (e.g. __offline_pages will need to call it after check for
- * isolated range for a next retry).
+ * on how strong of a guarantee the caller needs, zone_pcp_disable/enable()
+ * might be used to flush and disable pcplist before isolation and enable after
+ * unisolation.
*
* Return: 0 on success and -EBUSY if any part of range cannot be isolated.
*/
--
2.29.1
On 11.11.20 10:28, Vlastimil Babka wrote:
> All per-cpu pagesets for a zone use the same high and batch values, that are
> duplicated there just for performance (locality) reasons. This patch adds the
> same variables also to struct zone as a shared copy.
>
> This will be useful later for making possible to disable pcplists temporarily
> by setting high value to 0, while remembering the values for restoring them
> later. But we can also immediately benefit from not updating pagesets of all
> possible cpus in case the newly recalculated values (after sysctl change or
> memory online/offline) are actually unchanged from the previous ones.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> include/linux/mmzone.h | 6 ++++++
> mm/page_alloc.c | 16 ++++++++++++++--
> 2 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7385871768d4..bb9188de2718 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -470,6 +470,12 @@ struct zone {
> #endif
> struct pglist_data *zone_pgdat;
> struct per_cpu_pageset __percpu *pageset;
> + /*
> + * the high and batch values are copied to individual pagesets for
> + * faster access
> + */
> + int pageset_high;
> + int pageset_batch;
>
> #ifndef CONFIG_SPARSEMEM
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8b43d6d1a288..25bc9bb77696 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5912,6 +5912,9 @@ static void build_zonelists(pg_data_t *pgdat)
> * Other parts of the kernel may not check if the zone is available.
> */
> static void pageset_init(struct per_cpu_pageset *p);
> +/* These effectively disable the pcplists in the boot pageset completely */
> +#define BOOT_PAGESET_HIGH 0
> +#define BOOT_PAGESET_BATCH 1
> static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
> static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>
> @@ -6301,8 +6304,8 @@ static void pageset_init(struct per_cpu_pageset *p)
> * need to be as careful as pageset_update() as nobody can access the
> * pageset yet.
> */
> - pcp->high = 0;
> - pcp->batch = 1;
> + pcp->high = BOOT_PAGESET_HIGH;
> + pcp->batch = BOOT_PAGESET_BATCH;
> }
>
> /*
> @@ -6326,6 +6329,13 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
> new_batch = max(1UL, 1 * new_batch);
> }
>
> + if (zone->pageset_high == new_high &&
> + zone->pageset_batch == new_batch)
> + return;
> +
> + zone->pageset_high = new_high;
> + zone->pageset_batch = new_batch;
> +
> for_each_possible_cpu(cpu) {
> p = per_cpu_ptr(zone->pageset, cpu);
> pageset_update(&p->pcp, new_high, new_batch);
> @@ -6386,6 +6396,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
> * offset of a (static) per cpu variable into the per cpu area.
> */
> zone->pageset = &boot_pageset;
> + zone->pageset_high = BOOT_PAGESET_HIGH;
> + zone->pageset_batch = BOOT_PAGESET_BATCH;
>
> if (populated_zone(zone))
> printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n",
>
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 12.11.20 16:18, Vlastimil Babka wrote:
> On 11/11/20 6:58 PM, David Hildenbrand wrote:
>> On 11.11.20 10:28, Vlastimil Babka wrote:
>>> - /*
>>> - * per-cpu pages are drained after start_isolate_page_range, but
>>> - * if there are still pages that are not free, make sure that we
>>> - * drain again, because when we isolated range we might have
>>> - * raced with another thread that was adding pages to pcp list.
>>> - *
>>> - * Forward progress should be still guaranteed because
>>> - * pages on the pcp list can only belong to MOVABLE_ZONE
>>> - * because has_unmovable_pages explicitly checks for
>>> - * PageBuddy on freed pages on other zones.
>>> - */
>>> ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>>> - if (ret)
>>> - drain_all_pages(zone);
>>> +
>>
>> Why two empty lines before the "} while (ret);" ? (unless I'm confused
>> while looking at this diff)
>>
>
> No there's just a single emply line after "ret = test_pages_isolated..."
> Before there was none, which looked ok with the extra identation of the
> now-removed "drain_all_pages(zone);"
>
>>> +void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
>>> + unsigned long batch)
>>> +{
>>> + struct per_cpu_pageset *p;
>>> + int cpu;
>>> +
>>> + for_each_possible_cpu(cpu) {
>>> + p = per_cpu_ptr(zone->pageset, cpu);
>>> + pageset_update(&p->pcp, high, batch);
>>> + }
>>> +}
>>> +
>>> /*
>>> * Calculate and set new high and batch values for all per-cpu pagesets of a
>>> * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>>> @@ -6315,8 +6338,6 @@ static void pageset_init(struct per_cpu_pageset *p)
>>> static void zone_set_pageset_high_and_batch(struct zone *zone)
>>> {
>>> unsigned long new_high, new_batch;
>>> - struct per_cpu_pageset *p;
>>> - int cpu;
>>>
>>> if (percpu_pagelist_fraction) {
>>> new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
>>> @@ -6336,10 +6357,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>>> zone->pageset_high = new_high;
>>> zone->pageset_batch = new_batch;
>>>
>>> - for_each_possible_cpu(cpu) {
>>> - p = per_cpu_ptr(zone->pageset, cpu);
>>> - pageset_update(&p->pcp, new_high, new_batch);
>>> - }
>>> + __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
>>> }
>>
>> These two hunks look like an unrelated cleanup, or am I missing something?
>
> It's extracting part of functionality to __zone_set_pageset_high_and_batch()
> that's now called also from zone_pcp_enable() and zone_pcp_disable() - to only
> adjust the per-cpu zone->pageset values without the usual calculation.
>
>> Thanks for looking into this!
>
> Thanks for review. Here's updated version that adds more detailed comment about
> force_all_cpus parameter to __drain_all_pages() header, hopefully that clarifies
> your concerns.
LGTM!
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb