Regarding the first patch: The actual hard work of removing the lock was
done by Oscar, but I have set myself as author for fear of saying
something stupid in the commit message that would then be attributed to
him... Not sure what the best practice is there, please feel free to
reset the authorhip to him (or I can send a v2) if that's preferred.
Signed-off-by: Brendan Jackman <[email protected]>
---
Brendan Jackman (2):
mm,memory_hotplug: Remove un-taken lock
mm,memory_hotplug: {READ,WRITE}_ONCE unsynchronized zone data
include/linux/memory_hotplug.h | 35 -----------------------------------
include/linux/mmzone.h | 37 +++++++++++++++----------------------
mm/compaction.c | 2 +-
mm/memory_hotplug.c | 20 ++++++++++++--------
mm/mm_init.c | 3 +--
mm/page_alloc.c | 12 ++++--------
mm/show_mem.c | 8 ++++----
mm/vmstat.c | 4 ++--
8 files changed, 39 insertions(+), 82 deletions(-)
---
base-commit: 8f6a15f095a63a83b096d9b29aaff4f0fbe6f6e6
change-id: 20240521-mm-hotplug-sync-310ebbf34062
Best regards,
--
Brendan Jackman <[email protected]>
It seems that [1] was acked, and the a v2 was written[2] which improved
upon it, but got bogged down in discussion of other topics, so the
improvements were not included. Then [1] got merged as commit
27cacaad16c5 ("mm,memory_hotplug: drop unneeded locking") and we ended
up with locks that get taken for read but never for write.
So, let's remove the read locking.
Compared to Oscar's original v2[2], I have added a READ_ONCE in
page_outside_zone_boundaries; this is a substitute for the compiler
barrier that was implied by read_seqretry(). I believe this is necessary
to insure against UB, although the value being read here is only used
for a printk so the stakes seem very low (and this is all debug code
anyway). I believe a compiler barrier is also needed in zone_spans_pfn,
but I'll address that in a separate patch.
That read_seqretry() also impleied a CPU-level memory barrier, which I
don't think needs replacing: page_outside_zone_boundaries() is used in
the alloc and free paths, but you can't allocate or free pages from
the span that is in the middle of being added/removed by hotplug.
In other words, page_outside_zone_boundaries() doesn't require a
strictly up-to-date view of spanned_pages, but I think it does require
a value that was once/will eventually be correct, hence READ_ONCE.
[1] https://lore.kernel.org/all/[email protected]/T/#u
[2] https://lore.kernel.org/linux-mm/[email protected]/#t
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Co-developed-by: Oscar Salvador <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
Signed-off-by: Brendan Jackman <[email protected]>
---
include/linux/memory_hotplug.h | 35 -----------------------------------
include/linux/mmzone.h | 23 +++++------------------
mm/mm_init.c | 1 -
mm/page_alloc.c | 10 +++-------
4 files changed, 8 insertions(+), 61 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7a9ff464608d..f9577e67e5ee 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -141,31 +141,7 @@ bool mhp_supports_memmap_on_memory(void);
/*
* Zone resizing functions
- *
- * Note: any attempt to resize a zone should has pgdat_resize_lock()
- * zone_span_writelock() both held. This ensure the size of a zone
- * can't be changed while pgdat_resize_lock() held.
*/
-static inline unsigned zone_span_seqbegin(struct zone *zone)
-{
- return read_seqbegin(&zone->span_seqlock);
-}
-static inline int zone_span_seqretry(struct zone *zone, unsigned iv)
-{
- return read_seqretry(&zone->span_seqlock, iv);
-}
-static inline void zone_span_writelock(struct zone *zone)
-{
- write_seqlock(&zone->span_seqlock);
-}
-static inline void zone_span_writeunlock(struct zone *zone)
-{
- write_sequnlock(&zone->span_seqlock);
-}
-static inline void zone_seqlock_init(struct zone *zone)
-{
- seqlock_init(&zone->span_seqlock);
-}
extern void adjust_present_page_count(struct page *page,
struct memory_group *group,
long nr_pages);
@@ -251,17 +227,6 @@ static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat)
___page; \
})
-static inline unsigned zone_span_seqbegin(struct zone *zone)
-{
- return 0;
-}
-static inline int zone_span_seqretry(struct zone *zone, unsigned iv)
-{
- return 0;
-}
-static inline void zone_span_writelock(struct zone *zone) {}
-static inline void zone_span_writeunlock(struct zone *zone) {}
-static inline void zone_seqlock_init(struct zone *zone) {}
static inline int try_online_node(int nid)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f9c9590a42c..194ef7fed9d6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -14,7 +14,6 @@
#include <linux/threads.h>
#include <linux/numa.h>
#include <linux/init.h>
-#include <linux/seqlock.h>
#include <linux/nodemask.h>
#include <linux/pageblock-flags.h>
#include <linux/page-flags-layout.h>
@@ -896,18 +895,11 @@ struct zone {
*
* Locking rules:
*
- * zone_start_pfn and spanned_pages are protected by span_seqlock.
- * It is a seqlock because it has to be read outside of zone->lock,
- * and it is done in the main allocator path. But, it is written
- * quite infrequently.
- *
- * The span_seq lock is declared along with zone->lock because it is
- * frequently read in proximity to zone->lock. It's good to
- * give them a chance of being in the same cacheline.
- *
- * Write access to present_pages at runtime should be protected by
- * mem_hotplug_begin/done(). Any reader who can't tolerant drift of
- * present_pages should use get_online_mems() to get a stable value.
+ * Besides system initialization functions, memory-hotplug is the only
+ * user that can change zone's {spanned,present} pages at runtime, and
+ * it does so by holding the mem_hotplug_lock lock. Any readers who
+ * can't tolerate drift values should use {get,put}_online_mems to get
+ * a stable value.
*/
atomic_long_t managed_pages;
unsigned long spanned_pages;
@@ -930,11 +922,6 @@ struct zone {
unsigned long nr_isolate_pageblock;
#endif
-#ifdef CONFIG_MEMORY_HOTPLUG
- /* see spanned/present_pages for more description */
- seqlock_t span_seqlock;
-#endif
-
int initialized;
/* Write-intensive fields used from the page allocator */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f72b852bd5b8..c725618aeb58 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1383,7 +1383,6 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
zone->name = zone_names[idx];
zone->zone_pgdat = NODE_DATA(nid);
spin_lock_init(&zone->lock);
- zone_seqlock_init(zone);
zone_pcp_init(zone);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..5116a2b9ea6e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,16 +426,12 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
int ret;
- unsigned seq;
unsigned long pfn = page_to_pfn(page);
unsigned long sp, start_pfn;
- do {
- seq = zone_span_seqbegin(zone);
- start_pfn = zone->zone_start_pfn;
- sp = zone->spanned_pages;
- ret = !zone_spans_pfn(zone, pfn);
- } while (zone_span_seqretry(zone, seq));
+ start_pfn = zone->zone_start_pfn;
+ sp = READ_ONCE(zone->spanned_pages);
+ ret = !zone_spans_pfn(zone, pfn);
if (ret)
pr_err("page 0x%lx outside node %d zone %s [ 0x%lx - 0x%lx ]\n",
--
2.45.0.rc1.225.g2a3ae87e7f-goog
These fields are written by memory hotplug under mem_hotplug_lock but
read without any lock. It seems like reader code is robust against the
value being stale or "from the future", but we also need to account
for:
1. Load/store tearing (according to Linus[1], this really happens,
even when everything is aligned as you would hope).
2. Invented loads[2] - the compiler can spill and re-read these fields
([2] calls this "invented loads") and assume that they have not
changed.
Note we don't need READ_ONCE in paths that have the mem_hotplug_lock
for write, but we still need WRITE_ONCE to prevent store-tearing.
[1] https://lore.kernel.org/all/CAHk-=wj2t+GK+DGQ7Xy6U7zMf72e7Jkxn4_-kGyfH3WFEoH+YQ@mail.gmail.com/T/#u
As discovered via the original big-bad article[2]
[2] https://lwn.net/Articles/793253/
Signed-off-by: Brendan Jackman <[email protected]>
---
include/linux/mmzone.h | 14 ++++++++++----
mm/compaction.c | 2 +-
mm/memory_hotplug.c | 20 ++++++++++++--------
mm/mm_init.c | 2 +-
mm/page_alloc.c | 2 +-
mm/show_mem.c | 8 ++++----
mm/vmstat.c | 4 ++--
7 files changed, 31 insertions(+), 21 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 194ef7fed9d6..bdb3be76d10c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1018,11 +1018,13 @@ static inline unsigned long zone_cma_pages(struct zone *zone)
#endif
}
+/* This is unstable unless you hold mem_hotplug_lock. */
static inline unsigned long zone_end_pfn(const struct zone *zone)
{
- return zone->zone_start_pfn + zone->spanned_pages;
+ return zone->zone_start_pfn + READ_ONCE(zone->spanned_pages);
}
+/* This is unstable unless you hold mem_hotplug_lock. */
static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
{
return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
@@ -1033,9 +1035,10 @@ static inline bool zone_is_initialized(struct zone *zone)
return zone->initialized;
}
+/* This is unstable unless you hold mem_hotplug_lock. */
static inline bool zone_is_empty(struct zone *zone)
{
- return zone->spanned_pages == 0;
+ return READ_ONCE(zone->spanned_pages) == 0;
}
#ifndef BUILD_VDSO32_64
@@ -1485,10 +1488,13 @@ static inline bool managed_zone(struct zone *zone)
return zone_managed_pages(zone);
}
-/* Returns true if a zone has memory */
+/*
+ * Returns true if a zone has memory.
+ * This is unstable unless you old mem_hotplug_lock.
+ */
static inline bool populated_zone(struct zone *zone)
{
- return zone->present_pages;
+ return READ_ONCE(zone->present_pages);
}
#ifdef CONFIG_NUMA
diff --git a/mm/compaction.c b/mm/compaction.c
index e731d45befc7..b8066d1fdcf5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2239,7 +2239,7 @@ static unsigned int fragmentation_score_zone_weighted(struct zone *zone)
{
unsigned long score;
- score = zone->present_pages * fragmentation_score_zone(zone);
+ score = READ_ONCE(zone->present_pages) * fragmentation_score_zone(zone);
return div64_ul(score, zone->zone_pgdat->node_present_pages + 1);
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 431b1f6753c0..71b5e3d314a2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -463,6 +463,8 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
int nid = zone_to_nid(zone);
if (zone->zone_start_pfn == start_pfn) {
+ unsigned long old_end_pfn = zone_end_pfn(zone);
+
/*
* If the section is smallest section in the zone, it need
* shrink zone->zone_start_pfn and zone->zone_spanned_pages.
@@ -470,13 +472,13 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
* for shrinking zone.
*/
pfn = find_smallest_section_pfn(nid, zone, end_pfn,
- zone_end_pfn(zone));
+ old_end_pfn);
if (pfn) {
- zone->spanned_pages = zone_end_pfn(zone) - pfn;
+ WRITE_ONCE(zone->spanned_pages, old_end_pfn - pfn);
zone->zone_start_pfn = pfn;
} else {
zone->zone_start_pfn = 0;
- zone->spanned_pages = 0;
+ WRITE_ONCE(zone->spanned_pages, 0);
}
} else if (zone_end_pfn(zone) == end_pfn) {
/*
@@ -488,10 +490,11 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
start_pfn);
if (pfn)
- zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
+ WRITE_ONCE(zone->spanned_pages,
+ pfn - zone->zone_start_pfn + 1);
else {
zone->zone_start_pfn = 0;
- zone->spanned_pages = 0;
+ WRITE_ONCE(zone->spanned_pages, 0);
}
}
}
@@ -710,7 +713,8 @@ static void __meminit resize_zone_range(struct zone *zone, unsigned long start_p
if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
zone->zone_start_pfn = start_pfn;
- zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
+ WRITE_ONCE(zone->spanned_pages,
+ max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn);
}
static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn,
@@ -795,7 +799,7 @@ static void auto_movable_stats_account_zone(struct auto_movable_stats *stats,
struct zone *zone)
{
if (zone_idx(zone) == ZONE_MOVABLE) {
- stats->movable_pages += zone->present_pages;
+ stats->movable_pages += READ_ONCE(zone->present_pages);
} else {
stats->kernel_early_pages += zone->present_early_pages;
#ifdef CONFIG_CMA
@@ -1077,7 +1081,7 @@ void adjust_present_page_count(struct page *page, struct memory_group *group,
*/
if (early_section(__pfn_to_section(page_to_pfn(page))))
zone->present_early_pages += nr_pages;
- zone->present_pages += nr_pages;
+ WRITE_ONCE(zone->present_pages, zone->present_pages + nr_pages);
zone->zone_pgdat->node_present_pages += nr_pages;
if (group && movable)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c725618aeb58..ec66f2eadb95 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1540,7 +1540,7 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
for (z = 0; z < MAX_NR_ZONES; z++) {
struct zone *zone = pgdat->node_zones + z;
- zone->present_pages = 0;
+ WRITE_ONCE(zone->present_pages, 0);
zone_init_internals(zone, z, nid, 0);
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5116a2b9ea6e..1eb9000ec7d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5728,7 +5728,7 @@ __meminit void zone_pcp_init(struct zone *zone)
if (populated_zone(zone))
pr_debug(" %s zone: %lu pages, LIFO batch:%u\n", zone->name,
- zone->present_pages, zone_batchsize(zone));
+ READ_ONCE(zone->present_pages), zone_batchsize(zone));
}
void adjust_managed_page_count(struct page *page, long count)
diff --git a/mm/show_mem.c b/mm/show_mem.c
index bdb439551eef..667680a6107b 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -337,7 +337,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
- K(zone->present_pages),
+ K(READ_ONCE(zone->present_pages)),
K(zone_managed_pages(zone)),
K(zone_page_state(zone, NR_MLOCK)),
K(zone_page_state(zone, NR_BOUNCE)),
@@ -407,11 +407,11 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
for_each_populated_zone(zone) {
- total += zone->present_pages;
- reserved += zone->present_pages - zone_managed_pages(zone);
+ total += READ_ONCE(zone->present_pages);
+ reserved += READ_ONCE(zone->present_pages) - zone_managed_pages(zone);
if (is_highmem(zone))
- highmem += zone->present_pages;
+ highmem += READ_ONCE(zone->present_pages);
}
printk("%lu pages RAM\n", total);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8507c497218b..5a9c4b5768e5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1708,8 +1708,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
- zone->spanned_pages,
- zone->present_pages,
+ READ_ONCE(zone->spanned_pages),
+ READ_ONCE(zone->present_pages),
zone_managed_pages(zone),
zone_cma_pages(zone));
--
2.45.0.rc1.225.g2a3ae87e7f-goog
Hi Brendan,
On Tue, May 21, 2024 at 8:57 PM Brendan Jackman <[email protected]> wrote:
>
> These fields are written by memory hotplug under mem_hotplug_lock but
> read without any lock. It seems like reader code is robust against the
> value being stale or "from the future", but we also need to account
> for:
>
> 1. Load/store tearing (according to Linus[1], this really happens,
> even when everything is aligned as you would hope).
>
> 2. Invented loads[2] - the compiler can spill and re-read these fields
> ([2] calls this "invented loads") and assume that they have not
> changed.
>
> Note we don't need READ_ONCE in paths that have the mem_hotplug_lock
> for write, but we still need WRITE_ONCE to prevent store-tearing.
>
> [1] https://lore.kernel.org/all/CAHk-=wj2t+GK+DGQ7Xy6U7zMf72e7Jkxn4_-kGyfH3WFEoH+YQ@mail.gmail.com/T/#u
> As discovered via the original big-bad article[2]
> [2] https://lwn.net/Articles/793253/
>
> Signed-off-by: Brendan Jackman <[email protected]>
> ---
> include/linux/mmzone.h | 14 ++++++++++----
> mm/compaction.c | 2 +-
> mm/memory_hotplug.c | 20 ++++++++++++--------
> mm/mm_init.c | 2 +-
> mm/page_alloc.c | 2 +-
> mm/show_mem.c | 8 ++++----
> mm/vmstat.c | 4 ++--
> 7 files changed, 31 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 194ef7fed9d6..bdb3be76d10c 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1018,11 +1018,13 @@ static inline unsigned long zone_cma_pages(struct zone *zone)
> #endif
> }
>
> +/* This is unstable unless you hold mem_hotplug_lock. */
> static inline unsigned long zone_end_pfn(const struct zone *zone)
> {
> - return zone->zone_start_pfn + zone->spanned_pages;
> + return zone->zone_start_pfn + READ_ONCE(zone->spanned_pages);
> }
>
> +/* This is unstable unless you hold mem_hotplug_lock. */
> static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> {
> return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> @@ -1033,9 +1035,10 @@ static inline bool zone_is_initialized(struct zone *zone)
> return zone->initialized;
> }
>
> +/* This is unstable unless you hold mem_hotplug_lock. */
> static inline bool zone_is_empty(struct zone *zone)
> {
> - return zone->spanned_pages == 0;
> + return READ_ONCE(zone->spanned_pages) == 0;
> }
>
> #ifndef BUILD_VDSO32_64
> @@ -1485,10 +1488,13 @@ static inline bool managed_zone(struct zone *zone)
> return zone_managed_pages(zone);
> }
>
> -/* Returns true if a zone has memory */
> +/*
> + * Returns true if a zone has memory.
> + * This is unstable unless you old mem_hotplug_lock.
> + */
> static inline bool populated_zone(struct zone *zone)
> {
> - return zone->present_pages;
> + return READ_ONCE(zone->present_pages);
> }
>
> #ifdef CONFIG_NUMA
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e731d45befc7..b8066d1fdcf5 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2239,7 +2239,7 @@ static unsigned int fragmentation_score_zone_weighted(struct zone *zone)
> {
> unsigned long score;
>
> - score = zone->present_pages * fragmentation_score_zone(zone);
> + score = READ_ONCE(zone->present_pages) * fragmentation_score_zone(zone);
> return div64_ul(score, zone->zone_pgdat->node_present_pages + 1);
> }
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 431b1f6753c0..71b5e3d314a2 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -463,6 +463,8 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> int nid = zone_to_nid(zone);
>
> if (zone->zone_start_pfn == start_pfn) {
> + unsigned long old_end_pfn = zone_end_pfn(zone);
> +
> /*
> * If the section is smallest section in the zone, it need
> * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
> @@ -470,13 +472,13 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> * for shrinking zone.
> */
> pfn = find_smallest_section_pfn(nid, zone, end_pfn,
> - zone_end_pfn(zone));
> + old_end_pfn);
> if (pfn) {
> - zone->spanned_pages = zone_end_pfn(zone) - pfn;
> + WRITE_ONCE(zone->spanned_pages, old_end_pfn - pfn);
> zone->zone_start_pfn = pfn;
> } else {
> zone->zone_start_pfn = 0;
> - zone->spanned_pages = 0;
> + WRITE_ONCE(zone->spanned_pages, 0);
> }
> } else if (zone_end_pfn(zone) == end_pfn) {
> /*
> @@ -488,10 +490,11 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
> start_pfn);
> if (pfn)
> - zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
> + WRITE_ONCE(zone->spanned_pages,
> + pfn - zone->zone_start_pfn + 1);
> else {
> zone->zone_start_pfn = 0;
> - zone->spanned_pages = 0;
> + WRITE_ONCE(zone->spanned_pages, 0);
> }
> }
> }
> @@ -710,7 +713,8 @@ static void __meminit resize_zone_range(struct zone *zone, unsigned long start_p
> if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
> zone->zone_start_pfn = start_pfn;
>
> - zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
> + WRITE_ONCE(zone->spanned_pages,
> + max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn);
> }
>
> static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn,
> @@ -795,7 +799,7 @@ static void auto_movable_stats_account_zone(struct auto_movable_stats *stats,
> struct zone *zone)
> {
> if (zone_idx(zone) == ZONE_MOVABLE) {
> - stats->movable_pages += zone->present_pages;
> + stats->movable_pages += READ_ONCE(zone->present_pages);
> } else {
> stats->kernel_early_pages += zone->present_early_pages;
> #ifdef CONFIG_CMA
> @@ -1077,7 +1081,7 @@ void adjust_present_page_count(struct page *page, struct memory_group *group,
> */
> if (early_section(__pfn_to_section(page_to_pfn(page))))
> zone->present_early_pages += nr_pages;
> - zone->present_pages += nr_pages;
> + WRITE_ONCE(zone->present_pages, zone->present_pages + nr_pages);
I'm not sure that using the WRITE_ONCE() wrapper would prevent load tearing
on 'zone->present_pages', but it's probably just me overthinking it :)
Thanks,
Lance
> zone->zone_pgdat->node_present_pages += nr_pages;
>
> if (group && movable)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index c725618aeb58..ec66f2eadb95 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1540,7 +1540,7 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
> for (z = 0; z < MAX_NR_ZONES; z++) {
> struct zone *zone = pgdat->node_zones + z;
>
> - zone->present_pages = 0;
> + WRITE_ONCE(zone->present_pages, 0);
> zone_init_internals(zone, z, nid, 0);
> }
> }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5116a2b9ea6e..1eb9000ec7d7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5728,7 +5728,7 @@ __meminit void zone_pcp_init(struct zone *zone)
>
> if (populated_zone(zone))
> pr_debug(" %s zone: %lu pages, LIFO batch:%u\n", zone->name,
> - zone->present_pages, zone_batchsize(zone));
> + READ_ONCE(zone->present_pages), zone_batchsize(zone));
> }
>
> void adjust_managed_page_count(struct page *page, long count)
> diff --git a/mm/show_mem.c b/mm/show_mem.c
> index bdb439551eef..667680a6107b 100644
> --- a/mm/show_mem.c
> +++ b/mm/show_mem.c
> @@ -337,7 +337,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
> K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
> K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
> K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
> - K(zone->present_pages),
> + K(READ_ONCE(zone->present_pages)),
> K(zone_managed_pages(zone)),
> K(zone_page_state(zone, NR_MLOCK)),
> K(zone_page_state(zone, NR_BOUNCE)),
> @@ -407,11 +407,11 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
>
> for_each_populated_zone(zone) {
>
> - total += zone->present_pages;
> - reserved += zone->present_pages - zone_managed_pages(zone);
> + total += READ_ONCE(zone->present_pages);
> + reserved += READ_ONCE(zone->present_pages) - zone_managed_pages(zone);
>
> if (is_highmem(zone))
> - highmem += zone->present_pages;
> + highmem += READ_ONCE(zone->present_pages);
> }
>
> printk("%lu pages RAM\n", total);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8507c497218b..5a9c4b5768e5 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1708,8 +1708,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> min_wmark_pages(zone),
> low_wmark_pages(zone),
> high_wmark_pages(zone),
> - zone->spanned_pages,
> - zone->present_pages,
> + READ_ONCE(zone->spanned_pages),
> + READ_ONCE(zone->present_pages),
> zone_managed_pages(zone),
> zone_cma_pages(zone));
>
>
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
>
>
Hi Lance, thanks for taking a look.
On Wed, May 22, 2024 at 12:25:30PM +0800, Lance Yang wrote:
> Hi Brendan,
>
> On Tue, May 21, 2024 at 8:57 PM Brendan Jackman <[email protected]> wrote:
> > @@ -1077,7 +1081,7 @@ void adjust_present_page_count(struct page *page, struct memory_group *group,
> > */
> > if (early_section(__pfn_to_section(page_to_pfn(page))))
> > zone->present_early_pages += nr_pages;
> > - zone->present_pages += nr_pages;
> > + WRITE_ONCE(zone->present_pages, zone->present_pages + nr_pages);
>
> I'm not sure that using the WRITE_ONCE() wrapper would prevent load tearing
> on 'zone->present_pages', but it's probably just me overthinking it :)
Hmm.. this isn't for load-tearing, it's for store-tearing. I have a
feeling I might be missing your pont here though, can you elaborate?
I have just noticed that the original "big bad optimizing compiler"
article[1] only says store-tearing has been observed in the wild when
the value being stored can be split into immediates (i.e. is
constant). But it doesn't really seem wise to rely on that. From what
I can tell from tools/memory-model/Documentation you are really out in
the wild with unmarked accesses.
[1] https://lwn.net/Articles/793253
On Tue, May 21, 2024 at 12:57:19PM +0000, Brendan Jackman wrote:
> These fields are written by memory hotplug under mem_hotplug_lock but
> read without any lock. It seems like reader code is robust against the
> value being stale or "from the future", but we also need to account
> for:
>
> 1. Load/store tearing (according to Linus[1], this really happens,
> even when everything is aligned as you would hope).
>
> 2. Invented loads[2] - the compiler can spill and re-read these fields
> ([2] calls this "invented loads") and assume that they have not
> changed.
>
> Note we don't need READ_ONCE in paths that have the mem_hotplug_lock
> for write, but we still need WRITE_ONCE to prevent store-tearing.
>
> [1] https://lore.kernel.org/all/CAHk-=wj2t+GK+DGQ7Xy6U7zMf72e7Jkxn4_-kGyfH3WFEoH+YQ@mail.gmail.com/T/#u
> As discovered via the original big-bad article[2]
> [2] https://lwn.net/Articles/793253/
>
> Signed-off-by: Brendan Jackman <[email protected]>
Oh, from a quick look it seems cma_pages would need this too.
present_early_pages seems fine.
I'll wait a few days in case anyone points out this whole thing is
garbage, then check more carefully and send a v2.
On Wed, May 22, 2024 at 4:38 PM Brendan Jackman <[email protected]> wrote:
>
> Hi Lance, thanks for taking a look.
>
> On Wed, May 22, 2024 at 12:25:30PM +0800, Lance Yang wrote:
> > Hi Brendan,
> >
> > On Tue, May 21, 2024 at 8:57 PM Brendan Jackman <[email protected]> wrote:
> > > @@ -1077,7 +1081,7 @@ void adjust_present_page_count(struct page *page, struct memory_group *group,
> > > */
> > > if (early_section(__pfn_to_section(page_to_pfn(page))))
> > > zone->present_early_pages += nr_pages;
> > > - zone->present_pages += nr_pages;
> > > + WRITE_ONCE(zone->present_pages, zone->present_pages + nr_pages);
> >
> > I'm not sure that using the WRITE_ONCE() wrapper would prevent load tearing
> > on 'zone->present_pages', but it's probably just me overthinking it :)
>
> Hmm.. this isn't for load-tearing, it's for store-tearing. I have a
> feeling I might be missing your pont here though, can you elaborate?
Sorry, my explanation wasn't clear :(
I'm a bit confused about whether 'WRITE_ONCE(zone->present_pages,
zone->present_pages + nr_pages);'
is equivalent to the following:
1 a = zone->present_pages + nr_pages;
2 WRITE_ONCE(zone->present_pages, a);
If so, is there any possibility of load tearing on
'zone->present_pages' in line 1?
>
> I have just noticed that the original "big bad optimizing compiler"
> article[1] only says store-tearing has been observed in the wild when
> the value being stored can be split into immediates (i.e. is
> constant). But it doesn't really seem wise to rely on that. From what
> I can tell from tools/memory-model/Documentation you are really out in
> the wild with unmarked accesses.
>
> [1] https://lwn.net/Articles/793253
Thanks for clarifying!
Lance
On Wed, May 22, 2024 at 05:20:08PM +0800, Lance Yang wrote:
> On Wed, May 22, 2024 at 4:38 PM Brendan Jackman <[email protected]> wrote:
> >
> > Hi Lance, thanks for taking a look.
> >
> > On Wed, May 22, 2024 at 12:25:30PM +0800, Lance Yang wrote:
> > > Hi Brendan,
> > >
> > > On Tue, May 21, 2024 at 8:57 PM Brendan Jackman <[email protected]> wrote:
> > > > @@ -1077,7 +1081,7 @@ void adjust_present_page_count(struct page *page, struct memory_group *group,
> > > > */
> > > > if (early_section(__pfn_to_section(page_to_pfn(page))))
> > > > zone->present_early_pages += nr_pages;
> > > > - zone->present_pages += nr_pages;
> > > > + WRITE_ONCE(zone->present_pages, zone->present_pages + nr_pages);
> > >
> > > I'm not sure that using the WRITE_ONCE() wrapper would prevent load tearing
> > > on 'zone->present_pages', but it's probably just me overthinking it :)
> >
> > Hmm.. this isn't for load-tearing, it's for store-tearing. I have a
> > feeling I might be missing your pont here though, can you elaborate?
>
> Sorry, my explanation wasn't clear :(
>
> I'm a bit confused about whether 'WRITE_ONCE(zone->present_pages,
> zone->present_pages + nr_pages);'
> is equivalent to the following:
>
> 1 a = zone->present_pages + nr_pages;
> 2 WRITE_ONCE(zone->present_pages, a);
>
> If so, is there any possibility of load tearing on
> 'zone->present_pages' in line 1?
Ah gotcha, thanks for clarifying. Loads are protected by
mem_hotplug_lock here, so it's fine for them to get split up (because
the value can't change between loads). This is what I was referring to
in the bit of the commit message about not needing READ_ONCE.
On Wed, May 22, 2024 at 6:10 PM Brendan Jackman <[email protected]> wrote:
>
> On Wed, May 22, 2024 at 05:20:08PM +0800, Lance Yang wrote:
> > On Wed, May 22, 2024 at 4:38 PM Brendan Jackman <[email protected]> wrote:
> > >
> > > Hi Lance, thanks for taking a look.
> > >
> > > On Wed, May 22, 2024 at 12:25:30PM +0800, Lance Yang wrote:
> > > > Hi Brendan,
> > > >
> > > > On Tue, May 21, 2024 at 8:57 PM Brendan Jackman <[email protected]> wrote:
> > > > > @@ -1077,7 +1081,7 @@ void adjust_present_page_count(struct page *page, struct memory_group *group,
> > > > > */
> > > > > if (early_section(__pfn_to_section(page_to_pfn(page))))
> > > > > zone->present_early_pages += nr_pages;
> > > > > - zone->present_pages += nr_pages;
> > > > > + WRITE_ONCE(zone->present_pages, zone->present_pages + nr_pages);
> > > >
> > > > I'm not sure that using the WRITE_ONCE() wrapper would prevent load tearing
> > > > on 'zone->present_pages', but it's probably just me overthinking it :)
> > >
> > > Hmm.. this isn't for load-tearing, it's for store-tearing. I have a
> > > feeling I might be missing your pont here though, can you elaborate?
> >
> > Sorry, my explanation wasn't clear :(
> >
> > I'm a bit confused about whether 'WRITE_ONCE(zone->present_pages,
> > zone->present_pages + nr_pages);'
> > is equivalent to the following:
> >
> > 1 a = zone->present_pages + nr_pages;
> > 2 WRITE_ONCE(zone->present_pages, a);
> >
> > If so, is there any possibility of load tearing on
> > 'zone->present_pages' in line 1?
>
> Ah gotcha, thanks for clarifying. Loads are protected by
> mem_hotplug_lock here, so it's fine for them to get split up (because
> the value can't change between loads). This is what I was referring to
> in the bit of the commit message about not needing READ_ONCE.
I see, thanks again for clarifying!
Lance
On 21.05.24 14:57, Brendan Jackman wrote:
> These fields are written by memory hotplug under mem_hotplug_lock but
> read without any lock. It seems like reader code is robust against the
> value being stale or "from the future", but we also need to account
> for:
>
> 1. Load/store tearing (according to Linus[1], this really happens,
> even when everything is aligned as you would hope).
>
> 2. Invented loads[2] - the compiler can spill and re-read these fields
> ([2] calls this "invented loads") and assume that they have not
> changed.
>
> Note we don't need READ_ONCE in paths that have the mem_hotplug_lock
> for write, but we still need WRITE_ONCE to prevent store-tearing.
>
> [1] https://lore.kernel.org/all/CAHk-=wj2t+GK+DGQ7Xy6U7zMf72e7Jkxn4_-kGyfH3WFEoH+YQ@mail.gmail.com/T/#u
> As discovered via the original big-bad article[2]
> [2] https://lwn.net/Articles/793253/
>
> Signed-off-by: Brendan Jackman <[email protected]>
> ---
> include/linux/mmzone.h | 14 ++++++++++----
> mm/compaction.c | 2 +-
> mm/memory_hotplug.c | 20 ++++++++++++--------
> mm/mm_init.c | 2 +-
> mm/page_alloc.c | 2 +-
> mm/show_mem.c | 8 ++++----
> mm/vmstat.c | 4 ++--
> 7 files changed, 31 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 194ef7fed9d6..bdb3be76d10c 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1018,11 +1018,13 @@ static inline unsigned long zone_cma_pages(struct zone *zone)
> #endif
> }
>
> +/* This is unstable unless you hold mem_hotplug_lock. */
> static inline unsigned long zone_end_pfn(const struct zone *zone)
> {
> - return zone->zone_start_pfn + zone->spanned_pages;
> + return zone->zone_start_pfn + READ_ONCE(zone->spanned_pages);
It's weird to apply that logic only to spanned_pages, whereby
zone_start_pfn can (and will) similarly change when onlining/offlining
memory.
--
Cheers,
David / dhildenb
On 21.05.24 14:57, Brendan Jackman wrote:
> It seems that [1] was acked, and the a v2 was written[2] which improved
> upon it, but got bogged down in discussion of other topics, so the
> improvements were not included. Then [1] got merged as commit
> 27cacaad16c5 ("mm,memory_hotplug: drop unneeded locking") and we ended
> up with locks that get taken for read but never for write.
>
> So, let's remove the read locking.
>
> Compared to Oscar's original v2[2], I have added a READ_ONCE in
> page_outside_zone_boundaries; this is a substitute for the compiler
> barrier that was implied by read_seqretry(). I believe this is necessary
> to insure against UB, although the value being read here is only used
> for a printk so the stakes seem very low (and this is all debug code
> anyway). I believe a compiler barrier is also needed in zone_spans_pfn,
> but I'll address that in a separate patch.
>
> That read_seqretry() also impleied a CPU-level memory barrier, which I
> don't think needs replacing: page_outside_zone_boundaries() is used in
> the alloc and free paths, but you can't allocate or free pages from
> the span that is in the middle of being added/removed by hotplug.
>
> In other words, page_outside_zone_boundaries() doesn't require a
> strictly up-to-date view of spanned_pages, but I think it does require
> a value that was once/will eventually be correct, hence READ_ONCE.
>
> [1] https://lore.kernel.org/all/[email protected]/T/#u
> [2] https://lore.kernel.org/linux-mm/[email protected]/#t
>
> Cc: David Hildenbrand <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Co-developed-by: Oscar Salvador <[email protected]>
> Signed-off-by: Oscar Salvador <[email protected]>
> Signed-off-by: Brendan Jackman <[email protected]>
> ---
> include/linux/memory_hotplug.h | 35 -----------------------------------
> include/linux/mmzone.h | 23 +++++------------------
> mm/mm_init.c | 1 -
> mm/page_alloc.c | 10 +++-------
> 4 files changed, 8 insertions(+), 61 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 7a9ff464608d..f9577e67e5ee 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -141,31 +141,7 @@ bool mhp_supports_memmap_on_memory(void);
>
> /*
> * Zone resizing functions
> - *
> - * Note: any attempt to resize a zone should has pgdat_resize_lock()
> - * zone_span_writelock() both held. This ensure the size of a zone
> - * can't be changed while pgdat_resize_lock() held.
> */
> -static inline unsigned zone_span_seqbegin(struct zone *zone)
> -{
> - return read_seqbegin(&zone->span_seqlock);
> -}
> -static inline int zone_span_seqretry(struct zone *zone, unsigned iv)
> -{
> - return read_seqretry(&zone->span_seqlock, iv);
> -}
> -static inline void zone_span_writelock(struct zone *zone)
> -{
> - write_seqlock(&zone->span_seqlock);
> -}
> -static inline void zone_span_writeunlock(struct zone *zone)
> -{
> - write_sequnlock(&zone->span_seqlock);
> -}
> -static inline void zone_seqlock_init(struct zone *zone)
> -{
> - seqlock_init(&zone->span_seqlock);
> -}
> extern void adjust_present_page_count(struct page *page,
> struct memory_group *group,
> long nr_pages);
> @@ -251,17 +227,6 @@ static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat)
> ___page; \
> })
>
> -static inline unsigned zone_span_seqbegin(struct zone *zone)
> -{
> - return 0;
> -}
> -static inline int zone_span_seqretry(struct zone *zone, unsigned iv)
> -{
> - return 0;
> -}
> -static inline void zone_span_writelock(struct zone *zone) {}
> -static inline void zone_span_writeunlock(struct zone *zone) {}
> -static inline void zone_seqlock_init(struct zone *zone) {}
>
> static inline int try_online_node(int nid)
> {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8f9c9590a42c..194ef7fed9d6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -14,7 +14,6 @@
> #include <linux/threads.h>
> #include <linux/numa.h>
> #include <linux/init.h>
> -#include <linux/seqlock.h>
> #include <linux/nodemask.h>
> #include <linux/pageblock-flags.h>
> #include <linux/page-flags-layout.h>
> @@ -896,18 +895,11 @@ struct zone {
> *
> * Locking rules:
> *
> - * zone_start_pfn and spanned_pages are protected by span_seqlock.
> - * It is a seqlock because it has to be read outside of zone->lock,
> - * and it is done in the main allocator path. But, it is written
> - * quite infrequently.
> - *
> - * The span_seq lock is declared along with zone->lock because it is
> - * frequently read in proximity to zone->lock. It's good to
> - * give them a chance of being in the same cacheline.
> - *
> - * Write access to present_pages at runtime should be protected by
> - * mem_hotplug_begin/done(). Any reader who can't tolerant drift of
> - * present_pages should use get_online_mems() to get a stable value.
> + * Besides system initialization functions, memory-hotplug is the only
> + * user that can change zone's {spanned,present} pages at runtime, and
> + * it does so by holding the mem_hotplug_lock lock. Any readers who
> + * can't tolerate drift values should use {get,put}_online_mems to get
> + * a stable value.
> */
> atomic_long_t managed_pages;
> unsigned long spanned_pages;
> @@ -930,11 +922,6 @@ struct zone {
> unsigned long nr_isolate_pageblock;
> #endif
>
> -#ifdef CONFIG_MEMORY_HOTPLUG
> - /* see spanned/present_pages for more description */
> - seqlock_t span_seqlock;
> -#endif
> -
> int initialized;
>
> /* Write-intensive fields used from the page allocator */
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index f72b852bd5b8..c725618aeb58 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1383,7 +1383,6 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
> zone->name = zone_names[idx];
> zone->zone_pgdat = NODE_DATA(nid);
> spin_lock_init(&zone->lock);
> - zone_seqlock_init(zone);
> zone_pcp_init(zone);
> }
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e22ce5675ca..5116a2b9ea6e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -426,16 +426,12 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
> static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
> {
> int ret;
> - unsigned seq;
> unsigned long pfn = page_to_pfn(page);
> unsigned long sp, start_pfn;
>
> - do {
> - seq = zone_span_seqbegin(zone);
> - start_pfn = zone->zone_start_pfn;
> - sp = zone->spanned_pages;
> - ret = !zone_spans_pfn(zone, pfn);
> - } while (zone_span_seqretry(zone, seq));
> + start_pfn = zone->zone_start_pfn;
> + sp = READ_ONCE(zone->spanned_pages);
> + ret = !zone_spans_pfn(zone, pfn);
The old seqlock guaranteed that we would have obtained consistent values
here. start + spanned_pages defines a range. For example, growing a zone
to the beginning implies that both ranges must be changed.
I do wonder if it might be better to instead have zone->zone_start_pfn
and zone->zone_end_pfn. That way, both can be changed individually, not
requiring adjustment of both to grow/shrink a zone at the beginning.
Then, using READ_ONCE() on both might actually make sense ...
--
Cheers,
David / dhildenb
On Wed, May 22, 2024 at 04:05:12PM +0200, David Hildenbrand wrote:
> On 21.05.24 14:57, Brendan Jackman wrote:
> > + return zone->zone_start_pfn + READ_ONCE(zone->spanned_pages);
>
> It's weird to apply that logic only to spanned_pages, whereby zone_start_pfn
> can (and will) similarly change when onlining/offlining memory.
>
Oh, yep. For some reason I had decided that zone_start_pfn was fixed
but that is (actually very obviously) not true!
Will take a closer look and extend v2 to cover that too, unless
someone finds a reason this whole patch is nonsense.
Thanks for the review.
On Wed, May 22, 2024 at 04:09:41PM +0200, David Hildenbrand wrote:
> On 21.05.24 14:57, Brendan Jackman wrote:
> The old seqlock guaranteed that we would have obtained consistent values
> here. start + spanned_pages defines a range. For example, growing a zone to
> the beginning implies that both ranges must be changed.
>
> I do wonder if it might be better to instead have zone->zone_start_pfn and
> zone->zone_end_pfn. That way, both can be changed individually, not
> requiring adjustment of both to grow/shrink a zone at the beginning.
Thanks this is a good point.
So basically the fact that spanned_pages is "once or eventually"
correct is certainly not enough because it only has meaning with
reference to zone_start_pfn. I didn't realise this because of my
spontaneous inspiration to believe that zone_start_pfn was fixed.
By the way, some noob questions: am I OK with my assumption that it's
fine for reader code to operate on zone spans that are both stale and
"from the future"? thinking abstractly I guess that seeing a stale
value when racing with offline_pages is roughly the same as seeing a
value "from the future" when racing with online_pages?
Also, is it ever possible for pages to get removed and then added back
and end up in a different zone than before?
On 22.05.24 16:27, Brendan Jackman wrote:
> On Wed, May 22, 2024 at 04:09:41PM +0200, David Hildenbrand wrote:
>> On 21.05.24 14:57, Brendan Jackman wrote:
>> The old seqlock guaranteed that we would have obtained consistent values
>> here. start + spanned_pages defines a range. For example, growing a zone to
>> the beginning implies that both ranges must be changed.
>>
>> I do wonder if it might be better to instead have zone->zone_start_pfn and
>> zone->zone_end_pfn. That way, both can be changed individually, not
>> requiring adjustment of both to grow/shrink a zone at the beginning.
>
> Thanks this is a good point.
>
> So basically the fact that spanned_pages is "once or eventually"
> correct is certainly not enough because it only has meaning with
> reference to zone_start_pfn. I didn't realise this because of my
> spontaneous inspiration to believe that zone_start_pfn was fixed.
Right, it isn't.
>
> By the way, some noob questions: am I OK with my assumption that it's
> fine for reader code to operate on zone spans that are both stale and
> "from the future"? thinking abstractly I guess that seeing a stale
> value when racing with offline_pages is roughly the same as seeing a
> value "from the future" when racing with online_pages?
Right. PFN walkers should be using pfn_to_online_page(), where races are
possible but barely seen in practice.
zone handlers like mm/compaction.c can likely deal with races, although
it might all be cleaner (and safer?) when using start+end. I recall it
also recalls on pfn_to_online_page().
Regarding page_outside_zone_boundaries(), it should be fine if we can
read start+end atomically, that way we would not accidentally report
"page outside ..." when changing the start address. I think with your
current patch that might happen (although likely extremely hard to
trigger) when growing the zone at the start, reducing zone_start_pfn.
>
> Also, is it ever possible for pages to get removed and then added back
> and end up in a different zone than before?
Yes. Changing between MOVABLE and NORMAL is possible and can easily be
triggered by offlining+re-onlining memory blocks.
--
Cheers,
David / dhildenb
On Wed, May 22, 2024 at 05:24:17PM +0200, David Hildenbrand wrote:
> On 22.05.24 16:27, Brendan Jackman wrote:
> > On Wed, May 22, 2024 at 04:09:41PM +0200, David Hildenbrand wrote:
> > By the way, some noob questions: am I OK with my assumption that it's
> > fine for reader code to operate on zone spans that are both stale and
> > "from the future"? thinking abstractly I guess that seeing a stale
> > value when racing with offline_pages is roughly the same as seeing a
> > value "from the future" when racing with online_pages?
>
> Right. PFN walkers should be using pfn_to_online_page(), where races are
> possible but barely seen in practice.
>
> zone handlers like mm/compaction.c can likely deal with races, although it
> might all be cleaner (and safer?) when using start+end. I recall it also
> recalls on pfn_to_online_page().
>
> Regarding page_outside_zone_boundaries(), it should be fine if we can read
> start+end atomically, that way we would not accidentally report "page
> outside ..." when changing the start address. I think with your current
> patch that might happen (although likely extremely hard to trigger) when
> growing the zone at the start, reducing zone_start_pfn.
Thanks a lot, this is very helpful
> > Also, is it ever possible for pages to get removed and then added back
> > and end up in a different zone than before?
>
> Yes. Changing between MOVABLE and NORMAL is possible and can easily be
> triggered by offlining+re-onlining memory blocks.
So, even if we make it impossible to see a totally bogus zone span,
you can observe a stale/futuristic span which currently contains pages
from a different zone?
That seems to imply you could look up a page page from a PFN within
zone A's apparent span, lock zone A and assume you can safely modify
the freelist the page is on, but actually that page is now in zone B.
So for example:
1. compact_zone() sets cc->free_pfn based on zone_end_pfn
2. isolate_freepages() sets isolate_start_pfn = cc->free_pfn
3. isolate_freepages_block() looks up a page based on that PFN
3. ... then takes the cc->zone lock
4. ... then calls __isolate_free_page which removes the page from
whatever freelist it's on.
Is anything stopping part 4 from modifying a zone that wasn't locked
in part 3?
Am 24.05.24 um 14:02 schrieb Brendan Jackman:
> On Wed, May 22, 2024 at 05:24:17PM +0200, David Hildenbrand wrote:
>> On 22.05.24 16:27, Brendan Jackman wrote:
>>> On Wed, May 22, 2024 at 04:09:41PM +0200, David Hildenbrand wrote:
>
>>> By the way, some noob questions: am I OK with my assumption that it's
>>> fine for reader code to operate on zone spans that are both stale and
>>> "from the future"? thinking abstractly I guess that seeing a stale
>>> value when racing with offline_pages is roughly the same as seeing a
>>> value "from the future" when racing with online_pages?
>>
>> Right. PFN walkers should be using pfn_to_online_page(), where races are
>> possible but barely seen in practice.
>>
>> zone handlers like mm/compaction.c can likely deal with races, although it
>> might all be cleaner (and safer?) when using start+end. I recall it also
>> recalls on pfn_to_online_page().
>>
>> Regarding page_outside_zone_boundaries(), it should be fine if we can read
>> start+end atomically, that way we would not accidentally report "page
>> outside ..." when changing the start address. I think with your current
>> patch that might happen (although likely extremely hard to trigger) when
>> growing the zone at the start, reducing zone_start_pfn.
>
> Thanks a lot, this is very helpful
>
>>> Also, is it ever possible for pages to get removed and then added back
>>> and end up in a different zone than before?
>>
>> Yes. Changing between MOVABLE and NORMAL is possible and can easily be
>> triggered by offlining+re-onlining memory blocks.
>
> So, even if we make it impossible to see a totally bogus zone span,
> you can observe a stale/futuristic span which currently contains pages
> from a different zone?
Yes. Note that zones/nodes can easily overlap, so a page being spanned by
another zones is common and supported already.
>
> That seems to imply you could look up a page page from a PFN within
> zone A's apparent span, lock zone A and assume you can safely modify
> the freelist the page is on, but actually that page is now in zone B.
That's why we obtain the zone/node always from the page itself (stored in page
flags). This data can only change when offlining+reonlining memory (and
pfn_to_online_page() would refuse to hand out the page while temporarily online).
There were discussions around using RCU to improve pfn_to_online_page() racing
with memory offlining, but the motivation to do that has been rather small: we
barely see such races in practice. Memory offlining+re-onlining simply takes too
long.
>
> So for example:
>
> 1. compact_zone() sets cc->free_pfn based on zone_end_pfn
> 2. isolate_freepages() sets isolate_start_pfn = cc->free_pfn
> 3. isolate_freepages_block() looks up a page based on that PFN
> 3. ... then takes the cc->zone lock
> 4. ... then calls __isolate_free_page which removes the page from
> whatever freelist it's on.
>
> Is anything stopping part 4 from modifying a zone that wasn't locked
> in part 3?
Likely that overlapping zones already exist and are handled accordingly.
--
Thanks,
David / dhildenb
On Wed, May 22, 2024 at 02:11:16PM +0000, Brendan Jackman wrote:
> On Wed, May 22, 2024 at 04:05:12PM +0200, David Hildenbrand wrote:
> > On 21.05.24 14:57, Brendan Jackman wrote:
> > > + return zone->zone_start_pfn + READ_ONCE(zone->spanned_pages);
> >
> > It's weird to apply that logic only to spanned_pages, whereby zone_start_pfn
> > can (and will) similarly change when onlining/offlining memory.
> >
> Oh, yep. For some reason I had decided that zone_start_pfn was fixed
> but that is (actually very obviously) not true!
>
> Will take a closer look and extend v2 to cover that too, unless
> someone finds a reason this whole patch is nonsense.
>
> Thanks for the review.
Hmm so while poking around during spare moments this week I learned
that compaction.c also stores a bunch of data in struct zone that is
unsynchronized.
It seems pretty unlikely that you can corrupt any memory there (unless
there's some race possible with pfn_to_online_page, which is an
orthogonal question), but it does seem like if the compiler gets smart
with us we could maybe have a compaction run that takes quasi-forever
or something weird like that.
It seems easy enough to just spam READ_ONCE/WRITE_ONCE everywhere
there too, this would remove that risk, make KCSAN happy and serve as
a kinda "this is unsynchronized, take care" comment. (There's also at
least one place where we could put data_race()).
On the other hand it's a bit verbose & visually ugly. Personally I
think it's a pretty minor downside, but anyone feel differently?