2024-03-25 16:45:31

by Baoquan He

[permalink] [raw]
Subject: [PATCH v2 0/6] mm/mm_init.c: refactor free_area_init_core()

In function free_area_init_core(), the code calculating
zone->managed_pages and the subtracting dma_reserve from DMA zone looks
very confusing.

From git history, the code calculating zone->managed_pages was for
zone->present_pages originally. The early rough assignment is for
optimize zone's pcp and water mark setting. Later, managed_pages was
introduced into zone to represent the number of managed pages by buddy.
Now, zone->managed_pages is zeroed out and reset in mem_init() when
calling memblock_free_all(). zone's pcp and wmark setting relying on
actual zone->managed_pages are done later than mem_init() invocation.
So we don't need rush to early calculate and set zone->managed_pages,
just set it as zone->present_pages, will adjust it in mem_init().

And also add a new function calc_nr_kernel_pages() to count up free but
not reserved pages in memblock, then assign it to nr_all_pages and
nr_kernel_pages after memmap pages are allocated.

Changelog:
----------
v1->v2:
=======
These are all suggested by Mike, thanks to him.
- Swap the order of patch 1 and 2 in v1 to describe code change better,
Mike suggested this.
- Change to initializ zone->managed_pages as 0 in free_area_init_core()
as there isn't any page added into buddy system. And also improve the
ambiguous description in log. These are all in patch 4.

Baoquan He (6):
x86: remove unneeded memblock_find_dma_reserve()
mm/mm_init.c: remove the useless dma_reserve
mm/mm_init.c: add new function calc_nr_all_pages()
mm/mm_init.c: remove meaningless calculation of zone->managed_pages in
free_area_init_core()
mm/mm_init.c: remove unneeded calc_memmap_size()
mm/mm_init.c: remove arch_reserved_kernel_pages()

arch/powerpc/include/asm/mmu.h | 4 --
arch/powerpc/kernel/fadump.c | 5 --
arch/x86/include/asm/pgtable.h | 1 -
arch/x86/kernel/setup.c | 2 -
arch/x86/mm/init.c | 47 -------------
include/linux/mm.h | 4 --
mm/mm_init.c | 125 ++++++++-------------------------
7 files changed, 29 insertions(+), 159 deletions(-)

--
2.41.0



2024-03-25 16:46:01

by Baoquan He

[permalink] [raw]
Subject: [PATCH v2 2/6] mm/mm_init.c: remove the useless dma_reserve

Now nobody calls set_dma_reserve() to set value for dma_reserve, remove
set_dma_reserve(), global variable dma_reserve and the codes using it.

Signed-off-by: Baoquan He <[email protected]>
---
include/linux/mm.h | 1 -
mm/mm_init.c | 23 -----------------------
2 files changed, 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0436b919f1c7..ad19350e1538 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3210,7 +3210,6 @@ static inline int early_pfn_to_nid(unsigned long pfn)
extern int __meminit early_pfn_to_nid(unsigned long pfn);
#endif

-extern void set_dma_reserve(unsigned long new_dma_reserve);
extern void mem_init(void);
extern void __init mmap_init(void);

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..153fb2dc666f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -226,7 +226,6 @@ static unsigned long required_movablecore_percent __initdata;

static unsigned long nr_kernel_pages __initdata;
static unsigned long nr_all_pages __initdata;
-static unsigned long dma_reserve __initdata;

static bool deferred_struct_pages __meminitdata;

@@ -1583,12 +1582,6 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
zone_names[j], memmap_pages, freesize);
}

- /* Account for reserved pages */
- if (j == 0 && freesize > dma_reserve) {
- freesize -= dma_reserve;
- pr_debug(" %s zone: %lu pages reserved\n", zone_names[0], dma_reserve);
- }
-
if (!is_highmem_idx(j))
nr_kernel_pages += freesize;
/* Charge for highmem memmap if there are enough kernel pages */
@@ -2547,22 +2540,6 @@ void *__init alloc_large_system_hash(const char *tablename,
return table;
}

-/**
- * set_dma_reserve - set the specified number of pages reserved in the first zone
- * @new_dma_reserve: The number of pages to mark reserved
- *
- * The per-cpu batchsize and zone watermarks are determined by managed_pages.
- * In the DMA zone, a significant percentage may be consumed by kernel image
- * and other unfreeable allocations which can skew the watermarks badly. This
- * function may optionally be used to account for unfreeable pages in the
- * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and
- * smaller per-cpu batchsize.
- */
-void __init set_dma_reserve(unsigned long new_dma_reserve)
-{
- dma_reserve = new_dma_reserve;
-}
-
void __init memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
--
2.41.0


2024-03-25 16:46:26

by Baoquan He

[permalink] [raw]
Subject: [PATCH v2 5/6] mm/mm_init.c: remove unneeded calc_memmap_size()

Nobody calls calc_memmap_size() now.

Signed-off-by: Baoquan He <[email protected]>
---
mm/mm_init.c | 20 --------------------
1 file changed, 20 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f71e56e83f3..e269a724f70e 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1331,26 +1331,6 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
}

-static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
- unsigned long present_pages)
-{
- unsigned long pages = spanned_pages;
-
- /*
- * Provide a more accurate estimation if there are holes within
- * the zone and SPARSEMEM is in use. If there are holes within the
- * zone, each populated memory region may cost us one or two extra
- * memmap pages due to alignment because memmap pages for each
- * populated regions may not be naturally aligned on page boundary.
- * So the (present_pages >> 4) heuristic is a tradeoff for that.
- */
- if (spanned_pages > present_pages + (present_pages >> 4) &&
- IS_ENABLED(CONFIG_SPARSEMEM))
- pages = present_pages;
-
- return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void pgdat_init_split_queue(struct pglist_data *pgdat)
{
--
2.41.0


2024-03-25 16:46:32

by Baoquan He

[permalink] [raw]
Subject: [PATCH v2 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

Currently, in free_area_init_core(), when initialize zone's field, a
rough value is set to zone->managed_pages. That value is calculated by
(zone->present_pages - memmap_pages).

In the meantime, add the value to nr_all_pages and nr_kernel_pages which
represent all free pages of system (only low memory or including HIGHMEM
memory separately). Both of them are gonna be used in
alloc_large_system_hash().

However, the rough calculation and setting of zone->managed_pages is
meaningless because
a) memmap pages are allocated on units of node in sparse_init() or
alloc_node_mem_map(pgdat); The simple (zone->present_pages -
memmap_pages) is too rough to make sense for zone;
b) the set zone->managed_pages will be zeroed out and reset with
acutal value in mem_init() via memblock_free_all(). Before the
resetting, no buddy allocation request is issued.

Here, remove the meaningless and complicated calculation of
(zone->present_pages - memmap_pages), initialize zone->managed_pages as 0
which reflect its actual value because no any page is added into buddy
system right now. It will be reset in mem_init().

And also remove the assignment of nr_all_pages and nr_kernel_pages in
free_area_init_core(). Instead, call the newly added calc_nr_kernel_pages()
to count up all free but not reserved memory in memblock and assign to
nr_all_pages and nr_kernel_pages. The counting excludes memmap_pages,
and other kernel used data, which is more accurate than old way and
simpler, and can also cover the ppc required arch_reserved_kernel_pages()
case.

And also clean up the outdated code comment above free_area_init_core().
And free_area_init_core() is easy to understand now, no need to add
words to explain.

Signed-off-by: Baoquan He <[email protected]>
---
mm/mm_init.c | 46 +++++-----------------------------------------
1 file changed, 5 insertions(+), 41 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index c57a7fc97a16..7f71e56e83f3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1565,15 +1565,6 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
}
#endif

-/*
- * Set up the zone data structures:
- * - mark all pages reserved
- * - mark all memory queues empty
- * - clear the memory bitmaps
- *
- * NOTE: pgdat should get zeroed by caller.
- * NOTE: this function is only called during early init.
- */
static void __init free_area_init_core(struct pglist_data *pgdat)
{
enum zone_type j;
@@ -1584,41 +1575,13 @@ static void __init free_area_init_core(struct pglist_data *pgdat)

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, freesize, memmap_pages;
-
- size = zone->spanned_pages;
- freesize = zone->present_pages;
-
- /*
- * Adjust freesize so that it accounts for how much memory
- * is used by this zone for memmap. This affects the watermark
- * and per-cpu initialisations
- */
- memmap_pages = calc_memmap_size(size, freesize);
- if (!is_highmem_idx(j)) {
- if (freesize >= memmap_pages) {
- freesize -= memmap_pages;
- if (memmap_pages)
- pr_debug(" %s zone: %lu pages used for memmap\n",
- zone_names[j], memmap_pages);
- } else
- pr_warn(" %s zone: %lu memmap pages exceeds freesize %lu\n",
- zone_names[j], memmap_pages, freesize);
- }
-
- if (!is_highmem_idx(j))
- nr_kernel_pages += freesize;
- /* Charge for highmem memmap if there are enough kernel pages */
- else if (nr_kernel_pages > memmap_pages * 2)
- nr_kernel_pages -= memmap_pages;
- nr_all_pages += freesize;
+ unsigned long size = zone->spanned_pages;

/*
- * Set an approximate value for lowmem here, it will be adjusted
- * when the bootmem allocator frees pages into the buddy system.
- * And all highmem pages will be managed by the buddy system.
+ * Initialize zone->managed_pages as 0 , it will be reset
+ * when memblock allocator frees pages into buddy system.
*/
- zone_init_internals(zone, j, nid, freesize);
+ zone_init_internals(zone, j, nid, 0);

if (!size)
continue;
@@ -1915,6 +1878,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
check_for_memory(pgdat);
}

+ calc_nr_kernel_pages();
memmap_init();

/* disable hash distribution for systems with a single node */
--
2.41.0


2024-03-25 16:48:18

by Baoquan He

[permalink] [raw]
Subject: [PATCH v2 6/6] mm/mm_init.c: remove arch_reserved_kernel_pages()

Since the current calculation of calc_nr_kernel_pages() has taken into
consideration of kernel reserved memory, no need to have
arch_reserved_kernel_pages() any more.

Signed-off-by: Baoquan He <[email protected]>
---
arch/powerpc/include/asm/mmu.h | 4 ----
arch/powerpc/kernel/fadump.c | 5 -----
include/linux/mm.h | 3 ---
mm/mm_init.c | 12 ------------
4 files changed, 24 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index 3b72c7ed24cf..aa5c0fd5edb1 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -406,9 +406,5 @@ extern void *abatron_pteptrs[2];
#include <asm/nohash/mmu.h>
#endif

-#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP)
-#define __HAVE_ARCH_RESERVED_KERNEL_PAGES
-#endif
-
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_MMU_H_ */
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index d14eda1e8589..ae8c7619e597 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1735,8 +1735,3 @@ static void __init fadump_reserve_crash_area(u64 base)
memblock_reserve(mstart, msize);
}
}
-
-unsigned long __init arch_reserved_kernel_pages(void)
-{
- return memblock_reserved_size() / PAGE_SIZE;
-}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad19350e1538..ab1ba0a31429 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3221,9 +3221,6 @@ static inline void show_mem(void)
extern long si_mem_available(void);
extern void si_meminfo(struct sysinfo * val);
extern void si_meminfo_node(struct sysinfo *val, int nid);
-#ifdef __HAVE_ARCH_RESERVED_KERNEL_PAGES
-extern unsigned long arch_reserved_kernel_pages(void);
-#endif

extern __printf(3, 4)
void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index e269a724f70e..089dc60159e9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2373,17 +2373,6 @@ void __init page_alloc_init_late(void)
page_alloc_sysctl_init();
}

-#ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
-/*
- * Returns the number of pages that arch has reserved but
- * is not known to alloc_large_system_hash().
- */
-static unsigned long __init arch_reserved_kernel_pages(void)
-{
- return 0;
-}
-#endif
-
/*
* Adaptive scale is meant to reduce sizes of hash tables on large memory
* machines. As memory size is increased the scale is also increased but at
@@ -2426,7 +2415,6 @@ void *__init alloc_large_system_hash(const char *tablename,
if (!numentries) {
/* round applicable memory size up to nearest megabyte */
numentries = nr_kernel_pages;
- numentries -= arch_reserved_kernel_pages();

/* It isn't necessary when PAGE_SIZE >= 1MB */
if (PAGE_SIZE < SZ_1M)
--
2.41.0


2024-03-25 16:53:49

by Baoquan He

[permalink] [raw]
Subject: [PATCH v2 1/6] x86: remove unneeded memblock_find_dma_reserve()

Variable dma_reserve and its usage was introduced in commit 0e0b864e069c
("[PATCH] Account for memmap and optionally the kernel image as holes").
Its original purpose was to accounting for the reserved pages in DMA
zone to make DMA zone's watermarks calculation more accurate on x86.

However, currently there's zone->managed_pages to account for all
available pages for buddy, zone->present_pages to account for all
present physical pages in zone. What is more important, on x86,
calculating and setting the zone->managed_pages is a temporary move,
all zone's managed_pages will be zeroed out and reset to the actual
value according to how many pages are added to buddy allocator in
mem_init(). Before mem_init(), no buddy alloction is requested. And
zone's pcp and watermark setting are all done after mem_init(). So,
no need to worry about the DMA zone's setting accuracy during
free_area_init().

Hence, remove memblock_find_dma_reserve() to stop calculating and
setting dma_reserve.

Signed-off-by: Baoquan He <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 -
arch/x86/kernel/setup.c | 2 --
arch/x86/mm/init.c | 47 ----------------------------------
3 files changed, 50 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 315535ffb258..cefc7a84f7a4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1200,7 +1200,6 @@ static inline int pgd_none(pgd_t pgd)
extern int direct_gbpages;
void init_mem_mapping(void);
void early_alloc_pgt_buf(void);
-extern void memblock_find_dma_reserve(void);
void __init poking_init(void);
unsigned long init_memory_mapping(unsigned long start,
unsigned long end, pgprot_t prot);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ef206500ed6f..74873bd04ad1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1106,8 +1106,6 @@ void __init setup_arch(char **cmdline_p)
*/
arch_reserve_crashkernel();

- memblock_find_dma_reserve();
-
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 679893ea5e68..615f0bf4bda6 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -990,53 +990,6 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
}
#endif

-/*
- * Calculate the precise size of the DMA zone (first 16 MB of RAM),
- * and pass it to the MM layer - to help it set zone watermarks more
- * accurately.
- *
- * Done on 64-bit systems only for the time being, although 32-bit systems
- * might benefit from this as well.
- */
-void __init memblock_find_dma_reserve(void)
-{
-#ifdef CONFIG_X86_64
- u64 nr_pages = 0, nr_free_pages = 0;
- unsigned long start_pfn, end_pfn;
- phys_addr_t start_addr, end_addr;
- int i;
- u64 u;
-
- /*
- * Iterate over all memory ranges (free and reserved ones alike),
- * to calculate the total number of pages in the first 16 MB of RAM:
- */
- nr_pages = 0;
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
- start_pfn = min(start_pfn, MAX_DMA_PFN);
- end_pfn = min(end_pfn, MAX_DMA_PFN);
-
- nr_pages += end_pfn - start_pfn;
- }
-
- /*
- * Iterate over free memory ranges to calculate the number of free
- * pages in the DMA zone, while not counting potential partial
- * pages at the beginning or the end of the range:
- */
- nr_free_pages = 0;
- for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
- start_pfn = min_t(unsigned long, PFN_UP(start_addr), MAX_DMA_PFN);
- end_pfn = min_t(unsigned long, PFN_DOWN(end_addr), MAX_DMA_PFN);
-
- if (start_pfn < end_pfn)
- nr_free_pages += end_pfn - start_pfn;
- }
-
- set_dma_reserve(nr_pages - nr_free_pages);
-#endif
-}
-
void __init zone_sizes_init(void)
{
unsigned long max_zone_pfns[MAX_NR_ZONES];
--
2.41.0


2024-03-25 19:24:30

by Baoquan He

[permalink] [raw]
Subject: [PATCH v2 3/6] mm/mm_init.c: add new function calc_nr_all_pages()

This is a preparation to calculate nr_kernel_pages and nr_all_pages,
both of which will be used later in alloc_large_system_hash().

nr_all_pages counts up all free but not reserved memory in memblock
allocator, including HIGHMEM memory. While nr_kernel_pages counts up
all free but not reserved low memory in memblock allocator, excluding
HIGHMEM memory.

Signed-off-by: Baoquan He <[email protected]>
---
mm/mm_init.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 153fb2dc666f..c57a7fc97a16 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1264,6 +1264,30 @@ static void __init reset_memoryless_node_totalpages(struct pglist_data *pgdat)
pr_debug("On node %d totalpages: 0\n", pgdat->node_id);
}

+static void __init calc_nr_kernel_pages(void)
+{
+ unsigned long start_pfn, end_pfn;
+ phys_addr_t start_addr, end_addr;
+ u64 u;
+#ifdef CONFIG_HIGHMEM
+ unsigned long high_zone_low = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
+#endif
+
+ for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
+ start_pfn = PFN_UP(start_addr);
+ end_pfn = PFN_DOWN(end_addr);
+
+ if (start_pfn < end_pfn) {
+ nr_all_pages += end_pfn - start_pfn;
+#ifdef CONFIG_HIGHMEM
+ start_pfn = clamp(start_pfn, 0, high_zone_low);
+ end_pfn = clamp(end_pfn, 0, high_zone_low);
+#endif
+ nr_kernel_pages += end_pfn - start_pfn;
+ }
+ }
+}
+
static void __init calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long node_start_pfn,
unsigned long node_end_pfn)
--
2.41.0


2024-03-26 06:45:01

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] x86: remove unneeded memblock_find_dma_reserve()

On Mon, Mar 25, 2024 at 10:56:41PM +0800, Baoquan He wrote:
> Variable dma_reserve and its usage was introduced in commit 0e0b864e069c
> ("[PATCH] Account for memmap and optionally the kernel image as holes").
> Its original purpose was to accounting for the reserved pages in DMA
> zone to make DMA zone's watermarks calculation more accurate on x86.
>
> However, currently there's zone->managed_pages to account for all
> available pages for buddy, zone->present_pages to account for all
> present physical pages in zone. What is more important, on x86,
> calculating and setting the zone->managed_pages is a temporary move,
> all zone's managed_pages will be zeroed out and reset to the actual
> value according to how many pages are added to buddy allocator in
> mem_init(). Before mem_init(), no buddy alloction is requested. And
> zone's pcp and watermark setting are all done after mem_init(). So,
> no need to worry about the DMA zone's setting accuracy during
> free_area_init().
>
> Hence, remove memblock_find_dma_reserve() to stop calculating and
> setting dma_reserve.
>
> Signed-off-by: Baoquan He <[email protected]>

Reviewed-by: Mike Rapoport (IBM) <[email protected]>

> ---
> arch/x86/include/asm/pgtable.h | 1 -
> arch/x86/kernel/setup.c | 2 --
> arch/x86/mm/init.c | 47 ----------------------------------
> 3 files changed, 50 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 315535ffb258..cefc7a84f7a4 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1200,7 +1200,6 @@ static inline int pgd_none(pgd_t pgd)
> extern int direct_gbpages;
> void init_mem_mapping(void);
> void early_alloc_pgt_buf(void);
> -extern void memblock_find_dma_reserve(void);
> void __init poking_init(void);
> unsigned long init_memory_mapping(unsigned long start,
> unsigned long end, pgprot_t prot);
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index ef206500ed6f..74873bd04ad1 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1106,8 +1106,6 @@ void __init setup_arch(char **cmdline_p)
> */
> arch_reserve_crashkernel();
>
> - memblock_find_dma_reserve();
> -
> if (!early_xdbc_setup_hardware())
> early_xdbc_register_console();
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 679893ea5e68..615f0bf4bda6 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -990,53 +990,6 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
> }
> #endif
>
> -/*
> - * Calculate the precise size of the DMA zone (first 16 MB of RAM),
> - * and pass it to the MM layer - to help it set zone watermarks more
> - * accurately.
> - *
> - * Done on 64-bit systems only for the time being, although 32-bit systems
> - * might benefit from this as well.
> - */
> -void __init memblock_find_dma_reserve(void)
> -{
> -#ifdef CONFIG_X86_64
> - u64 nr_pages = 0, nr_free_pages = 0;
> - unsigned long start_pfn, end_pfn;
> - phys_addr_t start_addr, end_addr;
> - int i;
> - u64 u;
> -
> - /*
> - * Iterate over all memory ranges (free and reserved ones alike),
> - * to calculate the total number of pages in the first 16 MB of RAM:
> - */
> - nr_pages = 0;
> - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> - start_pfn = min(start_pfn, MAX_DMA_PFN);
> - end_pfn = min(end_pfn, MAX_DMA_PFN);
> -
> - nr_pages += end_pfn - start_pfn;
> - }
> -
> - /*
> - * Iterate over free memory ranges to calculate the number of free
> - * pages in the DMA zone, while not counting potential partial
> - * pages at the beginning or the end of the range:
> - */
> - nr_free_pages = 0;
> - for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
> - start_pfn = min_t(unsigned long, PFN_UP(start_addr), MAX_DMA_PFN);
> - end_pfn = min_t(unsigned long, PFN_DOWN(end_addr), MAX_DMA_PFN);
> -
> - if (start_pfn < end_pfn)
> - nr_free_pages += end_pfn - start_pfn;
> - }
> -
> - set_dma_reserve(nr_pages - nr_free_pages);
> -#endif
> -}
> -
> void __init zone_sizes_init(void)
> {
> unsigned long max_zone_pfns[MAX_NR_ZONES];
> --
> 2.41.0
>

--
Sincerely yours,
Mike.

2024-03-26 06:45:43

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] mm/mm_init.c: remove the useless dma_reserve

On Mon, Mar 25, 2024 at 10:56:42PM +0800, Baoquan He wrote:
> Now nobody calls set_dma_reserve() to set value for dma_reserve, remove
> set_dma_reserve(), global variable dma_reserve and the codes using it.
>
> Signed-off-by: Baoquan He <[email protected]>

Reviewed-by: Mike Rapoport (IBM) <[email protected]>

> ---
> include/linux/mm.h | 1 -
> mm/mm_init.c | 23 -----------------------
> 2 files changed, 24 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0436b919f1c7..ad19350e1538 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3210,7 +3210,6 @@ static inline int early_pfn_to_nid(unsigned long pfn)
> extern int __meminit early_pfn_to_nid(unsigned long pfn);
> #endif
>
> -extern void set_dma_reserve(unsigned long new_dma_reserve);
> extern void mem_init(void);
> extern void __init mmap_init(void);
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 549e76af8f82..153fb2dc666f 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -226,7 +226,6 @@ static unsigned long required_movablecore_percent __initdata;
>
> static unsigned long nr_kernel_pages __initdata;
> static unsigned long nr_all_pages __initdata;
> -static unsigned long dma_reserve __initdata;
>
> static bool deferred_struct_pages __meminitdata;
>
> @@ -1583,12 +1582,6 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
> zone_names[j], memmap_pages, freesize);
> }
>
> - /* Account for reserved pages */
> - if (j == 0 && freesize > dma_reserve) {
> - freesize -= dma_reserve;
> - pr_debug(" %s zone: %lu pages reserved\n", zone_names[0], dma_reserve);
> - }
> -
> if (!is_highmem_idx(j))
> nr_kernel_pages += freesize;
> /* Charge for highmem memmap if there are enough kernel pages */
> @@ -2547,22 +2540,6 @@ void *__init alloc_large_system_hash(const char *tablename,
> return table;
> }
>
> -/**
> - * set_dma_reserve - set the specified number of pages reserved in the first zone
> - * @new_dma_reserve: The number of pages to mark reserved
> - *
> - * The per-cpu batchsize and zone watermarks are determined by managed_pages.
> - * In the DMA zone, a significant percentage may be consumed by kernel image
> - * and other unfreeable allocations which can skew the watermarks badly. This
> - * function may optionally be used to account for unfreeable pages in the
> - * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and
> - * smaller per-cpu batchsize.
> - */
> -void __init set_dma_reserve(unsigned long new_dma_reserve)
> -{
> - dma_reserve = new_dma_reserve;
> -}
> -
> void __init memblock_free_pages(struct page *page, unsigned long pfn,
> unsigned int order)
> {
> --
> 2.41.0
>
>

--
Sincerely yours,
Mike.

2024-03-26 06:58:14

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] mm/mm_init.c: add new function calc_nr_all_pages()

Hi Baoquan,

On Mon, Mar 25, 2024 at 10:56:43PM +0800, Baoquan He wrote:
> This is a preparation to calculate nr_kernel_pages and nr_all_pages,
> both of which will be used later in alloc_large_system_hash().
>
> nr_all_pages counts up all free but not reserved memory in memblock
> allocator, including HIGHMEM memory. While nr_kernel_pages counts up
> all free but not reserved low memory in memblock allocator, excluding
> HIGHMEM memory.

Sorry I've missed this in the previous review, but I think this patch and
the patch "remove unneeded calc_memmap_size()" can be merged into "remove
meaningless calculation of zone->managed_pages in free_area_init_core()"
with an appropriate update of the commit message.

With the current patch splitting there will be compilation warning about unused
function for this and the next patch.

> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/mm_init.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 153fb2dc666f..c57a7fc97a16 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1264,6 +1264,30 @@ static void __init reset_memoryless_node_totalpages(struct pglist_data *pgdat)
> pr_debug("On node %d totalpages: 0\n", pgdat->node_id);
> }
>
> +static void __init calc_nr_kernel_pages(void)
> +{
> + unsigned long start_pfn, end_pfn;
> + phys_addr_t start_addr, end_addr;
> + u64 u;
> +#ifdef CONFIG_HIGHMEM
> + unsigned long high_zone_low = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
> +#endif
> +
> + for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
> + start_pfn = PFN_UP(start_addr);
> + end_pfn = PFN_DOWN(end_addr);
> +
> + if (start_pfn < end_pfn) {
> + nr_all_pages += end_pfn - start_pfn;
> +#ifdef CONFIG_HIGHMEM
> + start_pfn = clamp(start_pfn, 0, high_zone_low);
> + end_pfn = clamp(end_pfn, 0, high_zone_low);
> +#endif
> + nr_kernel_pages += end_pfn - start_pfn;
> + }
> + }
> +}
> +
> static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> unsigned long node_start_pfn,
> unsigned long node_end_pfn)
> --
> 2.41.0
>

--
Sincerely yours,
Mike.

2024-03-26 06:58:44

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] mm/mm_init.c: remove arch_reserved_kernel_pages()

On Mon, Mar 25, 2024 at 10:56:46PM +0800, Baoquan He wrote:
> Since the current calculation of calc_nr_kernel_pages() has taken into
> consideration of kernel reserved memory, no need to have
> arch_reserved_kernel_pages() any more.
>
> Signed-off-by: Baoquan He <[email protected]>

Reviewed-by: Mike Rapoport (IBM) <[email protected]>

> ---
> arch/powerpc/include/asm/mmu.h | 4 ----
> arch/powerpc/kernel/fadump.c | 5 -----
> include/linux/mm.h | 3 ---
> mm/mm_init.c | 12 ------------
> 4 files changed, 24 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
> index 3b72c7ed24cf..aa5c0fd5edb1 100644
> --- a/arch/powerpc/include/asm/mmu.h
> +++ b/arch/powerpc/include/asm/mmu.h
> @@ -406,9 +406,5 @@ extern void *abatron_pteptrs[2];
> #include <asm/nohash/mmu.h>
> #endif
>
> -#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP)
> -#define __HAVE_ARCH_RESERVED_KERNEL_PAGES
> -#endif
> -
> #endif /* __KERNEL__ */
> #endif /* _ASM_POWERPC_MMU_H_ */
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index d14eda1e8589..ae8c7619e597 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -1735,8 +1735,3 @@ static void __init fadump_reserve_crash_area(u64 base)
> memblock_reserve(mstart, msize);
> }
> }
> -
> -unsigned long __init arch_reserved_kernel_pages(void)
> -{
> - return memblock_reserved_size() / PAGE_SIZE;
> -}
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad19350e1538..ab1ba0a31429 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3221,9 +3221,6 @@ static inline void show_mem(void)
> extern long si_mem_available(void);
> extern void si_meminfo(struct sysinfo * val);
> extern void si_meminfo_node(struct sysinfo *val, int nid);
> -#ifdef __HAVE_ARCH_RESERVED_KERNEL_PAGES
> -extern unsigned long arch_reserved_kernel_pages(void);
> -#endif
>
> extern __printf(3, 4)
> void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index e269a724f70e..089dc60159e9 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2373,17 +2373,6 @@ void __init page_alloc_init_late(void)
> page_alloc_sysctl_init();
> }
>
> -#ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
> -/*
> - * Returns the number of pages that arch has reserved but
> - * is not known to alloc_large_system_hash().
> - */
> -static unsigned long __init arch_reserved_kernel_pages(void)
> -{
> - return 0;
> -}
> -#endif
> -
> /*
> * Adaptive scale is meant to reduce sizes of hash tables on large memory
> * machines. As memory size is increased the scale is also increased but at
> @@ -2426,7 +2415,6 @@ void *__init alloc_large_system_hash(const char *tablename,
> if (!numentries) {
> /* round applicable memory size up to nearest megabyte */
> numentries = nr_kernel_pages;
> - numentries -= arch_reserved_kernel_pages();
>
> /* It isn't necessary when PAGE_SIZE >= 1MB */
> if (PAGE_SIZE < SZ_1M)
> --
> 2.41.0
>

--
Sincerely yours,
Mike.

2024-03-26 13:49:44

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] mm/mm_init.c: add new function calc_nr_all_pages()

On 03/26/24 at 08:57am, Mike Rapoport wrote:
> Hi Baoquan,
>
> On Mon, Mar 25, 2024 at 10:56:43PM +0800, Baoquan He wrote:
> > This is a preparation to calculate nr_kernel_pages and nr_all_pages,
> > both of which will be used later in alloc_large_system_hash().
> >
> > nr_all_pages counts up all free but not reserved memory in memblock
> > allocator, including HIGHMEM memory. While nr_kernel_pages counts up
> > all free but not reserved low memory in memblock allocator, excluding
> > HIGHMEM memory.
>
> Sorry I've missed this in the previous review, but I think this patch and
> the patch "remove unneeded calc_memmap_size()" can be merged into "remove
> meaningless calculation of zone->managed_pages in free_area_init_core()"
> with an appropriate update of the commit message.
>
> With the current patch splitting there will be compilation warning about unused
> function for this and the next patch.

Thanks for careful checking.

We need to make patch bisect-able to not break compiling so that people can
spot the cirminal commit, that's for sure. Do we need care about the
compiling warning from intermediate patch in one series? Not sure about
it. I always suggest people to seperate out this kind of newly added
function to a standalone patch for better reviewing and later checking,
and I saw a lot of commits like this by searching with
'git log --oneline | grep helper'

>
> > Signed-off-by: Baoquan He <[email protected]>
> > ---
> > mm/mm_init.c | 24 ++++++++++++++++++++++++
> > 1 file changed, 24 insertions(+)
> >
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 153fb2dc666f..c57a7fc97a16 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -1264,6 +1264,30 @@ static void __init reset_memoryless_node_totalpages(struct pglist_data *pgdat)
> > pr_debug("On node %d totalpages: 0\n", pgdat->node_id);
> > }
> >
> > +static void __init calc_nr_kernel_pages(void)
> > +{
> > + unsigned long start_pfn, end_pfn;
> > + phys_addr_t start_addr, end_addr;
> > + u64 u;
> > +#ifdef CONFIG_HIGHMEM
> > + unsigned long high_zone_low = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
> > +#endif
> > +
> > + for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
> > + start_pfn = PFN_UP(start_addr);
> > + end_pfn = PFN_DOWN(end_addr);
> > +
> > + if (start_pfn < end_pfn) {
> > + nr_all_pages += end_pfn - start_pfn;
> > +#ifdef CONFIG_HIGHMEM
> > + start_pfn = clamp(start_pfn, 0, high_zone_low);
> > + end_pfn = clamp(end_pfn, 0, high_zone_low);
> > +#endif
> > + nr_kernel_pages += end_pfn - start_pfn;
> > + }
> > + }
> > +}
> > +
> > static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> > unsigned long node_start_pfn,
> > unsigned long node_end_pfn)
> > --
> > 2.41.0
> >
>
> --
> Sincerely yours,
> Mike.
>


2024-03-27 15:44:18

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

On Mon, Mar 25, 2024 at 10:56:44PM +0800, Baoquan He wrote:
> Currently, in free_area_init_core(), when initialize zone's field, a
> rough value is set to zone->managed_pages. That value is calculated by
> (zone->present_pages - memmap_pages).
>
> In the meantime, add the value to nr_all_pages and nr_kernel_pages which
> represent all free pages of system (only low memory or including HIGHMEM
> memory separately). Both of them are gonna be used in
> alloc_large_system_hash().
>
> However, the rough calculation and setting of zone->managed_pages is
> meaningless because
> a) memmap pages are allocated on units of node in sparse_init() or
> alloc_node_mem_map(pgdat); The simple (zone->present_pages -
> memmap_pages) is too rough to make sense for zone;
> b) the set zone->managed_pages will be zeroed out and reset with
> acutal value in mem_init() via memblock_free_all(). Before the
> resetting, no buddy allocation request is issued.
>
> Here, remove the meaningless and complicated calculation of
> (zone->present_pages - memmap_pages), initialize zone->managed_pages as 0
> which reflect its actual value because no any page is added into buddy
> system right now. It will be reset in mem_init().
>
> And also remove the assignment of nr_all_pages and nr_kernel_pages in
> free_area_init_core(). Instead, call the newly added calc_nr_kernel_pages()
> to count up all free but not reserved memory in memblock and assign to
> nr_all_pages and nr_kernel_pages. The counting excludes memmap_pages,
> and other kernel used data, which is more accurate than old way and
> simpler, and can also cover the ppc required arch_reserved_kernel_pages()
> case.
>
> And also clean up the outdated code comment above free_area_init_core().
> And free_area_init_core() is easy to understand now, no need to add
> words to explain.
>
> Signed-off-by: Baoquan He <[email protected]>

Reviewed-by: Mike Rapoport (IBM) <[email protected]>

> ---
> mm/mm_init.c | 46 +++++-----------------------------------------
> 1 file changed, 5 insertions(+), 41 deletions(-)

2024-03-27 15:44:56

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] mm/mm_init.c: remove arch_reserved_kernel_pages()

On Mon, Mar 25, 2024 at 10:56:46PM +0800, Baoquan He wrote:
> Since the current calculation of calc_nr_kernel_pages() has taken into
> consideration of kernel reserved memory, no need to have
> arch_reserved_kernel_pages() any more.
>
> Signed-off-by: Baoquan He <[email protected]>

Reviewed-by: Mike Rapoport (IBM) <[email protected]>

> ---
> arch/powerpc/include/asm/mmu.h | 4 ----
> arch/powerpc/kernel/fadump.c | 5 -----
> include/linux/mm.h | 3 ---
> mm/mm_init.c | 12 ------------
> 4 files changed, 24 deletions(-)
>

2024-03-27 15:50:19

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] mm/mm_init.c: add new function calc_nr_all_pages()

On Mon, Mar 25, 2024 at 10:56:43PM +0800, Baoquan He wrote:
> This is a preparation to calculate nr_kernel_pages and nr_all_pages,
> both of which will be used later in alloc_large_system_hash().
>
> nr_all_pages counts up all free but not reserved memory in memblock
> allocator, including HIGHMEM memory. While nr_kernel_pages counts up
> all free but not reserved low memory in memblock allocator, excluding
> HIGHMEM memory.
>
> Signed-off-by: Baoquan He <[email protected]>

Reviewed-by: Mike Rapoport (IBM) <[email protected]>

> ---
> mm/mm_init.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 153fb2dc666f..c57a7fc97a16 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1264,6 +1264,30 @@ static void __init reset_memoryless_node_totalpages(struct pglist_data *pgdat)
> pr_debug("On node %d totalpages: 0\n", pgdat->node_id);
> }
>
> +static void __init calc_nr_kernel_pages(void)
> +{
> + unsigned long start_pfn, end_pfn;
> + phys_addr_t start_addr, end_addr;
> + u64 u;
> +#ifdef CONFIG_HIGHMEM
> + unsigned long high_zone_low = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
> +#endif
> +
> + for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
> + start_pfn = PFN_UP(start_addr);
> + end_pfn = PFN_DOWN(end_addr);
> +
> + if (start_pfn < end_pfn) {
> + nr_all_pages += end_pfn - start_pfn;
> +#ifdef CONFIG_HIGHMEM
> + start_pfn = clamp(start_pfn, 0, high_zone_low);
> + end_pfn = clamp(end_pfn, 0, high_zone_low);
> +#endif
> + nr_kernel_pages += end_pfn - start_pfn;
> + }
> + }
> +}
> +
> static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> unsigned long node_start_pfn,
> unsigned long node_end_pfn)
> --
> 2.41.0
>

--
Sincerely yours,
Mike.

2024-03-27 16:22:12

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] mm/mm_init.c: remove unneeded calc_memmap_size()

On Mon, Mar 25, 2024 at 10:56:45PM +0800, Baoquan He wrote:
> Nobody calls calc_memmap_size() now.
>
> Signed-off-by: Baoquan He <[email protected]>

Reviewed-by: Mike Rapoport (IBM) <[email protected]>

Looks like I replied to patch 6/6 twice by mistake and missed this one.

> ---
> mm/mm_init.c | 20 --------------------
> 1 file changed, 20 deletions(-)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 7f71e56e83f3..e269a724f70e 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1331,26 +1331,6 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
> }
>
> -static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
> - unsigned long present_pages)
> -{
> - unsigned long pages = spanned_pages;
> -
> - /*
> - * Provide a more accurate estimation if there are holes within
> - * the zone and SPARSEMEM is in use. If there are holes within the
> - * zone, each populated memory region may cost us one or two extra
> - * memmap pages due to alignment because memmap pages for each
> - * populated regions may not be naturally aligned on page boundary.
> - * So the (present_pages >> 4) heuristic is a tradeoff for that.
> - */
> - if (spanned_pages > present_pages + (present_pages >> 4) &&
> - IS_ENABLED(CONFIG_SPARSEMEM))
> - pages = present_pages;
> -
> - return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> -}
> -
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> static void pgdat_init_split_queue(struct pglist_data *pgdat)
> {
> --
> 2.41.0
>

--
Sincerely yours,
Mike.

2024-03-28 01:24:48

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] mm/mm_init.c: remove unneeded calc_memmap_size()

On 03/27/24 at 06:21pm, Mike Rapoport wrote:
> On Mon, Mar 25, 2024 at 10:56:45PM +0800, Baoquan He wrote:
> > Nobody calls calc_memmap_size() now.
> >
> > Signed-off-by: Baoquan He <[email protected]>
>
> Reviewed-by: Mike Rapoport (IBM) <[email protected]>
>
> Looks like I replied to patch 6/6 twice by mistake and missed this one.

Thanks for your careful reviewing.

>
> > ---
> > mm/mm_init.c | 20 --------------------
> > 1 file changed, 20 deletions(-)
> >
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 7f71e56e83f3..e269a724f70e 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -1331,26 +1331,6 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> > pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
> > }
> >
> > -static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
> > - unsigned long present_pages)
> > -{
> > - unsigned long pages = spanned_pages;
> > -
> > - /*
> > - * Provide a more accurate estimation if there are holes within
> > - * the zone and SPARSEMEM is in use. If there are holes within the
> > - * zone, each populated memory region may cost us one or two extra
> > - * memmap pages due to alignment because memmap pages for each
> > - * populated regions may not be naturally aligned on page boundary.
> > - * So the (present_pages >> 4) heuristic is a tradeoff for that.
> > - */
> > - if (spanned_pages > present_pages + (present_pages >> 4) &&
> > - IS_ENABLED(CONFIG_SPARSEMEM))
> > - pages = present_pages;
> > -
> > - return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> > -}
> > -
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > static void pgdat_init_split_queue(struct pglist_data *pgdat)
> > {
> > --
> > 2.41.0
> >
>
> --
> Sincerely yours,
> Mike.
>


2024-03-28 08:32:58

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

On 03/25/24 at 10:56pm, Baoquan He wrote:
> Currently, in free_area_init_core(), when initialize zone's field, a
> rough value is set to zone->managed_pages. That value is calculated by
> (zone->present_pages - memmap_pages).
>
> In the meantime, add the value to nr_all_pages and nr_kernel_pages which
> represent all free pages of system (only low memory or including HIGHMEM
> memory separately). Both of them are gonna be used in
> alloc_large_system_hash().
>
> However, the rough calculation and setting of zone->managed_pages is
> meaningless because
> a) memmap pages are allocated on units of node in sparse_init() or
> alloc_node_mem_map(pgdat); The simple (zone->present_pages -
> memmap_pages) is too rough to make sense for zone;
> b) the set zone->managed_pages will be zeroed out and reset with
> acutal value in mem_init() via memblock_free_all(). Before the
> resetting, no buddy allocation request is issued.
>
> Here, remove the meaningless and complicated calculation of
> (zone->present_pages - memmap_pages), initialize zone->managed_pages as 0
> which reflect its actual value because no any page is added into buddy
> system right now. It will be reset in mem_init().
>
> And also remove the assignment of nr_all_pages and nr_kernel_pages in
> free_area_init_core(). Instead, call the newly added calc_nr_kernel_pages()
> to count up all free but not reserved memory in memblock and assign to
> nr_all_pages and nr_kernel_pages. The counting excludes memmap_pages,
> and other kernel used data, which is more accurate than old way and
> simpler, and can also cover the ppc required arch_reserved_kernel_pages()
> case.
>
> And also clean up the outdated code comment above free_area_init_core().
> And free_area_init_core() is easy to understand now, no need to add
> words to explain.
>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/mm_init.c | 46 +++++-----------------------------------------
> 1 file changed, 5 insertions(+), 41 deletions(-)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index c57a7fc97a16..7f71e56e83f3 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1565,15 +1565,6 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
> }
> #endif
>
> -/*
> - * Set up the zone data structures:
> - * - mark all pages reserved
> - * - mark all memory queues empty
> - * - clear the memory bitmaps
> - *
> - * NOTE: pgdat should get zeroed by caller.
> - * NOTE: this function is only called during early init.
> - */
> static void __init free_area_init_core(struct pglist_data *pgdat)
> {
> enum zone_type j;
> @@ -1584,41 +1575,13 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
>
> for (j = 0; j < MAX_NR_ZONES; j++) {
> struct zone *zone = pgdat->node_zones + j;
> - unsigned long size, freesize, memmap_pages;
> -
> - size = zone->spanned_pages;
> - freesize = zone->present_pages;
> -
> - /*
> - * Adjust freesize so that it accounts for how much memory
> - * is used by this zone for memmap. This affects the watermark
> - * and per-cpu initialisations
> - */
> - memmap_pages = calc_memmap_size(size, freesize);
> - if (!is_highmem_idx(j)) {
> - if (freesize >= memmap_pages) {
> - freesize -= memmap_pages;
> - if (memmap_pages)
> - pr_debug(" %s zone: %lu pages used for memmap\n",
> - zone_names[j], memmap_pages);
> - } else
> - pr_warn(" %s zone: %lu memmap pages exceeds freesize %lu\n",
> - zone_names[j], memmap_pages, freesize);
> - }
> -
> - if (!is_highmem_idx(j))
> - nr_kernel_pages += freesize;
> - /* Charge for highmem memmap if there are enough kernel pages */
> - else if (nr_kernel_pages > memmap_pages * 2)
> - nr_kernel_pages -= memmap_pages;
> - nr_all_pages += freesize;
> + unsigned long size = zone->spanned_pages;
>
> /*
> - * Set an approximate value for lowmem here, it will be adjusted
> - * when the bootmem allocator frees pages into the buddy system.
> - * And all highmem pages will be managed by the buddy system.
> + * Initialize zone->managed_pages as 0 , it will be reset
> + * when memblock allocator frees pages into buddy system.
> */
> - zone_init_internals(zone, j, nid, freesize);
> + zone_init_internals(zone, j, nid, 0);

Here, we should initialize zone->managed_pages as zone->present_pages
because later page_group_by_mobility_disabled need be set according to
zone->managed_pages. Otherwise page_group_by_mobility_disabled will be
set to 1 always. I will sent out v3.

From a17b0921b4bd00596330f61ee9ea4b82386a9fed Mon Sep 17 00:00:00 2001
From: Baoquan He <[email protected]>
Date: Thu, 28 Mar 2024 16:20:15 +0800
Subject: [PATCH] mm/mm_init.c: set zone's ->managed_pages as ->present_pages
for now
Content-type: text/plain

Because page_group_by_mobility_disabled need be set according to zone's
managed_pages later.

Signed-off-by: Baoquan He <[email protected]>
---
mm/mm_init.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index cc24e7958c0c..dd875f943cbb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1561,7 +1561,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
* Initialize zone->managed_pages as 0 , it will be reset
* when memblock allocator frees pages into buddy system.
*/
- zone_init_internals(zone, j, nid, 0);
+ zone_init_internals(zone, j, nid, zone->present_pages);

if (!size)
continue;
--
2.41.0


>
> if (!size)
> continue;
> @@ -1915,6 +1878,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> check_for_memory(pgdat);
> }
>
> + calc_nr_kernel_pages();
> memmap_init();
>
> /* disable hash distribution for systems with a single node */
> --
> 2.41.0
>


2024-03-28 09:12:34

by Baoquan He

[permalink] [raw]
Subject: [PATCH v3 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

Currently, in free_area_init_core(), when initialize zone's field, a
rough value is set to zone->managed_pages. That value is calculated by
(zone->present_pages - memmap_pages).

In the meantime, add the value to nr_all_pages and nr_kernel_pages which
represent all free pages of system (only low memory or including HIGHMEM
memory separately). Both of them are gonna be used in
alloc_large_system_hash().

However, the rough calculation and setting of zone->managed_pages is
meaningless because
a) memmap pages are allocated on units of node in sparse_init() or
alloc_node_mem_map(pgdat); The simple (zone->present_pages -
memmap_pages) is too rough to make sense for zone;
b) the set zone->managed_pages will be zeroed out and reset with
acutal value in mem_init() via memblock_free_all(). Before the
resetting, no buddy allocation request is issued.

Here, remove the meaningless and complicated calculation of
(zone->present_pages - memmap_pages), directly set zone->managed_pages
as zone->present_pages for now. It will be adjusted in mem_init().

And also remove the assignment of nr_all_pages and nr_kernel_pages in
free_area_init_core(). Instead, call the newly added calc_nr_kernel_pages()
to count up all free but not reserved memory in memblock and assign to
nr_all_pages and nr_kernel_pages. The counting excludes memmap_pages,
and other kernel used data, which is more accurate than old way and
simpler, and can also cover the ppc required arch_reserved_kernel_pages()
case.

And also clean up the outdated code comment above free_area_init_core().
And free_area_init_core() is easy to understand now, no need to add
words to explain.

Signed-off-by: Baoquan He <[email protected]>
---
v2->v3:
- Change to initialize zone->managed_pages as zone->present_pages for now
because later page_group_by_mobility_disabled need be set according to
zone->managed_pages. Otherwise it will cause setting
page_group_by_mobility_disabled to 1 always.

mm/mm_init.c | 46 +++++-----------------------------------------
1 file changed, 5 insertions(+), 41 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index c57a7fc97a16..a4b80e8276bb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1565,15 +1565,6 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
}
#endif

-/*
- * Set up the zone data structures:
- * - mark all pages reserved
- * - mark all memory queues empty
- * - clear the memory bitmaps
- *
- * NOTE: pgdat should get zeroed by caller.
- * NOTE: this function is only called during early init.
- */
static void __init free_area_init_core(struct pglist_data *pgdat)
{
enum zone_type j;
@@ -1584,41 +1575,13 @@ static void __init free_area_init_core(struct pglist_data *pgdat)

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, freesize, memmap_pages;
-
- size = zone->spanned_pages;
- freesize = zone->present_pages;
-
- /*
- * Adjust freesize so that it accounts for how much memory
- * is used by this zone for memmap. This affects the watermark
- * and per-cpu initialisations
- */
- memmap_pages = calc_memmap_size(size, freesize);
- if (!is_highmem_idx(j)) {
- if (freesize >= memmap_pages) {
- freesize -= memmap_pages;
- if (memmap_pages)
- pr_debug(" %s zone: %lu pages used for memmap\n",
- zone_names[j], memmap_pages);
- } else
- pr_warn(" %s zone: %lu memmap pages exceeds freesize %lu\n",
- zone_names[j], memmap_pages, freesize);
- }
-
- if (!is_highmem_idx(j))
- nr_kernel_pages += freesize;
- /* Charge for highmem memmap if there are enough kernel pages */
- else if (nr_kernel_pages > memmap_pages * 2)
- nr_kernel_pages -= memmap_pages;
- nr_all_pages += freesize;
+ unsigned long size = zone->spanned_pages;

/*
- * Set an approximate value for lowmem here, it will be adjusted
- * when the bootmem allocator frees pages into the buddy system.
- * And all highmem pages will be managed by the buddy system.
+ * Initialize zone->managed_pages as 0 , it will be reset
+ * when memblock allocator frees pages into buddy system.
*/
- zone_init_internals(zone, j, nid, freesize);
+ zone_init_internals(zone, j, nid, zone->present_pages);

if (!size)
continue;
@@ -1915,6 +1878,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
check_for_memory(pgdat);
}

+ calc_nr_kernel_pages();
memmap_init();

/* disable hash distribution for systems with a single node */
--
2.41.0


2024-03-28 10:02:39

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

On Thu, Mar 28, 2024 at 04:32:38PM +0800, Baoquan He wrote:
> On 03/25/24 at 10:56pm, Baoquan He wrote:
> >
> > /*
> > - * Set an approximate value for lowmem here, it will be adjusted
> > - * when the bootmem allocator frees pages into the buddy system.
> > - * And all highmem pages will be managed by the buddy system.
> > + * Initialize zone->managed_pages as 0 , it will be reset
> > + * when memblock allocator frees pages into buddy system.
> > */
> > - zone_init_internals(zone, j, nid, freesize);
> > + zone_init_internals(zone, j, nid, 0);
>
> Here, we should initialize zone->managed_pages as zone->present_pages
> because later page_group_by_mobility_disabled need be set according to
> zone->managed_pages. Otherwise page_group_by_mobility_disabled will be
> set to 1 always. I will sent out v3.

With zone->managed_pages set to zone->present_pages we won't account for
the reserved memory for initialization of page_group_by_mobility_disabled.

As watermarks are still not initialized at the time build_all_zonelists()
is called, we may use nr_all_pages - nr_kernel_pages instead of
nr_free_zone_pages(), IMO.

> From a17b0921b4bd00596330f61ee9ea4b82386a9fed Mon Sep 17 00:00:00 2001
> From: Baoquan He <[email protected]>
> Date: Thu, 28 Mar 2024 16:20:15 +0800
> Subject: [PATCH] mm/mm_init.c: set zone's ->managed_pages as ->present_pages
> for now
> Content-type: text/plain
>
> Because page_group_by_mobility_disabled need be set according to zone's
> managed_pages later.
>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/mm_init.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index cc24e7958c0c..dd875f943cbb 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1561,7 +1561,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
> * Initialize zone->managed_pages as 0 , it will be reset
> * when memblock allocator frees pages into buddy system.
> */
> - zone_init_internals(zone, j, nid, 0);
> + zone_init_internals(zone, j, nid, zone->present_pages);
>
> if (!size)
> continue;
> --
> 2.41.0
>
>
> >
> > if (!size)
> > continue;
> > @@ -1915,6 +1878,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> > check_for_memory(pgdat);
> > }
> >
> > + calc_nr_kernel_pages();
> > memmap_init();
> >
> > /* disable hash distribution for systems with a single node */
> > --
> > 2.41.0
> >
>

--
Sincerely yours,
Mike.

2024-03-28 14:47:05

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

On 03/28/24 at 11:53am, Mike Rapoport wrote:
> On Thu, Mar 28, 2024 at 04:32:38PM +0800, Baoquan He wrote:
> > On 03/25/24 at 10:56pm, Baoquan He wrote:
> > >
> > > /*
> > > - * Set an approximate value for lowmem here, it will be adjusted
> > > - * when the bootmem allocator frees pages into the buddy system.
> > > - * And all highmem pages will be managed by the buddy system.
> > > + * Initialize zone->managed_pages as 0 , it will be reset
> > > + * when memblock allocator frees pages into buddy system.
> > > */
> > > - zone_init_internals(zone, j, nid, freesize);
> > > + zone_init_internals(zone, j, nid, 0);
> >
> > Here, we should initialize zone->managed_pages as zone->present_pages
> > because later page_group_by_mobility_disabled need be set according to
> > zone->managed_pages. Otherwise page_group_by_mobility_disabled will be
> > set to 1 always. I will sent out v3.
>
> With zone->managed_pages set to zone->present_pages we won't account for
> the reserved memory for initialization of page_group_by_mobility_disabled.

The old zone->managed_pages didn't account for the reserved pages
either. It's calculated by (zone->present_pages - memmap_pages). memmap
pages only is only a very small portion, e.g on x86_64, 4K page size,
assuming size of struct page is 64, then it's 1/64 of system memory.
On arm64, 64K page size, it's 1/1024 of system memory.

And about the setting of page_group_by_mobility_disabled, the compared
value pageblock_nr_pages * MIGRATE_TYPES which is very small. On x86_64,
it's 4M*6=24M; on arm64 with 64K size and 128M*6=768M which should be
the biggest among ARCH-es.

if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
page_group_by_mobility_disabled = 1;
else
page_group_by_mobility_disabled = 0;

So page_group_by_mobility_disabled could be set to 1 only on system with
very little memory which is very rarely seen. And setting
zone->managed_pages as zone->present_pages is very close to its old
value: (zone->present_pages - memmap_pages). Here we don't need be very
accurate, just a rough value.

>
> As watermarks are still not initialized at the time build_all_zonelists()
> is called, we may use nr_all_pages - nr_kernel_pages instead of
> nr_free_zone_pages(), IMO.

nr_all_pages should be fine if we take this way. nr_kernel_pages is a
misleading name, it's all low memory pages excluding kernel reserved
apges. nr_all_pages is all memory pages including highmema and exluding
kernel reserved pages.

Both is fine to me. The first one is easier, simply setting
zone->managed_pages as zone->present_pages. The 2nd way is a little more
accurate.

>
> > From a17b0921b4bd00596330f61ee9ea4b82386a9fed Mon Sep 17 00:00:00 2001
> > From: Baoquan He <[email protected]>
> > Date: Thu, 28 Mar 2024 16:20:15 +0800
> > Subject: [PATCH] mm/mm_init.c: set zone's ->managed_pages as ->present_pages
> > for now
> > Content-type: text/plain
> >
> > Because page_group_by_mobility_disabled need be set according to zone's
> > managed_pages later.
> >
> > Signed-off-by: Baoquan He <[email protected]>
> > ---
> > mm/mm_init.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index cc24e7958c0c..dd875f943cbb 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -1561,7 +1561,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
> > * Initialize zone->managed_pages as 0 , it will be reset
> > * when memblock allocator frees pages into buddy system.
> > */
> > - zone_init_internals(zone, j, nid, 0);
> > + zone_init_internals(zone, j, nid, zone->present_pages);
> >
> > if (!size)
> > continue;
> > --
> > 2.41.0
> >
> >
> > >
> > > if (!size)
> > > continue;
> > > @@ -1915,6 +1878,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> > > check_for_memory(pgdat);
> > > }
> > >
> > > + calc_nr_kernel_pages();
> > > memmap_init();
> > >
> > > /* disable hash distribution for systems with a single node */
> > > --
> > > 2.41.0
> > >
> >
>
> --
> Sincerely yours,
> Mike.
>