2024-03-18 14:22:22

by Baoquan He

[permalink] [raw]
Subject: [PATCH 0/6] mm/mm_init.c: refactor free_area_init_core()

In function free_area_init_core(), the code calculating
zone->managed_pages and the subtracting dma_reserve from DMA zone looks
very confusing.

From git history, the code calculating zone->managed_pages was for
zone->present_pages originally. The early rough assignment is for
optimize zone's pcp and water mark setting. Later, managed_pages was
introduced into zone to represent the number of managed pages by buddy.
Now, zone->managed_pages is zeroed out and reset in mem_init() when
calling memblock_free_all(). zone's pcp and wmark setting relying on
actual zone->managed_pages are done later than mem_init() invocation.
So we don't need rush to early calculate and set zone->managed_pages,
just set it as zone->present_pages, will adjust it in mem_init().

And also add a new function calc_nr_kernel_pages() to count up free but
not reserved pages in memblock, then assign it to nr_all_pages and
nr_kernel_pages after memmap pages are allocated.


Baoquan He (6):
mm/mm_init.c: remove the useless dma_reserve
x86: remove unneeded memblock_find_dma_reserve()
mm/mm_init.c: add new function calc_nr_all_pages()
mm/mm_init.c: remove meaningless calculation of zone->managed_pages in
free_area_init_core()
mm/mm_init.c: remove unneeded calc_memmap_size()
mm/mm_init.c: remove arch_reserved_kernel_pages()

arch/powerpc/include/asm/mmu.h | 4 --
arch/powerpc/kernel/fadump.c | 5 --
arch/x86/include/asm/pgtable.h | 1 -
arch/x86/kernel/setup.c | 2 -
arch/x86/mm/init.c | 47 -------------
include/linux/mm.h | 4 --
mm/mm_init.c | 117 +++++++++------------------------
7 files changed, 30 insertions(+), 150 deletions(-)

--
2.41.0



2024-03-18 14:22:40

by Baoquan He

[permalink] [raw]
Subject: [PATCH 1/6] mm/mm_init.c: remove the useless dma_reserve

Variable dma_reserve and its usage was introduced in commit 0e0b864e069c
("[PATCH] Account for memmap and optionally the kernel image as holes").
Its original purpose was to accounting for the reserved pages in DMA
zone to make DMA zone's watermarks calculation more accurate on x86.

However, currently there's zone->managed_pages to account for all
available pages for buddy, zone->present_pages to account for all
present physical pages in zone. What is more important, on x86,
calculating and setting the zone->managed_pages is a temporary move,
all zone's managed_pages will be zeroed out and reset to the actual
value according to how many pages are added to buddy allocator in
mem_init(). Before mem_init(), no buddy alloction is requested. And
zone's pcp and watermark setting are all done after mem_init(). So,
no need to worry about the DMA zone's setting accuracy during
free_area_init().

Hence, remove dma_reserve and its handling in free_area_init_core()
because it's useless and causes confusion.

Signed-off-by: Baoquan He <[email protected]>
---
arch/x86/mm/init.c | 2 --
include/linux/mm.h | 1 -
mm/mm_init.c | 23 -----------------------
3 files changed, 26 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 679893ea5e68..5209549e8192 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1032,8 +1032,6 @@ void __init memblock_find_dma_reserve(void)
if (start_pfn < end_pfn)
nr_free_pages += end_pfn - start_pfn;
}
-
- set_dma_reserve(nr_pages - nr_free_pages);
#endif
}

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2c0910bc3e4a..1888b1935103 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3210,7 +3210,6 @@ static inline int early_pfn_to_nid(unsigned long pfn)
extern int __meminit early_pfn_to_nid(unsigned long pfn);
#endif

-extern void set_dma_reserve(unsigned long new_dma_reserve);
extern void mem_init(void);
extern void __init mmap_init(void);

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..153fb2dc666f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -226,7 +226,6 @@ static unsigned long required_movablecore_percent __initdata;

static unsigned long nr_kernel_pages __initdata;
static unsigned long nr_all_pages __initdata;
-static unsigned long dma_reserve __initdata;

static bool deferred_struct_pages __meminitdata;

@@ -1583,12 +1582,6 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
zone_names[j], memmap_pages, freesize);
}

- /* Account for reserved pages */
- if (j == 0 && freesize > dma_reserve) {
- freesize -= dma_reserve;
- pr_debug(" %s zone: %lu pages reserved\n", zone_names[0], dma_reserve);
- }
-
if (!is_highmem_idx(j))
nr_kernel_pages += freesize;
/* Charge for highmem memmap if there are enough kernel pages */
@@ -2547,22 +2540,6 @@ void *__init alloc_large_system_hash(const char *tablename,
return table;
}

-/**
- * set_dma_reserve - set the specified number of pages reserved in the first zone
- * @new_dma_reserve: The number of pages to mark reserved
- *
- * The per-cpu batchsize and zone watermarks are determined by managed_pages.
- * In the DMA zone, a significant percentage may be consumed by kernel image
- * and other unfreeable allocations which can skew the watermarks badly. This
- * function may optionally be used to account for unfreeable pages in the
- * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and
- * smaller per-cpu batchsize.
- */
-void __init set_dma_reserve(unsigned long new_dma_reserve)
-{
- dma_reserve = new_dma_reserve;
-}
-
void __init memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
--
2.41.0


2024-03-18 14:22:56

by Baoquan He

[permalink] [raw]
Subject: [PATCH 2/6] x86: remove memblock_find_dma_reserve()

This is not needed any more.

Signed-off-by: Baoquan He <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 -
arch/x86/kernel/setup.c | 2 --
arch/x86/mm/init.c | 45 ----------------------------------
3 files changed, 48 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 315535ffb258..cefc7a84f7a4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1200,7 +1200,6 @@ static inline int pgd_none(pgd_t pgd)
extern int direct_gbpages;
void init_mem_mapping(void);
void early_alloc_pgt_buf(void);
-extern void memblock_find_dma_reserve(void);
void __init poking_init(void);
unsigned long init_memory_mapping(unsigned long start,
unsigned long end, pgprot_t prot);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3e1e96efadfe..5aa00938051f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1106,8 +1106,6 @@ void __init setup_arch(char **cmdline_p)
*/
arch_reserve_crashkernel();

- memblock_find_dma_reserve();
-
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 5209549e8192..615f0bf4bda6 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -990,51 +990,6 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
}
#endif

-/*
- * Calculate the precise size of the DMA zone (first 16 MB of RAM),
- * and pass it to the MM layer - to help it set zone watermarks more
- * accurately.
- *
- * Done on 64-bit systems only for the time being, although 32-bit systems
- * might benefit from this as well.
- */
-void __init memblock_find_dma_reserve(void)
-{
-#ifdef CONFIG_X86_64
- u64 nr_pages = 0, nr_free_pages = 0;
- unsigned long start_pfn, end_pfn;
- phys_addr_t start_addr, end_addr;
- int i;
- u64 u;
-
- /*
- * Iterate over all memory ranges (free and reserved ones alike),
- * to calculate the total number of pages in the first 16 MB of RAM:
- */
- nr_pages = 0;
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
- start_pfn = min(start_pfn, MAX_DMA_PFN);
- end_pfn = min(end_pfn, MAX_DMA_PFN);
-
- nr_pages += end_pfn - start_pfn;
- }
-
- /*
- * Iterate over free memory ranges to calculate the number of free
- * pages in the DMA zone, while not counting potential partial
- * pages at the beginning or the end of the range:
- */
- nr_free_pages = 0;
- for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
- start_pfn = min_t(unsigned long, PFN_UP(start_addr), MAX_DMA_PFN);
- end_pfn = min_t(unsigned long, PFN_DOWN(end_addr), MAX_DMA_PFN);
-
- if (start_pfn < end_pfn)
- nr_free_pages += end_pfn - start_pfn;
- }
-#endif
-}
-
void __init zone_sizes_init(void)
{
unsigned long max_zone_pfns[MAX_NR_ZONES];
--
2.41.0


2024-03-18 14:23:10

by Baoquan He

[permalink] [raw]
Subject: [PATCH 3/6] mm/mm_init.c: add new function calc_nr_kernel_pages()

This is a preparation to calculate nr_kernel_pages and nr_all_pages,
both of which will be used later in alloc_large_system_hash().

nr_all_pages counts up all free but not reserved memory in memblock
allocator, including HIGHMEM memory. While nr_kernel_pages counts up
all free but not reserved low memory in memblock allocator, excluding
HIGHMEM memory.

Signed-off-by: Baoquan He <[email protected]>
---
mm/mm_init.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 153fb2dc666f..c57a7fc97a16 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1264,6 +1264,30 @@ static void __init reset_memoryless_node_totalpages(struct pglist_data *pgdat)
pr_debug("On node %d totalpages: 0\n", pgdat->node_id);
}

+static void __init calc_nr_kernel_pages(void)
+{
+ unsigned long start_pfn, end_pfn;
+ phys_addr_t start_addr, end_addr;
+ u64 u;
+#ifdef CONFIG_HIGHMEM
+ unsigned long high_zone_low = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
+#endif
+
+ for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
+ start_pfn = PFN_UP(start_addr);
+ end_pfn = PFN_DOWN(end_addr);
+
+ if (start_pfn < end_pfn) {
+ nr_all_pages += end_pfn - start_pfn;
+#ifdef CONFIG_HIGHMEM
+ start_pfn = clamp(start_pfn, 0, high_zone_low);
+ end_pfn = clamp(end_pfn, 0, high_zone_low);
+#endif
+ nr_kernel_pages += end_pfn - start_pfn;
+ }
+ }
+}
+
static void __init calculate_node_totalpages(struct pglist_data *pgdat,
unsigned long node_start_pfn,
unsigned long node_end_pfn)
--
2.41.0


2024-03-18 14:23:53

by Baoquan He

[permalink] [raw]
Subject: [PATCH 6/6] mm/mm_init.c: remove arch_reserved_kernel_pages()

Since the current calculation of calc_nr_kernel_pages() has taken into
consideration of kernel reserved memory, no need to have
arch_reserved_kernel_pages() any more.

Signed-off-by: Baoquan He <[email protected]>
---
arch/powerpc/include/asm/mmu.h | 4 ----
arch/powerpc/kernel/fadump.c | 5 -----
include/linux/mm.h | 3 ---
mm/mm_init.c | 12 ------------
4 files changed, 24 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index 3b72c7ed24cf..aa5c0fd5edb1 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -406,9 +406,5 @@ extern void *abatron_pteptrs[2];
#include <asm/nohash/mmu.h>
#endif

-#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP)
-#define __HAVE_ARCH_RESERVED_KERNEL_PAGES
-#endif
-
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_MMU_H_ */
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index d14eda1e8589..ae8c7619e597 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1735,8 +1735,3 @@ static void __init fadump_reserve_crash_area(u64 base)
memblock_reserve(mstart, msize);
}
}
-
-unsigned long __init arch_reserved_kernel_pages(void)
-{
- return memblock_reserved_size() / PAGE_SIZE;
-}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1888b1935103..8147b1302413 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3221,9 +3221,6 @@ static inline void show_mem(void)
extern long si_mem_available(void);
extern void si_meminfo(struct sysinfo * val);
extern void si_meminfo_node(struct sysinfo *val, int nid);
-#ifdef __HAVE_ARCH_RESERVED_KERNEL_PAGES
-extern unsigned long arch_reserved_kernel_pages(void);
-#endif

extern __printf(3, 4)
void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9ed4b9e77c4a..370a057dae97 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2383,17 +2383,6 @@ void __init page_alloc_init_late(void)
page_alloc_sysctl_init();
}

-#ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
-/*
- * Returns the number of pages that arch has reserved but
- * is not known to alloc_large_system_hash().
- */
-static unsigned long __init arch_reserved_kernel_pages(void)
-{
- return 0;
-}
-#endif
-
/*
* Adaptive scale is meant to reduce sizes of hash tables on large memory
* machines. As memory size is increased the scale is also increased but at
@@ -2436,7 +2425,6 @@ void *__init alloc_large_system_hash(const char *tablename,
if (!numentries) {
/* round applicable memory size up to nearest megabyte */
numentries = nr_kernel_pages;
- numentries -= arch_reserved_kernel_pages();

/* It isn't necessary when PAGE_SIZE >= 1MB */
if (PAGE_SIZE < SZ_1M)
--
2.41.0


2024-03-18 14:24:28

by Baoquan He

[permalink] [raw]
Subject: [PATCH 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

Currently, in free_area_init_core(), when initialize zone's field, a
rough value is set to zone->managed_pages. That value is calculated by
(zone->present_pages - memmap_pages).

In the meantime, add the value to nr_all_pages and nr_kernel_pages which
represent all free pages of system (only low memory or including HIGHMEM
memory separately). Both of them are gonna be used in
alloc_large_system_hash().

However, the rough calculation and setting of zone->managed_pages is
meaningless because
a) memmap pages are allocated on units of node in sparse_init() or
alloc_node_mem_map(pgdat); The simple (zone->present_pages -
memmap_pages) is too rough to make sense for zone;
b) the set zone->managed_pages will be zeroed out and reset with
acutal value in mem_init() via memblock_free_all(). Before the
resetting, no buddy allocation request is issued.

Here, remove the meaningless and complicated calculation of
(zone->present_pages - memmap_pages), directly set zone->present_pages to
zone->managed_pages. It will be adjusted in mem_init().

And also remove the assignment of nr_all_pages and nr_kernel_pages in
free_area_init_core(). Instead, call the newly added calc_nr_kernel_pages()
to count up all free but not reserved memory in memblock and assign to
nr_all_pages and nr_kernel_pages. The counting excludes memmap_pages,
and other kernel used data, which is more accurate than old way and
simpler, and can also cover the ppc required arch_reserved_kernel_pages()
case.

Signed-off-by: Baoquan He <[email protected]>
---
mm/mm_init.c | 38 ++++++--------------------------------
1 file changed, 6 insertions(+), 32 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index c57a7fc97a16..55a2b886b7a6 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1584,41 +1584,14 @@ static void __init free_area_init_core(struct pglist_data *pgdat)

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, freesize, memmap_pages;
-
- size = zone->spanned_pages;
- freesize = zone->present_pages;
-
- /*
- * Adjust freesize so that it accounts for how much memory
- * is used by this zone for memmap. This affects the watermark
- * and per-cpu initialisations
- */
- memmap_pages = calc_memmap_size(size, freesize);
- if (!is_highmem_idx(j)) {
- if (freesize >= memmap_pages) {
- freesize -= memmap_pages;
- if (memmap_pages)
- pr_debug(" %s zone: %lu pages used for memmap\n",
- zone_names[j], memmap_pages);
- } else
- pr_warn(" %s zone: %lu memmap pages exceeds freesize %lu\n",
- zone_names[j], memmap_pages, freesize);
- }
-
- if (!is_highmem_idx(j))
- nr_kernel_pages += freesize;
- /* Charge for highmem memmap if there are enough kernel pages */
- else if (nr_kernel_pages > memmap_pages * 2)
- nr_kernel_pages -= memmap_pages;
- nr_all_pages += freesize;
+ unsigned long size = zone->spanned_pages;

/*
- * Set an approximate value for lowmem here, it will be adjusted
- * when the bootmem allocator frees pages into the buddy system.
- * And all highmem pages will be managed by the buddy system.
+ * Set the zone->managed_pages as zone->present_pages roughly, it
+ * be zeroed out and reset when memblock allocator frees pages into
+ * buddy system.
*/
- zone_init_internals(zone, j, nid, freesize);
+ zone_init_internals(zone, j, nid, zone->present_pages);

if (!size)
continue;
@@ -1915,6 +1888,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
check_for_memory(pgdat);
}

+ calc_nr_kernel_pages();
memmap_init();

/* disable hash distribution for systems with a single node */
--
2.41.0


2024-03-18 14:24:33

by Baoquan He

[permalink] [raw]
Subject: [PATCH 5/6] mm/mm_init.c: remove unneeded calc_memmap_size()

Nobody calls calc_memmap_size() now.

Signed-off-by: Baoquan He <[email protected]>
---
mm/mm_init.c | 20 --------------------
1 file changed, 20 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 55a2b886b7a6..9ed4b9e77c4a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1331,26 +1331,6 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
}

-static unsigned long __init calc_memmap_size(unsigned long spanned_pages,
- unsigned long present_pages)
-{
- unsigned long pages = spanned_pages;
-
- /*
- * Provide a more accurate estimation if there are holes within
- * the zone and SPARSEMEM is in use. If there are holes within the
- * zone, each populated memory region may cost us one or two extra
- * memmap pages due to alignment because memmap pages for each
- * populated regions may not be naturally aligned on page boundary.
- * So the (present_pages >> 4) heuristic is a tradeoff for that.
- */
- if (spanned_pages > present_pages + (present_pages >> 4) &&
- IS_ENABLED(CONFIG_SPARSEMEM))
- pages = present_pages;
-
- return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void pgdat_init_split_queue(struct pglist_data *pgdat)
{
--
2.41.0


2024-03-19 15:50:34

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()

Hi Baoquan,

On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote:
> This is not needed any more.

I'd swap this and the first patch, so that the first patch would remove
memblock_find_dma_reserve() and it's changelog will explain why it's not
needed and then the second patch will simply drop unused set_dma_reserve()

> Signed-off-by: Baoquan He <[email protected]>
> ---
> arch/x86/include/asm/pgtable.h | 1 -
> arch/x86/kernel/setup.c | 2 --
> arch/x86/mm/init.c | 45 ----------------------------------
> 3 files changed, 48 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 315535ffb258..cefc7a84f7a4 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1200,7 +1200,6 @@ static inline int pgd_none(pgd_t pgd)
> extern int direct_gbpages;
> void init_mem_mapping(void);
> void early_alloc_pgt_buf(void);
> -extern void memblock_find_dma_reserve(void);
> void __init poking_init(void);
> unsigned long init_memory_mapping(unsigned long start,
> unsigned long end, pgprot_t prot);
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3e1e96efadfe..5aa00938051f 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1106,8 +1106,6 @@ void __init setup_arch(char **cmdline_p)
> */
> arch_reserve_crashkernel();
>
> - memblock_find_dma_reserve();
> -
> if (!early_xdbc_setup_hardware())
> early_xdbc_register_console();
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 5209549e8192..615f0bf4bda6 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -990,51 +990,6 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
> }
> #endif
>
> -/*
> - * Calculate the precise size of the DMA zone (first 16 MB of RAM),
> - * and pass it to the MM layer - to help it set zone watermarks more
> - * accurately.
> - *
> - * Done on 64-bit systems only for the time being, although 32-bit systems
> - * might benefit from this as well.
> - */
> -void __init memblock_find_dma_reserve(void)
> -{
> -#ifdef CONFIG_X86_64
> - u64 nr_pages = 0, nr_free_pages = 0;
> - unsigned long start_pfn, end_pfn;
> - phys_addr_t start_addr, end_addr;
> - int i;
> - u64 u;
> -
> - /*
> - * Iterate over all memory ranges (free and reserved ones alike),
> - * to calculate the total number of pages in the first 16 MB of RAM:
> - */
> - nr_pages = 0;
> - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> - start_pfn = min(start_pfn, MAX_DMA_PFN);
> - end_pfn = min(end_pfn, MAX_DMA_PFN);
> -
> - nr_pages += end_pfn - start_pfn;
> - }
> -
> - /*
> - * Iterate over free memory ranges to calculate the number of free
> - * pages in the DMA zone, while not counting potential partial
> - * pages at the beginning or the end of the range:
> - */
> - nr_free_pages = 0;
> - for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) {
> - start_pfn = min_t(unsigned long, PFN_UP(start_addr), MAX_DMA_PFN);
> - end_pfn = min_t(unsigned long, PFN_DOWN(end_addr), MAX_DMA_PFN);
> -
> - if (start_pfn < end_pfn)
> - nr_free_pages += end_pfn - start_pfn;
> - }
> -#endif
> -}
> -
> void __init zone_sizes_init(void)
> {
> unsigned long max_zone_pfns[MAX_NR_ZONES];
> --
> 2.41.0
>

--
Sincerely yours,
Mike.

2024-03-19 16:25:58

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

On Mon, Mar 18, 2024 at 10:21:36PM +0800, Baoquan He wrote:
> Currently, in free_area_init_core(), when initialize zone's field, a
> rough value is set to zone->managed_pages. That value is calculated by
> (zone->present_pages - memmap_pages).
>
> In the meantime, add the value to nr_all_pages and nr_kernel_pages which
> represent all free pages of system (only low memory or including HIGHMEM
> memory separately). Both of them are gonna be used in
> alloc_large_system_hash().
>
> However, the rough calculation and setting of zone->managed_pages is
> meaningless because
> a) memmap pages are allocated on units of node in sparse_init() or
> alloc_node_mem_map(pgdat); The simple (zone->present_pages -
> memmap_pages) is too rough to make sense for zone;
> b) the set zone->managed_pages will be zeroed out and reset with
> acutal value in mem_init() via memblock_free_all(). Before the
> resetting, no buddy allocation request is issued.
>
> Here, remove the meaningless and complicated calculation of
> (zone->present_pages - memmap_pages), directly set zone->present_pages to
> zone->managed_pages. It will be adjusted in mem_init().

Do you mean "set zone->managed_pages to zone->present_pages"?

I think we can just set zone->managed_pages to 0 in free_area_init_core().
Anyway it will be reset before the first use.

> And also remove the assignment of nr_all_pages and nr_kernel_pages in
> free_area_init_core(). Instead, call the newly added calc_nr_kernel_pages()
> to count up all free but not reserved memory in memblock and assign to
> nr_all_pages and nr_kernel_pages. The counting excludes memmap_pages,
> and other kernel used data, which is more accurate than old way and
> simpler, and can also cover the ppc required arch_reserved_kernel_pages()
> case.
>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/mm_init.c | 38 ++++++--------------------------------
> 1 file changed, 6 insertions(+), 32 deletions(-)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index c57a7fc97a16..55a2b886b7a6 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1584,41 +1584,14 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
>
> for (j = 0; j < MAX_NR_ZONES; j++) {
> struct zone *zone = pgdat->node_zones + j;
> - unsigned long size, freesize, memmap_pages;
> -
> - size = zone->spanned_pages;
> - freesize = zone->present_pages;
> -
> - /*
> - * Adjust freesize so that it accounts for how much memory
> - * is used by this zone for memmap. This affects the watermark
> - * and per-cpu initialisations
> - */
> - memmap_pages = calc_memmap_size(size, freesize);
> - if (!is_highmem_idx(j)) {
> - if (freesize >= memmap_pages) {
> - freesize -= memmap_pages;
> - if (memmap_pages)
> - pr_debug(" %s zone: %lu pages used for memmap\n",
> - zone_names[j], memmap_pages);
> - } else
> - pr_warn(" %s zone: %lu memmap pages exceeds freesize %lu\n",
> - zone_names[j], memmap_pages, freesize);
> - }
> -
> - if (!is_highmem_idx(j))
> - nr_kernel_pages += freesize;
> - /* Charge for highmem memmap if there are enough kernel pages */
> - else if (nr_kernel_pages > memmap_pages * 2)
> - nr_kernel_pages -= memmap_pages;
> - nr_all_pages += freesize;
> + unsigned long size = zone->spanned_pages;
>
> /*
> - * Set an approximate value for lowmem here, it will be adjusted
> - * when the bootmem allocator frees pages into the buddy system.
> - * And all highmem pages will be managed by the buddy system.
> + * Set the zone->managed_pages as zone->present_pages roughly, it
> + * be zeroed out and reset when memblock allocator frees pages into
> + * buddy system.
> */
> - zone_init_internals(zone, j, nid, freesize);
> + zone_init_internals(zone, j, nid, zone->present_pages);
>
> if (!size)
> continue;
> @@ -1915,6 +1888,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> check_for_memory(pgdat);
> }
>
> + calc_nr_kernel_pages();
> memmap_init();
>
> /* disable hash distribution for systems with a single node */
> --
> 2.41.0
>

--
Sincerely yours,
Mike.

2024-03-20 07:53:12

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()

On 03/19/24 at 05:49pm, Mike Rapoport wrote:
> Hi Baoquan,
>
> On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote:
> > This is not needed any more.
>
> I'd swap this and the first patch, so that the first patch would remove
> memblock_find_dma_reserve() and it's changelog will explain why it's not
> needed and then the second patch will simply drop unused set_dma_reserve()

Thanks, Mike.

My thought on the patch 1/2 splitting is:
patch 1 is removing all relevant codes in mm, including the usage of
dma_reserve in free_area_init_core() and exporting set_dma_reserve()
to any ARCH which want to subtract the dma_reserve from DMA zone.

Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve.

Your suggestion is also good to me, I can rearrange the order and
repost.


2024-03-20 08:18:48

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

On 03/19/24 at 06:17pm, Mike Rapoport wrote:
> On Mon, Mar 18, 2024 at 10:21:36PM +0800, Baoquan He wrote:
> > Currently, in free_area_init_core(), when initialize zone's field, a
> > rough value is set to zone->managed_pages. That value is calculated by
> > (zone->present_pages - memmap_pages).
> >
> > In the meantime, add the value to nr_all_pages and nr_kernel_pages which
> > represent all free pages of system (only low memory or including HIGHMEM
> > memory separately). Both of them are gonna be used in
> > alloc_large_system_hash().
> >
> > However, the rough calculation and setting of zone->managed_pages is
> > meaningless because
> > a) memmap pages are allocated on units of node in sparse_init() or
> > alloc_node_mem_map(pgdat); The simple (zone->present_pages -
> > memmap_pages) is too rough to make sense for zone;
> > b) the set zone->managed_pages will be zeroed out and reset with
> > acutal value in mem_init() via memblock_free_all(). Before the
> > resetting, no buddy allocation request is issued.
> >
> > Here, remove the meaningless and complicated calculation of
> > (zone->present_pages - memmap_pages), directly set zone->present_pages to
> > zone->managed_pages. It will be adjusted in mem_init().
>
> Do you mean "set zone->managed_pages to zone->present_pages"?

Hmm, maybe 'set zone->managed_pages as zone->present_pages'
or
'assign zone->present_pages to zone->managed_pages'
which is more precise.

Wwill update.

>
> I think we can just set zone->managed_pages to 0 in free_area_init_core().
> Anyway it will be reset before the first use.

Yeah, setting to 0 is also fine. I thougt of 0 ever. Considering
zone->present_pages is closer value to actual zone->managed_pages
than 0, and it may be needed in the future in some way before
mem_init(). If no strong objection, I will keep the assigning
'zone->present_pages' to 'zone->managed_pages'.

Thanks again for careful reviewing.


2024-03-20 08:54:39

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

On 03/20/24 at 04:18pm, Baoquan He wrote:
> On 03/19/24 at 06:17pm, Mike Rapoport wrote:
> > On Mon, Mar 18, 2024 at 10:21:36PM +0800, Baoquan He wrote:
> > > Currently, in free_area_init_core(), when initialize zone's field, a
> > > rough value is set to zone->managed_pages. That value is calculated by
> > > (zone->present_pages - memmap_pages).
> > >
> > > In the meantime, add the value to nr_all_pages and nr_kernel_pages which
> > > represent all free pages of system (only low memory or including HIGHMEM
> > > memory separately). Both of them are gonna be used in
> > > alloc_large_system_hash().
> > >
> > > However, the rough calculation and setting of zone->managed_pages is
> > > meaningless because
> > > a) memmap pages are allocated on units of node in sparse_init() or
> > > alloc_node_mem_map(pgdat); The simple (zone->present_pages -
> > > memmap_pages) is too rough to make sense for zone;
> > > b) the set zone->managed_pages will be zeroed out and reset with
> > > acutal value in mem_init() via memblock_free_all(). Before the
> > > resetting, no buddy allocation request is issued.
> > >
> > > Here, remove the meaningless and complicated calculation of
> > > (zone->present_pages - memmap_pages), directly set zone->present_pages to
> > > zone->managed_pages. It will be adjusted in mem_init().
> >
> > Do you mean "set zone->managed_pages to zone->present_pages"?
>
> Hmm, maybe 'set zone->managed_pages as zone->present_pages'
> or
> 'assign zone->present_pages to zone->managed_pages'
> which is more precise.
>
> Wwill update.
>
> >
> > I think we can just set zone->managed_pages to 0 in free_area_init_core().
> > Anyway it will be reset before the first use.

Rethink about this, it's better to set zone->managed_pages to 0 because
there isn't any page added to buddy. Will update.

>
> Yeah, setting to 0 is also fine. I thougt of 0 ever. Considering
> zone->present_pages is closer value to actual zone->managed_pages
> than 0, and it may be needed in the future in some way before
> mem_init(). If no strong objection, I will keep the assigning
> 'zone->present_pages' to 'zone->managed_pages'.
>
> Thanks again for careful reviewing.
>


2024-03-20 09:38:02

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()

On Wed, Mar 20, 2024 at 03:52:52PM +0800, Baoquan He wrote:
> On 03/19/24 at 05:49pm, Mike Rapoport wrote:
> > Hi Baoquan,
> >
> > On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote:
> > > This is not needed any more.
> >
> > I'd swap this and the first patch, so that the first patch would remove
> > memblock_find_dma_reserve() and it's changelog will explain why it's not
> > needed and then the second patch will simply drop unused set_dma_reserve()
>
> Thanks, Mike.
>
> My thought on the patch 1/2 splitting is:
> patch 1 is removing all relevant codes in mm, including the usage of
> dma_reserve in free_area_init_core() and exporting set_dma_reserve()
> to any ARCH which want to subtract the dma_reserve from DMA zone.
>
> Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve.

I think it's better first to remove the usage of set_dma_reserve() in x86
and then clean up the unused code.

> Your suggestion is also good to me, I can rearrange the order and
> repost.

--
Sincerely yours,
Mike.

2024-03-20 13:14:49

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()

On 03/20/24 at 11:36am, Mike Rapoport wrote:
> On Wed, Mar 20, 2024 at 03:52:52PM +0800, Baoquan He wrote:
> > On 03/19/24 at 05:49pm, Mike Rapoport wrote:
> > > Hi Baoquan,
> > >
> > > On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote:
> > > > This is not needed any more.
> > >
> > > I'd swap this and the first patch, so that the first patch would remove
> > > memblock_find_dma_reserve() and it's changelog will explain why it's not
> > > needed and then the second patch will simply drop unused set_dma_reserve()
> >
> > Thanks, Mike.
> >
> > My thought on the patch 1/2 splitting is:
> > patch 1 is removing all relevant codes in mm, including the usage of
> > dma_reserve in free_area_init_core() and exporting set_dma_reserve()
> > to any ARCH which want to subtract the dma_reserve from DMA zone.
> >
> > Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve.
>
> I think it's better first to remove the usage of set_dma_reserve() in x86
> and then clean up the unused code.

OK, firslty remove the only user, that sounds reasonable. Will change.
Thanks.