2018-07-18 12:48:39

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH 0/3] Re-structure free_area_init_node / free_area_init_core

From: Oscar Salvador <[email protected]>

When free_area_init_node()->free_area_init_core() get called
from memhotplug path, there are some things that we do need to run.

This patchset __pretends__ to make more clear what things get executed
when those two functions get called depending on the context (non-/memhotplug path).


I tested it on x86_64 / powerpc and I did not see anything wrong there.
But some feedback would be appreciated.

We might come up with the conclusion that we can live with the code as it is now.

Oscar Salvador (3):
mm/page_alloc: Move ifdefery out of free_area_init_core
mm/page_alloc: Refactor free_area_init_core
mm/page_alloc: Split context in free_area_init_node

mm/page_alloc.c | 181 +++++++++++++++++++++++++++++++++++---------------------
1 file changed, 114 insertions(+), 67 deletions(-)

--
2.13.6



2018-07-18 12:48:44

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH 1/3] mm/page_alloc: Move ifdefery out of free_area_init_core

From: Oscar Salvador <[email protected]>

Moving the #ifdefs out of the function makes it easier to follow.

Signed-off-by: Oscar Salvador <[email protected]>
---
mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++++++-------------
1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e357189cd24a..8a73305f7c55 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6206,6 +6206,37 @@ static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
}

+#ifdef CONFIG_NUMA_BALANCING
+static void pgdat_init_numabalancing(struct pglist_data *pgdat)
+{
+ spin_lock_init(&pgdat->numabalancing_migrate_lock);
+ pgdat->numabalancing_migrate_nr_pages = 0;
+ pgdat->numabalancing_migrate_next_window = jiffies;
+}
+#else
+static void pgdat_init_numabalancing(struct pglist_data *pgdat) {}
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pgdat_init_split_queue(struct pglist_data *pgdat)
+{
+ spin_lock_init(&pgdat->split_queue_lock);
+ INIT_LIST_HEAD(&pgdat->split_queue);
+ pgdat->split_queue_len = 0;
+}
+#else
+static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
+#endif
+
+#ifdef CONFIG_COMPACTION
+static void pgdat_init_kcompactd(struct pglist_data *pgdat)
+{
+ init_waitqueue_head(&pgdat->kcompactd_wait);
+}
+#else
+static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
+#endif
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -6220,21 +6251,14 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
int nid = pgdat->node_id;

pgdat_resize_init(pgdat);
-#ifdef CONFIG_NUMA_BALANCING
- spin_lock_init(&pgdat->numabalancing_migrate_lock);
- pgdat->numabalancing_migrate_nr_pages = 0;
- pgdat->numabalancing_migrate_next_window = jiffies;
-#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- spin_lock_init(&pgdat->split_queue_lock);
- INIT_LIST_HEAD(&pgdat->split_queue);
- pgdat->split_queue_len = 0;
-#endif
+
+ pgdat_init_numabalancing(pgdat);
+ pgdat_init_split_queue(pgdat);
+ pgdat_init_kcompactd(pgdat);
+
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
-#ifdef CONFIG_COMPACTION
- init_waitqueue_head(&pgdat->kcompactd_wait);
-#endif
+
pgdat_page_ext_init(pgdat);
spin_lock_init(&pgdat->lru_lock);
lruvec_init(node_lruvec(pgdat));
--
2.13.6


2018-07-18 12:48:49

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH 3/3] mm/page_alloc: Split context in free_area_init_node

From: Oscar Salvador <[email protected]>

If free_area_init_node gets called from memhotplug code,
we do not need to call calculate_node_totalpages(),
as the node has no pages.

The same goes for the deferred initialization, as
memmap_init_zone skips that when the context is MEMMAP_HOTPLUG.

Signed-off-by: Oscar Salvador <[email protected]>
---
mm/page_alloc.c | 37 +++++++++++++++++++++++++------------
1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d652a3ad720c..99c342eeb5db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6388,12 +6388,28 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { }
#endif /* CONFIG_FLAT_NODE_MEM_MAP */

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static void pgdat_set_deferred_range(pg_data_t *pgdat)
+{
+ /*
+ * We start only with one section of pages, more pages are added as
+ * needed until the rest of deferred pages are initialized.
+ */
+ pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
+ pgdat->node_spanned_pages);
+ pgdat->first_deferred_pfn = ULONG_MAX;
+}
+#else
+static void pgdat_set_deferred_range(pg_data_t *pgdat) {}
+#endif
+
void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
unsigned long node_start_pfn, unsigned long *zholes_size)
{
pg_data_t *pgdat = NODE_DATA(nid);
unsigned long start_pfn = 0;
unsigned long end_pfn = 0;
+ bool no_hotplug_context = node_online(nid);

/* pg_data_t should be reset to zero when it's allocated */
WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
@@ -6409,20 +6425,17 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
#else
start_pfn = node_start_pfn;
#endif
- calculate_node_totalpages(pgdat, start_pfn, end_pfn,
- zones_size, zholes_size);

- alloc_node_mem_map(pgdat);
-
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
- /*
- * We start only with one section of pages, more pages are added as
- * needed until the rest of deferred pages are initialized.
+ /* Memhotplug is the only place where free_area_init_node gets called
+ * with the node being still offline.
*/
- pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
- pgdat->node_spanned_pages);
- pgdat->first_deferred_pfn = ULONG_MAX;
-#endif
+ if (no_hotplug_context) {
+ calculate_node_totalpages(pgdat, start_pfn, end_pfn,
+ zones_size, zholes_size);
+ alloc_node_mem_map(pgdat);
+ pgdat_set_deferred_range(pgdat);
+ }
+
free_area_init_core(pgdat);
}

--
2.13.6


2018-07-18 12:49:45

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH 2/3] mm/page_alloc: Refactor free_area_init_core

From: Oscar Salvador <[email protected]>

When free_area_init_core gets called from the memhotplug code,
we only need to perform some of the operations in
there.

Since memhotplug code is the only place where free_area_init_core
gets called while node being still offline, we can better separate
the context from where it is called.

This patch re-structures the code for that purpose.

Signed-off-by: Oscar Salvador <[email protected]>
---
mm/page_alloc.c | 94 +++++++++++++++++++++++++++++++--------------------------
1 file changed, 52 insertions(+), 42 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a73305f7c55..d652a3ad720c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6237,6 +6237,40 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
#endif

+static unsigned long calc_remaining_pages(enum zone_type type, unsigned long freesize,
+ unsigned long size)
+{
+ unsigned long memmap_pages = calc_memmap_size(size, freesize);
+
+ if(!is_highmem_idx(type)) {
+ if (freesize >= memmap_pages) {
+ freesize -= memmap_pages;
+ if (memmap_pages)
+ printk(KERN_DEBUG
+ " %s zone: %lu pages used for memmap\n",
+ zone_names[type], memmap_pages);
+ } else
+ pr_warn(" %s zone: %lu pages exceeds freesize %lu\n",
+ zone_names[type], memmap_pages, freesize);
+ }
+
+ /* Account for reserved pages */
+ if (type == 0 && freesize > dma_reserve) {
+ freesize -= dma_reserve;
+ printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
+ zone_names[0], dma_reserve);
+ }
+
+ if (!is_highmem_idx(type))
+ nr_kernel_pages += freesize;
+ /* Charge for highmem memmap if there are enough kernel pages */
+ else if (nr_kernel_pages > memmap_pages * 2)
+ nr_kernel_pages -= memmap_pages;
+ nr_all_pages += freesize;
+
+ return freesize;
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -6249,6 +6283,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
{
enum zone_type j;
int nid = pgdat->node_id;
+ bool no_hotplug_context;

pgdat_resize_init(pgdat);

@@ -6265,45 +6300,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)

pgdat->per_cpu_nodestats = &boot_nodestats;

+ /* Memhotplug is the only place where free_area_init_node gets called
+ * with the node being still offline.
+ */
+ no_hotplug_context = node_online(nid);
+
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, freesize, memmap_pages;
- unsigned long zone_start_pfn = zone->zone_start_pfn;
+ unsigned long size = zone->spanned_pages;
+ unsigned long freesize = zone->present_pages;

- size = zone->spanned_pages;
- freesize = zone->present_pages;
-
- /*
- * Adjust freesize so that it accounts for how much memory
- * is used by this zone for memmap. This affects the watermark
- * and per-cpu initialisations
- */
- memmap_pages = calc_memmap_size(size, freesize);
- if (!is_highmem_idx(j)) {
- if (freesize >= memmap_pages) {
- freesize -= memmap_pages;
- if (memmap_pages)
- printk(KERN_DEBUG
- " %s zone: %lu pages used for memmap\n",
- zone_names[j], memmap_pages);
- } else
- pr_warn(" %s zone: %lu pages exceeds freesize %lu\n",
- zone_names[j], memmap_pages, freesize);
- }
-
- /* Account for reserved pages */
- if (j == 0 && freesize > dma_reserve) {
- freesize -= dma_reserve;
- printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
- zone_names[0], dma_reserve);
- }
-
- if (!is_highmem_idx(j))
- nr_kernel_pages += freesize;
- /* Charge for highmem memmap if there are enough kernel pages */
- else if (nr_kernel_pages > memmap_pages * 2)
- nr_kernel_pages -= memmap_pages;
- nr_all_pages += freesize;
+ if (no_hotplug_context)
+ freesize = calc_remaining_pages(j, freesize, size);

/*
* Set an approximate value for lowmem here, it will be adjusted
@@ -6311,6 +6319,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
* And all highmem pages will be managed by the buddy system.
*/
zone->managed_pages = freesize;
+
#ifdef CONFIG_NUMA
zone->node = nid;
#endif
@@ -6320,13 +6329,14 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
zone_seqlock_init(zone);
zone_pcp_init(zone);

- if (!size)
- continue;
+ if (size && no_hotplug_context) {
+ unsigned long zone_start_pfn = zone->zone_start_pfn;

- set_pageblock_order();
- setup_usemap(pgdat, zone, zone_start_pfn, size);
- init_currently_empty_zone(zone, zone_start_pfn, size);
- memmap_init(size, nid, j, zone_start_pfn);
+ set_pageblock_order();
+ setup_usemap(pgdat, zone, zone_start_pfn, size);
+ init_currently_empty_zone(zone, zone_start_pfn, size);
+ memmap_init(size, nid, j, zone_start_pfn);
+ }
}
}

--
2.13.6


2018-07-18 13:38:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: Refactor free_area_init_core

On Wed 18-07-18 14:47:21, [email protected] wrote:
> From: Oscar Salvador <[email protected]>
>
> When free_area_init_core gets called from the memhotplug code,
> we only need to perform some of the operations in
> there.

Which ones? Or other way around. Which we do not want to do and why?

> Since memhotplug code is the only place where free_area_init_core
> gets called while node being still offline, we can better separate
> the context from where it is called.

I really do not like this if node is offline than only perform half of
the function. This will generate more mess in the future. Why don't you
simply. If we can split out this code into logical units then let's do
that but no, please do not make random ifs for hotplug code paths.
Sooner or later somebody will simply don't know what is needed and what
is not.

> This patch re-structures the code for that purpose.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> mm/page_alloc.c | 94 +++++++++++++++++++++++++++++++--------------------------
> 1 file changed, 52 insertions(+), 42 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8a73305f7c55..d652a3ad720c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6237,6 +6237,40 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
> static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
> #endif
>
> +static unsigned long calc_remaining_pages(enum zone_type type, unsigned long freesize,
> + unsigned long size)
> +{
> + unsigned long memmap_pages = calc_memmap_size(size, freesize);
> +
> + if(!is_highmem_idx(type)) {
> + if (freesize >= memmap_pages) {
> + freesize -= memmap_pages;
> + if (memmap_pages)
> + printk(KERN_DEBUG
> + " %s zone: %lu pages used for memmap\n",
> + zone_names[type], memmap_pages);
> + } else
> + pr_warn(" %s zone: %lu pages exceeds freesize %lu\n",
> + zone_names[type], memmap_pages, freesize);
> + }
> +
> + /* Account for reserved pages */
> + if (type == 0 && freesize > dma_reserve) {
> + freesize -= dma_reserve;
> + printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
> + zone_names[0], dma_reserve);
> + }
> +
> + if (!is_highmem_idx(type))
> + nr_kernel_pages += freesize;
> + /* Charge for highmem memmap if there are enough kernel pages */
> + else if (nr_kernel_pages > memmap_pages * 2)
> + nr_kernel_pages -= memmap_pages;
> + nr_all_pages += freesize;
> +
> + return freesize;
> +}
> +
> /*
> * Set up the zone data structures:
> * - mark all pages reserved
> @@ -6249,6 +6283,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> {
> enum zone_type j;
> int nid = pgdat->node_id;
> + bool no_hotplug_context;
>
> pgdat_resize_init(pgdat);
>
> @@ -6265,45 +6300,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>
> pgdat->per_cpu_nodestats = &boot_nodestats;
>
> + /* Memhotplug is the only place where free_area_init_node gets called
> + * with the node being still offline.
> + */
> + no_hotplug_context = node_online(nid);
> +
> for (j = 0; j < MAX_NR_ZONES; j++) {
> struct zone *zone = pgdat->node_zones + j;
> - unsigned long size, freesize, memmap_pages;
> - unsigned long zone_start_pfn = zone->zone_start_pfn;
> + unsigned long size = zone->spanned_pages;
> + unsigned long freesize = zone->present_pages;
>
> - size = zone->spanned_pages;
> - freesize = zone->present_pages;
> -
> - /*
> - * Adjust freesize so that it accounts for how much memory
> - * is used by this zone for memmap. This affects the watermark
> - * and per-cpu initialisations
> - */
> - memmap_pages = calc_memmap_size(size, freesize);
> - if (!is_highmem_idx(j)) {
> - if (freesize >= memmap_pages) {
> - freesize -= memmap_pages;
> - if (memmap_pages)
> - printk(KERN_DEBUG
> - " %s zone: %lu pages used for memmap\n",
> - zone_names[j], memmap_pages);
> - } else
> - pr_warn(" %s zone: %lu pages exceeds freesize %lu\n",
> - zone_names[j], memmap_pages, freesize);
> - }
> -
> - /* Account for reserved pages */
> - if (j == 0 && freesize > dma_reserve) {
> - freesize -= dma_reserve;
> - printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
> - zone_names[0], dma_reserve);
> - }
> -
> - if (!is_highmem_idx(j))
> - nr_kernel_pages += freesize;
> - /* Charge for highmem memmap if there are enough kernel pages */
> - else if (nr_kernel_pages > memmap_pages * 2)
> - nr_kernel_pages -= memmap_pages;
> - nr_all_pages += freesize;
> + if (no_hotplug_context)
> + freesize = calc_remaining_pages(j, freesize, size);
>
> /*
> * Set an approximate value for lowmem here, it will be adjusted
> @@ -6311,6 +6319,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> * And all highmem pages will be managed by the buddy system.
> */
> zone->managed_pages = freesize;
> +
> #ifdef CONFIG_NUMA
> zone->node = nid;
> #endif
> @@ -6320,13 +6329,14 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> zone_seqlock_init(zone);
> zone_pcp_init(zone);
>
> - if (!size)
> - continue;
> + if (size && no_hotplug_context) {
> + unsigned long zone_start_pfn = zone->zone_start_pfn;
>
> - set_pageblock_order();
> - setup_usemap(pgdat, zone, zone_start_pfn, size);
> - init_currently_empty_zone(zone, zone_start_pfn, size);
> - memmap_init(size, nid, j, zone_start_pfn);
> + set_pageblock_order();
> + setup_usemap(pgdat, zone, zone_start_pfn, size);
> + init_currently_empty_zone(zone, zone_start_pfn, size);
> + memmap_init(size, nid, j, zone_start_pfn);
> + }
> }
> }
>
> --
> 2.13.6
>

--
Michal Hocko
SUSE Labs

2018-07-18 13:38:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/page_alloc: Move ifdefery out of free_area_init_core

On Wed 18-07-18 14:47:20, [email protected] wrote:
> From: Oscar Salvador <[email protected]>
>
> Moving the #ifdefs out of the function makes it easier to follow.
>
> Signed-off-by: Oscar Salvador <[email protected]>

OK, this makes some sense.
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 37 insertions(+), 13 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e357189cd24a..8a73305f7c55 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6206,6 +6206,37 @@ static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
> return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> }
>
> +#ifdef CONFIG_NUMA_BALANCING
> +static void pgdat_init_numabalancing(struct pglist_data *pgdat)
> +{
> + spin_lock_init(&pgdat->numabalancing_migrate_lock);
> + pgdat->numabalancing_migrate_nr_pages = 0;
> + pgdat->numabalancing_migrate_next_window = jiffies;
> +}
> +#else
> +static void pgdat_init_numabalancing(struct pglist_data *pgdat) {}
> +#endif
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pgdat_init_split_queue(struct pglist_data *pgdat)
> +{
> + spin_lock_init(&pgdat->split_queue_lock);
> + INIT_LIST_HEAD(&pgdat->split_queue);
> + pgdat->split_queue_len = 0;
> +}
> +#else
> +static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
> +#endif
> +
> +#ifdef CONFIG_COMPACTION
> +static void pgdat_init_kcompactd(struct pglist_data *pgdat)
> +{
> + init_waitqueue_head(&pgdat->kcompactd_wait);
> +}
> +#else
> +static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
> +#endif
> +
> /*
> * Set up the zone data structures:
> * - mark all pages reserved
> @@ -6220,21 +6251,14 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
> int nid = pgdat->node_id;
>
> pgdat_resize_init(pgdat);
> -#ifdef CONFIG_NUMA_BALANCING
> - spin_lock_init(&pgdat->numabalancing_migrate_lock);
> - pgdat->numabalancing_migrate_nr_pages = 0;
> - pgdat->numabalancing_migrate_next_window = jiffies;
> -#endif
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> - spin_lock_init(&pgdat->split_queue_lock);
> - INIT_LIST_HEAD(&pgdat->split_queue);
> - pgdat->split_queue_len = 0;
> -#endif
> +
> + pgdat_init_numabalancing(pgdat);
> + pgdat_init_split_queue(pgdat);
> + pgdat_init_kcompactd(pgdat);
> +
> init_waitqueue_head(&pgdat->kswapd_wait);
> init_waitqueue_head(&pgdat->pfmemalloc_wait);
> -#ifdef CONFIG_COMPACTION
> - init_waitqueue_head(&pgdat->kcompactd_wait);
> -#endif
> +
> pgdat_page_ext_init(pgdat);
> spin_lock_init(&pgdat->lru_lock);
> lruvec_init(node_lruvec(pgdat));
> --
> 2.13.6
>

--
Michal Hocko
SUSE Labs

2018-07-18 14:14:03

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: Refactor free_area_init_core

On Wed, Jul 18, 2018 at 03:36:47PM +0200, Michal Hocko wrote:
> On Wed 18-07-18 14:47:21, [email protected] wrote:
> > From: Oscar Salvador <[email protected]>
> >
> > When free_area_init_core gets called from the memhotplug code,
> > we only need to perform some of the operations in
> > there.
>
> Which ones? Or other way around. Which we do not want to do and why?
>
> > Since memhotplug code is the only place where free_area_init_core
> > gets called while node being still offline, we can better separate
> > the context from where it is called.
>
> I really do not like this if node is offline than only perform half of
> the function. This will generate more mess in the future. Why don't you
> simply. If we can split out this code into logical units then let's do
> that but no, please do not make random ifs for hotplug code paths.
> Sooner or later somebody will simply don't know what is needed and what
> is not.

Yes, you are right.
I gave it another thought and it was not a really good idea.
Although I think the code from free_area_init_core can be simplified.

I will try to come up with something that makes more sense.

Thanks
--
Oscar Salvador
SUSE L3

2018-07-18 14:14:16

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/page_alloc: Move ifdefery out of free_area_init_core

On 18-07-18 14:47:20, [email protected] wrote:
> From: Oscar Salvador <[email protected]>
>
> Moving the #ifdefs out of the function makes it easier to follow.
>
> Signed-off-by: Oscar Salvador <[email protected]>

Hi Oscar,

Reviewed-by: Pavel Tatashin <[email protected]>

Please include the following patch in your series, to get rid of the last
ifdef in this function.

From f841184e141b21e79c4d017d3b7678c697016d2a Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <[email protected]>
Date: Wed, 18 Jul 2018 09:56:52 -0400
Subject: [PATCH] mm: access zone->node via zone_to_nid() and zone_set_nid()

zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
have inline functions to access this field in order to avoid ifdef's in
c files.

Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/mm.h | 9 ---------
include/linux/mmzone.h | 26 ++++++++++++++++++++------
mm/mempolicy.c | 4 ++--
mm/mm_init.c | 9 ++-------
mm/page_alloc.c | 10 ++++------
5 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3982c83fdcbf..8634eeafad0b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -940,15 +940,6 @@ static inline int page_zone_id(struct page *page)
return (page->flags >> ZONEID_PGSHIFT) & ZONEID_MASK;
}

-static inline int zone_to_nid(struct zone *zone)
-{
-#ifdef CONFIG_NUMA
- return zone->node;
-#else
- return 0;
-#endif
-}
-
#ifdef NODE_NOT_IN_PAGE_FLAGS
extern int page_to_nid(const struct page *page);
#else
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..aeb6c1499da5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -841,6 +841,25 @@ static inline bool populated_zone(struct zone *zone)
return zone->present_pages;
}

+#ifdef CONFIG_NUMA
+static inline int zone_to_nid(struct zone *zone)
+{
+ return zone->node;
+}
+
+static inline void zone_set_nid(struct zone *zone, int nid)
+{
+ zone->node = nid;
+}
+#else
+static inline int zone_to_nid(struct zone *zone)
+{
+ return 0;
+}
+
+static inline void zone_set_nid(struct zone *zone, int nid) {}
+#endif
+
extern int movable_zone;

#ifdef CONFIG_HIGHMEM
@@ -956,12 +975,7 @@ static inline int zonelist_zone_idx(struct zoneref *zoneref)

static inline int zonelist_node_idx(struct zoneref *zoneref)
{
-#ifdef CONFIG_NUMA
- /* zone_to_nid not available in this context */
- return zoneref->zone->node;
-#else
- return 0;
-#endif /* CONFIG_NUMA */
+ return zone_to_nid(zoneref->zone);
}

struct zoneref *__next_zones_zonelist(struct zoneref *z,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9ac49ef17b4e..9cfccb54d207 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1784,7 +1784,7 @@ unsigned int mempolicy_slab_node(void)
zonelist = &NODE_DATA(node)->node_zonelists[ZONELIST_FALLBACK];
z = first_zones_zonelist(zonelist, highest_zoneidx,
&policy->v.nodes);
- return z->zone ? z->zone->node : node;
+ return z->zone ? zone_to_nid(z->zone) : node;
}

default:
@@ -2326,7 +2326,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
node_zonelist(numa_node_id(), GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
&pol->v.nodes);
- polnid = z->zone->node;
+ polnid = zone_to_nid(z->zone);
break;

default:
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5b72266b4b03..6838a530789b 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -53,13 +53,8 @@ void __init mminit_verify_zonelist(void)
zone->name);

/* Iterate the zonelist */
- for_each_zone_zonelist(zone, z, zonelist, zoneid) {
-#ifdef CONFIG_NUMA
- pr_cont("%d:%s ", zone->node, zone->name);
-#else
- pr_cont("0:%s ", zone->name);
-#endif /* CONFIG_NUMA */
- }
+ for_each_zone_zonelist(zone, z, zonelist, zoneid)
+ pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
pr_cont("\n");
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 790722afa7cf..dbce6c0ced74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2908,10 +2908,10 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
if (!static_branch_likely(&vm_numa_stat_key))
return;

- if (z->node != numa_node_id())
+ if (zone_to_nid(z) != numa_node_id())
local_stat = NUMA_OTHER;

- if (z->node == preferred_zone->node)
+ if (zone_to_nid(z) == zone_to_nid(preferred_zone))
__inc_numa_state(z, NUMA_HIT);
else {
__inc_numa_state(z, NUMA_MISS);
@@ -5280,7 +5280,7 @@ int local_memory_node(int node)
z = first_zones_zonelist(node_zonelist(node, GFP_KERNEL),
gfp_zone(GFP_KERNEL),
NULL);
- return z->zone->node;
+ return zone_to_nid(z->zone);
}
#endif

@@ -6302,9 +6302,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
* And all highmem pages will be managed by the buddy system.
*/
zone->managed_pages = freesize;
-#ifdef CONFIG_NUMA
- zone->node = nid;
-#endif
+ zone_set_nid(nid);
zone->name = zone_names[j];
zone->zone_pgdat = pgdat;
spin_lock_init(&zone->lock);
--
2.18.0



2018-07-18 14:36:21

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/page_alloc: Split context in free_area_init_node

On Wed, Jul 18, 2018 at 8:47 AM <[email protected]> wrote:
>
> From: Oscar Salvador <[email protected]>
>
> If free_area_init_node gets called from memhotplug code,
> we do not need to call calculate_node_totalpages(),
> as the node has no pages.

I am not positive this is safe. Some pgdat fields in
calculate_node_totalpages() are set. Even if those fields are always
set to zeros, pgdat may be reused (i.e. node went offline and later
came back online), so we might still need to set those fields to
zeroes.

Pavel

2018-07-18 15:12:47

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/page_alloc: Refactor free_area_init_core

On Wed, Jul 18, 2018 at 04:12:26PM +0200, Oscar Salvador wrote:
> On Wed, Jul 18, 2018 at 03:36:47PM +0200, Michal Hocko wrote:
> > I really do not like this if node is offline than only perform half of
> > the function. This will generate more mess in the future. Why don't you
> > simply. If we can split out this code into logical units then let's do
> > that but no, please do not make random ifs for hotplug code paths.
> > Sooner or later somebody will simply don't know what is needed and what
> > is not.
>
> Yes, you are right.
> I gave it another thought and it was not a really good idea.
> Although I think the code from free_area_init_core can be simplified.
>
> I will try to come up with something that makes more sense.

I guess we could so something like:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a73305f7c55..70fe4c80643f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6237,6 +6237,40 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
#endif

+static unsigned long calc_remaining_pages(enum zone_type type, unsigned long freesize,
+ unsigned long size)
+{
+ unsigned long memmap_pages = calc_memmap_size(size, freesize);
+
+ if(!is_highmem_idx(type)) {
+ if (freesize >= memmap_pages) {
+ freesize -= memmap_pages;
+ if (memmap_pages)
+ printk(KERN_DEBUG
+ " %s zone: %lu pages used for memmap\n",
+ zone_names[type], memmap_pages);
+ } else
+ pr_warn(" %s zone: %lu pages exceeds freesize %lu\n",
+ zone_names[type], memmap_pages, freesize);
+ }
+
+ /* Account for reserved pages */
+ if (type == 0 && freesize > dma_reserve) {
+ freesize -= dma_reserve;
+ printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
+ zone_names[0], dma_reserve);
+ }
+
+ if (!is_highmem_idx(type))
+ nr_kernel_pages += freesize;
+ /* Charge for highmem memmap if there are enough kernel pages */
+ else if (nr_kernel_pages > memmap_pages * 2)
+ nr_kernel_pages -= memmap_pages;
+ nr_all_pages += freesize;
+
+ return freesize;
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -6267,43 +6301,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, freesize, memmap_pages;
+ unsigned long size = zone->spanned_pages;
+ unsigned long freesize = zone->present_pages;
unsigned long zone_start_pfn = zone->zone_start_pfn;

- size = zone->spanned_pages;
- freesize = zone->present_pages;
-
- /*
- * Adjust freesize so that it accounts for how much memory
- * is used by this zone for memmap. This affects the watermark
- * and per-cpu initialisations
- */
- memmap_pages = calc_memmap_size(size, freesize);
- if (!is_highmem_idx(j)) {
- if (freesize >= memmap_pages) {
- freesize -= memmap_pages;
- if (memmap_pages)
- printk(KERN_DEBUG
- " %s zone: %lu pages used for memmap\n",
- zone_names[j], memmap_pages);
- } else
- pr_warn(" %s zone: %lu pages exceeds freesize %lu\n",
- zone_names[j], memmap_pages, freesize);
- }
-
- /* Account for reserved pages */
- if (j == 0 && freesize > dma_reserve) {
- freesize -= dma_reserve;
- printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
- zone_names[0], dma_reserve);
- }
-
- if (!is_highmem_idx(j))
- nr_kernel_pages += freesize;
- /* Charge for highmem memmap if there are enough kernel pages */
- else if (nr_kernel_pages > memmap_pages * 2)
- nr_kernel_pages -= memmap_pages;
- nr_all_pages += freesize;
+ if (freesize)
+


So we just do the calculations with the pages (nr_kernel_pages, nr_all_pages, memmap_pages, etc...)
if freesize is not 0.
Otherwise it does not make sense to do it (AFAICS).
--
Oscar Salvador
SUSE L3

2018-07-18 15:17:58

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/page_alloc: Move ifdefery out of free_area_init_core

On Wed, Jul 18, 2018 at 10:11:50AM -0400, Pavel Tatashin wrote:
> On 18-07-18 14:47:20, [email protected] wrote:
> > From: Oscar Salvador <[email protected]>
> >
> > Moving the #ifdefs out of the function makes it easier to follow.
> >
> > Signed-off-by: Oscar Salvador <[email protected]>
>
> Hi Oscar,
>
> Reviewed-by: Pavel Tatashin <[email protected]>
>
> Please include the following patch in your series, to get rid of the last
> ifdef in this function.
>
> From f841184e141b21e79c4d017d3b7678c697016d2a Mon Sep 17 00:00:00 2001
> From: Pavel Tatashin <[email protected]>
> Date: Wed, 18 Jul 2018 09:56:52 -0400
> Subject: [PATCH] mm: access zone->node via zone_to_nid() and zone_set_nid()
>
> zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
> have inline functions to access this field in order to avoid ifdef's in
> c files.

Hi Pavel

I will! Thanks for the patch ;-)
--
Oscar Salvador
SUSE L3

2018-07-19 07:36:25

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/page_alloc: Split context in free_area_init_node

On Wed, Jul 18, 2018 at 10:34:19AM -0400, Pavel Tatashin wrote:
> On Wed, Jul 18, 2018 at 8:47 AM <[email protected]> wrote:
> >
> > From: Oscar Salvador <[email protected]>
> >
> > If free_area_init_node gets called from memhotplug code,
> > we do not need to call calculate_node_totalpages(),
> > as the node has no pages.
>
> I am not positive this is safe. Some pgdat fields in
> calculate_node_totalpages() are set. Even if those fields are always
> set to zeros, pgdat may be reused (i.e. node went offline and later
> came back online), so we might still need to set those fields to
> zeroes.
>

You are right, I do not know why, but I thought that we were zeroing pgdat struct
before getting in the function.

I will leave that part out.
Since we only should care about deferred pfns during the boot, maybe we can change
it to something like:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 70fe4c80643f..89fc8f4240ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6381,6 +6381,21 @@ static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { }
#endif /* CONFIG_FLAT_NODE_MEM_MAP */

+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static void pgdat_set_deferred_range(pg_data_t *pgdat)
+{
+ /*
+ * We start only with one section of pages, more pages are added as
+ * needed until the rest of deferred pages are initialized.
+ */
+ pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
+ pgdat->node_spanned_pages);
+ pgdat->first_deferred_pfn = ULONG_MAX;
+}
+#else
+static void pgdat_set_deferred_range(pg_data_t *pgdat) {}
+#endif
+
void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
unsigned long node_start_pfn, unsigned long *zholes_size)
{
@@ -6402,20 +6417,14 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
#else
start_pfn = node_start_pfn;
#endif
- calculate_node_totalpages(pgdat, start_pfn, end_pfn,
- zones_size, zholes_size);

+ calculate_node_totalpages(pgdat, start_pfn, end_pfn,
+ zones_size, zholes_size);
alloc_node_mem_map(pgdat);

-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
- /*
- * We start only with one section of pages, more pages are added as
- * needed until the rest of deferred pages are initialized.
- */
- pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
- pgdat->node_spanned_pages);
- pgdat->first_deferred_pfn = ULONG_MAX;
-#endif
+ if (system_state == SYSTEM_BOOTING)
+ pgdat_set_deferred_range(pgdat);
+
free_area_init_core(pgdat);
}

Thanks
--
Oscar Salvador
SUSE L3

2018-07-19 12:20:14

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/page_alloc: Move ifdefery out of free_area_init_core

On Wed, Jul 18, 2018 at 10:11:50AM -0400, Pavel Tatashin wrote:
> On 18-07-18 14:47:20, [email protected] wrote:
> > From: Oscar Salvador <[email protected]>
> >
> > Moving the #ifdefs out of the function makes it easier to follow.
> >
> > Signed-off-by: Oscar Salvador <[email protected]>
>
> Hi Oscar,
>
> Reviewed-by: Pavel Tatashin <[email protected]>
>
> Please include the following patch in your series, to get rid of the last
> ifdef in this function.

Hi Pavel,

I am about to send v2 with this patch included, but I just wanted to let you know
this:

> + zone_set_nid(nid);

This should be:

zone_set_nid(zone, nid);

I fixed it up in your patch, I hope that is ok.

Thanks
--
Oscar Salvador
SUSE L3

2018-07-19 13:20:57

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/page_alloc: Move ifdefery out of free_area_init_core

> + zone_set_nid(nid);
>
> This should be:
>
> zone_set_nid(zone, nid);
>
> I fixed it up in your patch, I hope that is ok.

Yes, thank you. I fixed this when compile tested this patch, but must
have forgotten to regenerate the patch before sending it.

Thank you,
Pavel

>
> Thanks
> --
> Oscar Salvador
> SUSE L3
>