2018-08-17 15:43:00

by Oscar Salvador

[permalink] [raw]
Subject: [RFC v2 0/2] Do not touch pages in remove_memory path

From: Oscar Salvador <[email protected]>

This patchset moves all zone/page handling from the remove_memory
path back to the offline_pages stage.
This has be done for two reasons:

1) We can access steal pages if we remove memory that was never online [1]
2) Code consistency

Currently, when we online memory, online_pages() takes care to initialize
the pages and put the memory range into its corresponding zone.
So, zone/pgdat's spanned/present pages get resized.

But the opposite does not happen when we offline memory.
Only present pages is decremented, and we wait to shrink zone/node's
spanned_pages until we remove that memory.
But as explained above, this is wrong.

So this patchset tries to cover this by moving this handling to the place
it should be.

The main difficulty I faced here was in regard of HMM/devm, as it really handles
the hot-add/remove memory particulary, and what is more important,
also the resources.

I really scratched my head for ideas about how to handle this case, and
after some fails I came up with the idea that we could check for the
res->flags.

Memory resources that goes through the "official" memory-hotplug channels
have the IORESOURCE_SYSTEM_RAM flag.
This flag is made of (IORESOURCE_MEM|IORESOURCE_SYSRAM).

HMM/devm, on the other hand, request and release the resources
through devm_request_mem_region/devm_release_mem_region, and
these resources do not contain the IORESOURCE_SYSRAM flag.

So what I ended up doing is to check for IORESOURCE_SYSRAM
in release_mem_region_adjustable.
If we see that a resource does not have such a flag, we know that
we are dealing with a resource coming from HMM/devm, and so,
we do not need to do anything as HMM/dev will take care of that part.

I online compiled the code, but I did not test it (I will do next week),
but I sent this RFCv2 mainly because I would like to get feedback,
and see if the direction I took is the right one.

This time I left out [2] because I am working on this in a separate patch,
and does not really belong to this patchset.

[1] https://patchwork.kernel.org/patch/10547445/ (Reported by David)
[2] https://patchwork.kernel.org/patch/10558723/

Oscar Salvador (2):
mm/memory_hotplug: Add nid parameter to arch_remove_memory
mm/memory_hotplug: Shrink spanned pages when offlining memory

arch/ia64/mm/init.c | 6 +-
arch/powerpc/mm/mem.c | 12 +---
arch/s390/mm/init.c | 2 +-
arch/sh/mm/init.c | 6 +-
arch/x86/mm/init_32.c | 6 +-
arch/x86/mm/init_64.c | 10 +--
include/linux/memory_hotplug.h | 11 +++-
kernel/memremap.c | 16 ++---
kernel/resource.c | 16 +++++
mm/hmm.c | 34 +++++-----
mm/memory_hotplug.c | 145 ++++++++++++++++++++++++++---------------
mm/sparse.c | 4 +-
12 files changed, 157 insertions(+), 111 deletions(-)

--
2.13.6



2018-08-17 15:43:00

by Oscar Salvador

[permalink] [raw]
Subject: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

From: Oscar Salvador <[email protected]>

Currently, we decrement zone/node spanned_pages when we
remove memory and not when we offline it.

This, besides of not being consistent with the current code,
implies that we can access steal pages if we never get to online
that memory.

In order to prevent that, we have to move all zone/pages stuff to
the offlining memory stage.
Removing memory path should only care about memory sections and memory
blocks.

Another thing to notice here is that this is not so easy to be done
as HMM/devm have a particular handling of memory-hotplug.
They do not go through the common path, and so, they do not
call either offline_pages() nor online_pages().

All they care about is to add the sections, move the pages to
ZONE_DEVICE, and in some cases, to create the linear mapping.

In order to do this more smooth, two new functions are created
to deal with these particular cases:

del_device_memory
add_device_memory

add_device_memory is in charge of

a) calling either arch_add_memory() or add_pages(), depending on whether
we want a linear mapping
b) online the memory sections that correspond to the pfn range
c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
expand zone/pgdat spanned pages and initialize its pages

del_device_memory, on the other hand, is in charge of

a) offline the memory sections that correspond to the pfn range
b) calling shrink_pages(), which shrinks node/zone spanned pages.
c) calling either arch_remove_memory() or __remove_pages(), depending on
whether we need to tear down the linear mapping or not

These two functions are called from:

add_device_memory:
- devm_memremap_pages()
- hmm_devmem_pages_create()

del_device_memory:
- devm_memremap_pages_release()
- hmm_devmem_release()

I think that this will get easier as soon as [1] gets merged.

Finally, shrink_pages() is moved to offline_pages(), so now,
all pages/zone handling is being taken care in online/offline_pages stage.

[1] https://lkml.org/lkml/2018/6/19/110

Signed-off-by: Oscar Salvador <[email protected]>
---
arch/ia64/mm/init.c | 4 +-
arch/powerpc/mm/mem.c | 10 +--
arch/sh/mm/init.c | 4 +-
arch/x86/mm/init_32.c | 4 +-
arch/x86/mm/init_64.c | 8 +--
include/linux/memory_hotplug.h | 9 ++-
kernel/memremap.c | 14 ++--
kernel/resource.c | 16 +++++
mm/hmm.c | 32 ++++-----
mm/memory_hotplug.c | 143 +++++++++++++++++++++++++++--------------
mm/sparse.c | 4 +-
11 files changed, 145 insertions(+), 103 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index b54d0ee74b53..30ed58e5e86c 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -666,11 +666,9 @@ int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- struct zone *zone;
int ret;

- zone = page_zone(pfn_to_page(start_pfn));
- ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+ ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
if (ret)
pr_warn("%s: Problem encountered in __remove_pages() as"
" ret=%d\n", __func__, ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 9e17d57a9948..f3b47d02b879 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -146,15 +146,7 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altma
struct page *page;
int ret;

- /*
- * If we have an altmap then we need to skip over any reserved PFNs
- * when querying the zone.
- */
- page = pfn_to_page(start_pfn);
- if (altmap)
- page += vmem_altmap_offset(altmap);
-
- ret = __remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
+ ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
if (ret)
return ret;

diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 5c91bb6abc35..6f6e7d321b1f 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -448,11 +448,9 @@ int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
- struct zone *zone;
int ret;

- zone = page_zone(pfn_to_page(start_pfn));
- ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+ ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
if (unlikely(ret))
pr_warn("%s: Failed, __remove_pages() == %d\n", __func__,
ret);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 9fa503f2f56c..88909e88c873 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -865,10 +865,8 @@ int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- struct zone *zone;

- zone = page_zone(pfn_to_page(start_pfn));
- return __remove_pages(zone, start_pfn, nr_pages, altmap);
+ return __remove_pages(nid, start_pfn, nr_pages, altmap);
}
#endif
#endif
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index f90fa077b0c4..d04338ffabec 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1152,15 +1152,9 @@ int __ref arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *a
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- struct page *page = pfn_to_page(start_pfn);
- struct zone *zone;
int ret;

- /* With altmap the first mapped page is offset from @start */
- if (altmap)
- page += vmem_altmap_offset(altmap);
- zone = page_zone(page);
- ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+ ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
WARN_ON_ONCE(ret);
kernel_physical_mapping_remove(start, start + size);

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index c68b03fd87bd..e1446dbb37dc 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -109,14 +109,19 @@ static inline bool movable_node_is_enabled(void)
#ifdef CONFIG_MEMORY_HOTREMOVE
extern int arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap);
-extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
+extern int __remove_pages(int nid, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
+extern int del_device_memory(int nid, unsigned long start, unsigned long size,
+ struct vmem_altmap *altmap, bool mapping);
#endif /* CONFIG_MEMORY_HOTREMOVE */

/* reasonably generic interface to expand the physical pages */
extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
struct vmem_altmap *altmap, bool want_memblock);

+extern int add_device_memory(int nid, unsigned long start, unsigned long size,
+ struct vmem_altmap *altmap, bool mapping);
+
#ifndef CONFIG_ARCH_HAS_ADD_PAGES
static inline int add_pages(int nid, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap,
@@ -333,7 +338,7 @@ extern bool is_memblock_offlined(struct memory_block *mem);
extern void remove_memory(int nid, u64 start, u64 size);
extern int sparse_add_one_section(struct pglist_data *pgdat,
unsigned long start_pfn, struct vmem_altmap *altmap);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+extern void sparse_remove_one_section(int nid, struct mem_section *ms,
unsigned long map_offset, struct vmem_altmap *altmap);
extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 7a832b844f24..95df37686196 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -119,6 +119,8 @@ static void devm_memremap_pages_release(void *data)
struct dev_pagemap *pgmap = data;
struct device *dev = pgmap->dev;
struct resource *res = &pgmap->res;
+ struct vmem_altmap *altmap = pgmap->altmap_valid ?
+ &pgmap->altmap : NULL;
resource_size_t align_start, align_size;
unsigned long pfn;
int nid;
@@ -138,8 +140,7 @@ static void devm_memremap_pages_release(void *data)
nid = dev_to_node(dev);

mem_hotplug_begin();
- arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ?
- &pgmap->altmap : NULL);
+ del_device_memory(nid, align_start, align_size, altmap, true);
kasan_remove_zero_shadow(__va(align_start), align_size);
mem_hotplug_done();

@@ -175,7 +176,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
{
resource_size_t align_start, align_size, align_end;
struct vmem_altmap *altmap = pgmap->altmap_valid ?
- &pgmap->altmap : NULL;
+ &pgmap->altmap : NULL;
struct resource *res = &pgmap->res;
unsigned long pfn, pgoff, order;
pgprot_t pgprot = PAGE_KERNEL;
@@ -249,11 +250,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
goto err_kasan;
}

- error = arch_add_memory(nid, align_start, align_size, altmap, false);
- if (!error)
- move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
- align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, altmap);
+ error = add_device_memory(nid, align_start, align_size, altmap, true);
+
mem_hotplug_done();
if (error)
goto err_add_memory;
diff --git a/kernel/resource.c b/kernel/resource.c
index 30e1bc68503b..8e68b5ca67ca 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1262,6 +1262,22 @@ int release_mem_region_adjustable(struct resource *parent,
continue;
}

+ /*
+ * All memory regions added from memory-hotplug path
+ * have the flag IORESOURCE_SYSTEM_RAM.
+ * IORESOURCE_SYSTEM_RAM = (IORESOURCE_MEM|IORESOURCE_SYSRAM)
+ * If the resource does not have this flag, we know that
+ * we are dealing with a resource coming from HMM/devm.
+ * HMM/devm use another mechanism to add/release a resource.
+ * This goes via devm_request_mem_region/devm_release_mem_region.
+ * HMM/devm take care to release their resources when they want, so
+ * if we are dealing with them, let us just back off nicely
+ */
+ if (!(res->flags & IORESOURCE_SYSRAM)) {
+ ret = 0;
+ break;
+ }
+
if (!(res->flags & IORESOURCE_MEM))
break;

diff --git a/mm/hmm.c b/mm/hmm.c
index 21787e480b4a..23ce7fbdb651 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -996,6 +996,7 @@ static void hmm_devmem_release(struct device *dev, void *data)
struct zone *zone;
struct page *page;
int nid;
+ bool mapping;

if (percpu_ref_tryget_live(&devmem->ref)) {
dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
@@ -1012,12 +1013,14 @@ static void hmm_devmem_release(struct device *dev, void *data)

mem_hotplug_begin();
if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
- __remove_pages(zone, start_pfn, npages, NULL);
+ mapping = false;
else
- arch_remove_memory(nid, start_pfn << PAGE_SHIFT,
- npages << PAGE_SHIFT, NULL);
- mem_hotplug_done();
+ mapping = true;

+ del_device_memory(nid, start_pfn << PAGE_SHIFT, npages << PAGE_SHIFT,
+ NULL,
+ mapping);
+ mem_hotplug_done();
hmm_devmem_radix_release(resource);
}

@@ -1027,6 +1030,7 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
struct device *device = devmem->device;
int ret, nid, is_ram;
unsigned long pfn;
+ bool mapping;

align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
align_size = ALIGN(devmem->resource->start +
@@ -1085,7 +1089,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
if (nid < 0)
nid = numa_mem_id();

- mem_hotplug_begin();
/*
* For device private memory we call add_pages() as we only need to
* allocate and initialize struct page for the device memory. More-
@@ -1096,21 +1099,18 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
* For device public memory, which is accesible by the CPU, we do
* want the linear mapping and thus use arch_add_memory().
*/
+ mem_hotplug_begin();
if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
- ret = arch_add_memory(nid, align_start, align_size, NULL,
- false);
+ mapping = true;
else
- ret = add_pages(nid, align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, NULL, false);
- if (ret) {
- mem_hotplug_done();
- goto error_add_memory;
- }
- move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
- align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, NULL);
+ mapping = false;
+
+ ret = add_device_memory(nid, align_start, align_size, NULL, mapping);
mem_hotplug_done();

+ if (ret)
+ goto error_add_memory;
+
for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
struct page *page = pfn_to_page(pfn);

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9bd629944c91..60b67f09956e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -320,12 +320,10 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
unsigned long start_pfn,
unsigned long end_pfn)
{
- struct mem_section *ms;
-
for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
- ms = __pfn_to_section(start_pfn);
+ struct mem_section *ms = __pfn_to_section(start_pfn);

- if (unlikely(!valid_section(ms)))
+ if (unlikely(!online_section(ms)))
continue;

if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -345,15 +343,14 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
unsigned long start_pfn,
unsigned long end_pfn)
{
- struct mem_section *ms;
unsigned long pfn;

/* pfn is the end pfn of a memory section. */
pfn = end_pfn - 1;
for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
- ms = __pfn_to_section(pfn);
+ struct mem_section *ms = __pfn_to_section(pfn);

- if (unlikely(!valid_section(ms)))
+ if (unlikely(!online_section(ms)))
continue;

if (unlikely(pfn_to_nid(pfn) != nid))
@@ -415,7 +412,7 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
ms = __pfn_to_section(pfn);

- if (unlikely(!valid_section(ms)))
+ if (unlikely(!online_section(ms)))
continue;

if (page_zone(pfn_to_page(pfn)) != zone)
@@ -502,23 +499,33 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
pgdat->node_spanned_pages = 0;
}

-static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+static void shrink_pages(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn,
+ unsigned long offlined_pages)
{
struct pglist_data *pgdat = zone->zone_pgdat;
int nr_pages = PAGES_PER_SECTION;
unsigned long flags;
+ unsigned long pfn;

- pgdat_resize_lock(zone->zone_pgdat, &flags);
- shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
- shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
+ zone->present_pages -= offlined_pages;
+
+ clear_zone_contiguous(zone);
+ pgdat_resize_lock(pgdat, &flags);
+
+ for(pfn = start_pfn; pfn < end_pfn; pfn += nr_pages) {
+ shrink_zone_span(zone, pfn, pfn + nr_pages);
+ shrink_pgdat_span(pgdat, pfn, pfn + nr_pages);
+ }
+ pgdat->node_present_pages -= offlined_pages;
+
+ pgdat_resize_unlock(pgdat, &flags);
+ set_zone_contiguous(zone);
}

-static int __remove_section(struct zone *zone, struct mem_section *ms,
+static int __remove_section(int nid, struct mem_section *ms,
unsigned long map_offset, struct vmem_altmap *altmap)
{
- unsigned long start_pfn;
- int scn_nr;
int ret = -EINVAL;

if (!valid_section(ms))
@@ -528,17 +535,13 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
if (ret)
return ret;

- scn_nr = __section_nr(ms);
- start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
- __remove_zone(zone, start_pfn);
-
- sparse_remove_one_section(zone, ms, map_offset, altmap);
+ sparse_remove_one_section(nid, ms, map_offset, altmap);
return 0;
}

/**
* __remove_pages() - remove sections of pages from a zone
- * @zone: zone from which pages need to be removed
+ * @nid: nid from which pages belong to
* @phys_start_pfn: starting pageframe (must be aligned to start of a section)
* @nr_pages: number of pages to remove (must be multiple of section size)
* @altmap: alternative device page map or %NULL if default memmap is used
@@ -548,35 +551,27 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
* sure that pages are marked reserved and zones are adjust properly by
* calling offline_pages().
*/
-int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
+int __remove_pages(int nid, unsigned long phys_start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap)
{
unsigned long i;
unsigned long map_offset = 0;
int sections_to_remove, ret = 0;
+ resource_size_t start, size;

- /* In the ZONE_DEVICE case device driver owns the memory region */
- if (is_dev_zone(zone)) {
- if (altmap)
- map_offset = vmem_altmap_offset(altmap);
- } else {
- resource_size_t start, size;
-
- start = phys_start_pfn << PAGE_SHIFT;
- size = nr_pages * PAGE_SIZE;
+ start = phys_start_pfn << PAGE_SHIFT;
+ size = nr_pages * PAGE_SIZE;

- ret = release_mem_region_adjustable(&iomem_resource, start,
- size);
- if (ret) {
- resource_size_t endres = start + size - 1;
+ if (altmap)
+ map_offset = vmem_altmap_offset(altmap);

- pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
- &start, &endres, ret);
- }
+ ret = release_mem_region_adjustable(&iomem_resource, start, size);
+ if (ret) {
+ resource_size_t endres = start + size - 1;
+ pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
+ &start, &endres, ret);
}

- clear_zone_contiguous(zone);
-
/*
* We can only remove entire sections
*/
@@ -587,15 +582,13 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
for (i = 0; i < sections_to_remove; i++) {
unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;

- ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
- altmap);
+ ret = __remove_section(nid, __pfn_to_section(pfn), map_offset,
+ altmap);
map_offset = 0;
if (ret)
break;
}

- set_zone_contiguous(zone);
-
return ret;
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -1595,7 +1588,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
unsigned long pfn, nr_pages;
long offlined_pages;
int ret, node;
- unsigned long flags;
unsigned long valid_start, valid_end;
struct zone *zone;
struct memory_notify arg;
@@ -1665,11 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
- zone->present_pages -= offlined_pages;

- pgdat_resize_lock(zone->zone_pgdat, &flags);
- zone->zone_pgdat->node_present_pages -= offlined_pages;
- pgdat_resize_unlock(zone->zone_pgdat, &flags);
+ /* Here we will shrink zone/node's spanned/present_pages */
+ shrink_pages(zone, valid_start, valid_end, offlined_pages);

init_per_zone_wmark_min();

@@ -1902,4 +1892,57 @@ void __ref remove_memory(int nid, u64 start, u64 size)
mem_hotplug_done();
}
EXPORT_SYMBOL_GPL(remove_memory);
+
+static int __del_device_memory(int nid, unsigned long start, unsigned long size,
+ struct vmem_altmap *altmap, bool mapping)
+{
+ int ret;
+ unsigned long start_pfn = PHYS_PFN(start);
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
+
+ offline_mem_sections(start_pfn, start_pfn + nr_pages);
+ shrink_pages(zone, start_pfn, start_pfn + nr_pages, 0);
+
+ if (mapping)
+ ret = arch_remove_memory(nid, start, size, altmap);
+ else
+ ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
+
+ return ret;
+}
+
+int del_device_memory(int nid, unsigned long start, unsigned long size,
+ struct vmem_altmap *altmap, bool mapping)
+{
+ return __del_device_memory(nid, start, size, altmap, mapping);
+}
#endif /* CONFIG_MEMORY_HOTREMOVE */
+
+static int __add_device_memory(int nid, unsigned long start, unsigned long size,
+ struct vmem_altmap *altmap, bool mapping)
+{
+ int ret;
+ unsigned long start_pfn = PHYS_PFN(start);
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+
+ if (mapping)
+ ret = arch_add_memory(nid, start, size, altmap, false);
+ else
+ ret = add_pages(nid, start_pfn, nr_pages, altmap, false);
+
+ if (!ret) {
+ struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
+
+ online_mem_sections(start_pfn, start_pfn + nr_pages);
+ move_pfn_range_to_zone(zone, start_pfn, nr_pages, altmap);
+ }
+
+ return ret;
+}
+
+int add_device_memory(int nid, unsigned long start, unsigned long size,
+ struct vmem_altmap *altmap, bool mapping)
+{
+ return __add_device_memory(nid, start, size, altmap, mapping);
+}
diff --git a/mm/sparse.c b/mm/sparse.c
index 10b07eea9a6e..016020bd20b5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -766,12 +766,12 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
free_map_bootmem(memmap);
}

-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+void sparse_remove_one_section(int nid, struct mem_section *ms,
unsigned long map_offset, struct vmem_altmap *altmap)
{
struct page *memmap = NULL;
unsigned long *usemap = NULL, flags;
- struct pglist_data *pgdat = zone->zone_pgdat;
+ struct pglist_data *pgdat = NODE_DATA(nid);

pgdat_resize_lock(pgdat, &flags);
if (ms->section_mem_map) {
--
2.13.6


2018-08-17 15:43:13

by Oscar Salvador

[permalink] [raw]
Subject: [RFC v2 1/2] mm/memory_hotplug: Add nid parameter to arch_remove_memory

From: Oscar Salvador <[email protected]>

This patch is only a preparation for the following-up patches.
The idea of passing the nid is that will allow us to get rid
of the zone parameter in the patches that follow.

Signed-off-by: Oscar Salvador <[email protected]>
---
arch/ia64/mm/init.c | 2 +-
arch/powerpc/mm/mem.c | 2 +-
arch/s390/mm/init.c | 2 +-
arch/sh/mm/init.c | 2 +-
arch/x86/mm/init_32.c | 2 +-
arch/x86/mm/init_64.c | 2 +-
include/linux/memory_hotplug.h | 2 +-
kernel/memremap.c | 4 +++-
mm/hmm.c | 4 +++-
mm/memory_hotplug.c | 2 +-
10 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 3b85c3ecac38..b54d0ee74b53 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -662,7 +662,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
}

#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5c8530d0c611..9e17d57a9948 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -139,7 +139,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
}

#ifdef CONFIG_MEMORY_HOTREMOVE
-int __meminit arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int __meminit arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 3fa3e5323612..bc49b560625e 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -240,7 +240,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
}

#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
/*
* There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 7713c084d040..5c91bb6abc35 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -444,7 +444,7 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
#endif

#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 979e0a02cbe1..9fa503f2f56c 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -861,7 +861,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
}

#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..f90fa077b0c4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1148,7 +1148,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
remove_pagetable(start, end, true, NULL);
}

-int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int __ref arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 34a28227068d..c68b03fd87bd 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -107,7 +107,7 @@ static inline bool movable_node_is_enabled(void)
}

#ifdef CONFIG_MEMORY_HOTREMOVE
-extern int arch_remove_memory(u64 start, u64 size,
+extern int arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap);
extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 5b8600d39931..7a832b844f24 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -121,6 +121,7 @@ static void devm_memremap_pages_release(void *data)
struct resource *res = &pgmap->res;
resource_size_t align_start, align_size;
unsigned long pfn;
+ int nid;

for_each_device_pfn(pfn, pgmap)
put_page(pfn_to_page(pfn));
@@ -134,9 +135,10 @@ static void devm_memremap_pages_release(void *data)
align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
- align_start;
+ nid = dev_to_node(dev);

mem_hotplug_begin();
- arch_remove_memory(align_start, align_size, pgmap->altmap_valid ?
+ arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ?
&pgmap->altmap : NULL);
kasan_remove_zero_shadow(__va(align_start), align_size);
mem_hotplug_done();
diff --git a/mm/hmm.c b/mm/hmm.c
index c968e49f7a0c..21787e480b4a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -995,6 +995,7 @@ static void hmm_devmem_release(struct device *dev, void *data)
unsigned long start_pfn, npages;
struct zone *zone;
struct page *page;
+ int nid;

if (percpu_ref_tryget_live(&devmem->ref)) {
dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
@@ -1007,12 +1008,13 @@ static void hmm_devmem_release(struct device *dev, void *data)

page = pfn_to_page(start_pfn);
zone = page_zone(page);
+ nid = zone->zone_pgdat->node_id;

mem_hotplug_begin();
if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
__remove_pages(zone, start_pfn, npages, NULL);
else
- arch_remove_memory(start_pfn << PAGE_SHIFT,
+ arch_remove_memory(nid, start_pfn << PAGE_SHIFT,
npages << PAGE_SHIFT, NULL);
mem_hotplug_done();

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9eea6e809a4e..9bd629944c91 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1895,7 +1895,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
memblock_free(start, size);
memblock_remove(start, size);

- arch_remove_memory(start, size, NULL);
+ arch_remove_memory(nid, start, size, NULL);

try_offline_node(nid);

--
2.13.6


2018-08-21 13:37:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

> add_device_memory is in charge of

I wouldn't use the terminology of onlining/offlining here. That applies
rather to memory that is exposed to the rest of the system (e.g. buddy
allocator, has underlying memory block devices). I guess it is rather a
pure setup/teardown of that device memory.

>
> a) calling either arch_add_memory() or add_pages(), depending on whether
> we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
> expand zone/pgdat spanned pages and initialize its pages
>
> del_device_memory, on the other hand, is in charge of
>
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
> whether we need to tear down the linear mapping or not
>
> These two functions are called from:
>
> add_device_memory:
> - devm_memremap_pages()
> - hmm_devmem_pages_create()
>
> del_device_memory:
> - devm_memremap_pages_release()
> - hmm_devmem_release()
>
> I think that this will get easier as soon as [1] gets merged.
>
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
>
> [1] https://lkml.org/lkml/2018/6/19/110
>

[...]

> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index f90fa077b0c4..d04338ffabec 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1152,15 +1152,9 @@ int __ref arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *a
> {
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct page *page = pfn_to_page(start_pfn);
> - struct zone *zone;
> int ret;
>
> - /* With altmap the first mapped page is offset from @start */
> - if (altmap)
> - page += vmem_altmap_offset(altmap);
> - zone = page_zone(page);
> - ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
> + ret = __remove_pages(nid, start_pfn, nr_pages, altmap);

These changes certainly look nice.

[...]

> index 7a832b844f24..95df37686196 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -119,6 +119,8 @@ static void devm_memremap_pages_release(void *data)
> struct dev_pagemap *pgmap = data;
> struct device *dev = pgmap->dev;
> struct resource *res = &pgmap->res;
> + struct vmem_altmap *altmap = pgmap->altmap_valid ?
> + &pgmap->altmap : NULL;
> resource_size_t align_start, align_size;
> unsigned long pfn;
> int nid;
> @@ -138,8 +140,7 @@ static void devm_memremap_pages_release(void *data)
> nid = dev_to_node(dev);
>
> mem_hotplug_begin();

I would really like to see the mem_hotplug_begin/end also getting moved
inside add_device_memory()/del_device_memory(). (just like for
add/remove_memory)

I wonder if kasan_ stuff actually requires this lock, or if it could
also be somehow moved inside add_device_memory/del_device_memory.

> - arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ?
> - &pgmap->altmap : NULL);
> + del_device_memory(nid, align_start, align_size, altmap, true);
> kasan_remove_zero_shadow(__va(align_start), align_size);
> mem_hotplug_done();
>
> @@ -175,7 +176,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
> {
> resource_size_t align_start, align_size, align_end;
> struct vmem_altmap *altmap = pgmap->altmap_valid ?
> - &pgmap->altmap : NULL;
> + &pgmap->altmap : NULL;
> struct resource *res = &pgmap->res;
> unsigned long pfn, pgoff, order;
> pgprot_t pgprot = PAGE_KERNEL;
> @@ -249,11 +250,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
> goto err_kasan;
> }
>
> - error = arch_add_memory(nid, align_start, align_size, altmap, false);
> - if (!error)
> - move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> - align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, altmap);
> + error = add_device_memory(nid, align_start, align_size, altmap, true);
> +
> mem_hotplug_done();
> if (error)
> goto err_add_memory;
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 30e1bc68503b..8e68b5ca67ca 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1262,6 +1262,22 @@ int release_mem_region_adjustable(struct resource *parent,
> continue;
> }
>
> + /*
> + * All memory regions added from memory-hotplug path
> + * have the flag IORESOURCE_SYSTEM_RAM.
> + * IORESOURCE_SYSTEM_RAM = (IORESOURCE_MEM|IORESOURCE_SYSRAM)
> + * If the resource does not have this flag, we know that
> + * we are dealing with a resource coming from HMM/devm.
> + * HMM/devm use another mechanism to add/release a resource.
> + * This goes via devm_request_mem_region/devm_release_mem_region.
> + * HMM/devm take care to release their resources when they want, so
> + * if we are dealing with them, let us just back off nicely
> + */

Maybe shorten that a bit

"HMM/devm memory does not have IORESOURCE_SYSTEM_RAM set. They use
devm_request_mem_region/devm_release_mem_region to add/release a
resource. Just back off here."

> + if (!(res->flags & IORESOURCE_SYSRAM)) {
> + ret = 0;
> + break;
> + }
> +
> if (!(res->flags & IORESOURCE_MEM))
> break;
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 21787e480b4a..23ce7fbdb651 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -996,6 +996,7 @@ static void hmm_devmem_release(struct device *dev, void *data)
> struct zone *zone;
> struct page *page;
> int nid;
> + bool mapping;
>
> if (percpu_ref_tryget_live(&devmem->ref)) {
> dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
> @@ -1012,12 +1013,14 @@ static void hmm_devmem_release(struct device *dev, void *data)
>
> mem_hotplug_begin();
> if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> - __remove_pages(zone, start_pfn, npages, NULL);
> + mapping = false;
> else
> - arch_remove_memory(nid, start_pfn << PAGE_SHIFT,
> - npages << PAGE_SHIFT, NULL);
> - mem_hotplug_done();
> + mapping = true;
>
> + del_device_memory(nid, start_pfn << PAGE_SHIFT, npages << PAGE_SHIFT,
> + NULL,
> + mapping);
> + mem_hotplug_done();
> hmm_devmem_radix_release(resource);
> }
>
> @@ -1027,6 +1030,7 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> struct device *device = devmem->device;
> int ret, nid, is_ram;
> unsigned long pfn;
> + bool mapping;
>
> align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
> align_size = ALIGN(devmem->resource->start +
> @@ -1085,7 +1089,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> if (nid < 0)
> nid = numa_mem_id();
>
> - mem_hotplug_begin();
> /*
> * For device private memory we call add_pages() as we only need to
> * allocate and initialize struct page for the device memory. More-
> @@ -1096,21 +1099,18 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> * For device public memory, which is accesible by the CPU, we do
> * want the linear mapping and thus use arch_add_memory().
> */
> + mem_hotplug_begin();
> if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> - ret = arch_add_memory(nid, align_start, align_size, NULL,
> - false);
> + mapping = true;
> else
> - ret = add_pages(nid, align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, NULL, false);
> - if (ret) {
> - mem_hotplug_done();
> - goto error_add_memory;
> - }
> - move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> - align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, NULL);
> + mapping = false;
> +
> + ret = add_device_memory(nid, align_start, align_size, NULL, mapping);
> mem_hotplug_done();
>
> + if (ret)
> + goto error_add_memory;
> +
> for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
> struct page *page = pfn_to_page(pfn);
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9bd629944c91..60b67f09956e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -320,12 +320,10 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
> unsigned long start_pfn,
> unsigned long end_pfn)
> {
> - struct mem_section *ms;
> -
> for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
> - ms = __pfn_to_section(start_pfn);
> + struct mem_section *ms = __pfn_to_section(start_pfn);
>
> - if (unlikely(!valid_section(ms)))
> + if (unlikely(!online_section(ms)))
> continue;
>
> if (unlikely(pfn_to_nid(start_pfn) != nid))
> @@ -345,15 +343,14 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> unsigned long start_pfn,
> unsigned long end_pfn)
> {
> - struct mem_section *ms;
> unsigned long pfn;
>
> /* pfn is the end pfn of a memory section. */
> pfn = end_pfn - 1;
> for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
> - ms = __pfn_to_section(pfn);
> + struct mem_section *ms = __pfn_to_section(pfn);
>
> - if (unlikely(!valid_section(ms)))
> + if (unlikely(!online_section(ms)))
> continue;
>
> if (unlikely(pfn_to_nid(pfn) != nid))
> @@ -415,7 +412,7 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
> ms = __pfn_to_section(pfn);
>
> - if (unlikely(!valid_section(ms)))
> + if (unlikely(!online_section(ms)))
> continue;
>
> if (page_zone(pfn_to_page(pfn)) != zone)
> @@ -502,23 +499,33 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
> pgdat->node_spanned_pages = 0;
> }
>
> -static void __remove_zone(struct zone *zone, unsigned long start_pfn)
> +static void shrink_pages(struct zone *zone, unsigned long start_pfn,
> + unsigned long end_pfn,
> + unsigned long offlined_pages)
> {
> struct pglist_data *pgdat = zone->zone_pgdat;
> int nr_pages = PAGES_PER_SECTION;
> unsigned long flags;
> + unsigned long pfn;
>
> - pgdat_resize_lock(zone->zone_pgdat, &flags);
> - shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
> - shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
> - pgdat_resize_unlock(zone->zone_pgdat, &flags);
> + zone->present_pages -= offlined_pages;
> +
> + clear_zone_contiguous(zone);
> + pgdat_resize_lock(pgdat, &flags);
> +
> + for(pfn = start_pfn; pfn < end_pfn; pfn += nr_pages) {
> + shrink_zone_span(zone, pfn, pfn + nr_pages);
> + shrink_pgdat_span(pgdat, pfn, pfn + nr_pages);
> + }
> + pgdat->node_present_pages -= offlined_pages;
> +
> + pgdat_resize_unlock(pgdat, &flags);
> + set_zone_contiguous(zone);
> }
>
> -static int __remove_section(struct zone *zone, struct mem_section *ms,
> +static int __remove_section(int nid, struct mem_section *ms,
> unsigned long map_offset, struct vmem_altmap *altmap)
> {
> - unsigned long start_pfn;
> - int scn_nr;
> int ret = -EINVAL;
>
> if (!valid_section(ms))
> @@ -528,17 +535,13 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
> if (ret)
> return ret;
>
> - scn_nr = __section_nr(ms);
> - start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
> - __remove_zone(zone, start_pfn);
> -
> - sparse_remove_one_section(zone, ms, map_offset, altmap);
> + sparse_remove_one_section(nid, ms, map_offset, altmap);
> return 0;
> }
>
> /**
> * __remove_pages() - remove sections of pages from a zone
> - * @zone: zone from which pages need to be removed
> + * @nid: nid from which pages belong to
> * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
> * @nr_pages: number of pages to remove (must be multiple of section size)
> * @altmap: alternative device page map or %NULL if default memmap is used
> @@ -548,35 +551,27 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
> * sure that pages are marked reserved and zones are adjust properly by
> * calling offline_pages().
> */
> -int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> +int __remove_pages(int nid, unsigned long phys_start_pfn,
> unsigned long nr_pages, struct vmem_altmap *altmap)
> {
> unsigned long i;
> unsigned long map_offset = 0;
> int sections_to_remove, ret = 0;
> + resource_size_t start, size;
>
> - /* In the ZONE_DEVICE case device driver owns the memory region */
> - if (is_dev_zone(zone)) {
> - if (altmap)
> - map_offset = vmem_altmap_offset(altmap);
> - } else {
> - resource_size_t start, size;
> -
> - start = phys_start_pfn << PAGE_SHIFT;
> - size = nr_pages * PAGE_SIZE;
> + start = phys_start_pfn << PAGE_SHIFT;
> + size = nr_pages * PAGE_SIZE;
>
> - ret = release_mem_region_adjustable(&iomem_resource, start,
> - size);
> - if (ret) {
> - resource_size_t endres = start + size - 1;
> + if (altmap)
> + map_offset = vmem_altmap_offset(altmap);
>
> - pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> - &start, &endres, ret);
> - }
> + ret = release_mem_region_adjustable(&iomem_resource, start, size);
> + if (ret) {
> + resource_size_t endres = start + size - 1;
> + pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> + &start, &endres, ret);
> }
>
> - clear_zone_contiguous(zone);
> -
> /*
> * We can only remove entire sections
> */
> @@ -587,15 +582,13 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> for (i = 0; i < sections_to_remove; i++) {
> unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
>
> - ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
> - altmap);
> + ret = __remove_section(nid, __pfn_to_section(pfn), map_offset,
> + altmap);
> map_offset = 0;
> if (ret)
> break;
> }
>
> - set_zone_contiguous(zone);
> -
> return ret;
> }
> #endif /* CONFIG_MEMORY_HOTREMOVE */
> @@ -1595,7 +1588,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
> unsigned long pfn, nr_pages;
> long offlined_pages;
> int ret, node;
> - unsigned long flags;
> unsigned long valid_start, valid_end;
> struct zone *zone;
> struct memory_notify arg;
> @@ -1665,11 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
> undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> /* removal success */
> adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> - zone->present_pages -= offlined_pages;
>
> - pgdat_resize_lock(zone->zone_pgdat, &flags);
> - zone->zone_pgdat->node_present_pages -= offlined_pages;
> - pgdat_resize_unlock(zone->zone_pgdat, &flags);
> + /* Here we will shrink zone/node's spanned/present_pages */
> + shrink_pages(zone, valid_start, valid_end, offlined_pages);
>
> init_per_zone_wmark_min();
>
> @@ -1902,4 +1892,57 @@ void __ref remove_memory(int nid, u64 start, u64 size)
> mem_hotplug_done();
> }
> EXPORT_SYMBOL_GPL(remove_memory);
> +
> +static int __del_device_memory(int nid, unsigned long start, unsigned long size,
> + struct vmem_altmap *altmap, bool mapping)
> +{
> + int ret;
> + unsigned long start_pfn = PHYS_PFN(start);
> + unsigned long nr_pages = size >> PAGE_SHIFT;
> + struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> +
> + offline_mem_sections(start_pfn, start_pfn + nr_pages);
> + shrink_pages(zone, start_pfn, start_pfn + nr_pages, 0);
> +
> + if (mapping)
> + ret = arch_remove_memory(nid, start, size, altmap);
> + else
> + ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
> +
> + return ret;
> +}
> +
> +int del_device_memory(int nid, unsigned long start, unsigned long size,
> + struct vmem_altmap *altmap, bool mapping)
> +{
> + return __del_device_memory(nid, start, size, altmap, mapping);
> +}
> #endif /* CONFIG_MEMORY_HOTREMOVE */
> +
> +static int __add_device_memory(int nid, unsigned long start, unsigned long size,
> + struct vmem_altmap *altmap, bool mapping)
> +{
> + int ret;
> + unsigned long start_pfn = PHYS_PFN(start);
> + unsigned long nr_pages = size >> PAGE_SHIFT;
> +
> + if (mapping)
> + ret = arch_add_memory(nid, start, size, altmap, false);
> + else
> + ret = add_pages(nid, start_pfn, nr_pages, altmap, false);
> +
> + if (!ret) {
> + struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> +
> + online_mem_sections(start_pfn, start_pfn + nr_pages);
> + move_pfn_range_to_zone(zone, start_pfn, nr_pages, altmap);
> + }
> +
> + return ret;
> +}
> +
> +int add_device_memory(int nid, unsigned long start, unsigned long size,
> + struct vmem_altmap *altmap, bool mapping)
> +{
> + return __add_device_memory(nid, start, size, altmap, mapping);
> +}

Any reason for these indirections?

> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..016020bd20b5 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -766,12 +766,12 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
> free_map_bootmem(memmap);
> }


I guess for readability, this patch could be split up into several
patches. E.g. factoring out of add_device_memory/del_device_memory,
release_mem_region_adjustable change ...

--

Thanks,

David / dhildenb

2018-08-22 08:01:26

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

On Tue, Aug 21, 2018 at 03:17:10PM +0200, David Hildenbrand wrote:
> > add_device_memory is in charge of
>
> I wouldn't use the terminology of onlining/offlining here. That applies
> rather to memory that is exposed to the rest of the system (e.g. buddy
> allocator, has underlying memory block devices). I guess it is rather a
> pure setup/teardown of that device memory.

Hi David,

I am not sure if you are referring to:

"
a) calling either arch_add_memory() or add_pages(), depending on whether
we want a linear mapping
b) online the memory sections that correspond to the pfn range
c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
expand zone/pgdat spanned pages and initialize its pages
"

Well, that is partialy true.
I mean, in order to make this work, we need to offline/online the memory
sections, because shrink_pages will rely on that from now on.
Is what we do when online/offline pages, but since device memory
does not go through the "official" channels, we need to do it there
as well.

Sure I can use another terminology, but since that is what
offline/online_mem_sections do, I just came up with that.

> I would really like to see the mem_hotplug_begin/end also getting moved
> inside add_device_memory()/del_device_memory(). (just like for
> add/remove_memory)
>
> I wonder if kasan_ stuff actually requires this lock, or if it could
> also be somehow moved inside add_device_memory/del_device_memory.

Yes, that was my first approach, but then I saw that the kasan stuff is being
handled whithin those locks, so I was not sure and I backed off leaving the
mem_hotplug_begin/end where they were.

Maybe Jerome can shed some light and, and we can just handle the kasan stuff
out of the locks.

> Maybe shorten that a bit
>
> "HMM/devm memory does not have IORESOURCE_SYSTEM_RAM set. They use
> devm_request_mem_region/devm_release_mem_region to add/release a
> resource. Just back off here."

Uhm, fair enough.

> Any reason for these indirections?

I wanted to hide the internals in the memory_hotplug code.
I thought about removing them, but I finally left them.
If people think that we are better off without them, I can just
remove them.

> I guess for readability, this patch could be split up into several
> patches. E.g. factoring out of add_device_memory/del_device_memory,
> release_mem_region_adjustable change ...

Yes, really true.
But I wanted first to gather feedback mainly from HMM/devm people to see
if they saw an outright bug within the series because I am not so
familiar with that part of the code.

Feedback from Jerome/Dan will be appreciate as well to see if this is a good
direction.

But you are right, in the end, this will have to be slipt up into several
parts to ease the review.

Thanks for reviewing this David!
I will try to address your concerns.

Thanks
--
Oscar Salvador
SUSE L3

2018-08-22 08:21:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

On 22.08.2018 09:50, Oscar Salvador wrote:
> On Tue, Aug 21, 2018 at 03:17:10PM +0200, David Hildenbrand wrote:
>>> add_device_memory is in charge of
>>
>> I wouldn't use the terminology of onlining/offlining here. That applies
>> rather to memory that is exposed to the rest of the system (e.g. buddy
>> allocator, has underlying memory block devices). I guess it is rather a
>> pure setup/teardown of that device memory.
>
> Hi David,
>
> I am not sure if you are referring to:
>
> "
> a) calling either arch_add_memory() or add_pages(), depending on whether
> we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
> expand zone/pgdat spanned pages and initialize its pages
> "
>
> Well, that is partialy true.
> I mean, in order to make this work, we need to offline/online the memory
> sections, because shrink_pages will rely on that from now on.
> Is what we do when online/offline pages, but since device memory
> does not go through the "official" channels, we need to do it there
> as well.
>
> Sure I can use another terminology, but since that is what
> offline/online_mem_sections do, I just came up with that.
>

Okay, got it, so it is basically "mark the sections as online/offline".

>> I would really like to see the mem_hotplug_begin/end also getting moved
>> inside add_device_memory()/del_device_memory(). (just like for
>> add/remove_memory)
>>
>> I wonder if kasan_ stuff actually requires this lock, or if it could
>> also be somehow moved inside add_device_memory/del_device_memory.
>
> Yes, that was my first approach, but then I saw that the kasan stuff is being
> handled whithin those locks, so I was not sure and I backed off leaving the
> mem_hotplug_begin/end where they were.
>
> Maybe Jerome can shed some light and, and we can just handle the kasan stuff
> out of the locks.
>
>> Maybe shorten that a bit
>>
>> "HMM/devm memory does not have IORESOURCE_SYSTEM_RAM set. They use
>> devm_request_mem_region/devm_release_mem_region to add/release a
>> resource. Just back off here."
>
> Uhm, fair enough.
>
>> Any reason for these indirections?
>
> I wanted to hide the internals in the memory_hotplug code.
> I thought about removing them, but I finally left them.
> If people think that we are better off without them, I can just
> remove them.

I don't see a need for that. (everyone following the functions has to go
via one indirection that just passes on parameters). It is also not done
for other functions (a.g. add_memory)

>
>> I guess for readability, this patch could be split up into several
>> patches. E.g. factoring out of add_device_memory/del_device_memory,
>> release_mem_region_adjustable change ...
>
> Yes, really true.
> But I wanted first to gather feedback mainly from HMM/devm people to see
> if they saw an outright bug within the series because I am not so
> familiar with that part of the code.
>
> Feedback from Jerome/Dan will be appreciate as well to see if this is a good
> direction.

Yes, they probably know best how this all fits together.

>
> But you are right, in the end, this will have to be slipt up into several
> parts to ease the review.
>
> Thanks for reviewing this David!
> I will try to address your concerns.
>
> Thanks
>


--

Thanks,

David / dhildenb

2018-08-28 11:49:08

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC v2 0/2] Do not touch pages in remove_memory path

On Fri, Aug 17, 2018 at 05:41:25PM +0200, Oscar Salvador wrote:
> From: Oscar Salvador <[email protected]>
[...]
>
> The main difficulty I faced here was in regard of HMM/devm, as it really handles
> the hot-add/remove memory particulary, and what is more important,
> also the resources.
>
> I really scratched my head for ideas about how to handle this case, and
> after some fails I came up with the idea that we could check for the
> res->flags.
>
> Memory resources that goes through the "official" memory-hotplug channels
> have the IORESOURCE_SYSTEM_RAM flag.
> This flag is made of (IORESOURCE_MEM|IORESOURCE_SYSRAM).
>
> HMM/devm, on the other hand, request and release the resources
> through devm_request_mem_region/devm_release_mem_region, and
> these resources do not contain the IORESOURCE_SYSRAM flag.
>
> So what I ended up doing is to check for IORESOURCE_SYSRAM
> in release_mem_region_adjustable.
> If we see that a resource does not have such a flag, we know that
> we are dealing with a resource coming from HMM/devm, and so,
> we do not need to do anything as HMM/dev will take care of that part.
>

Jerome/Dan, now that the merge window is closed, and before sending the RFCv3, could you please check
this and see if you see something that is flagrant wrong? (about devm/HMM)

If you prefer I can send v3 spliting up even more.
Maybe this will ease the review.

Thanks
--
Oscar Salvador
SUSE L3

2018-08-29 17:06:09

by Jerome Glisse

[permalink] [raw]
Subject: Re: [RFC v2 0/2] Do not touch pages in remove_memory path

On Tue, Aug 28, 2018 at 01:47:09PM +0200, Oscar Salvador wrote:
> On Fri, Aug 17, 2018 at 05:41:25PM +0200, Oscar Salvador wrote:
> > From: Oscar Salvador <[email protected]>
> [...]
> >
> > The main difficulty I faced here was in regard of HMM/devm, as it really handles
> > the hot-add/remove memory particulary, and what is more important,
> > also the resources.
> >
> > I really scratched my head for ideas about how to handle this case, and
> > after some fails I came up with the idea that we could check for the
> > res->flags.
> >
> > Memory resources that goes through the "official" memory-hotplug channels
> > have the IORESOURCE_SYSTEM_RAM flag.
> > This flag is made of (IORESOURCE_MEM|IORESOURCE_SYSRAM).
> >
> > HMM/devm, on the other hand, request and release the resources
> > through devm_request_mem_region/devm_release_mem_region, and
> > these resources do not contain the IORESOURCE_SYSRAM flag.
> >
> > So what I ended up doing is to check for IORESOURCE_SYSRAM
> > in release_mem_region_adjustable.
> > If we see that a resource does not have such a flag, we know that
> > we are dealing with a resource coming from HMM/devm, and so,
> > we do not need to do anything as HMM/dev will take care of that part.
> >
>
> Jerome/Dan, now that the merge window is closed, and before sending the RFCv3, could you please check
> this and see if you see something that is flagrant wrong? (about devm/HMM)
>
> If you prefer I can send v3 spliting up even more.
> Maybe this will ease the review.
>

This looks good to me you can add Reviewed-by: J?r?me Glisse <[email protected]>

2018-08-29 23:10:38

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory


On 8/17/18 11:41 AM, Oscar Salvador wrote:
> From: Oscar Salvador <[email protected]>
>
> Currently, we decrement zone/node spanned_pages when we
> remove memory and not when we offline it.
>
> This, besides of not being consistent with the current code,
> implies that we can access steal pages if we never get to online
> that memory.
>
> In order to prevent that, we have to move all zone/pages stuff to
> the offlining memory stage.
> Removing memory path should only care about memory sections and memory
> blocks.
>
> Another thing to notice here is that this is not so easy to be done
> as HMM/devm have a particular handling of memory-hotplug.
> They do not go through the common path, and so, they do not
> call either offline_pages() nor online_pages().
>
> All they care about is to add the sections, move the pages to
> ZONE_DEVICE, and in some cases, to create the linear mapping.
>
> In order to do this more smooth, two new functions are created
> to deal with these particular cases:
>
> del_device_memory
> add_device_memory
>
> add_device_memory is in charge of
>
> a) calling either arch_add_memory() or add_pages(), depending on whether
> we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
> expand zone/pgdat spanned pages and initialize its pages
>
> del_device_memory, on the other hand, is in charge of
>
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
> whether we need to tear down the linear mapping or not
>
> These two functions are called from:
>
> add_device_memory:
> - devm_memremap_pages()
> - hmm_devmem_pages_create()
>
> del_device_memory:
> - devm_memremap_pages_release()
> - hmm_devmem_release()
>
> I think that this will get easier as soon as [1] gets merged.
>
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
>
> [1] https://lkml.org/lkml/2018/6/19/110
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> arch/ia64/mm/init.c | 4 +-
> arch/powerpc/mm/mem.c | 10 +--
> arch/sh/mm/init.c | 4 +-
> arch/x86/mm/init_32.c | 4 +-
> arch/x86/mm/init_64.c | 8 +--
> include/linux/memory_hotplug.h | 9 ++-
> kernel/memremap.c | 14 ++--
> kernel/resource.c | 16 +++++
> mm/hmm.c | 32 ++++-----
> mm/memory_hotplug.c | 143 +++++++++++++++++++++++++++--------------
> mm/sparse.c | 4 +-
> 11 files changed, 145 insertions(+), 103 deletions(-)

Hi Oscar,

I have been studying this patch, and do not see anything bad about it
except that it begs to be split into smaller patches. I think you can
send this work as a series without RFC if this patch is split into 3 or
so patches. I will review that series.

Thank you,
Pavel

2018-08-31 20:52:28

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

On Wed, Aug 29, 2018 at 11:09:01PM +0000, Pasha Tatashin wrote:
> Hi Oscar,
>
> I have been studying this patch, and do not see anything bad about it
> except that it begs to be split into smaller patches. I think you can
> send this work as a series without RFC if this patch is split into 3 or
> so patches. I will review that series.

Thanks Pavel for having taken a look at this.
I will split up the patch and re-send it without RFC.

Thanks!
--
Oscar Salvador
SUSE L3