I am right now working on a paravirtualized memory device ("virtio-mem").
These devices control a memory region and the amount of memory available
via it. Memory will not be indicated/added/onlined via ACPI and friends,
the device driver is responsible for it.
When the device driver starts up, it will add and online the requested
amount of memory from its assigned physical memory region. On request, it can
add (online) either more memory or try to remove (offline) memory. As it
will be a virtio module, we also want to be able to have it as a loadable
kernel module.
Such a device can be thought of like a "resizable DIMM" or a "huge
number of 4MB DIMMS" that can be automatically managed.
As we want to be able to add/remove small chunks of memory to a VM without
fragmenting guest memory ("it's not what the guest pays for" and "what if
the hypervisor wants to sue huge pages"), it looks like we can do that
under Linux in a 4MB granularity by using online_pages()/offline_pages()
We add a segment and online only 4MB blocks of it on demand. So the other
memory might not be accessible. For kdump and offlining code, we have to
mark pages as offline before a new segment is visible to the system (e.g.
as these pages might not be backed by real memory in the hypervisor).
This is not a balloon driver. Main differences:
- We can add more memory to a VM without having to use mixture of
technologies - e.g. ACPI for plugging, balloon for unplugging (in contrast
to virtio-balloon).
- The device is responsible for its own memory only - will not inflate on
any system memory. (in contrast to all balloons)
- Works on a coarser granularity (e.g. 4MB because that's what we can
online/offline in Linux). We are not using the buddy allocator when unplugging
but really search for chunks of memory we can offline. We actually
can support arbitrary block sizes. (in contrast to all balloons)
- That's why we don't fragment guest memory.
- A device can belong to exactly one NUMA node. This way we can online/offline
memory in a fine granularity NUMA aware. Even if the guest does not even
know how to spell NUMA. (in contrast to all balloons)
- Architectures that don't have proper memory hotplug interfaces (e.g. s390x)
get memory hotplug support. I have a prototype for s390x.
- Once all 4MB chunks of a memory block are offline, we can remove the
memory block and therefore the struct pages. (in contrast to all balloons)
This essentially allows us to add/remove 4MB chunks to/from a VM. Especially
without caring about the future when adding memory ("If I add a 128GB DIMM
I can only unplug 128GB again") or running into limits ("If I want my VM to
grow to 4TB, I have to plug at least 16GB per DIMM").
Future work:
- Performance improvements
- Be smarter about which blocks to offline first (e.g. free ones)
- Automatically manage assignemnt to NORMAL/MOVABLE zone to make
unplug more likely to succeed.
I will post the next prototype of virtio-mem shortly.
RFC -> RFCv2:
- "mm: introduce and use PageOffline()"
-> Use a mapcount value instead of a page flag
-> Rework to not require to revert a patch completely
- "kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in ELF info$"
-> Export the mapcount value instead
- "mm/memory_hotplug: limit offline_pages() to sizes we can actually .."
-> Make this look a bit nicer and drivers to also use the size
- "mm/memory_hotplug: print only with DEBUG_VM in offline_pages()"
-> offlining is right now fairly noisy when delaing with small chunks
- "mm/memory_hotplug: teach offline_pages() to not try forever"
-> We need offline_pages() to fail fast and not loop forever on persistent
errors (e.g. -ENOMEM)
- "mm/memory_hotplug: allow online/offline memory by a kernel module"
-> Actually compiled it as a module and noticed that a lot was still missing
David Hildenbrand (7):
mm: introduce and use PageOffline()
kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in ELF info
mm/memory_hotplug: limit offline_pages() to sizes we can actually
handle
mm/memory_hotplug: allow to control onlining/offlining of memory by a
driver
mm/memory_hotplug: print only with DEBUG_VM in offline_pages()
mm/memory_hotplug: teach offline_pages() to not try forever
mm/memory_hotplug: allow online/offline memory by a kernel module
arch/powerpc/platforms/powernv/memtrace.c | 2 +-
drivers/base/memory.c | 25 ++++--
drivers/base/node.c | 1 -
drivers/xen/balloon.c | 2 +-
include/linux/memory.h | 2 +-
include/linux/memory_hotplug.h | 20 +++--
include/linux/mm.h | 2 +
include/linux/page-flags.h | 9 ++
kernel/crash_core.c | 1 +
mm/memory_hotplug.c | 131 +++++++++++++++++++++++++-----
mm/page_alloc.c | 22 +++--
mm/sparse.c | 25 +++++-
12 files changed, 195 insertions(+), 47 deletions(-)
--
2.14.3
It can easily happen that we get stuck forever trying to offline pages -
e.g. on persistent errors.
Let's add a way to change this behavior and fail fast.
This is interesting if offline_pages() is called from a driver and we
just want to find some block to offline.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Reza Arbab <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/platforms/powernv/memtrace.c | 2 +-
drivers/base/memory.c | 2 +-
include/linux/memory_hotplug.h | 8 ++++----
mm/memory_hotplug.c | 14 ++++++++++----
4 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index fc222a0c2ac4..8ce71f7e1558 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -110,7 +110,7 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
change_memblock_state);
- if (offline_pages(start_pfn, nr_pages)) {
+ if (offline_pages(start_pfn, nr_pages, true)) {
walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE,
change_memblock_state);
return false;
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 3b8616551561..c785e4c01b23 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -248,7 +248,7 @@ memory_block_action(struct memory_block *mem, unsigned long action)
ret = online_pages(start_pfn, nr_pages, mem->online_type);
break;
case MEM_OFFLINE:
- ret = offline_pages(start_pfn, nr_pages);
+ ret = offline_pages(start_pfn, nr_pages, true);
break;
default:
WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 497e28f5b000..ae53017b54df 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -303,7 +303,8 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+ bool retry_forever);
extern void remove_memory(int nid, u64 start, u64 size);
#else
@@ -315,7 +316,8 @@ static inline bool is_mem_section_removable(unsigned long pfn,
static inline void try_offline_node(int nid) {}
-static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+ bool retry_forever)
{
return -EINVAL;
}
@@ -333,9 +335,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap, bool want_memblock);
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages, struct vmem_altmap *altmap);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
extern bool is_memblock_offlined(struct memory_block *mem);
-extern void remove_memory(int nid, u64 start, u64 size);
extern int sparse_add_one_section(struct pglist_data *pgdat,
unsigned long start_pfn, struct vmem_altmap *altmap);
extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d8f127754c2e..c47cc68341fc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1618,8 +1618,8 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
node_clear_state(node, N_MEMORY);
}
-static int __ref __offline_pages(unsigned long start_pfn,
- unsigned long end_pfn)
+static int __ref __offline_pages(unsigned long start_pfn, unsigned long end_pfn,
+ bool retry_forever)
{
unsigned long pfn, nr_pages;
long offlined_pages;
@@ -1671,6 +1671,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
pfn = scan_movable_pages(start_pfn, end_pfn);
if (pfn) { /* We have movable pages */
ret = do_migrate_range(pfn, end_pfn);
+ if (ret && !retry_forever) {
+ ret = -EBUSY;
+ goto failed_removal;
+ }
goto repeat;
}
@@ -1737,6 +1741,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
* offline_pages - offline pages in a given range (that are currently online)
* @start_pfn: start pfn of the memory range
* @nr_pages: the number of pages
+ * @retry_forever: weather to retry (possibly) forever
*
* This function tries to offline the given pages. The alignment/size that
* can be used is given by offline_nr_pages.
@@ -1749,9 +1754,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
*
* Must be protected by mem_hotplug_begin() or a device_lock
*/
-int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+ bool retry_forever)
{
- return __offline_pages(start_pfn, start_pfn + nr_pages);
+ return __offline_pages(start_pfn, start_pfn + nr_pages, retry_forever);
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
--
2.14.3
Let's try to minimze the noise.
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Reza Arbab <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory_hotplug.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4c7e0efff079..d8f127754c2e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1685,7 +1685,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
offlined_pages = check_pages_isolated(start_pfn, end_pfn);
if (offlined_pages < 0)
goto repeat;
+#ifdef CONFIG_DEBUG_VM
pr_info("Offlined Pages %ld\n", offlined_pages);
+#endif
/* Ok, all of our target is isolated.
We cannot do rollback at this point. */
offline_isolated_pages(start_pfn, end_pfn);
@@ -1720,9 +1722,11 @@ static int __ref __offline_pages(unsigned long start_pfn,
return 0;
failed_removal:
+#ifdef CONFIG_DEBUG_VM
pr_debug("memory offlining [mem %#010llx-%#010llx] failed\n",
(unsigned long long) start_pfn << PAGE_SHIFT,
((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
+#endif
memory_notify(MEM_CANCEL_OFFLINE, &arg);
/* pushback to free area */
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
--
2.14.3
We have to take care of MAX_ORDER. Page blocks might contain references
to the next page block. So sometimes a page block cannot be offlined
independently. E.g. on x86: page block size is 2MB, MAX_ORDER -1 (10)
allows 4MB allocations.
E.g. a buddy page could either overlap at the beginning or the end of the
range to offline. While the end case could be handled easily (shrink the
buddy page), overlaps at the beginning are hard to handle (unknown page
order).
Let document offline_pages() while at it.
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Reza Arbab <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/memory_hotplug.h | 6 ++++++
mm/memory_hotplug.c | 22 ++++++++++++++++++----
2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index e0e49b5b1ee1..d71829d54360 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -294,6 +294,12 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
#endif /* !(CONFIG_MEMORY_HOTPLUG || CONFIG_DEFERRED_STRUCT_PAGE_INIT) */
#ifdef CONFIG_MEMORY_HOTREMOVE
+/*
+ * Isolation and offlining code cannot deal with pages (e.g. buddy)
+ * overlapping with the range to be offlined yet.
+ */
+#define offline_nr_pages max((unsigned long)pageblock_nr_pages, \
+ (unsigned long)MAX_ORDER_NR_PAGES)
extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
extern void try_offline_node(int nid);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7f7bd2acb55b..c971295a1100 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1599,10 +1599,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
struct zone *zone;
struct memory_notify arg;
- /* at least, alignment against pageblock is necessary */
- if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
+ if (!IS_ALIGNED(start_pfn, offline_nr_pages))
return -EINVAL;
- if (!IS_ALIGNED(end_pfn, pageblock_nr_pages))
+ if (!IS_ALIGNED(end_pfn, offline_nr_pages))
return -EINVAL;
/* This makes hotplug much easier...and readable.
we assume this for now. .*/
@@ -1700,7 +1699,22 @@ static int __ref __offline_pages(unsigned long start_pfn,
return ret;
}
-/* Must be protected by mem_hotplug_begin() or a device_lock */
+/**
+ * offline_pages - offline pages in a given range (that are currently online)
+ * @start_pfn: start pfn of the memory range
+ * @nr_pages: the number of pages
+ *
+ * This function tries to offline the given pages. The alignment/size that
+ * can be used is given by offline_nr_pages.
+ *
+ * Returns 0 if sucessful, -EBUSY if the pages cannot be offlined and
+ * -EINVAL if start_pfn/nr_pages is not properly aligned or not in a zone.
+ * -EINTR is returned if interrupted by a signal.
+ *
+ * Bad things will happen if pages in the range are already offline.
+ *
+ * Must be protected by mem_hotplug_begin() or a device_lock
+ */
int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
{
return __offline_pages(start_pfn, start_pfn + nr_pages);
--
2.14.3
This allows dump tools to skip pages that are offline.
Cc: Andrew Morton <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
kernel/crash_core.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index f7674d676889..c0a45e9ba84e 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void)
#ifdef CONFIG_HUGETLB_PAGE
VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
#endif
+ VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
arch_crash_save_vmcoreinfo();
update_vmcoreinfo_note();
--
2.14.3
Kernel modules that want to control how/when memory is onlined/offlined
need a proper interface to these functions. Also, for adding memory
properly, memory_block_size_bytes is required.
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 1 +
include/linux/memory_hotplug.h | 2 ++
mm/memory_hotplug.c | 27 +++++++++++++++++++++++++--
3 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c785e4c01b23..0a7c79cfaaf8 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_bytes(void)
{
return MIN_MEMORY_BLOCK_SIZE;
}
+EXPORT_SYMBOL(memory_block_size_bytes);
static unsigned long get_memory_block_size(void)
{
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ae53017b54df..0e3e48410415 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -97,6 +97,8 @@ extern void __online_page_increment_counters(struct page *page);
extern void __online_page_free(struct page *page);
extern int try_online_node(int nid);
+extern int online_memory_blocks(uint64_t start, uint64_t size);
+extern int offline_memory_blocks(uint64_t start, uint64_t size);
extern bool memhp_auto_online;
/* If movable_node boot option specified */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c47cc68341fc..849bf0543fb1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -89,12 +89,14 @@ void mem_hotplug_begin(void)
cpus_read_lock();
percpu_down_write(&mem_hotplug_lock);
}
+EXPORT_SYMBOL(mem_hotplug_begin);
void mem_hotplug_done(void)
{
percpu_up_write(&mem_hotplug_lock);
cpus_read_unlock();
}
+EXPORT_SYMBOL(mem_hotplug_done);
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
@@ -980,6 +982,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
memory_notify(MEM_CANCEL_ONLINE, &arg);
return ret;
}
+EXPORT_SYMBOL(online_pages);
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
static void reset_node_present_pages(pg_data_t *pgdat)
@@ -1109,6 +1112,25 @@ static int online_memory_block(struct memory_block *mem, void *arg)
return device_online(&mem->dev);
}
+static int offline_memory_block(struct memory_block *mem, void *arg)
+{
+ return device_offline(&mem->dev);
+}
+
+int online_memory_blocks(uint64_t start, uint64_t size)
+{
+ return walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
+ NULL, online_memory_block);
+}
+EXPORT_SYMBOL(online_memory_blocks);
+
+int offline_memory_blocks(uint64_t start, uint64_t size)
+{
+ return walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
+ NULL, offline_memory_block);
+}
+EXPORT_SYMBOL(offline_memory_blocks);
+
static int mark_memory_block_driver_managed(struct memory_block *mem, void *arg)
{
mem->driver_managed = true;
@@ -1197,8 +1219,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online,
/* online pages if requested */
if (online)
- walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
- NULL, online_memory_block);
+ online_memory_blocks(start, size);
else if (driver_managed)
walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
NULL, mark_memory_block_driver_managed);
@@ -1297,6 +1318,7 @@ bool is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages)
/* All pageblocks in the memory block are likely to be hot-removable */
return true;
}
+EXPORT_SYMBOL(is_mem_section_removable);
/*
* Confirm all pages in a range [start, end) belong to the same zone.
@@ -1759,6 +1781,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
{
return __offline_pages(start_pfn, start_pfn + nr_pages, retry_forever);
}
+EXPORT_SYMBOL(offline_pages);
#endif /* CONFIG_MEMORY_HOTREMOVE */
/**
--
2.14.3
Some devices (esp. paravirtualized) might want to control
- when to online/offline a memory block
- how to online memory (MOVABLE/NORMAL)
- in which granularity to online/offline memory
So let's add a new flag "driver_managed" and disallow to change the
state by user space. Device onlining/offlining will still work, however
the memory will not be actually onlined/offlined. That has to be handled
by the device driver that owns the memory.
Please note that we have to create user visible memory blocks after all
since this is required to trigger the right udevs events in order to
reload kexec/kdump. Also, it allows to see what is going on in the
system (e.g. which memory blocks are still around).
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Reza Arbab <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/memory.c | 22 ++++++++++++++--------
drivers/xen/balloon.c | 2 +-
include/linux/memory.h | 1 +
include/linux/memory_hotplug.h | 4 +++-
mm/memory_hotplug.c | 34 ++++++++++++++++++++++++++++++++--
5 files changed, 51 insertions(+), 12 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index bffe8616bd55..3b8616551561 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -231,27 +231,28 @@ static bool pages_correctly_probed(unsigned long start_pfn)
* Must already be protected by mem_hotplug_begin().
*/
static int
-memory_block_action(unsigned long phys_index, unsigned long action, int online_type)
+memory_block_action(struct memory_block *mem, unsigned long action)
{
- unsigned long start_pfn;
+ unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
- int ret;
+ int ret = 0;
- start_pfn = section_nr_to_pfn(phys_index);
+ if (mem->driver_managed)
+ return 0;
switch (action) {
case MEM_ONLINE:
if (!pages_correctly_probed(start_pfn))
return -EBUSY;
- ret = online_pages(start_pfn, nr_pages, online_type);
+ ret = online_pages(start_pfn, nr_pages, mem->online_type);
break;
case MEM_OFFLINE:
ret = offline_pages(start_pfn, nr_pages);
break;
default:
WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
- "%ld\n", __func__, phys_index, action, action);
+ "%ld\n", __func__, mem->start_section_nr, action, action);
ret = -EINVAL;
}
@@ -269,8 +270,7 @@ static int memory_block_change_state(struct memory_block *mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;
- ret = memory_block_action(mem->start_section_nr, to_state,
- mem->online_type);
+ ret = memory_block_action(mem, to_state);
mem->state = ret ? from_state_req : to_state;
@@ -350,6 +350,11 @@ store_mem_state(struct device *dev,
*/
mem_hotplug_begin();
+ if (mem->driver_managed) {
+ ret = -EINVAL;
+ goto out;
+ }
+
switch (online_type) {
case MMOP_ONLINE_KERNEL:
case MMOP_ONLINE_MOVABLE:
@@ -364,6 +369,7 @@ store_mem_state(struct device *dev,
ret = -EINVAL; /* should never happen */
}
+out:
mem_hotplug_done();
err:
unlock_device_hotplug();
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 065f0b607373..89981d573c06 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -401,7 +401,7 @@ static enum bp_state reserve_additional_memory(void)
* callers drop the mutex before trying again.
*/
mutex_unlock(&balloon_mutex);
- rc = add_memory_resource(nid, resource, memhp_auto_online);
+ rc = add_memory_resource(nid, resource, memhp_auto_online, false);
mutex_lock(&balloon_mutex);
if (rc) {
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 9f8cd856ca1e..018c5e5ecde1 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -29,6 +29,7 @@ struct memory_block {
unsigned long state; /* serialized by the dev->lock */
int section_count; /* serialized by mem_sysfs_mutex */
int online_type; /* for passing data to online routine */
+ bool driver_managed; /* driver handles online/offline */
int phys_device; /* to which fru does this belong? */
void *hw; /* optional pointer to fw/hw data */
int (*phys_callback)(struct memory_block *);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d71829d54360..497e28f5b000 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -326,7 +326,9 @@ static inline void remove_memory(int nid, u64 start, u64 size) {}
extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
void *arg, int (*func)(struct memory_block *, void *));
extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource, bool online);
+extern int add_memory_driver_managed(int nid, u64 start, u64 size);
+extern int add_memory_resource(int nid, struct resource *resource, bool online,
+ bool driver_managed);
extern int arch_add_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap, bool want_memblock);
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c971295a1100..4c7e0efff079 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1109,8 +1109,15 @@ static int online_memory_block(struct memory_block *mem, void *arg)
return device_online(&mem->dev);
}
+static int mark_memory_block_driver_managed(struct memory_block *mem, void *arg)
+{
+ mem->driver_managed = true;
+ return 0;
+}
+
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
-int __ref add_memory_resource(int nid, struct resource *res, bool online)
+int __ref add_memory_resource(int nid, struct resource *res, bool online,
+ bool driver_managed)
{
u64 start, size;
pg_data_t *pgdat = NULL;
@@ -1118,6 +1125,9 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
bool new_node;
int ret;
+ if (online && driver_managed)
+ return -EINVAL;
+
start = res->start;
size = resource_size(res);
@@ -1189,6 +1199,9 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
if (online)
walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
NULL, online_memory_block);
+ else if (driver_managed)
+ walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
+ NULL, mark_memory_block_driver_managed);
goto out;
@@ -1213,13 +1226,30 @@ int __ref add_memory(int nid, u64 start, u64 size)
if (IS_ERR(res))
return PTR_ERR(res);
- ret = add_memory_resource(nid, res, memhp_auto_online);
+ ret = add_memory_resource(nid, res, memhp_auto_online, false);
if (ret < 0)
release_memory_resource(res);
return ret;
}
EXPORT_SYMBOL_GPL(add_memory);
+int __ref add_memory_driver_managed(int nid, u64 start, u64 size)
+{
+ struct resource *res;
+ int ret;
+
+ res = register_memory_resource(start, size);
+ if (IS_ERR(res))
+ return PTR_ERR(res);
+
+ ret = add_memory_resource(nid, res, false, true);
+ if (ret < 0)
+ release_memory_resource(res);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(add_memory_driver_managed);
+
+
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
--
2.14.3
offline_pages() theoretically works on sub-section sizes. Problem is that
we have no way to know which pages are actually offline. So right now,
offline_pages() will always mark the whole section as offline.
In addition, in virtualized environments, we might soon have pages that are
logically offline and shall no longer be read or written - e.g. because
we offline a subsection and told our hypervisor to remove it. We need a way
(e.g. for kdump) to flag these pages (like PG_hwpoison), otherwise kdump
will happily access all memory and crash the system when accessing
memory that is not meant to be accessed.
Marking pages as offline will also later to give kdump that information
and to mark a section as offline once all pages are offline. It is save
to use mapcount as all pages are logically removed from the system
(offline_pages()).
This e.g. allows us to add/remove memory to Linux in a VM in 4MB chunks
Please note that we can't use PG_reserved for this. PG_reserved does not
imply that
- a page should not be dumped
- a page is offline and we should mark the section offline
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Philippe Ombredanne <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Miles Chen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Reza Arbab <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
drivers/base/node.c | 1 -
include/linux/memory.h | 1 -
include/linux/mm.h | 2 ++
include/linux/page-flags.h | 9 +++++++++
mm/memory_hotplug.c | 32 +++++++++++++++++++++++---------
mm/page_alloc.c | 22 ++++++++++++++--------
mm/sparse.c | 25 ++++++++++++++++++++++++-
7 files changed, 72 insertions(+), 20 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7a3a580821e0..58a889b2b2f4 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -408,7 +408,6 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
if (!mem_blk)
return -EFAULT;
- mem_blk->nid = nid;
if (!node_online(nid))
return 0;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 31ca3e28b0eb..9f8cd856ca1e 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -33,7 +33,6 @@ struct memory_block {
void *hw; /* optional pointer to fw/hw data */
int (*phys_callback)(struct memory_block *);
struct device dev;
- int nid; /* NID for this memory block */
};
int arch_get_memory_phys_device(unsigned long start_pfn);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06a4be6..30c56665c327 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2063,6 +2063,8 @@ extern unsigned long find_min_pfn_with_active_regions(void);
extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
extern void sparse_memory_present_with_active_regions(int nid);
+extern void __meminit init_single_page(struct page *page, unsigned long pfn,
+ unsigned long zone, int nid);
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e34a27727b9a..07ec6e48073b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -686,6 +686,15 @@ PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
#define PAGE_KMEMCG_MAPCOUNT_VALUE (-512)
PAGE_MAPCOUNT_OPS(Kmemcg, KMEMCG)
+/*
+ * PageOffline() indicates that a page is offline (either never online via
+ * online_pages() or offlined via offline_pages()). Nobody in the system
+ * should have a reference to these pages. In virtual environments,
+ * the backing storage might already have been removed. Don't touch!
+ */
+#define PAGE_OFFLINE_MAPCOUNT_VALUE (-1024)
+PAGE_MAPCOUNT_OPS(Offline, OFFLINE)
+
extern bool is_free_buddy_page(struct page *page);
__PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f74826cdceea..7f7bd2acb55b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,6 +250,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
struct vmem_altmap *altmap, bool want_memblock)
{
int ret;
+ int i;
if (pfn_valid(phys_start_pfn))
return -EEXIST;
@@ -258,6 +259,25 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
if (ret < 0)
return ret;
+ /*
+ * Mark all the pages in the section as offline before creating the
+ * memblock and onlining any sub-sections (and therefore marking the
+ * whole section as online). Mark them reserved so nobody will stumble
+ * over a half inititalized state.
+ */
+ for (i = 0; i < PAGES_PER_SECTION; i++) {
+ unsigned long pfn = phys_start_pfn + i;
+ struct page *page;
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+
+ /* dummy zone, the actual one will be set when onlining pages */
+ init_single_page(page, pfn, ZONE_NORMAL, nid);
+ SetPageReserved(page);
+ __SetPageOffline(page);
+ }
+
if (!want_memblock)
return 0;
@@ -651,6 +671,7 @@ EXPORT_SYMBOL_GPL(__online_page_increment_counters);
void __online_page_free(struct page *page)
{
+ __ClearPageOffline(page);
__free_reserved_page(page);
}
EXPORT_SYMBOL_GPL(__online_page_free);
@@ -891,15 +912,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
int nid;
int ret;
struct memory_notify arg;
- struct memory_block *mem;
-
- /*
- * We can't use pfn_to_nid() because nid might be stored in struct page
- * which is not yet initialized. Instead, we find nid from memory block.
- */
- mem = find_memory_block(__pfn_to_section(pfn));
- nid = mem->nid;
+ nid = pfn_to_nid(pfn);
/* associate pfn range with the zone */
zone = move_pfn_range(online_type, nid, pfn, nr_pages);
@@ -1426,7 +1440,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
}
/*
- * remove from free_area[] and mark all as Reserved.
+ * remove from free_area[] and mark all as Reserved and Offline.
*/
static int
offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d7962f..567278f28188 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1171,7 +1171,7 @@ static void free_one_page(struct zone *zone,
spin_unlock(&zone->lock);
}
-static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+extern void __meminit init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
mm_zero_struct_page(page);
@@ -1206,7 +1206,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
break;
}
- __init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+ init_single_page(pfn_to_page(pfn), pfn, zid, nid);
}
#else
static inline void init_reserved_page(unsigned long pfn)
@@ -1523,7 +1523,7 @@ static unsigned long __init deferred_init_pages(int nid, int zid,
} else {
page++;
}
- __init_single_page(page, pfn, zid, nid);
+ init_single_page(page, pfn, zid, nid);
nr_pages++;
}
return (nr_pages);
@@ -5514,9 +5514,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
not_early:
page = pfn_to_page(pfn);
- __init_single_page(page, pfn, zone, nid);
if (context == MEMMAP_HOTPLUG)
- SetPageReserved(page);
+ /* everything but the zone was inititalized */
+ set_page_zone(page, zone);
+ else
+ init_single_page(page, pfn, zone, nid);
/*
* Mark the block movable so that blocks are reserved for
@@ -6404,7 +6406,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
#ifdef CONFIG_HAVE_MEMBLOCK
/*
* Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
+ * initialized by going through init_single_page(). But, there are some
* struct pages which are reserved in memblock allocator and their fields
* may be accessed (for example page_to_pfn() on some configuration accesses
* flags). We must explicitly zero those struct pages.
@@ -8005,7 +8007,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
break;
if (pfn == end_pfn)
return;
- offline_mem_sections(pfn, end_pfn);
zone = page_zone(pfn_to_page(pfn));
spin_lock_irqsave(&zone->lock, flags);
pfn = start_pfn;
@@ -8022,11 +8023,13 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
pfn++;
SetPageReserved(page);
+ __SetPageOffline(page);
continue;
}
BUG_ON(page_count(page));
BUG_ON(!PageBuddy(page));
+ BUG_ON(PageOffline(page));
order = page_order(page);
#ifdef CONFIG_DEBUG_VM
pr_info("remove from free list %lx %d %lx\n",
@@ -8035,11 +8038,14 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
list_del(&page->lru);
rmv_page_order(page);
zone->free_area[order].nr_free--;
- for (i = 0; i < (1 << order); i++)
+ for (i = 0; i < (1 << order); i++) {
SetPageReserved((page+i));
+ __SetPageOffline(page + i);
+ }
pfn += (1 << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
+ offline_mem_sections(start_pfn, end_pfn);
}
#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index 62eef264a7bd..693e8ba2ad0c 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -623,7 +623,24 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
}
#ifdef CONFIG_MEMORY_HOTREMOVE
-/* Mark all memory sections within the pfn range as online */
+static bool all_pages_in_section_offline(unsigned long section_nr)
+{
+ unsigned long pfn = section_nr_to_pfn(section_nr);
+ struct page *page;
+ int i;
+
+ for (i = 0; i < PAGES_PER_SECTION; i++, pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+ if (!PageOffline(page))
+ return false;
+ }
+ return true;
+}
+
+/* Try to mark all memory sections within the pfn range as offline */
void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
{
unsigned long pfn;
@@ -639,6 +656,12 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
if (WARN_ON(!valid_section_nr(section_nr)))
continue;
+ /* if we don't cover whole sections, check all pages */
+ if ((section_nr_to_pfn(section_nr) != start_pfn ||
+ start_pfn + PAGES_PER_SECTION >= end_pfn) &&
+ !all_pages_in_section_offline(section_nr))
+ continue;
+
ms = __nr_to_section(section_nr);
ms->section_mem_map &= ~SECTION_IS_ONLINE;
}
--
2.14.3
Hi Dave,
A few comments below:
> + for (i = 0; i < PAGES_PER_SECTION; i++) {
Performance wise, this is unfortunate that we have to add this loop for every hot-plug. But, I do like the finer hot-plug granularity that you achieve, and do not have a better suggestion how to avoid this loop. What I also like, is that you call init_single_page() only one time.
> + unsigned long pfn = phys_start_pfn + i;
> + struct page *page;
> + if (!pfn_valid(pfn))
> + continue;
> + page = pfn_to_page(pfn);
> +
> + /* dummy zone, the actual one will be set when onlining pages */
> + init_single_page(page, pfn, ZONE_NORMAL, nid);
Is there a reason to use ZONE_NORMAL as a dummy zone? May be define some non-existent zone-id for that? I.e. __MAX_NR_ZONES? That might trigger some debugging checks of course..
In init_single_page() if WANT_PAGE_VIRTUAL is defined it is used to set virtual address. Which is broken if we do not belong to ZONE_NORMAL.
1186 if (!is_highmem_idx(zone))
1187 set_page_address(page, __va(pfn << PAGE_SHIFT));
Otherwise, if you want to keep ZONE_NORMAL here, you could add a new function:
#ifdef WANT_PAGE_VIRTUAL
static void set_page_virtual(struct page *page, and enum zone_type zone)
{
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
set_page_address(page, __va(pfn << PAGE_SHIFT));
}
#else
static inline void set_page_virtual(struct page *page, and enum zone_type zone)
{}
#endif
And call it from init_single_page(), and from __meminit memmap_init_zone() in "context == MEMMAP_HOTPLUG" if case.
>
> -static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> +extern void __meminit init_single_page(struct page *page, unsigned long pfn,
I've seen it in other places, but what is the point of having "extern" function in .c file?
> #ifdef CONFIG_MEMORY_HOTREMOVE
> -/* Mark all memory sections within the pfn range as online */
> +static bool all_pages_in_section_offline(unsigned long section_nr)
> +{
> + unsigned long pfn = section_nr_to_pfn(section_nr);
> + struct page *page;
> + int i;
> +
> + for (i = 0; i < PAGES_PER_SECTION; i++, pfn++) {
> + if (!pfn_valid(pfn))
> + continue;
> +
> + page = pfn_to_page(pfn);
> + if (!PageOffline(page))
> + return false;
> + }
> + return true;
> +}
Perhaps we could use some counter to keep track of number of subsections that are currently offlined? If section covers 128M of memory, and offline/online is 4M granularity, there are up-to 32 subsections in a section, and thus we need 5-bits to count them. I'm not sure if there is a space in mem_section for this counter. But, that would eliminate the loop above.
Thank you,
Pavel
On 30.04.2018 16:35, Pavel Tatashin wrote:
> Hi Dave,
>
> A few comments below:
>
>> + for (i = 0; i < PAGES_PER_SECTION; i++) {
>
> Performance wise, this is unfortunate that we have to add this loop for every hot-plug. But, I do like the finer hot-plug granularity that you achieve, and do not have a better suggestion how to avoid this loop. What I also like, is that you call init_single_page() only one time.
Thanks! Yes, unfortunately we cannot live with the single loop when
onlining pages for this feature.
>
>> + unsigned long pfn = phys_start_pfn + i;
>> + struct page *page;
>> + if (!pfn_valid(pfn))
>> + continue;
>> + page = pfn_to_page(pfn);
>> +
>> + /* dummy zone, the actual one will be set when onlining pages */
>> + init_single_page(page, pfn, ZONE_NORMAL, nid);
>
> Is there a reason to use ZONE_NORMAL as a dummy zone? May be define some non-existent zone-id for that? I.e. __MAX_NR_ZONES? That might trigger some debugging checks of course..
Than it could happen that we consume more bits in pageflags than we
actually need. But it could be an opt-in debugging option later on, right?
>
> In init_single_page() if WANT_PAGE_VIRTUAL is defined it is used to set virtual address. Which is broken if we do not belong to ZONE_NORMAL.
>
Grr, missed that. Thanks for your very good eyes!
> 1186 if (!is_highmem_idx(zone))
> 1187 set_page_address(page, __va(pfn << PAGE_SHIFT));
>
> Otherwise, if you want to keep ZONE_NORMAL here, you could add a new function:
>
> #ifdef WANT_PAGE_VIRTUAL
> static void set_page_virtual(struct page *page, and enum zone_type zone)
> {
> /* The shift won't overflow because ZONE_NORMAL is below 4G. */
> if (!is_highmem_idx(zone))
> set_page_address(page, __va(pfn << PAGE_SHIFT));
> }
> #else
> static inline void set_page_virtual(struct page *page, and enum zone_type zone)
> {}
> #endif
>
> And call it from init_single_page(), and from __meminit memmap_init_zone() in "context == MEMMAP_HOTPLUG" if case.
Was thinking about moving it to set_page_zone() and conditionally
setting it to 0 or set_page_address(page, __va(pfn << PAGE_SHIFT)). What
do you prefer?
>
>>
>> -static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>> +extern void __meminit init_single_page(struct page *page, unsigned long pfn,
>
> I've seen it in other places, but what is the point of having "extern" function in .c file?
I've seen it all over the place, that's why I am using it :) (as I
basically had the same question). Can somebody answer that?
>
>
>> #ifdef CONFIG_MEMORY_HOTREMOVE
>> -/* Mark all memory sections within the pfn range as online */
>> +static bool all_pages_in_section_offline(unsigned long section_nr)
>> +{
>> + unsigned long pfn = section_nr_to_pfn(section_nr);
>> + struct page *page;
>> + int i;
>> +
>> + for (i = 0; i < PAGES_PER_SECTION; i++, pfn++) {
>> + if (!pfn_valid(pfn))
>> + continue;
>> +
>> + page = pfn_to_page(pfn);
>> + if (!PageOffline(page))
>> + return false;
>> + }
>> + return true;
>> +}
>
> Perhaps we could use some counter to keep track of number of subsections that are currently offlined? If section covers 128M of memory, and offline/online is 4M granularity, there are up-to 32 subsections in a section, and thus we need 5-bits to count them. I'm not sure if there is a space in mem_section for this counter. But, that would eliminate the loop above.
Yes, that would also be an optimization. At least I optimized it for now
so ordinary offline/online is not harmed. As we need PageOffline() also
for kdump (and maybe later also for safety checks when
onlining/offlining pages), we would right now store duplicate
information, so I would like to defer that.
Thanks a lot Pavel!
>
> Thank you,
> Pavel
>
--
Thanks,
David / dhildenb
>>
>>>
>>> -static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>>> +extern void __meminit init_single_page(struct page *page, unsigned long pfn,
>>
>> I've seen it in other places, but what is the point of having "extern" function in .c file?
>
> I've seen it all over the place, that's why I am using it :) (as I
> basically had the same question). Can somebody answer that?
BTW I was looking at the wrong file (header). This of course has to go!
--
Thanks,
David / dhildenb
On 30.04.2018 11:42, David Hildenbrand wrote:
> I am right now working on a paravirtualized memory device ("virtio-mem").
> These devices control a memory region and the amount of memory available
> via it. Memory will not be indicated/added/onlined via ACPI and friends,
> the device driver is responsible for it.
>
> When the device driver starts up, it will add and online the requested
> amount of memory from its assigned physical memory region. On request, it can
> add (online) either more memory or try to remove (offline) memory. As it
> will be a virtio module, we also want to be able to have it as a loadable
> kernel module.
>
> Such a device can be thought of like a "resizable DIMM" or a "huge
> number of 4MB DIMMS" that can be automatically managed.
>
> As we want to be able to add/remove small chunks of memory to a VM without
> fragmenting guest memory ("it's not what the guest pays for" and "what if
> the hypervisor wants to sue huge pages"), it looks like we can do that
> under Linux in a 4MB granularity by using online_pages()/offline_pages()
>
> We add a segment and online only 4MB blocks of it on demand. So the other
> memory might not be accessible. For kdump and offlining code, we have to
> mark pages as offline before a new segment is visible to the system (e.g.
> as these pages might not be backed by real memory in the hypervisor).
>
> This is not a balloon driver. Main differences:
> - We can add more memory to a VM without having to use mixture of
> technologies - e.g. ACPI for plugging, balloon for unplugging (in contrast
> to virtio-balloon).
> - The device is responsible for its own memory only - will not inflate on
> any system memory. (in contrast to all balloons)
> - Works on a coarser granularity (e.g. 4MB because that's what we can
> online/offline in Linux). We are not using the buddy allocator when unplugging
> but really search for chunks of memory we can offline. We actually
> can support arbitrary block sizes. (in contrast to all balloons)
> - That's why we don't fragment guest memory.
> - A device can belong to exactly one NUMA node. This way we can online/offline
> memory in a fine granularity NUMA aware. Even if the guest does not even
> know how to spell NUMA. (in contrast to all balloons)
> - Architectures that don't have proper memory hotplug interfaces (e.g. s390x)
> get memory hotplug support. I have a prototype for s390x.
> - Once all 4MB chunks of a memory block are offline, we can remove the
> memory block and therefore the struct pages. (in contrast to all balloons)
>
> This essentially allows us to add/remove 4MB chunks to/from a VM. Especially
> without caring about the future when adding memory ("If I add a 128GB DIMM
> I can only unplug 128GB again") or running into limits ("If I want my VM to
> grow to 4TB, I have to plug at least 16GB per DIMM").
>
> Future work:
> - Performance improvements
> - Be smarter about which blocks to offline first (e.g. free ones)
> - Automatically manage assignemnt to NORMAL/MOVABLE zone to make
> unplug more likely to succeed.
>
> I will post the next prototype of virtio-mem shortly.
>
If there are no further comments, I'll send a v1 (!RFC) version, along
with the virtio-mem prototype after rebasing (assuming that nothing
breaks :) ).
--
Thanks,
David / dhildenb