Subject: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

One of the primary issues with Firmware Assisted Dump (fadump) on Power
is that it needs a large amount of memory to be reserved. This reserved
memory is used for saving the contents of old crashed kernel's memory before
fadump capture kernel uses old kernel's memory area to boot. However, This
reserved memory area stays unused until system crash and isn't available
for production kernel to use.

Instead of setting aside a significant chunk of memory that nobody can use,
take advantage ZONE_MOVABLE to mark a significant chunk of reserved memory
as ZONE_MOVABLE, so that the kernel is prevented from using, but
applications are free to use it.

Patch 1 introduces an interface to mark reserved memory as ZONE_MOVABLE.
Patch 2 uses the above interface to mark reserved memory movable so that
it can be used for applications usage, making fadump reservationless.
Patch 3 and 4 fixes minor issues.

Changes in V6:
- Introduce an interface to mark reserved memory as ZONE_MOVABLE. Hence
sending this series as RFC again.
- Mark reserved area as ZONE_MOVABLE instead of CMA.
- Add fadump=nonmovable parameter for user who don't want to use ZONE_MOVABLE.

Changes in V5:
- Drop the patch that does metadata movement.
- Move the kexec fix patch to top (patch 1)
- Fold CMA documenation patch into patch 2
- Fix the compilation issues when CONFIG_CMA is not set reported by Hari.
- Use the approach of using boot memory size for CMA as suggested by Hari
except the movement of sections. Thanks to Hari.

Changes in V4:
- patch 1: Make fadump compatible irrespective of kernel versions.
- patch 4: moved out of the series and been posted seperatly at
http://patchwork.ozlabs.org/patch/896716/
- Documentation update about CMA reservation.

Changes in V3:
- patch 1 & 2: move metadata region and documentation update.
- patch 7: Un-register the faudmp on kexec path


---

Mahesh Salgaonkar (4):
mm/page_alloc: Introduce an interface to mark reserved memory as ZONE_MOVABLE
powerpc/fadump: Reservationless firmware assisted dump
powerpc/fadump: throw proper error message on fadump registration failure.
powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.


Documentation/powerpc/firmware-assisted-dump.txt | 18 +++
arch/powerpc/include/asm/fadump.h | 7 +
arch/powerpc/kernel/fadump.c | 123 +++++++++++++++++--
arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +
include/linux/mmzone.h | 2
mm/page_alloc.c | 146 ++++++++++++++++++++++
6 files changed, 290 insertions(+), 13 deletions(-)

--
Signature



Subject: [RFC PATCH v6 1/4] mm/page_alloc: Introduce an interface to mark reserved memory as ZONE_MOVABLE

From: Mahesh Salgaonkar <[email protected]>

Add an interface to allow a custom reserved memory to be marked as
ZONE_MOVABLE. This will help some subsystem's to convert their reserved
memory region into ZONE_MOVABLE so that the memory can still be available
to user applications.

The approach is based on Joonsoo Kim's commit bad8c6c0
(https://github.com/torvalds/linux/commit/bad8c6c0) that
uses ZONE_MOVABLE to manage CMA area. Majority of the code has been taken
from the Joonsoo Kim's commit mentioned above. But I see above commit
has been reverted due to some issues reported on i386. I believe this
patch is being reworked and re-posted soon.

Like CMA, the other user of ZONE_MOVABLE can be fadump on powerpc, which
reserves significant chunk of memory that is used only after system
is crashed. Until then the reserved memory is unused. By marking that
memory to ZONE_MOVABLE, it can be at least utilized by user applications.

This patch proposes a RFC implementation of an interface to mark
specified reserved area as ZONE_MOVABLE. Comments are welcome.

Signed-off-by: Mahesh Salgaonkar <[email protected]>
---
include/linux/mmzone.h | 2 +
mm/page_alloc.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 148 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..2519dd690572 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1288,6 +1288,8 @@ struct mminit_pfnnid_cache {
#endif

void memory_present(int nid, unsigned long start, unsigned long end);
+extern int __init zone_movable_init_reserved_mem(phys_addr_t base,
+ phys_addr_t size);

/*
* If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..0817ed8843cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7687,6 +7687,152 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
return true;
}

+static __init void mark_zone_movable(struct page *page)
+{
+ unsigned i = pageblock_nr_pages;
+ struct page *p = page;
+ struct zone *zone;
+ unsigned long pfn = page_to_pfn(page);
+ int nid = page_to_nid(page);
+
+ zone = page_zone(page);
+ zone->present_pages -= pageblock_nr_pages;
+
+ do {
+ __ClearPageReserved(p);
+ set_page_count(p, 0);
+
+ /* Steal pages from other zones */
+ set_page_links(p, ZONE_MOVABLE, nid, pfn);
+ } while (++p, ++pfn, --i);
+
+ zone = page_zone(page);
+ zone->present_pages += pageblock_nr_pages;
+
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
+ if (pageblock_order >= MAX_ORDER) {
+ i = pageblock_nr_pages;
+ p = page;
+ do {
+ set_page_refcounted(p);
+ __free_pages(p, MAX_ORDER - 1);
+ p += MAX_ORDER_NR_PAGES;
+ } while (i -= MAX_ORDER_NR_PAGES);
+ } else {
+ set_page_refcounted(page);
+ __free_pages(page, pageblock_order);
+ }
+
+ adjust_managed_page_count(page, pageblock_nr_pages);
+}
+
+static int __init zone_movable_activate_area(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned long base_pfn = start_pfn, pfn = start_pfn;
+ struct zone *zone;
+ unsigned i = (end_pfn - start_pfn) >> pageblock_order;
+
+ zone = page_zone(pfn_to_page(base_pfn));
+ while (pfn < end_pfn) {
+ if (!pfn_valid(pfn))
+ goto err;
+
+ if (page_zone(pfn_to_page(pfn)) != zone)
+ goto err;
+ pfn++;
+ }
+
+ do {
+ mark_zone_movable(pfn_to_page(base_pfn));
+ base_pfn += pageblock_nr_pages;
+ } while (--i);
+
+ return 0;
+err:
+ pr_err("Zone movable could not be activated\n");
+ return -EINVAL;
+}
+
+/**
+ * zone_movable_init_reserved_mem() - create custom zone movable area from
+ * reserved memory
+ * @base: Base address of the reserved area
+ * @size: Size of the reserved area (in bytes),
+ *
+ * This function creates custom zone movable area from already reserved memory.
+ */
+int __init zone_movable_init_reserved_mem(phys_addr_t base, phys_addr_t size)
+{
+ struct zone *zone;
+ pg_data_t *pgdat;
+ unsigned long start_pfn = PHYS_PFN(base);
+ unsigned long end_pfn = PHYS_PFN(base + size);
+ phys_addr_t alignment;
+ int ret;
+
+ if (!size || !memblock_is_region_reserved(base, size))
+ return -EINVAL;
+
+ /* ensure minimal alignment required by mm core */
+ alignment = PAGE_SIZE <<
+ max_t(unsigned long, MAX_ORDER - 1, pageblock_order);
+
+ if (ALIGN(base, alignment) != base || ALIGN(size, alignment) != size)
+ return -EINVAL;
+
+ for_each_online_pgdat(pgdat) {
+ zone = &pgdat->node_zones[ZONE_MOVABLE];
+
+ /*
+ * Continue if zone is already populated.
+ * Should we at least bump up the zone->spanned_pages
+ * for existing populated zone ?
+ */
+ if (populated_zone(zone))
+ continue;
+
+ /*
+ * Is it possible to allow memory region across nodes to
+ * be marked as ZONE_MOVABLE ?
+ */
+ if (pfn_to_nid(start_pfn) != pgdat->node_id)
+ continue;
+
+ /* Not sure if this is a right place to init empty zone. */
+ if (zone_is_empty(zone)) {
+ init_currently_empty_zone(zone, start_pfn,
+ end_pfn - start_pfn);
+ zone->spanned_pages = end_pfn - start_pfn;
+ }
+ }
+
+ ret = zone_movable_activate_area(start_pfn, end_pfn);
+
+ if (ret)
+ return ret;
+
+ /*
+ * Reserved pages for ZONE_MOVABLE are now activated and
+ * this would change ZONE_MOVABLE's managed page counter and
+ * the other zones' present counter. We need to re-calculate
+ * various zone information that depends on this initialization.
+ */
+ build_all_zonelists(NULL);
+ for_each_populated_zone(zone) {
+ if (zone_idx(zone) == ZONE_MOVABLE) {
+ zone_pcp_reset(zone);
+ setup_zone_pageset(zone);
+ } else
+ zone_pcp_update(zone);
+
+ set_zone_contiguous(zone);
+ }
+
+ return 0;
+}
+
#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || defined(CONFIG_CMA)

static unsigned long pfn_max_align_down(unsigned long pfn)


Subject: [RFC PATCH v6 2/4] powerpc/fadump: Reservationless firmware assisted dump

From: Mahesh Salgaonkar <[email protected]>

One of the primary issues with Firmware Assisted Dump (fadump) on Power
is that it needs a large amount of memory to be reserved. On large
systems with TeraBytes of memory, this reservation can be quite
significant.

In some cases, fadump fails if the memory reserved is insufficient, or
if the reserved memory was DLPAR hot-removed.

In the normal case, post reboot, the preserved memory is filtered to
extract only relevant areas of interest using the makedumpfile tool.
While the tool provides flexibility to determine what needs to be part
of the dump and what memory to filter out, all supported distributions
default this to "Capture only kernel data and nothing else".

We take advantage of this default and the Linux kernel's zone movable
feature to fundamentally change the memory reservation model for fadump.

Instead of setting aside a significant chunk of memory nobody can use,
this patch marks a significant chunk of reserved memory as ZONE_MOVABLE
that the kernel is prevented from using (due to MIGRATE_MOVABLE),
but applications are free to use it. With this fadump will still be able
to capture all of the kernel memory and most of the user space memory
except the user pages that were present in ZONE_MOVABLE zone. But if
someone wants to capture all of user space memory and ok with reserved
memory not available to production system, then 'fadump=nonmovable' kernel
parameter can be used to fallback to old behaviour.

Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
[root@zzxx-yy10 ~]# free -m
total used free shared buff/cache available
Mem: 7557 193 6822 12 541 6725
Swap: 4095 0 4095

With this patch:
[root@zzxx-yy10 ~]# free -m
total used free shared buff/cache available
Mem: 8133 194 7464 12 475 7338
Swap: 4095 0 4095

Changes made here are completely transparent to how fadump has
traditionally worked.

Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
Signed-off-by: Mahesh Salgaonkar <[email protected]>
Signed-off-by: Hari Bathini <[email protected]>
---
Documentation/powerpc/firmware-assisted-dump.txt | 18 +++++
arch/powerpc/include/asm/fadump.h | 5 +
arch/powerpc/kernel/fadump.c | 80 ++++++++++++++++++++--
3 files changed, 95 insertions(+), 8 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt
index bdd344aa18d9..f8a6343a1dcf 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -113,7 +113,16 @@ header, is usually reserved at an offset greater than boot memory
size (see Fig. 1). This area is *not* released: this region will
be kept permanently reserved, so that it can act as a receptacle
for a copy of the boot memory content in addition to CPU state
-and HPTE region, in the case a crash does occur.
+and HPTE region, in the case a crash does occur. Since this reserved
+memory area is used only after the system crash, there is no point in
+blocking this significant chunk of memory from production kernel.
+Hence, the implementation marks the memory reserved for fadump as
+ZONE_MOVABLE. With ZONE_MOVABLE this memory will be available for
+applications to use it, while kernel is prevented from using it. With
+this fadump will still be able to capture all of the kernel memory and
+most of the user space memory except the user pages that were present
+in ZONE_MOVABLE region.
+

o Memory Reservation during first kernel

@@ -162,6 +171,9 @@ How to enable firmware-assisted dump (fadump):

1. Set config option CONFIG_FA_DUMP=y and build kernel.
2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
+ By default, the reserved memory will be marked as zone movable.
+ Alternatively, user can boot linux kernel with 'fadump=nonmovable' to
+ prevent fadump to mark reserved memory as zone movable.
3. Optionally, user can also set 'crashkernel=' kernel cmdline
to specify size of the memory to reserve for boot memory dump
preservation.
@@ -172,6 +184,10 @@ NOTE: 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead
2. If firmware-assisted dump fails to reserve memory then it
will fallback to existing kdump mechanism if 'crashkernel='
option is set at kernel cmdline.
+ 3. if user wants to capture all of user space memory and ok with
+ reserved memory not available to production system, then
+ 'fadump=nonmovable' kernel parameter can be used to fallback to
+ old behaviour.

Sysfs/debugfs files:
------------
diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h
index 5a23010af600..5c0de4508aab 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -48,6 +48,10 @@

#define memblock_num_regions(memblock_type) (memblock.memblock_type.cnt)

+/* Alignement per core mm requirement. */
+#define FADUMP_PAGEBLOCK_ALIGNMENT (PAGE_SIZE << \
+ max_t(unsigned long, MAX_ORDER - 1, pageblock_order))
+
/* Firmware provided dump sections */
#define FADUMP_CPU_STATE_DATA 0x0001
#define FADUMP_HPTE_REGION 0x0002
@@ -141,6 +145,7 @@ struct fw_dump {
unsigned long fadump_supported:1;
unsigned long dump_active:1;
unsigned long dump_registered:1;
+ unsigned long nonmovable:1; /* !ZONE_MOVABLE */
};

/*
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 07e8396d472b..ce333c1d4cb8 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -34,6 +34,7 @@
#include <linux/crash_dump.h>
#include <linux/kobject.h>
#include <linux/sysfs.h>
+#include <linux/mmzone.h>

#include <asm/debugfs.h>
#include <asm/page.h>
@@ -375,8 +376,11 @@ int __init fadump_reserve_mem(void)
*/
if (fdm_active)
fw_dump.boot_memory_size = be64_to_cpu(fdm_active->rmr_region.source_len);
- else
+ else {
fw_dump.boot_memory_size = fadump_calculate_reserve_size();
+ fw_dump.boot_memory_size = ALIGN(fw_dump.boot_memory_size,
+ FADUMP_PAGEBLOCK_ALIGNMENT);
+ }

/*
* Calculate the memory boundary.
@@ -423,8 +427,7 @@ int __init fadump_reserve_mem(void)
fw_dump.fadumphdr_addr =
be64_to_cpu(fdm_active->rmr_region.destination_address) +
be64_to_cpu(fdm_active->rmr_region.source_len);
- pr_debug("fadumphdr_addr = %p\n",
- (void *) fw_dump.fadumphdr_addr);
+ pr_debug("fadumphdr_addr = %pa\n", &fw_dump.fadumphdr_addr);
} else {
size = get_fadump_area_size();

@@ -474,6 +477,10 @@ static int __init early_fadump_param(char *p)
fw_dump.fadump_enabled = 1;
else if (strncmp(p, "off", 3) == 0)
fw_dump.fadump_enabled = 0;
+ else if (strncmp(p, "nonmovable", 10) == 0) {
+ fw_dump.fadump_enabled = 1;
+ fw_dump.nonmovable = 1;
+ }

return 0;
}
@@ -1146,7 +1153,7 @@ static int fadump_unregister_dump(struct fadump_mem_struct *fdm)
return 0;
}

-static int fadump_invalidate_dump(struct fadump_mem_struct *fdm)
+static int fadump_invalidate_dump(const struct fadump_mem_struct *fdm)
{
int rc = 0;
unsigned int wait_time;
@@ -1177,9 +1184,8 @@ void fadump_cleanup(void)
{
/* Invalidate the registration only if dump is active. */
if (fw_dump.dump_active) {
- init_fadump_mem_struct(&fdm,
- be64_to_cpu(fdm_active->cpu_state_data.destination_address));
- fadump_invalidate_dump(&fdm);
+ /* pass the same memory dump structure provided by platform */
+ fadump_invalidate_dump(fdm_active);
} else if (fw_dump.dump_registered) {
/* Un-register Firmware-assisted dump if it was registered. */
fadump_unregister_dump(&fdm);
@@ -1525,3 +1531,63 @@ int __init setup_fadump(void)
return 1;
}
subsys_initcall(setup_fadump);
+
+/*
+ * Mark the fadump reserved area as ZONE_MOVABLE.
+ * The total size of fadump reserved memory covers for boot memory size
+ * + cpu data size + hpte size and metadata. Initialize only the area
+ * equivalent to boot memory size as zone movable. The reamining portion
+ * of fadump reserved memory will be not given to movable zone and pages
+ * for thoes will stay reserved. boot memory size is aligned per core mm
+ * requirement to satisy zone_movable_init_reserved_mem() call.
+ * But for some reason even if it fails we still have the memory reservation
+ * with us and we can still continue doing fadump.
+ */
+static int __init fadump_init_reserved_mem(void)
+{
+ unsigned long long base, size;
+ int rc;
+
+ if (!fw_dump.fadump_enabled)
+ return 0;
+
+ /* Ignore if booted with fadump=nonmovable */
+ if (fw_dump.nonmovable)
+ return 0;
+
+ if (fw_dump.dump_active)
+ return 0;
+
+ /*
+ * Mark only the size equivalent to boot memory size as movable
+ * zone.
+ */
+ base = fw_dump.reserve_dump_area_start;
+ size = fw_dump.boot_memory_size;
+
+ if (!size)
+ return 0;
+
+ rc = zone_movable_init_reserved_mem(base, size);
+ if (rc) {
+ pr_err("Failed to init zone movable area for firmware-assisted dump,%d\n", rc);
+ /*
+ * Though the zone movable init has failed, we still have memory
+ * reservation with us. The reserved memory will be
+ * blocked from production system usage. Hence return 1,
+ * so that we can continue with fadump.
+ */
+ return 1;
+ }
+
+ /*
+ * So we now have successfully initialized reserved area as
+ * ZONE_MOVABLE for fadump.
+ */
+ pr_info("Initialized 0x%llx bytes as zone movable area at %ldMB from "
+ "0x%lx bytes of memory reserved for firmware-assisted dump\n",
+ size, (unsigned long)base >> 20,
+ fw_dump.reserve_dump_area_size);
+ return 1;
+}
+core_initcall(fadump_init_reserved_mem);


Subject: [RFC PATCH v6 3/4] powerpc/fadump: throw proper error message on fadump registration failure.

From: Mahesh Salgaonkar <[email protected]>

fadump fails to register when there are holes in reserved memory area.
This can happen if user has hot-removed a memory that falls in the fadump
reserved memory area. Throw a meaningful error message to the user in
such case.

Signed-off-by: Mahesh Salgaonkar <[email protected]>
---
arch/powerpc/kernel/fadump.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index ce333c1d4cb8..d1375f3f48c3 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -170,6 +170,36 @@ static int is_boot_memory_area_contiguous(void)
return ret;
}

+/*
+ * Returns 1, if there are no holes in reserved memory area,
+ * 0 otherwise.
+ */
+static int is_reserved_memory_area_contiguous(void)
+{
+ struct memblock_region *reg;
+ unsigned long start, end;
+ unsigned long d_start = fw_dump.reserve_dump_area_start;
+ unsigned long d_end = d_start + fw_dump.reserve_dump_area_size;
+ int ret = 0;
+
+ for_each_memblock(memory, reg) {
+ start = max(d_start, (unsigned long)reg->base);
+ end = min(d_end, (unsigned long)(reg->base + reg->size));
+ if (d_start < end) {
+ /* Memory hole from d_start to start */
+ if (start > d_start)
+ break;
+
+ if (end == d_end) {
+ ret = 1;
+ break;
+ }
+ d_start = end + 1;
+ }
+ }
+ return ret;
+}
+
/* Print firmware assisted dump configurations for debugging purpose. */
static void fadump_show_config(void)
{
@@ -531,6 +561,9 @@ static int register_fw_dump(struct fadump_mem_struct *fdm)
if (!is_boot_memory_area_contiguous())
pr_err("Can't have holes in boot memory area while "
"registering fadump\n");
+ else if (!is_reserved_memory_area_contiguous())
+ pr_err("Can't have holes in reserved memory area while"
+ " registering fadump\n");

printk(KERN_ERR "Failed to register firmware-assisted kernel"
" dump. Parameter Error(%d).\n", rc);


Subject: [RFC PATCH v6 4/4] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.

From: Mahesh Salgaonkar <[email protected]>

For fadump to work successfully there should not be any holes in reserved
memory ranges where kernel has asked firmware to move the content of old
kernel memory in event of crash. Now that fadump reserved memory is marked
as movable zone, this memory area is now not protected from hot-remove
operations. Hence, fadump service can fail to re-register after the
hot-remove operation, if hot-removed memory belongs to fadump reserved
region. To avoid this make sure that memory from fadump reserved area is
not hot-removable if fadump is registered.

However, if user still wants to remove that memory, he can do so by
manually stopping fadump service before hot-remove operation.

Signed-off-by: Mahesh Salgaonkar <[email protected]>
---
arch/powerpc/include/asm/fadump.h | 2 +-
arch/powerpc/kernel/fadump.c | 10 ++++++++--
arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +++++--
3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h
index 5c0de4508aab..cd28e9b59057 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -208,7 +208,7 @@ struct fad_crash_memory_ranges {
unsigned long long size;
};

-extern int is_fadump_boot_memory_area(u64 addr, ulong size);
+extern int is_fadump_memory_area(u64 addr, ulong size);
extern int early_init_dt_scan_fw_dump(unsigned long node,
const char *uname, int depth, void *data);
extern int fadump_reserve_mem(void);
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index d1375f3f48c3..18a35f12ffb5 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -116,13 +116,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,

/*
* If fadump is registered, check if the memory provided
- * falls within boot memory area.
+ * falls within boot memory area and reserved memory area.
*/
-int is_fadump_boot_memory_area(u64 addr, ulong size)
+int is_fadump_memory_area(u64 addr, ulong size)
{
+ u64 d_start = fw_dump.reserve_dump_area_start;
+ u64 d_end = d_start + fw_dump.reserve_dump_area_size;
+
if (!fw_dump.dump_registered)
return 0;

+ if (((addr + size) > d_start) && (addr <= d_end))
+ return 1;
+
return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size;
}

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c1578f54c626..e4c658cda3a7 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
phys_addr = lmb->base_addr;

#ifdef CONFIG_FA_DUMP
- /* Don't hot-remove memory that falls in fadump boot memory area */
- if (is_fadump_boot_memory_area(phys_addr, block_sz))
+ /*
+ * Don't hot-remove memory that falls in fadump boot memory area
+ * and memory that is reserved for capturing old kernel memory.
+ */
+ if (is_fadump_memory_area(phys_addr, block_sz))
return false;
#endif



2018-07-16 08:28:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote:
> One of the primary issues with Firmware Assisted Dump (fadump) on Power
> is that it needs a large amount of memory to be reserved. This reserved
> memory is used for saving the contents of old crashed kernel's memory before
> fadump capture kernel uses old kernel's memory area to boot. However, This
> reserved memory area stays unused until system crash and isn't available
> for production kernel to use.

How much memory are we talking about. Regular kernel dump process needs
some reserved memory as well. Why that is not a big problem?

> Instead of setting aside a significant chunk of memory that nobody can use,
> take advantage ZONE_MOVABLE to mark a significant chunk of reserved memory
> as ZONE_MOVABLE, so that the kernel is prevented from using, but
> applications are free to use it.

Why kernel cannot use that memory while userspace can?
[...]
> Documentation/powerpc/firmware-assisted-dump.txt | 18 +++
> arch/powerpc/include/asm/fadump.h | 7 +
> arch/powerpc/kernel/fadump.c | 123 +++++++++++++++++--
> arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +
> include/linux/mmzone.h | 2
> mm/page_alloc.c | 146 ++++++++++++++++++++++
> 6 files changed, 290 insertions(+), 13 deletions(-)

This is quite a large change and you didn't seem to explain why we need
it.
--
Michal Hocko
SUSE Labs

Subject: Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

On 07/16/2018 01:56 PM, Michal Hocko wrote:
> On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote:
>> One of the primary issues with Firmware Assisted Dump (fadump) on Power
>> is that it needs a large amount of memory to be reserved. This reserved
>> memory is used for saving the contents of old crashed kernel's memory before
>> fadump capture kernel uses old kernel's memory area to boot. However, This
>> reserved memory area stays unused until system crash and isn't available
>> for production kernel to use.
>
> How much memory are we talking about. Regular kernel dump process needs
> some reserved memory as well. Why that is not a big problem?

We reserve around 5% of total system RAM. On large systems with
TeraBytes of memory, this reservation can be quite significant.

The regular kernel dump uses the kexec method to boot into capture
kernel and it can control the parameters that are being passed to
capture kernel. This allows a capability to strip down the parameters
that can help lowering down the memory requirement for capture kernel to
boot. This allows regular kdump to reserve less memory to start with.

Where as fadump depends on power firmware (pHyp) to load the capture
kernel after full reset and boots like a regular kernel. It needs same
amount of memory to boot as the production kernel. On large systems
production kernel needs significant amount of memory to boot. Hence
fadump needs to reserve enough memory for capture kernel to boot
successfully and execute dump capturing operations. By default fadump
reserves 5% of total system RAM and in most cases this has worked
flawlessly on variety of system configurations. Optionally,
'crashkernel=X' can also be used to specify more fine-tuned memory size
for reservation.


>
>> Instead of setting aside a significant chunk of memory that nobody can use,
>> take advantage ZONE_MOVABLE to mark a significant chunk of reserved memory
>> as ZONE_MOVABLE, so that the kernel is prevented from using, but
>> applications are free to use it.
>
> Why kernel cannot use that memory while userspace can?

fadump needs to reserve memory to be able to save crashing kernel's
memory, with help from power firmware, before the capture kernel loads
into crashing kernel's memory area. Any contents present in this
reserved memory will be over-written. If kernel is allowed to use this
memory, then we loose that kernel data and won't be part of captured
dump, which could be critical to debug root cause of system crash.

Kdump and fadump both uses same infrastructure/tool (makedumpfile) to
capture the memory dump. While the tool provides flexibility to
determine what needs to be part of the dump and what memory to filter
out, all supported distributions defaults to "Capture only kernel data
and nothing else". Taking advantage of this default we can at least make
the reserved memory available for userspace to use.

If someone wants to capture userspace data as well then
'fadump=nonmovable' option can be used where reserved pages won't be
marked zone movable.

Advantage of movable method is the reserved memory chunk is also
available for use.

> [...]
>> Documentation/powerpc/firmware-assisted-dump.txt | 18 +++
>> arch/powerpc/include/asm/fadump.h | 7 +
>> arch/powerpc/kernel/fadump.c | 123 +++++++++++++++++--
>> arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +
>> include/linux/mmzone.h | 2
>> mm/page_alloc.c | 146 ++++++++++++++++++++++
>> 6 files changed, 290 insertions(+), 13 deletions(-)
>
> This is quite a large change and you didn't seem to explain why we need
> it.
>

In fadump case, the reserved memory stays unused until system is
crashed. fadump uses very small portion of this reserved memory, few
KBs, for storing fadump metadata. Otherwise, the significant chunk of
memory is completely unused. Hence, instead of blocking a memory that is
un-utilized through out the lifetime of system, it's better to give it
back to production kernel to use. But at the same time we don't want
kernel to use that memory. While exploring we found 1) Linux kernel's
Contiguous Memory Allocator (CMA) feature and 2) ZONE_MOVABLE, that
suites the requirement. Initial 5 revisions of this patchset () was
using CMA feature. However, fadump does not do any cma allocations,
hence it will be more appropriate to use zone movable to achieve the same.

But unlike CMA, there is no interface available to mark a custom
reserved memory area as ZONE_MOVABLE. Hence patch 1/4 proposes the same.

Thanks,
-Mahesh.



2018-07-17 11:53:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

On Tue 17-07-18 16:58:10, Mahesh Jagannath Salgaonkar wrote:
> On 07/16/2018 01:56 PM, Michal Hocko wrote:
> > On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote:
> >> One of the primary issues with Firmware Assisted Dump (fadump) on Power
> >> is that it needs a large amount of memory to be reserved. This reserved
> >> memory is used for saving the contents of old crashed kernel's memory before
> >> fadump capture kernel uses old kernel's memory area to boot. However, This
> >> reserved memory area stays unused until system crash and isn't available
> >> for production kernel to use.
> >
> > How much memory are we talking about. Regular kernel dump process needs
> > some reserved memory as well. Why that is not a big problem?
>
> We reserve around 5% of total system RAM. On large systems with
> TeraBytes of memory, this reservation can be quite significant.
>
> The regular kernel dump uses the kexec method to boot into capture
> kernel and it can control the parameters that are being passed to
> capture kernel. This allows a capability to strip down the parameters
> that can help lowering down the memory requirement for capture kernel to
> boot. This allows regular kdump to reserve less memory to start with.
>
> Where as fadump depends on power firmware (pHyp) to load the capture
> kernel after full reset and boots like a regular kernel. It needs same
> amount of memory to boot as the production kernel. On large systems
> production kernel needs significant amount of memory to boot. Hence
> fadump needs to reserve enough memory for capture kernel to boot
> successfully and execute dump capturing operations. By default fadump
> reserves 5% of total system RAM and in most cases this has worked
> flawlessly on variety of system configurations. Optionally,
> 'crashkernel=X' can also be used to specify more fine-tuned memory size
> for reservation.

So why do we even care about fadump when regular kexec provides
(presumably) same functionality with a smaller memory footprint? Or is
there any reason why kexec doesn't work well on ppc?

> >> Instead of setting aside a significant chunk of memory that nobody can use,
> >> take advantage ZONE_MOVABLE to mark a significant chunk of reserved memory
> >> as ZONE_MOVABLE, so that the kernel is prevented from using, but
> >> applications are free to use it.
> >
> > Why kernel cannot use that memory while userspace can?
>
> fadump needs to reserve memory to be able to save crashing kernel's
> memory, with help from power firmware, before the capture kernel loads
> into crashing kernel's memory area. Any contents present in this
> reserved memory will be over-written. If kernel is allowed to use this
> memory, then we loose that kernel data and won't be part of captured
> dump, which could be critical to debug root cause of system crash.

But then you simply screw user memory sitting there. This might be not
so critical as the kernel memory but still it sounds like you are
reducing the usefulness of the dump just because of inherent limitations
of fadump.

> Kdump and fadump both uses same infrastructure/tool (makedumpfile) to
> capture the memory dump. While the tool provides flexibility to
> determine what needs to be part of the dump and what memory to filter
> out, all supported distributions defaults to "Capture only kernel data
> and nothing else". Taking advantage of this default we can at least make
> the reserved memory available for userspace to use.
>
> If someone wants to capture userspace data as well then
> 'fadump=nonmovable' option can be used where reserved pages won't be
> marked zone movable.

Ohh, so you have an unclutter thing to support the case above.

> Advantage of movable method is the reserved memory chunk is also
> available for use.
>
> > [...]
> >> Documentation/powerpc/firmware-assisted-dump.txt | 18 +++
> >> arch/powerpc/include/asm/fadump.h | 7 +
> >> arch/powerpc/kernel/fadump.c | 123 +++++++++++++++++--
> >> arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +
> >> include/linux/mmzone.h | 2
> >> mm/page_alloc.c | 146 ++++++++++++++++++++++
> >> 6 files changed, 290 insertions(+), 13 deletions(-)
> >
> > This is quite a large change and you didn't seem to explain why we need
> > it.
> >
>
> In fadump case, the reserved memory stays unused until system is
> crashed. fadump uses very small portion of this reserved memory, few
> KBs, for storing fadump metadata. Otherwise, the significant chunk of
> memory is completely unused. Hence, instead of blocking a memory that is
> un-utilized through out the lifetime of system, it's better to give it
> back to production kernel to use. But at the same time we don't want
> kernel to use that memory. While exploring we found 1) Linux kernel's
> Contiguous Memory Allocator (CMA) feature and 2) ZONE_MOVABLE, that
> suites the requirement. Initial 5 revisions of this patchset () was
> using CMA feature. However, fadump does not do any cma allocations,
> hence it will be more appropriate to use zone movable to achieve the same.
>
> But unlike CMA, there is no interface available to mark a custom
> reserved memory area as ZONE_MOVABLE. Hence patch 1/4 proposes the same.

Well, you are adding a significant amount of code so you should be much
better in explaining why does the generic code care about a ppc specific
kdump method.
--
Michal Hocko
SUSE Labs

Subject: Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

On 07/17/2018 05:22 PM, Michal Hocko wrote:
> On Tue 17-07-18 16:58:10, Mahesh Jagannath Salgaonkar wrote:
>> On 07/16/2018 01:56 PM, Michal Hocko wrote:
>>> On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote:
>>>> One of the primary issues with Firmware Assisted Dump (fadump) on Power
>>>> is that it needs a large amount of memory to be reserved. This reserved
>>>> memory is used for saving the contents of old crashed kernel's memory before
>>>> fadump capture kernel uses old kernel's memory area to boot. However, This
>>>> reserved memory area stays unused until system crash and isn't available
>>>> for production kernel to use.
>>>
>>> How much memory are we talking about. Regular kernel dump process needs
>>> some reserved memory as well. Why that is not a big problem?
>>
>> We reserve around 5% of total system RAM. On large systems with
>> TeraBytes of memory, this reservation can be quite significant.
>>
>> The regular kernel dump uses the kexec method to boot into capture
>> kernel and it can control the parameters that are being passed to
>> capture kernel. This allows a capability to strip down the parameters
>> that can help lowering down the memory requirement for capture kernel to
>> boot. This allows regular kdump to reserve less memory to start with.
>>
>> Where as fadump depends on power firmware (pHyp) to load the capture
>> kernel after full reset and boots like a regular kernel. It needs same
>> amount of memory to boot as the production kernel. On large systems
>> production kernel needs significant amount of memory to boot. Hence
>> fadump needs to reserve enough memory for capture kernel to boot
>> successfully and execute dump capturing operations. By default fadump
>> reserves 5% of total system RAM and in most cases this has worked
>> flawlessly on variety of system configurations. Optionally,
>> 'crashkernel=X' can also be used to specify more fine-tuned memory size
>> for reservation.
>
> So why do we even care about fadump when regular kexec provides
> (presumably) same functionality with a smaller memory footprint? Or is
> there any reason why kexec doesn't work well on ppc?

Kexec based kdump is loaded by crashing kernel. When OS crashes, the
system is in an inconsistent state, especially the devices. In some
cases, a rogue DMA or ill-behaving device drivers can cause the kdump
capture to fail.

On power platform, fadump solves these issues by taking help from power
firmware, to fully-reset the system, load the fresh copy of same kernel
to capture the dump with PCI and I/O devices reinitialized, making it
more reliable.

Fadump does full system reset, booting system through the regular boot
options i.e the dump capture kernel is booted in the same fashion and
doesn't have specialized kernel command line option. This implies, we
need to give more memory for the system boot. Since the new kernel boots
from the same memory location as crashed kernel, we reserve 5% of memory
where power firmware moves the crashed kernel's memory content. This
reserved memory is completely removed from the available memory. For
large memory systems like 64TB systems, this account to ~ 3TB, which is
a significant chunk of memory production kernel is deprived of. Hence,
this patch adds an improvement to exiting fadump feature to make the
reserved memory available to system for use, using zone movable.

Thanks,
-Mahesh.

>
>>>> Instead of setting aside a significant chunk of memory that nobody can use,
>>>> take advantage ZONE_MOVABLE to mark a significant chunk of reserved memory
>>>> as ZONE_MOVABLE, so that the kernel is prevented from using, but
>>>> applications are free to use it.
>>>
>>> Why kernel cannot use that memory while userspace can?
>>
>> fadump needs to reserve memory to be able to save crashing kernel's
>> memory, with help from power firmware, before the capture kernel loads
>> into crashing kernel's memory area. Any contents present in this
>> reserved memory will be over-written. If kernel is allowed to use this
>> memory, then we loose that kernel data and won't be part of captured
>> dump, which could be critical to debug root cause of system crash.
>
> But then you simply screw user memory sitting there. This might be not
> so critical as the kernel memory but still it sounds like you are
> reducing the usefulness of the dump just because of inherent limitations
> of fadump.
>
>> Kdump and fadump both uses same infrastructure/tool (makedumpfile) to
>> capture the memory dump. While the tool provides flexibility to
>> determine what needs to be part of the dump and what memory to filter
>> out, all supported distributions defaults to "Capture only kernel data
>> and nothing else". Taking advantage of this default we can at least make
>> the reserved memory available for userspace to use.
>>
>> If someone wants to capture userspace data as well then
>> 'fadump=nonmovable' option can be used where reserved pages won't be
>> marked zone movable.
>
> Ohh, so you have an unclutter thing to support the case above.
>
>> Advantage of movable method is the reserved memory chunk is also
>> available for use.
>>
>>> [...]
>>>> Documentation/powerpc/firmware-assisted-dump.txt | 18 +++
>>>> arch/powerpc/include/asm/fadump.h | 7 +
>>>> arch/powerpc/kernel/fadump.c | 123 +++++++++++++++++--
>>>> arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +
>>>> include/linux/mmzone.h | 2
>>>> mm/page_alloc.c | 146 ++++++++++++++++++++++
>>>> 6 files changed, 290 insertions(+), 13 deletions(-)
>>>
>>> This is quite a large change and you didn't seem to explain why we need
>>> it.
>>>
>>
>> In fadump case, the reserved memory stays unused until system is
>> crashed. fadump uses very small portion of this reserved memory, few
>> KBs, for storing fadump metadata. Otherwise, the significant chunk of
>> memory is completely unused. Hence, instead of blocking a memory that is
>> un-utilized through out the lifetime of system, it's better to give it
>> back to production kernel to use. But at the same time we don't want
>> kernel to use that memory. While exploring we found 1) Linux kernel's
>> Contiguous Memory Allocator (CMA) feature and 2) ZONE_MOVABLE, that
>> suites the requirement. Initial 5 revisions of this patchset () was
>> using CMA feature. However, fadump does not do any cma allocations,
>> hence it will be more appropriate to use zone movable to achieve the same.
>>
>> But unlike CMA, there is no interface available to mark a custom
>> reserved memory area as ZONE_MOVABLE. Hence patch 1/4 proposes the same.
>
> Well, you are adding a significant amount of code so you should be much
> better in explaining why does the generic code care about a ppc specific
> kdump method.
>


2018-07-19 08:09:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

On Wed 18-07-18 21:52:17, Mahesh Jagannath Salgaonkar wrote:
> On 07/17/2018 05:22 PM, Michal Hocko wrote:
> > On Tue 17-07-18 16:58:10, Mahesh Jagannath Salgaonkar wrote:
> >> On 07/16/2018 01:56 PM, Michal Hocko wrote:
> >>> On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote:
> >>>> One of the primary issues with Firmware Assisted Dump (fadump) on Power
> >>>> is that it needs a large amount of memory to be reserved. This reserved
> >>>> memory is used for saving the contents of old crashed kernel's memory before
> >>>> fadump capture kernel uses old kernel's memory area to boot. However, This
> >>>> reserved memory area stays unused until system crash and isn't available
> >>>> for production kernel to use.
> >>>
> >>> How much memory are we talking about. Regular kernel dump process needs
> >>> some reserved memory as well. Why that is not a big problem?
> >>
> >> We reserve around 5% of total system RAM. On large systems with
> >> TeraBytes of memory, this reservation can be quite significant.
> >>
> >> The regular kernel dump uses the kexec method to boot into capture
> >> kernel and it can control the parameters that are being passed to
> >> capture kernel. This allows a capability to strip down the parameters
> >> that can help lowering down the memory requirement for capture kernel to
> >> boot. This allows regular kdump to reserve less memory to start with.
> >>
> >> Where as fadump depends on power firmware (pHyp) to load the capture
> >> kernel after full reset and boots like a regular kernel. It needs same
> >> amount of memory to boot as the production kernel. On large systems
> >> production kernel needs significant amount of memory to boot. Hence
> >> fadump needs to reserve enough memory for capture kernel to boot
> >> successfully and execute dump capturing operations. By default fadump
> >> reserves 5% of total system RAM and in most cases this has worked
> >> flawlessly on variety of system configurations. Optionally,
> >> 'crashkernel=X' can also be used to specify more fine-tuned memory size
> >> for reservation.
> >
> > So why do we even care about fadump when regular kexec provides
> > (presumably) same functionality with a smaller memory footprint? Or is
> > there any reason why kexec doesn't work well on ppc?
>
> Kexec based kdump is loaded by crashing kernel. When OS crashes, the
> system is in an inconsistent state, especially the devices. In some
> cases, a rogue DMA or ill-behaving device drivers can cause the kdump
> capture to fail.
>
> On power platform, fadump solves these issues by taking help from power
> firmware, to fully-reset the system, load the fresh copy of same kernel
> to capture the dump with PCI and I/O devices reinitialized, making it
> more reliable.

Thanks for the clarification.

> Fadump does full system reset, booting system through the regular boot
> options i.e the dump capture kernel is booted in the same fashion and
> doesn't have specialized kernel command line option. This implies, we
> need to give more memory for the system boot. Since the new kernel boots
> from the same memory location as crashed kernel, we reserve 5% of memory
> where power firmware moves the crashed kernel's memory content. This
> reserved memory is completely removed from the available memory. For
> large memory systems like 64TB systems, this account to ~ 3TB, which is
> a significant chunk of memory production kernel is deprived of. Hence,
> this patch adds an improvement to exiting fadump feature to make the
> reserved memory available to system for use, using zone movable.

Is the 5% a reasonable estimate or more a ballpark number? I find it a
bit strange to require 3TB of memory to boot a kernel just to dump the
crashed kernel image. Shouldn't you rather look into this estimate than
spreading ZONE_MOVABLE abuse? Larger systems need more memory to dump
even with the regular kexec kdump but I have never seen any to use more
than 1G or something like that.
--
Michal Hocko
SUSE Labs