Hello folks,
Compared with the V2[1] I posted a long time ago, this time it is a
completely new proposal design.
### Background and motivate overview ###
---
Crash dump is an important feature for troubleshooting the kernel. It is the
final way to chase what happened at the kernel panic, slowdown, and so on. It
is one of the most important tools for customer support.
Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
configure the dumpable regions. Generally, (A)iomem resources registered with
flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
(B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
are dumpable.
The pmem use cases including fsdax and devdax, could map their vmemmap to
their own devices. In this case, these part of vmemmap will not be dumped when
crash happened since these regions are satisfied with neither the above (A)
nor (B).
In fsdax, the vmemmap(struct page array) becomes very important, it is one of
the key data to find status of reverse map. Lacking of the information may
cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
That means troubleshooters are unable to check more details about pmem from
the dumpfile.
### Proposal ###
---
In this proposal, register the device backed vmemmap as a separate resource.
This resource has its own new flag and name, and then teaches kexec_file_load(2)
and kexec_load(2) to mark it as dumpable.
Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
Proposed name: "Device Backed Vmemmap"
NOTE: crash-utils also needs to adapt to this new name for kexec_load()
With current proposal, the /proc/iomem should show as following for device
backed vmemmap
# cat /proc/iomem
..
fffc0000-ffffffff : Reserved
100000000-13fffffff : Persistent Memory
100000000-10fffffff : namespace0.0
100000000-1005fffff : Device Backed Vmemmap # fsdax
a80000000-b7fffffff : CXL Window 0
a80000000-affffffff : Persistent Memory
a80000000-affffffff : region1
a80000000-a811fffff : namespace1.0
a80000000-a811fffff : Device Backed Vmemmap # devdax
a81200000-abfffffff : dax1.0
b80000000-c7fffffff : CXL Window 1
c80000000-147fffffff : PCI Bus 0000:00
c80000000-c801fffff : PCI Bus 0000:01
..
### Kdump service reloading ###
---
Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr should also
be updated. There are 2 approaches to make the reloading more efficient.
1) Use udev rules to watch CPU and memory events, then reload kdump
2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
This reloading also needed when device backed vmemmap layouts change, Similar
to what 1) does now, one could add the following as the first lines to the
RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
# namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
# devdax <-> system-ram updated: watch daxX.Y of dax
SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
Regarding 2), my idea is that it would need to call the memory_notify() in
devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
hotplug. This part is not yet mature, but it does not affect the whole feature
because we can still use method 1) alternatively.
[1] https://lore.kernel.org/lkml/[email protected]/T/
--------------------------------------------
changes from V2[1]
- new proposal design
CC: Alison Schofield <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Baoquan He <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: Dan Williams <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Dave Jiang <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Ira Weiny <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Vishal Verma <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
Li Zhijian (7):
mm: memremap: register/unregister altmap region to a separate resource
mm: memremap: add pgmap_parent_resource() helper
nvdimm: pmem: assign a parent resource for vmemmap region for the
fsdax
dax: pmem: assign a parent resource for vmemmap region for the devdax
resource: Introduce walk device_backed_vmemmap res() helper
x86/crash: make device backed vmemmap dumpable for kexec_file_load
nvdimm: set force_raw=1 in kdump kernel
arch/x86/kernel/crash.c | 5 +++++
drivers/dax/pmem.c | 8 ++++++--
drivers/nvdimm/namespace_devs.c | 3 +++
drivers/nvdimm/pmem.c | 9 ++++++---
include/linux/ioport.h | 4 ++++
include/linux/memremap.h | 4 ++++
kernel/resource.c | 13 +++++++++++++
mm/memremap.c | 30 +++++++++++++++++++++++++++++-
8 files changed, 70 insertions(+), 6 deletions(-)
--
2.29.2
It walks resources registered with flags
(IORESOURCE_DEVICE_BACKED_VMEMMAP | IORESOURCE_BUSY), usually used by
device backed vmemmap region. currently, it only sticks to the
persistent memory type since it is only one user.
CC: Andrew Morton <[email protected]>
CC: Baoquan He <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: Alison Schofield <[email protected]>
CC: Dave Jiang <[email protected]>
CC: Dan Williams <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
---
include/linux/ioport.h | 3 +++
kernel/resource.c | 13 +++++++++++++
2 files changed, 16 insertions(+)
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 3b59e924f531..10a60227d6c2 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -332,6 +332,9 @@ extern int
walk_system_ram_res(u64 start, u64 end, void *arg,
int (*func)(struct resource *, void *));
extern int
+walk_device_backed_vmemmap_res(u64 start, u64 end, void *arg,
+ int (*func)(struct resource *, void *));
+extern int
walk_system_ram_res_rev(u64 start, u64 end, void *arg,
int (*func)(struct resource *, void *));
extern int
diff --git a/kernel/resource.c b/kernel/resource.c
index fcbca39dbc45..5f484266af07 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -431,6 +431,19 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
func);
}
+/*
+ * This function calls the @func callback against all memory ranges, which
+ * are ranges marked as (IORESOURCE_DEVICE_BACKED_VMEMMAP | IORESOURCE_BUSY)
+ * and IORES_DESC_PERSISTENT_MEMORY.
+ */
+int walk_device_backed_vmemmap_res(u64 start, u64 end, void *arg,
+ int (*func)(struct resource *, void *))
+{
+ return __walk_iomem_res_desc(start, end,
+ IORESOURCE_DEVICE_BACKED_VMEMMAP | IORESOURCE_BUSY,
+ IORES_DESC_PERSISTENT_MEMORY, arg, func);
+}
+
/*
* This function, being a variant of walk_system_ram_res(), calls the @func
* callback against all memory ranges of type System RAM which are marked as
--
2.29.2
When the pmem is configured as fsdax, set the vmemmap region as a child
of the namespace region so that it can be registered as a separate
resource later.
CC: Dan Williams <[email protected]>
CC: Vishal Verma <[email protected]>
CC: Dave Jiang <[email protected]>
CC: Ira Weiny <[email protected]>
CC: Baoquan He <[email protected]>
CC: [email protected]
Signed-off-by: Li Zhijian <[email protected]>
---
drivers/nvdimm/pmem.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..b2640a3fb693 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -452,7 +452,7 @@ static int pmem_attach_disk(struct device *dev,
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
struct nd_region *nd_region = to_nd_region(dev->parent);
int nid = dev_to_node(dev), fua;
- struct resource *res = &nsio->res;
+ struct resource *res = &nsio->res, *parent;
struct range bb_range;
struct nd_pfn *nd_pfn = NULL;
struct dax_device *dax_dev;
@@ -491,12 +491,15 @@ static int pmem_attach_disk(struct device *dev,
fua = 0;
}
- if (!devm_request_mem_region(dev, res->start, resource_size(res),
- dev_name(&ndns->dev))) {
+ parent = devm_request_mem_region(dev, res->start, resource_size(res),
+ dev_name(&ndns->dev));
+ if (!res) {
dev_warn(dev, "could not reserve region %pR\n", res);
return -EBUSY;
}
+ pgmap_parent_resource(&pmem->pgmap, parent);
+
disk = blk_alloc_disk(nid);
if (!disk)
return -ENOMEM;
--
2.29.2
The virtually mapped memory map allows storing struct page objects for
persistent memory devices in pre-allocated storage on those devices.
These 'struct page objects' on devices are also known as metadata.
During libnvdimm/nd_pmem are loading, the previous metadata will
be re-constructed to fit the current running kernel. For kdump purpose,
these metadata should not be touched until the dumping is done so that
the metadata is identical.
To achieve this, we have some options
1. Don't provide libnvdimm driver in kdump kernel rootfs/initramfs
2. Disable libnvdimm driver by specific comline parameters:
(initcall_blacklist=libnvdimm_init libnvdimm.blacklist=1 rd.driver.blacklist=libnvdimm)
3. Enforce force_raw=1 for nvdimm namespace, because when force_raw=1,
metadata will not be re-constructed again. This may also result in
the pmem doesn't work before a few extra configurations.
Here apply the 3rd option.
CC: Dan Williams <[email protected]>
CC: Vishal Verma <[email protected]>
CC: Dave Jiang <[email protected]>
CC: Ira Weiny <[email protected]>
CC: [email protected]
Signed-off-by: Li Zhijian <[email protected]>
---
drivers/nvdimm/namespace_devs.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index d6d558f94d6b..04f855c7f0b1 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -9,6 +9,7 @@
#include <linux/slab.h>
#include <linux/list.h>
#include <linux/nd.h>
+#include <linux/crash_dump.h>
#include "nd-core.h"
#include "pmem.h"
#include "pfn.h"
@@ -1513,6 +1514,8 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
return ERR_PTR(-ENODEV);
}
+ if (is_kdump_kernel())
+ ndns->force_raw = true;
return ndns;
}
EXPORT_SYMBOL(nvdimm_namespace_common_probe);
--
2.29.2
When the pmem is configured as devdax, set the vmemmap region as a child
of the namespace region so that it can be registered as a separate
resource later.
CC: Dan Williams <[email protected]>
CC: Vishal Verma <[email protected]>
CC: Dave Jiang <[email protected]>
CC: Baoquan He <[email protected]>
CC: [email protected]
CC: [email protected]
Signed-off-by: Li Zhijian <[email protected]>
---
drivers/dax/pmem.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index f3c6c67b8412..6ffeb81e6c7c 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -21,6 +21,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
struct nd_dax *nd_dax = to_nd_dax(dev);
struct nd_pfn *nd_pfn = &nd_dax->nd_pfn;
struct nd_region *nd_region = to_nd_region(dev->parent);
+ struct resource *parent;
ndns = nvdimm_namespace_common_probe(dev);
if (IS_ERR(ndns))
@@ -39,8 +40,9 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
pfn_sb = nd_pfn->pfn_sb;
offset = le64_to_cpu(pfn_sb->dataoff);
nsio = to_nd_namespace_io(&ndns->dev);
- if (!devm_request_mem_region(dev, nsio->res.start, offset,
- dev_name(&ndns->dev))) {
+ parent = devm_request_mem_region(dev, nsio->res.start, offset,
+ dev_name(&ndns->dev));
+ if (!parent) {
dev_warn(dev, "could not reserve metadata\n");
return ERR_PTR(-EBUSY);
}
@@ -66,6 +68,8 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
.memmap_on_memory = false,
};
+ pgmap_parent_resource(&pgmap, parent);
+
return devm_create_dev_dax(&data);
}
--
2.29.2
The elfcorehdr descirbes the dumpable region in PT_LOADs.
Generally, an iomem resource registered with flags
(IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) will be added to PT_LOADs by
kexe_file_load(2). An iomem resource with name prefix "System RAM" will
be added to PT_LOADs in kexec-tools by calling kexe_load(2).
So a simple way to make the altmap dumpable is to register altmap region
as a separate resource with the proper name and resource flags.
Here naming it as "Device Backed Vmemmap" plus resource flags
(IORESOURCE_DEVICE_BACKED_VMEMMAP and IORESOUCE_BUSY) to make it work first.
A /proc/iomem example is as following:
$ sudo cat /proc/iomem
..
fffc0000-ffffffff : Reserved
100000000-13fffffff : Persistent Memory
100000000-10fffffff : namespace0.0
100000000-1005fffff : Device Backed Vmemmap # fsdax
a80000000-b7fffffff : CXL Window 0
a80000000-affffffff : Persistent Memory
a80000000-affffffff : region1
a80000000-a811fffff : namespace1.0
a80000000-a811fffff : Device Backed Vmemmap # devdax
a81200000-abfffffff : dax1.0
b80000000-c7fffffff : CXL Window 1
c80000000-147fffffff : PCI Bus 0000:00
c80000000-c801fffff : PCI Bus 0000:01
..
CC: Andrew Morton <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: Baoquan He <[email protected]>
CC: Dan Williams <[email protected]>
CC: [email protected]
Signed-off-by: Li Zhijian <[email protected]>
---
include/linux/ioport.h | 1 +
include/linux/memremap.h | 3 +++
mm/memremap.c | 23 ++++++++++++++++++++++-
3 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index db7fe25f3370..3b59e924f531 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -69,6 +69,7 @@ struct resource {
#define IORESOURCE_UNSET 0x20000000 /* No address assigned yet */
#define IORESOURCE_AUTO 0x40000000
#define IORESOURCE_BUSY 0x80000000 /* Driver has marked this resource busy */
+#define IORESOURCE_DEVICE_BACKED_VMEMMAP 0xa0000000 /* device backed vmemmap resource */
/* I/O resource extended types */
#define IORESOURCE_SYSTEM_RAM (IORESOURCE_MEM|IORESOURCE_SYSRAM)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 744c830f4b13..ca1f12353008 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,6 +17,8 @@ struct device;
* @free: free pages set aside in the mapping for memmap storage
* @align: pages reserved to meet allocation alignments
* @alloc: track pages consumed, private to vmemmap_populate()
+ * @parent: the parent resource that altmap region belongs to
+ * @res: altmap region resource
*/
struct vmem_altmap {
unsigned long base_pfn;
@@ -25,6 +27,7 @@ struct vmem_altmap {
unsigned long free;
unsigned long align;
unsigned long alloc;
+ struct resource *parent, *res;
};
/*
diff --git a/mm/memremap.c b/mm/memremap.c
index 9e9fb1972fff..78047157b0ee 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -157,7 +157,17 @@ EXPORT_SYMBOL_GPL(memunmap_pages);
static void devm_memremap_pages_release(void *data)
{
- memunmap_pages(data);
+ struct dev_pagemap *pgmap = data;
+
+ if (pgmap->flags & PGMAP_ALTMAP_VALID && pgmap->altmap.res) {
+ resource_size_t start = pgmap->altmap.res->start;
+ resource_size_t size = pgmap->altmap.res->end -
+ pgmap->altmap.res->start + 1;
+
+ __release_region(pgmap->altmap.parent, start, size);
+ }
+
+ memunmap_pages(pgmap);
}
static void dev_pagemap_percpu_release(struct percpu_ref *ref)
@@ -404,11 +414,22 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
{
int error;
void *ret;
+ struct vmem_altmap *altmap = &pgmap->altmap;
ret = memremap_pages(pgmap, dev_to_node(dev));
if (IS_ERR(ret))
return ret;
+ if (pgmap->flags & PGMAP_ALTMAP_VALID && altmap->parent) {
+ unsigned long start = altmap->base_pfn << PAGE_SHIFT;
+ unsigned long size = vmem_altmap_offset(altmap) << PAGE_SHIFT;
+ int flags = IORESOURCE_DEVICE_BACKED_VMEMMAP | IORESOURCE_BUSY;
+
+ altmap->res = __request_region(altmap->parent, start, size,
+ "Device Backed Vmemmap", flags);
+ pr_debug("Insert a separate resource for altmap, %lx-%lx\n",
+ start, start + size);
+ }
error = devm_add_action_or_reset(dev, devm_memremap_pages_release,
pgmap);
if (error)
--
2.29.2
Add resources with specific flags to PT_LOADs of the elfcorehdr so that
these resources can be dumpable. This change is for kexec_file_load(2)
while kexec_load(2) setups the PT_LOADs according to its parameters
by the callers which usually rely on resources' name from /proc/iomem
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Baoquan He <[email protected]>
CC: Andrew Morton <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: [email protected]
Signed-off-by: Li Zhijian <[email protected]>
---
arch/x86/kernel/crash.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index b6b044356f1b..b8426fedd2cd 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -146,6 +146,8 @@ static struct crash_mem *fill_up_crash_elf_data(void)
if (!nr_ranges)
return NULL;
+ walk_device_backed_vmemmap_res(0, -1, &nr_ranges,
+ get_nr_ram_ranges_callback);
/*
* Exclusion of crash region and/or crashk_low_res may cause
* another range split. So add extra two slots here.
@@ -212,6 +214,9 @@ static int prepare_elf_headers(void **addr, unsigned long *sz,
if (ret)
goto out;
+ walk_device_backed_vmemmap_res(0, -1, cmem,
+ prepare_elf64_ram_headers_callback);
+
/* Exclude unwanted mem ranges */
ret = elf_header_exclude_ranges(cmem);
if (ret)
--
2.29.2
Hi Li,
kernel test robot noticed the following build errors:
[auto build test ERROR on nvdimm/libnvdimm-for-next]
[also build test ERROR on tip/x86/core linus/master v6.8-rc7]
[cannot apply to akpm-mm/mm-everything nvdimm/dax-misc next-20240306]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memremap-register-unregister-altmap-region-to-a-separate-resource/20240306-183118
base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
patch link: https://lore.kernel.org/r/20240306102846.1020868-4-lizhijian%40fujitsu.com
patch subject: [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax
config: riscv-defconfig (https://download.01.org/0day-ci/archive/20240307/[email protected]/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 325f51237252e6dab8e4e1ea1fa7acbb4faee1cd)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240307/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
In file included from drivers/nvdimm/pmem.c:10:
In file included from include/linux/blkdev.h:9:
In file included from include/linux/blk_types.h:10:
In file included from include/linux/bvec.h:10:
In file included from include/linux/highmem.h:8:
In file included from include/linux/cacheflush.h:5:
In file included from arch/riscv/include/asm/cacheflush.h:9:
In file included from include/linux/mm.h:2188:
include/linux/vmstat.h:522:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
522 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
>> drivers/nvdimm/pmem.c:501:2: error: call to undeclared function 'pgmap_parent_resource'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
501 | pgmap_parent_resource(&pmem->pgmap, parent);
| ^
1 warning and 1 error generated.
vim +/pgmap_parent_resource +501 drivers/nvdimm/pmem.c
448
449 static int pmem_attach_disk(struct device *dev,
450 struct nd_namespace_common *ndns)
451 {
452 struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
453 struct nd_region *nd_region = to_nd_region(dev->parent);
454 int nid = dev_to_node(dev), fua;
455 struct resource *res = &nsio->res, *parent;
456 struct range bb_range;
457 struct nd_pfn *nd_pfn = NULL;
458 struct dax_device *dax_dev;
459 struct nd_pfn_sb *pfn_sb;
460 struct pmem_device *pmem;
461 struct request_queue *q;
462 struct gendisk *disk;
463 void *addr;
464 int rc;
465
466 pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
467 if (!pmem)
468 return -ENOMEM;
469
470 rc = devm_namespace_enable(dev, ndns, nd_info_block_reserve());
471 if (rc)
472 return rc;
473
474 /* while nsio_rw_bytes is active, parse a pfn info block if present */
475 if (is_nd_pfn(dev)) {
476 nd_pfn = to_nd_pfn(dev);
477 rc = nvdimm_setup_pfn(nd_pfn, &pmem->pgmap);
478 if (rc)
479 return rc;
480 }
481
482 /* we're attaching a block device, disable raw namespace access */
483 devm_namespace_disable(dev, ndns);
484
485 dev_set_drvdata(dev, pmem);
486 pmem->phys_addr = res->start;
487 pmem->size = resource_size(res);
488 fua = nvdimm_has_flush(nd_region);
489 if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) || fua < 0) {
490 dev_warn(dev, "unable to guarantee persistence of writes\n");
491 fua = 0;
492 }
493
494 parent = devm_request_mem_region(dev, res->start, resource_size(res),
495 dev_name(&ndns->dev));
496 if (!res) {
497 dev_warn(dev, "could not reserve region %pR\n", res);
498 return -EBUSY;
499 }
500
> 501 pgmap_parent_resource(&pmem->pgmap, parent);
502
503 disk = blk_alloc_disk(nid);
504 if (!disk)
505 return -ENOMEM;
506 q = disk->queue;
507
508 pmem->disk = disk;
509 pmem->pgmap.owner = pmem;
510 pmem->pfn_flags = PFN_DEV;
511 if (is_nd_pfn(dev)) {
512 pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
513 pmem->pgmap.ops = &fsdax_pagemap_ops;
514 addr = devm_memremap_pages(dev, &pmem->pgmap);
515 pfn_sb = nd_pfn->pfn_sb;
516 pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
517 pmem->pfn_pad = resource_size(res) -
518 range_len(&pmem->pgmap.range);
519 pmem->pfn_flags |= PFN_MAP;
520 bb_range = pmem->pgmap.range;
521 bb_range.start += pmem->data_offset;
522 } else if (pmem_should_map_pages(dev)) {
523 pmem->pgmap.range.start = res->start;
524 pmem->pgmap.range.end = res->end;
525 pmem->pgmap.nr_range = 1;
526 pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
527 pmem->pgmap.ops = &fsdax_pagemap_ops;
528 addr = devm_memremap_pages(dev, &pmem->pgmap);
529 pmem->pfn_flags |= PFN_MAP;
530 bb_range = pmem->pgmap.range;
531 } else {
532 addr = devm_memremap(dev, pmem->phys_addr,
533 pmem->size, ARCH_MEMREMAP_PMEM);
534 bb_range.start = res->start;
535 bb_range.end = res->end;
536 }
537
538 if (IS_ERR(addr)) {
539 rc = PTR_ERR(addr);
540 goto out;
541 }
542 pmem->virt_addr = addr;
543
544 blk_queue_write_cache(q, true, fua);
545 blk_queue_physical_block_size(q, PAGE_SIZE);
546 blk_queue_logical_block_size(q, pmem_sector_size(ndns));
547 blk_queue_max_hw_sectors(q, UINT_MAX);
548 blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
549 blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
550 if (pmem->pfn_flags & PFN_MAP)
551 blk_queue_flag_set(QUEUE_FLAG_DAX, q);
552
553 disk->fops = &pmem_fops;
554 disk->private_data = pmem;
555 nvdimm_namespace_disk_name(ndns, disk->disk_name);
556 set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
557 / 512);
558 if (devm_init_badblocks(dev, &pmem->bb))
559 return -ENOMEM;
560 nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_range);
561 disk->bb = &pmem->bb;
562
563 dax_dev = alloc_dax(pmem, &pmem_dax_ops);
564 if (IS_ERR(dax_dev)) {
565 rc = PTR_ERR(dax_dev);
566 goto out;
567 }
568 set_dax_nocache(dax_dev);
569 set_dax_nomc(dax_dev);
570 if (is_nvdimm_sync(nd_region))
571 set_dax_synchronous(dax_dev);
572 rc = dax_add_host(dax_dev, disk);
573 if (rc)
574 goto out_cleanup_dax;
575 dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
576 pmem->dax_dev = dax_dev;
577
578 rc = device_add_disk(dev, disk, pmem_attribute_groups);
579 if (rc)
580 goto out_remove_host;
581 if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
582 return -ENOMEM;
583
584 nvdimm_check_and_set_ro(disk);
585
586 pmem->bb_state = sysfs_get_dirent(disk_to_dev(disk)->kobj.sd,
587 "badblocks");
588 if (!pmem->bb_state)
589 dev_warn(dev, "'badblocks' notification disabled\n");
590 return 0;
591
592 out_remove_host:
593 dax_remove_host(pmem->disk);
594 out_cleanup_dax:
595 kill_dax(pmem->dax_dev);
596 put_dax(pmem->dax_dev);
597 out:
598 put_disk(pmem->disk);
599 return rc;
600 }
601
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Li,
kernel test robot noticed the following build errors:
[auto build test ERROR on nvdimm/libnvdimm-for-next]
[also build test ERROR on tip/x86/core linus/master v6.8-rc7]
[cannot apply to akpm-mm/mm-everything nvdimm/dax-misc next-20240306]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Li-Zhijian/mm-memremap-register-unregister-altmap-region-to-a-separate-resource/20240306-183118
base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
patch link: https://lore.kernel.org/r/20240306102846.1020868-4-lizhijian%40fujitsu.com
patch subject: [PATCH v3 3/7] nvdimm: pmem: assign a parent resource for vmemmap region for the fsdax
config: loongarch-allmodconfig (https://download.01.org/0day-ci/archive/20240307/[email protected]/config)
compiler: loongarch64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240307/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
drivers/nvdimm/pmem.c: In function 'pmem_attach_disk':
>> drivers/nvdimm/pmem.c:501:9: error: implicit declaration of function 'pgmap_parent_resource' [-Werror=implicit-function-declaration]
501 | pgmap_parent_resource(&pmem->pgmap, parent);
| ^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/pgmap_parent_resource +501 drivers/nvdimm/pmem.c
448
449 static int pmem_attach_disk(struct device *dev,
450 struct nd_namespace_common *ndns)
451 {
452 struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
453 struct nd_region *nd_region = to_nd_region(dev->parent);
454 int nid = dev_to_node(dev), fua;
455 struct resource *res = &nsio->res, *parent;
456 struct range bb_range;
457 struct nd_pfn *nd_pfn = NULL;
458 struct dax_device *dax_dev;
459 struct nd_pfn_sb *pfn_sb;
460 struct pmem_device *pmem;
461 struct request_queue *q;
462 struct gendisk *disk;
463 void *addr;
464 int rc;
465
466 pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
467 if (!pmem)
468 return -ENOMEM;
469
470 rc = devm_namespace_enable(dev, ndns, nd_info_block_reserve());
471 if (rc)
472 return rc;
473
474 /* while nsio_rw_bytes is active, parse a pfn info block if present */
475 if (is_nd_pfn(dev)) {
476 nd_pfn = to_nd_pfn(dev);
477 rc = nvdimm_setup_pfn(nd_pfn, &pmem->pgmap);
478 if (rc)
479 return rc;
480 }
481
482 /* we're attaching a block device, disable raw namespace access */
483 devm_namespace_disable(dev, ndns);
484
485 dev_set_drvdata(dev, pmem);
486 pmem->phys_addr = res->start;
487 pmem->size = resource_size(res);
488 fua = nvdimm_has_flush(nd_region);
489 if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) || fua < 0) {
490 dev_warn(dev, "unable to guarantee persistence of writes\n");
491 fua = 0;
492 }
493
494 parent = devm_request_mem_region(dev, res->start, resource_size(res),
495 dev_name(&ndns->dev));
496 if (!res) {
497 dev_warn(dev, "could not reserve region %pR\n", res);
498 return -EBUSY;
499 }
500
> 501 pgmap_parent_resource(&pmem->pgmap, parent);
502
503 disk = blk_alloc_disk(nid);
504 if (!disk)
505 return -ENOMEM;
506 q = disk->queue;
507
508 pmem->disk = disk;
509 pmem->pgmap.owner = pmem;
510 pmem->pfn_flags = PFN_DEV;
511 if (is_nd_pfn(dev)) {
512 pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
513 pmem->pgmap.ops = &fsdax_pagemap_ops;
514 addr = devm_memremap_pages(dev, &pmem->pgmap);
515 pfn_sb = nd_pfn->pfn_sb;
516 pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
517 pmem->pfn_pad = resource_size(res) -
518 range_len(&pmem->pgmap.range);
519 pmem->pfn_flags |= PFN_MAP;
520 bb_range = pmem->pgmap.range;
521 bb_range.start += pmem->data_offset;
522 } else if (pmem_should_map_pages(dev)) {
523 pmem->pgmap.range.start = res->start;
524 pmem->pgmap.range.end = res->end;
525 pmem->pgmap.nr_range = 1;
526 pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
527 pmem->pgmap.ops = &fsdax_pagemap_ops;
528 addr = devm_memremap_pages(dev, &pmem->pgmap);
529 pmem->pfn_flags |= PFN_MAP;
530 bb_range = pmem->pgmap.range;
531 } else {
532 addr = devm_memremap(dev, pmem->phys_addr,
533 pmem->size, ARCH_MEMREMAP_PMEM);
534 bb_range.start = res->start;
535 bb_range.end = res->end;
536 }
537
538 if (IS_ERR(addr)) {
539 rc = PTR_ERR(addr);
540 goto out;
541 }
542 pmem->virt_addr = addr;
543
544 blk_queue_write_cache(q, true, fua);
545 blk_queue_physical_block_size(q, PAGE_SIZE);
546 blk_queue_logical_block_size(q, pmem_sector_size(ndns));
547 blk_queue_max_hw_sectors(q, UINT_MAX);
548 blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
549 blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
550 if (pmem->pfn_flags & PFN_MAP)
551 blk_queue_flag_set(QUEUE_FLAG_DAX, q);
552
553 disk->fops = &pmem_fops;
554 disk->private_data = pmem;
555 nvdimm_namespace_disk_name(ndns, disk->disk_name);
556 set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
557 / 512);
558 if (devm_init_badblocks(dev, &pmem->bb))
559 return -ENOMEM;
560 nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_range);
561 disk->bb = &pmem->bb;
562
563 dax_dev = alloc_dax(pmem, &pmem_dax_ops);
564 if (IS_ERR(dax_dev)) {
565 rc = PTR_ERR(dax_dev);
566 goto out;
567 }
568 set_dax_nocache(dax_dev);
569 set_dax_nomc(dax_dev);
570 if (is_nvdimm_sync(nd_region))
571 set_dax_synchronous(dax_dev);
572 rc = dax_add_host(dax_dev, disk);
573 if (rc)
574 goto out_cleanup_dax;
575 dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
576 pmem->dax_dev = dax_dev;
577
578 rc = device_add_disk(dev, disk, pmem_attribute_groups);
579 if (rc)
580 goto out_remove_host;
581 if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
582 return -ENOMEM;
583
584 nvdimm_check_and_set_ro(disk);
585
586 pmem->bb_state = sysfs_get_dirent(disk_to_dev(disk)->kobj.sd,
587 "badblocks");
588 if (!pmem->bb_state)
589 dev_warn(dev, "'badblocks' notification disabled\n");
590 return 0;
591
592 out_remove_host:
593 dax_remove_host(pmem->disk);
594 out_cleanup_dax:
595 kill_dax(pmem->dax_dev);
596 put_dax(pmem->dax_dev);
597 out:
598 put_disk(pmem->disk);
599 return rc;
600 }
601
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
ping
Any comment is welcome.
On 06/03/2024 18:28, Li Zhijian wrote:
> Hello folks,
>
> Compared with the V2[1] I posted a long time ago, this time it is a
> completely new proposal design.
>
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for troubleshooting the kernel. It is the
> final way to chase what happened at the kernel panic, slowdown, and so on. It
> is one of the most important tools for customer support.
>
> Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
> configure the dumpable regions. Generally, (A)iomem resources registered with
> flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
> (B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
> are dumpable.
>
> The pmem use cases including fsdax and devdax, could map their vmemmap to
> their own devices. In this case, these part of vmemmap will not be dumped when
> crash happened since these regions are satisfied with neither the above (A)
> nor (B).
>
> In fsdax, the vmemmap(struct page array) becomes very important, it is one of
> the key data to find status of reverse map. Lacking of the information may
> cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
> That means troubleshooters are unable to check more details about pmem from
> the dumpfile.
>
> ### Proposal ###
> ---
> In this proposal, register the device backed vmemmap as a separate resource.
> This resource has its own new flag and name, and then teaches kexec_file_load(2)
> and kexec_load(2) to mark it as dumpable.
>
> Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
> Proposed name: "Device Backed Vmemmap"
>
> NOTE: crash-utils also needs to adapt to this new name for kexec_load()
>
> With current proposal, the /proc/iomem should show as following for device
> backed vmemmap
> # cat /proc/iomem
> ...
> fffc0000-ffffffff : Reserved
> 100000000-13fffffff : Persistent Memory
> 100000000-10fffffff : namespace0.0
> 100000000-1005fffff : Device Backed Vmemmap # fsdax
> a80000000-b7fffffff : CXL Window 0
> a80000000-affffffff : Persistent Memory
> a80000000-affffffff : region1
> a80000000-a811fffff : namespace1.0
> a80000000-a811fffff : Device Backed Vmemmap # devdax
> a81200000-abfffffff : dax1.0
> b80000000-c7fffffff : CXL Window 1
> c80000000-147fffffff : PCI Bus 0000:00
> c80000000-c801fffff : PCI Bus 0000:01
> ...
>
> ### Kdump service reloading ###
> ---
> Once the kdump service is loaded, if changes to CPUs or memory occur,
> either by hot un/plug or off/onlining, the crash elfcorehdr should also
> be updated. There are 2 approaches to make the reloading more efficient.
> 1) Use udev rules to watch CPU and memory events, then reload kdump
> 2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
>
> This reloading also needed when device backed vmemmap layouts change, Similar
> to what 1) does now, one could add the following as the first lines to the
> RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
>
> # namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
> # devdax <-> system-ram updated: watch daxX.Y of dax
> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>
> Regarding 2), my idea is that it would need to call the memory_notify() in
> devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
> hotplug. This part is not yet mature, but it does not affect the whole feature
> because we can still use method 1) alternatively.
>
> [1] https://lore.kernel.org/lkml/[email protected]/T/
> --------------------------------------------
> changes from V2[1]
> - new proposal design
>
> CC: Alison Schofield <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Baoquan He <[email protected]>
> CC: Borislav Petkov <[email protected]>
> CC: Dan Williams <[email protected]>
> CC: Dave Hansen <[email protected]>
> CC: Dave Jiang <[email protected]>
> CC: Greg Kroah-Hartman <[email protected]>
> CC: "H. Peter Anvin" <[email protected]>
> CC: Ingo Molnar <[email protected]>
> CC: Ira Weiny <[email protected]>
> CC: Thomas Gleixner <[email protected]>
> CC: Vishal Verma <[email protected]>
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
>
> Li Zhijian (7):
> mm: memremap: register/unregister altmap region to a separate resource
> mm: memremap: add pgmap_parent_resource() helper
> nvdimm: pmem: assign a parent resource for vmemmap region for the
> fsdax
> dax: pmem: assign a parent resource for vmemmap region for the devdax
> resource: Introduce walk device_backed_vmemmap res() helper
> x86/crash: make device backed vmemmap dumpable for kexec_file_load
> nvdimm: set force_raw=1 in kdump kernel
>
> arch/x86/kernel/crash.c | 5 +++++
> drivers/dax/pmem.c | 8 ++++++--
> drivers/nvdimm/namespace_devs.c | 3 +++
> drivers/nvdimm/pmem.c | 9 ++++++---
> include/linux/ioport.h | 4 ++++
> include/linux/memremap.h | 4 ++++
> kernel/resource.c | 13 +++++++++++++
> mm/memremap.c | 30 +++++++++++++++++++++++++++++-
> 8 files changed, 70 insertions(+), 6 deletions(-)
>
On 03/21/24 at 05:40am, Zhijian Li (Fujitsu) wrote:
> ping
>
>
> Any comment is welcome.
I will have a look at this from kdump side. How do you test your code?
By the way, there's issue reported by test robot.
Thanks
Baoquan
>
>
> On 06/03/2024 18:28, Li Zhijian wrote:
> > Hello folks,
> >
> > Compared with the V2[1] I posted a long time ago, this time it is a
> > completely new proposal design.
> >
> > ### Background and motivate overview ###
> > ---
> > Crash dump is an important feature for troubleshooting the kernel. It is the
> > final way to chase what happened at the kernel panic, slowdown, and so on. It
> > is one of the most important tools for customer support.
> >
> > Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
> > configure the dumpable regions. Generally, (A)iomem resources registered with
> > flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
> > (B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
> > are dumpable.
> >
> > The pmem use cases including fsdax and devdax, could map their vmemmap to
> > their own devices. In this case, these part of vmemmap will not be dumped when
> > crash happened since these regions are satisfied with neither the above (A)
> > nor (B).
> >
> > In fsdax, the vmemmap(struct page array) becomes very important, it is one of
> > the key data to find status of reverse map. Lacking of the information may
> > cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
> > That means troubleshooters are unable to check more details about pmem from
> > the dumpfile.
> >
> > ### Proposal ###
> > ---
> > In this proposal, register the device backed vmemmap as a separate resource.
> > This resource has its own new flag and name, and then teaches kexec_file_load(2)
> > and kexec_load(2) to mark it as dumpable.
> >
> > Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
> > Proposed name: "Device Backed Vmemmap"
> >
> > NOTE: crash-utils also needs to adapt to this new name for kexec_load()
> >
> > With current proposal, the /proc/iomem should show as following for device
> > backed vmemmap
> > # cat /proc/iomem
> > ...
> > fffc0000-ffffffff : Reserved
> > 100000000-13fffffff : Persistent Memory
> > 100000000-10fffffff : namespace0.0
> > 100000000-1005fffff : Device Backed Vmemmap # fsdax
> > a80000000-b7fffffff : CXL Window 0
> > a80000000-affffffff : Persistent Memory
> > a80000000-affffffff : region1
> > a80000000-a811fffff : namespace1.0
> > a80000000-a811fffff : Device Backed Vmemmap # devdax
> > a81200000-abfffffff : dax1.0
> > b80000000-c7fffffff : CXL Window 1
> > c80000000-147fffffff : PCI Bus 0000:00
> > c80000000-c801fffff : PCI Bus 0000:01
> > ...
> >
> > ### Kdump service reloading ###
> > ---
> > Once the kdump service is loaded, if changes to CPUs or memory occur,
> > either by hot un/plug or off/onlining, the crash elfcorehdr should also
> > be updated. There are 2 approaches to make the reloading more efficient.
> > 1) Use udev rules to watch CPU and memory events, then reload kdump
> > 2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
> >
> > This reloading also needed when device backed vmemmap layouts change, Similar
> > to what 1) does now, one could add the following as the first lines to the
> > RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
> >
> > # namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
> > SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
> > SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
> > # devdax <-> system-ram updated: watch daxX.Y of dax
> > SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
> > SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
> >
> > Regarding 2), my idea is that it would need to call the memory_notify() in
> > devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
> > hotplug. This part is not yet mature, but it does not affect the whole feature
> > because we can still use method 1) alternatively.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/T/
> > --------------------------------------------
> > changes from V2[1]
> > - new proposal design
> >
> > CC: Alison Schofield <[email protected]>
> > CC: Andrew Morton <[email protected]>
> > CC: Baoquan He <[email protected]>
> > CC: Borislav Petkov <[email protected]>
> > CC: Dan Williams <[email protected]>
> > CC: Dave Hansen <[email protected]>
> > CC: Dave Jiang <[email protected]>
> > CC: Greg Kroah-Hartman <[email protected]>
> > CC: "H. Peter Anvin" <[email protected]>
> > CC: Ingo Molnar <[email protected]>
> > CC: Ira Weiny <[email protected]>
> > CC: Thomas Gleixner <[email protected]>
> > CC: Vishal Verma <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > CC: [email protected]
> > CC: [email protected]
> >
> > Li Zhijian (7):
> > mm: memremap: register/unregister altmap region to a separate resource
> > mm: memremap: add pgmap_parent_resource() helper
> > nvdimm: pmem: assign a parent resource for vmemmap region for the
> > fsdax
> > dax: pmem: assign a parent resource for vmemmap region for the devdax
> > resource: Introduce walk device_backed_vmemmap res() helper
> > x86/crash: make device backed vmemmap dumpable for kexec_file_load
> > nvdimm: set force_raw=1 in kdump kernel
> >
> > arch/x86/kernel/crash.c | 5 +++++
> > drivers/dax/pmem.c | 8 ++++++--
> > drivers/nvdimm/namespace_devs.c | 3 +++
> > drivers/nvdimm/pmem.c | 9 ++++++---
> > include/linux/ioport.h | 4 ++++
> > include/linux/memremap.h | 4 ++++
> > kernel/resource.c | 13 +++++++++++++
> > mm/memremap.c | 30 +++++++++++++++++++++++++++++-
> > 8 files changed, 70 insertions(+), 6 deletions(-)
> >
Baoquan,
On 21/03/2024 14:17, Baoquan He wrote:
> On 03/21/24 at 05:40am, Zhijian Li (Fujitsu) wrote:
>> ping
>>
>>
>> Any comment is welcome.
>
> I will have a look at this from kdump side. How do you test your code?
Thanks for your support.
- nothing change is required for makedumpfile and crash-utils
- nothing change is required for kexe-utils if you are using kexec_file_load(2)
- kexec-tool needs apply below patch if you are using kexec_load(2),
After the coredump is collected, crash-utils is able to inspect the device backed vmemmap memory.
From 4a2f0f91cdd8b084bb4ebafdc4f71b8e2a333720 Mon Sep 17 00:00:00 2001
From: Li Zhijian <[email protected]>
Date: Fri, 1 Mar 2024 13:44:36 +0800
Subject: [PATCH] consider device_backed_vmemmap dumpable
Signed-off-by: Li Zhijian <[email protected]>
---
kexec/arch/i386/crashdump-x86.c | 2 ++
kexec/arch/i386/kexec-x86-common.c | 3 +++
kexec/crashdump-elf.c | 2 +-
kexec/kexec.h | 1 +
4 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/kexec/arch/i386/crashdump-x86.c b/kexec/arch/i386/crashdump-x86.c
index a01031e570aa..757922dc3a57 100644
--- a/kexec/arch/i386/crashdump-x86.c
+++ b/kexec/arch/i386/crashdump-x86.c
@@ -284,6 +284,8 @@ static int get_crash_memory_ranges(struct memory_range **range, int *ranges,
type = RANGE_PRAM;
} else if(memcmp(str,"Persistent Memory\n",18) == 0 ) {
type = RANGE_PMEM;
+ } else if (memcmp(str, "Device Backed Vmemmap\n", 22)) {
+ type = RANGE_DEVICE_BACKED_VMEMMAP;
} else if(memcmp(str,"reserved\n",9) == 0 ) {
type = RANGE_RESERVED;
} else if (memcmp(str, "Reserved\n", 9) == 0) {
diff --git a/kexec/arch/i386/kexec-x86-common.c b/kexec/arch/i386/kexec-x86-common.c
index ffc95a9e43f8..0dd76903e7fc 100644
--- a/kexec/arch/i386/kexec-x86-common.c
+++ b/kexec/arch/i386/kexec-x86-common.c
@@ -111,6 +111,9 @@ static int get_memory_ranges_proc_iomem(struct memory_range **range, int *ranges
else if (memcmp(str, "Persistent Memory\n", 18) == 0) {
type = RANGE_PMEM;
}
+ else if (memcmp(str, "Device Backed Vmemmap\n", 22)) {
+ type = RANGE_DEVICE_BACKED_VMEMMAP;
+ }
else {
continue;
}
diff --git a/kexec/crashdump-elf.c b/kexec/crashdump-elf.c
index b8bb686a17ca..98ad854e3f8d 100644
--- a/kexec/crashdump-elf.c
+++ b/kexec/crashdump-elf.c
@@ -199,7 +199,7 @@ int FUNC(struct kexec_info *info,
* A seprate program header for Backup Region*/
for (i = 0; i < ranges; i++, range++) {
unsigned long long mstart, mend;
- if (range->type != RANGE_RAM)
+ if (range->type != RANGE_RAM || range->type != RANGE_DEVICE_BACKED_VMEMMAP)
continue;
mstart = range->start;
mend = range->end;
diff --git a/kexec/kexec.h b/kexec/kexec.h
index 1004aff15945..c0481bb2727d 100644
--- a/kexec/kexec.h
+++ b/kexec/kexec.h
@@ -139,6 +139,7 @@ struct memory_range {
#define RANGE_UNCACHED 4
#define RANGE_PMEM 6
#define RANGE_PRAM 11
+#define RANGE_DEVICE_BACKED_VMEMMAP 12
};
struct memory_ranges {
--
2.29.2
>
> By the way, there's issue reported by test robot.
I see, i have fixed it locally which doesn't matter for this feature(it fails in !ZONE_DEVICE)
Thanks
Zhijian
>
> Thanks
> Baoquan
>
>>
>>
>> On 06/03/2024 18:28, Li Zhijian wrote:
>>> Hello folks,
>>>
>>> Compared with the V2[1] I posted a long time ago, this time it is a
>>> completely new proposal design.
>>>
>>> ### Background and motivate overview ###
>>> ---
>>> Crash dump is an important feature for troubleshooting the kernel. It is the
>>> final way to chase what happened at the kernel panic, slowdown, and so on. It
>>> is one of the most important tools for customer support.
>>>
>>> Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
>>> configure the dumpable regions. Generally, (A)iomem resources registered with
>>> flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
>>> (B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
>>> are dumpable.
>>>
>>> The pmem use cases including fsdax and devdax, could map their vmemmap to
>>> their own devices. In this case, these part of vmemmap will not be dumped when
>>> crash happened since these regions are satisfied with neither the above (A)
>>> nor (B).
>>>
>>> In fsdax, the vmemmap(struct page array) becomes very important, it is one of
>>> the key data to find status of reverse map. Lacking of the information may
>>> cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
>>> That means troubleshooters are unable to check more details about pmem from
>>> the dumpfile.
>>>
>>> ### Proposal ###
>>> ---
>>> In this proposal, register the device backed vmemmap as a separate resource.
>>> This resource has its own new flag and name, and then teaches kexec_file_load(2)
>>> and kexec_load(2) to mark it as dumpable.
>>>
>>> Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
>>> Proposed name: "Device Backed Vmemmap"
>>>
>>> NOTE: crash-utils also needs to adapt to this new name for kexec_load()
>>>
>>> With current proposal, the /proc/iomem should show as following for device
>>> backed vmemmap
>>> # cat /proc/iomem
>>> ...
>>> fffc0000-ffffffff : Reserved
>>> 100000000-13fffffff : Persistent Memory
>>> 100000000-10fffffff : namespace0.0
>>> 100000000-1005fffff : Device Backed Vmemmap # fsdax
>>> a80000000-b7fffffff : CXL Window 0
>>> a80000000-affffffff : Persistent Memory
>>> a80000000-affffffff : region1
>>> a80000000-a811fffff : namespace1.0
>>> a80000000-a811fffff : Device Backed Vmemmap # devdax
>>> a81200000-abfffffff : dax1.0
>>> b80000000-c7fffffff : CXL Window 1
>>> c80000000-147fffffff : PCI Bus 0000:00
>>> c80000000-c801fffff : PCI Bus 0000:01
>>> ...
>>>
>>> ### Kdump service reloading ###
>>> ---
>>> Once the kdump service is loaded, if changes to CPUs or memory occur,
>>> either by hot un/plug or off/onlining, the crash elfcorehdr should also
>>> be updated. There are 2 approaches to make the reloading more efficient.
>>> 1) Use udev rules to watch CPU and memory events, then reload kdump
>>> 2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
>>>
>>> This reloading also needed when device backed vmemmap layouts change, Similar
>>> to what 1) does now, one could add the following as the first lines to the
>>> RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
>>>
>>> # namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
>>> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
>>> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>>> # devdax <-> system-ram updated: watch daxX.Y of dax
>>> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
>>> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>>>
>>> Regarding 2), my idea is that it would need to call the memory_notify() in
>>> devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
>>> hotplug. This part is not yet mature, but it does not affect the whole feature
>>> because we can still use method 1) alternatively.
>>>
>>> [1] https://lore.kernel.org/lkml/[email protected]/T/
>>> --------------------------------------------
>>> changes from V2[1]
>>> - new proposal design
>>>
>>> CC: Alison Schofield <[email protected]>
>>> CC: Andrew Morton <[email protected]>
>>> CC: Baoquan He <[email protected]>
>>> CC: Borislav Petkov <[email protected]>
>>> CC: Dan Williams <[email protected]>
>>> CC: Dave Hansen <[email protected]>
>>> CC: Dave Jiang <[email protected]>
>>> CC: Greg Kroah-Hartman <[email protected]>
>>> CC: "H. Peter Anvin" <[email protected]>
>>> CC: Ingo Molnar <[email protected]>
>>> CC: Ira Weiny <[email protected]>
>>> CC: Thomas Gleixner <[email protected]>
>>> CC: Vishal Verma <[email protected]>
>>> CC: [email protected]
>>> CC: [email protected]
>>> CC: [email protected]
>>> CC: [email protected]
>>>
>>> Li Zhijian (7):
>>> mm: memremap: register/unregister altmap region to a separate resource
>>> mm: memremap: add pgmap_parent_resource() helper
>>> nvdimm: pmem: assign a parent resource for vmemmap region for the
>>> fsdax
>>> dax: pmem: assign a parent resource for vmemmap region for the devdax
>>> resource: Introduce walk device_backed_vmemmap res() helper
>>> x86/crash: make device backed vmemmap dumpable for kexec_file_load
>>> nvdimm: set force_raw=1 in kdump kernel
>>>
>>> arch/x86/kernel/crash.c | 5 +++++
>>> drivers/dax/pmem.c | 8 ++++++--
>>> drivers/nvdimm/namespace_devs.c | 3 +++
>>> drivers/nvdimm/pmem.c | 9 ++++++---
>>> include/linux/ioport.h | 4 ++++
>>> include/linux/memremap.h | 4 ++++
>>> kernel/resource.c | 13 +++++++++++++
>>> mm/memremap.c | 30 +++++++++++++++++++++++++++++-
>>> 8 files changed, 70 insertions(+), 6 deletions(-)
>>>
>
ping again,
I would appreciate feedback from the nvdimm and linux-mm communities
Thanks
Zhijian
On 21/03/2024 13:40, Li Zhijian wrote:
> ping
>
>
> Any comment is welcome.
>
>
> On 06/03/2024 18:28, Li Zhijian wrote:
>> Hello folks,
>>
>> Compared with the V2[1] I posted a long time ago, this time it is a
>> completely new proposal design.
>>
>> ### Background and motivate overview ###
>> ---
>> Crash dump is an important feature for troubleshooting the kernel. It is the
>> final way to chase what happened at the kernel panic, slowdown, and so on. It
>> is one of the most important tools for customer support.
>>
>> Currently, there are 2 syscalls(kexec_file_load(2) and kexec_load(2)) to
>> configure the dumpable regions. Generally, (A)iomem resources registered with
>> flags (IORESOURCE_SYSTEM_RAM | IORESOUCE_BUSY) for kexec_file_load(2) or
>> (B)iomem resources registered with "System RAM" name prefix for kexec_load(2)
>> are dumpable.
>>
>> The pmem use cases including fsdax and devdax, could map their vmemmap to
>> their own devices. In this case, these part of vmemmap will not be dumped when
>> crash happened since these regions are satisfied with neither the above (A)
>> nor (B).
>>
>> In fsdax, the vmemmap(struct page array) becomes very important, it is one of
>> the key data to find status of reverse map. Lacking of the information may
>> cause difficulty to analyze trouble around pmem (especially Filesystem-DAX).
>> That means troubleshooters are unable to check more details about pmem from
>> the dumpfile.
>>
>> ### Proposal ###
>> ---
>> In this proposal, register the device backed vmemmap as a separate resource.
>> This resource has its own new flag and name, and then teaches kexec_file_load(2)
>> and kexec_load(2) to mark it as dumpable.
>>
>> Proposed flag: IORESOURCE_DEVICE_BACKED_VMEMMAP
>> Proposed name: "Device Backed Vmemmap"
>>
>> NOTE: crash-utils also needs to adapt to this new name for kexec_load()
>>
>> With current proposal, the /proc/iomem should show as following for device
>> backed vmemmap
>> # cat /proc/iomem
>> ...
>> fffc0000-ffffffff : Reserved
>> 100000000-13fffffff : Persistent Memory
>> 100000000-10fffffff : namespace0.0
>> 100000000-1005fffff : Device Backed Vmemmap # fsdax
>> a80000000-b7fffffff : CXL Window 0
>> a80000000-affffffff : Persistent Memory
>> a80000000-affffffff : region1
>> a80000000-a811fffff : namespace1.0
>> a80000000-a811fffff : Device Backed Vmemmap # devdax
>> a81200000-abfffffff : dax1.0
>> b80000000-c7fffffff : CXL Window 1
>> c80000000-147fffffff : PCI Bus 0000:00
>> c80000000-c801fffff : PCI Bus 0000:01
>> ...
>>
>> ### Kdump service reloading ###
>> ---
>> Once the kdump service is loaded, if changes to CPUs or memory occur,
>> either by hot un/plug or off/onlining, the crash elfcorehdr should also
>> be updated. There are 2 approaches to make the reloading more efficient.
>> 1) Use udev rules to watch CPU and memory events, then reload kdump
>> 2) Enable kernel crash hotplug to automatically reload elfcorehdr (>= 6.5)
>>
>> This reloading also needed when device backed vmemmap layouts change, Similar
>> to what 1) does now, one could add the following as the first lines to the
>> RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules:
>>
>> # namespace updated: watch daxX.Y(devdax) and pfnX.Y(fsdax) of nd
>> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="bind", GOTO="kdump_reload"
>> SUBSYSTEM=="nd", KERNEL=="[dp][af][xn][0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>> # devdax <-> system-ram updated: watch daxX.Y of dax
>> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="bind", GOTO="kdump_reload"
>> SUBSYSTEM=="dax", KERNEL=="dax[0-9].*", ACTION=="unbind", GOTO="kdump_reload"
>>
>> Regarding 2), my idea is that it would need to call the memory_notify() in
>> devm_memremap_pages_release() and devm_memremap_pages() to trigger the crash
>> hotplug. This part is not yet mature, but it does not affect the whole feature
>> because we can still use method 1) alternatively.
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/T/
>> --------------------------------------------
>> changes from V2[1]
>> - new proposal design
>>
>> CC: Alison Schofield <[email protected]>
>> CC: Andrew Morton <[email protected]>
>> CC: Baoquan He <[email protected]>
>> CC: Borislav Petkov <[email protected]>
>> CC: Dan Williams <[email protected]>
>> CC: Dave Hansen <[email protected]>
>> CC: Dave Jiang <[email protected]>
>> CC: Greg Kroah-Hartman <[email protected]>
>> CC: "H. Peter Anvin" <[email protected]>
>> CC: Ingo Molnar <[email protected]>
>> CC: Ira Weiny <[email protected]>
>> CC: Thomas Gleixner <[email protected]>
>> CC: Vishal Verma <[email protected]>
>> CC: [email protected]
>> CC: [email protected]
>> CC: [email protected]
>> CC: [email protected]
>>
>> Li Zhijian (7):
>> mm: memremap: register/unregister altmap region to a separate resource
>> mm: memremap: add pgmap_parent_resource() helper
>> nvdimm: pmem: assign a parent resource for vmemmap region for the
>> fsdax
>> dax: pmem: assign a parent resource for vmemmap region for the devdax
>> resource: Introduce walk device_backed_vmemmap res() helper
>> x86/crash: make device backed vmemmap dumpable for kexec_file_load
>> nvdimm: set force_raw=1 in kdump kernel
>>
>> arch/x86/kernel/crash.c | 5 +++++
>> drivers/dax/pmem.c | 8 ++++++--
>> drivers/nvdimm/namespace_devs.c | 3 +++
>> drivers/nvdimm/pmem.c | 9 ++++++---
>> include/linux/ioport.h | 4 ++++
>> include/linux/memremap.h | 4 ++++
>> kernel/resource.c | 13 +++++++++++++
>> mm/memremap.c | 30 +++++++++++++++++++++++++++++-
>> 8 files changed, 70 insertions(+), 6 deletions(-)
>>