2019-05-02 18:44:51

by Pasha Tatashin

[permalink] [raw]
Subject: [v5 0/3] "Hotremove" persistent memory

Changelog:
v5
- Addressed comments from Dan Williams: made remove_memory() to return
an error code, and use this function from dax.

v4
- Addressed comments from Dave Hansen

v3
- Addressed comments from David Hildenbrand. Don't release
lock_device_hotplug after checking memory status, and rename
memblock_offlined_cb() to check_memblock_offlined_cb()

v2
- Dan Williams mentioned that drv->remove() return is ignored
by unbind. Unbind always succeeds. Because we cannot guarantee
that memory can be offlined from the driver, don't even
attempt to do so. Simply check that every section is offlined
beforehand and only then proceed with removing dax memory.

---

Recently, adding a persistent memory to be used like a regular RAM was
added to Linux. This work extends this functionality to also allow hot
removing persistent memory.

We (Microsoft) have an important use case for this functionality.

The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time (<1s). Yet, there is
a userland state that is expensive to recreate (~2G).

The solution is to boot machines with 2G preserved for persistent
memory.

Copy the state, and hotadd the persistent memory so machine still has
all 8G available for runtime. Before reboot, offline and hotremove
device-dax 2G, copy the memory that is needed to be preserved to pmem0
device, and reboot.

The series of operations look like this:

1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
and free ramdisk.
2. Convert raw pmem0 to devdax
ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
3. Hotadd to System RAM
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
echo offline > /sys/devices/system/memoryXXX/state
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
ndctl create-namespace --mode raw -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
zero memory in pmem0 region (These machines have only regular
volatile memory). So to have pmem0 device either memmap kernel
parameter is used, or devices nodes in dtb are specified.

Pavel Tatashin (3):
device-dax: fix memory and resource leak if hotplug fails
mm/hotplug: make remove_memory() interface useable
device-dax: "Hotremove" persistent memory that is used like normal RAM

drivers/dax/dax-private.h | 2 ++
drivers/dax/kmem.c | 46 ++++++++++++++++++++++---
include/linux/memory_hotplug.h | 8 +++--
mm/memory_hotplug.c | 61 ++++++++++++++++++++++------------
4 files changed, 89 insertions(+), 28 deletions(-)

--
2.21.0


2019-05-02 18:45:01

by Pasha Tatashin

[permalink] [raw]
Subject: [v5 2/3] mm/hotplug: make remove_memory() interface useable

As of right now remove_memory() interface is inherently broken. It tries
to remove memory but panics if some memory is not offline. The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to
change memory state via sysfs.

So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.

Make this interface to return an error if memory removal fails. This way
it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.

Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/memory_hotplug.h | 8 +++--
mm/memory_hotplug.c | 61 ++++++++++++++++++++++------------
2 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8ade08c50d26..5438a2d92560 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -304,7 +304,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
extern void try_offline_node(int nid);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
extern void __remove_memory(int nid, u64 start, u64 size);

#else
@@ -321,7 +321,11 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
return -EINVAL;
}

-static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline bool remove_memory(int nid, u64 start, u64 size)
+{
+ return -EBUSY;
+}
+
static inline void __remove_memory(int nid, u64 start, u64 size) {}
#endif /* CONFIG_MEMORY_HOTREMOVE */

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c454e82d4f6..a826aededa1a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1778,9 +1778,10 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
&beginpa, &endpa);
- }

- return ret;
+ return -EBUSY;
+ }
+ return 0;
}

static int check_cpu_on_node(pg_data_t *pgdat)
@@ -1843,19 +1844,9 @@ void try_offline_node(int nid)
}
EXPORT_SYMBOL(try_offline_node);

-/**
- * remove_memory
- * @nid: the node ID
- * @start: physical address of the region to remove
- * @size: size of the region to remove
- *
- * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
- * and online/offline operations before this call, as required by
- * try_offline_node().
- */
-void __ref __remove_memory(int nid, u64 start, u64 size)
+static int __ref try_remove_memory(int nid, u64 start, u64 size)
{
- int ret;
+ int rc = 0;

BUG_ON(check_hotplug_memory_range(start, size));

@@ -1863,13 +1854,13 @@ void __ref __remove_memory(int nid, u64 start, u64 size)

/*
* All memory blocks must be offlined before removing memory. Check
- * whether all memory blocks in question are offline and trigger a BUG()
+ * whether all memory blocks in question are offline and return error
* if this is not the case.
*/
- ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
- check_memblock_offlined_cb);
- if (ret)
- BUG();
+ rc = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
+ check_memblock_offlined_cb);
+ if (rc)
+ goto done;

/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
@@ -1879,14 +1870,42 @@ void __ref __remove_memory(int nid, u64 start, u64 size)

try_offline_node(nid);

+done:
mem_hotplug_done();
+ return rc;
}

-void remove_memory(int nid, u64 start, u64 size)
+/**
+ * remove_memory
+ * @nid: the node ID
+ * @start: physical address of the region to remove
+ * @size: size of the region to remove
+ *
+ * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
+ * and online/offline operations before this call, as required by
+ * try_offline_node().
+ */
+void __remove_memory(int nid, u64 start, u64 size)
{
+
+ /*
+ * trigger BUG() is some memory is not offlined prior to calling this
+ * function
+ */
+ if (try_remove_memory(nid, start, size))
+ BUG();
+}
+
+/* Remove memory if every memory block is offline, otherwise return false */
+int remove_memory(int nid, u64 start, u64 size)
+{
+ int rc;
+
lock_device_hotplug();
- __remove_memory(nid, start, size);
+ rc = try_remove_memory(nid, start, size);
unlock_device_hotplug();
+
+ return rc;
}
EXPORT_SYMBOL_GPL(remove_memory);
#endif /* CONFIG_MEMORY_HOTREMOVE */
--
2.21.0

2019-05-02 18:45:59

by Pasha Tatashin

[permalink] [raw]
Subject: [v5 1/3] device-dax: fix memory and resource leak if hotplug fails

When add_memory() function fails, the resource and the memory should be
freed.

Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
---
drivers/dax/kmem.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a02318c6d28a..4c0131857133 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -66,8 +66,11 @@ int dev_dax_kmem_probe(struct device *dev)
new_res->name = dev_name(dev);

rc = add_memory(numa_node, new_res->start, resource_size(new_res));
- if (rc)
+ if (rc) {
+ release_resource(new_res);
+ kfree(new_res);
return rc;
+ }

return 0;
}
--
2.21.0

2019-05-02 18:46:23

by Pasha Tatashin

[permalink] [raw]
Subject: [v5 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM

It is now allowed to use persistent memory like a regular RAM, but
currently there is no way to remove this memory until machine is
rebooted.

This work expands the functionality to also allows hotremoving
previously hotplugged persistent memory, and recover the device for use
for other purposes.

To hotremove persistent memory, the management software must first
offline all memory blocks of dax region, and than unbind it from
device-dax/kmem driver. So, operations should look like this:

echo offline > /sys/devices/system/memory/memoryN/state
...
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Note: if unbind is done without offlining memory beforehand, it won't be
possible to do dax0.0 hotremove, and dax's memory is going to be part of
System RAM until reboot.

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
drivers/dax/dax-private.h | 2 ++
drivers/dax/kmem.c | 41 +++++++++++++++++++++++++++++++++++----
2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index a45612148ca0..999aaf3a29b3 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -53,6 +53,7 @@ struct dax_region {
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
* @ref: pgmap reference count (driver owned)
* @cmp: @ref final put completion (driver owned)
+ * @dax_mem_res: physical address range of hotadded DAX memory
*/
struct dev_dax {
struct dax_region *region;
@@ -62,6 +63,7 @@ struct dev_dax {
struct dev_pagemap pgmap;
struct percpu_ref ref;
struct completion cmp;
+ struct resource *dax_kmem_res;
};

static inline struct dev_dax *to_dev_dax(struct device *dev)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 4c0131857133..3d0a7e702c94 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -71,21 +71,54 @@ int dev_dax_kmem_probe(struct device *dev)
kfree(new_res);
return rc;
}
+ dev_dax->dax_kmem_res = new_res;

return 0;
}

+#ifdef CONFIG_MEMORY_HOTREMOVE
+static int dev_dax_kmem_remove(struct device *dev)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct resource *res = dev_dax->dax_kmem_res;
+ resource_size_t kmem_start = res->start;
+ resource_size_t kmem_size = resource_size(res);
+ int rc;
+
+ /*
+ * We have one shot for removing memory, if some memory blocks were not
+ * offline prior to calling this function remove_memory() will fail, and
+ * there is no way to hotremove this memory until reboot because device
+ * unbind will succeed even if we return failure.
+ */
+ rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
+ if (rc) {
+ dev_err(dev,
+ "DAX region %pR cannot be hotremoved until the next reboot\n",
+ res);
+ return rc;
+ }
+
+ /* Release and free dax resources */
+ release_resource(res);
+ kfree(res);
+ dev_dax->dax_kmem_res = NULL;
+
+ return 0;
+}
+#else
static int dev_dax_kmem_remove(struct device *dev)
{
/*
- * Purposely leak the request_mem_region() for the device-dax
- * range and return '0' to ->remove() attempts. The removal of
- * the device from the driver always succeeds, but the region
- * is permanently pinned as reserved by the unreleased
+ * Without hotremove purposely leak the request_mem_region() for the
+ * device-dax range and return '0' to ->remove() attempts. The removal
+ * of the device from the driver always succeeds, but the region is
+ * permanently pinned as reserved by the unreleased
* request_mem_region().
*/
return 0;
}
+#endif /* CONFIG_MEMORY_HOTREMOVE */

static struct dax_device_driver device_dax_kmem_driver = {
.drv = {
--
2.21.0

2019-05-02 20:52:15

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> The series of operations look like this:
>
> 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
> and free ramdisk.
> 2. Convert raw pmem0 to devdax
> ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> 3. Hotadd to System RAM
> echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> echo online_movable > /sys/devices/system/memoryXXX/state
> 4. Before reboot hotremove device-dax memory from System RAM
> echo offline > /sys/devices/system/memoryXXX/state
> echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Hi Pavel,

I am working on adding this sort of a workflow into a new daxctl command
(daxctl-reconfigure-device)- this will allow changing the 'mode' of a
dax device to kmem, online the resulting memory, and with your patches,
also attempt to offline the memory, and change back to device-dax.

In running with these patches, and testing the offlining part, I ran
into the following lockdep below.

This is with just these three patches on top of -rc7.


[ +0.004886] ======================================================
[ +0.001576] WARNING: possible circular locking dependency detected
[ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
[ +0.000929] ------------------------------------------------------
[ +0.000708] daxctl/22950 is trying to acquire lock:
[ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
[ +0.000922]
but task is already holding lock:
[ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
[ +0.000960]
which lock already depends on the new lock.

[ +0.001001]
the existing dependency chain (in reverse order) is:
[ +0.000837]
-> #3 (mem_sysfs_mutex){+.+.}:
[ +0.000631] __mutex_lock+0x82/0x9a0
[ +0.000477] unregister_memory_section+0x22/0xa0
[ +0.000582] __remove_pages+0xe9/0x520
[ +0.000489] arch_remove_memory+0x81/0xc0
[ +0.000510] devm_memremap_pages_release+0x180/0x270
[ +0.000633] release_nodes+0x234/0x280
[ +0.000483] device_release_driver_internal+0xf4/0x1d0
[ +0.000701] bus_remove_device+0xfc/0x170
[ +0.000529] device_del+0x16a/0x380
[ +0.000459] unregister_dev_dax+0x23/0x50
[ +0.000526] release_nodes+0x234/0x280
[ +0.000487] device_release_driver_internal+0xf4/0x1d0
[ +0.000646] unbind_store+0x9b/0x130
[ +0.000467] kernfs_fop_write+0xf0/0x1a0
[ +0.000510] vfs_write+0xba/0x1c0
[ +0.000438] ksys_write+0x5a/0xe0
[ +0.000521] do_syscall_64+0x60/0x210
[ +0.000489] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ +0.000637]
-> #2 (mem_hotplug_lock.rw_sem){++++}:
[ +0.000717] get_online_mems+0x3e/0x80
[ +0.000491] kmem_cache_create_usercopy+0x2e/0x270
[ +0.000609] kmem_cache_create+0x12/0x20
[ +0.000507] ptlock_cache_init+0x20/0x28
[ +0.000506] start_kernel+0x240/0x4d0
[ +0.000480] secondary_startup_64+0xa4/0xb0
[ +0.000539]
-> #1 (cpu_hotplug_lock.rw_sem){++++}:
[ +0.000784] cpus_read_lock+0x3e/0x80
[ +0.000511] online_pages+0x37/0x310
[ +0.000469] memory_subsys_online+0x34/0x60
[ +0.000611] device_online+0x60/0x80
[ +0.000611] state_store+0x66/0xd0
[ +0.000552] kernfs_fop_write+0xf0/0x1a0
[ +0.000649] vfs_write+0xba/0x1c0
[ +0.000487] ksys_write+0x5a/0xe0
[ +0.000459] do_syscall_64+0x60/0x210
[ +0.000482] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ +0.000646]
-> #0 (kn->count#424){++++}:
[ +0.000669] lock_acquire+0x9e/0x180
[ +0.000471] __kernfs_remove+0x26a/0x310
[ +0.000518] kernfs_remove_by_name_ns+0x40/0x80
[ +0.000583] remove_files.isra.1+0x30/0x70
[ +0.000555] sysfs_remove_group+0x3d/0x80
[ +0.000524] sysfs_remove_groups+0x29/0x40
[ +0.000532] device_remove_attrs+0x42/0x80
[ +0.000522] device_del+0x162/0x380
[ +0.000464] device_unregister+0x16/0x60
[ +0.000505] unregister_memory_section+0x6e/0xa0
[ +0.000591] __remove_pages+0xe9/0x520
[ +0.000492] arch_remove_memory+0x81/0xc0
[ +0.000568] try_remove_memory+0xba/0xd0
[ +0.000510] remove_memory+0x23/0x40
[ +0.000483] dev_dax_kmem_remove+0x29/0x57 [kmem]
[ +0.000608] device_release_driver_internal+0xe4/0x1d0
[ +0.000637] unbind_store+0x9b/0x130
[ +0.000464] kernfs_fop_write+0xf0/0x1a0
[ +0.000685] vfs_write+0xba/0x1c0
[ +0.000594] ksys_write+0x5a/0xe0
[ +0.000449] do_syscall_64+0x60/0x210
[ +0.000481] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ +0.000619]
other info that might help us debug this:

[ +0.000889] Chain exists of:
kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex

[ +0.001269] Possible unsafe locking scenario:

[ +0.000652] CPU0 CPU1
[ +0.000505] ---- ----
[ +0.000523] lock(mem_sysfs_mutex);
[ +0.000422] lock(mem_hotplug_lock.rw_sem);
[ +0.000905] lock(mem_sysfs_mutex);
[ +0.000793] lock(kn->count#424);
[ +0.000394]
*** DEADLOCK ***

[ +0.000665] 7 locks held by daxctl/22950:
[ +0.000458] #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0
[ +0.000943] #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0
[ +0.000895] #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0
[ +0.001019] #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40
[ +0.000942] #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0
[ +0.001019] #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120
[ +0.001118] #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
[ +0.001033]
stack backtrace:
[ +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G O 5.1.0-rc7+ #13
[ +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
[ +0.001360] Call Trace:
[ +0.000293] dump_stack+0x85/0xc0
[ +0.000390] print_circular_bug.isra.41.cold.60+0x15c/0x195
[ +0.000651] check_prev_add.constprop.50+0x5fd/0xbe0
[ +0.000563] ? call_rcu_zapped+0x80/0x80
[ +0.000449] __lock_acquire+0xcee/0xfd0
[ +0.000437] lock_acquire+0x9e/0x180
[ +0.000428] ? kernfs_remove_by_name_ns+0x40/0x80
[ +0.000531] __kernfs_remove+0x26a/0x310
[ +0.000451] ? kernfs_remove_by_name_ns+0x40/0x80
[ +0.000529] ? kernfs_name_hash+0x12/0x80
[ +0.000462] kernfs_remove_by_name_ns+0x40/0x80
[ +0.000513] remove_files.isra.1+0x30/0x70
[ +0.000483] sysfs_remove_group+0x3d/0x80
[ +0.000458] sysfs_remove_groups+0x29/0x40
[ +0.000477] device_remove_attrs+0x42/0x80
[ +0.000461] device_del+0x162/0x380
[ +0.000399] device_unregister+0x16/0x60
[ +0.000442] unregister_memory_section+0x6e/0xa0
[ +0.001232] __remove_pages+0xe9/0x520
[ +0.000443] arch_remove_memory+0x81/0xc0
[ +0.000459] try_remove_memory+0xba/0xd0
[ +0.000460] remove_memory+0x23/0x40
[ +0.000461] dev_dax_kmem_remove+0x29/0x57 [kmem]
[ +0.000603] device_release_driver_internal+0xe4/0x1d0
[ +0.000590] unbind_store+0x9b/0x130
[ +0.000409] kernfs_fop_write+0xf0/0x1a0
[ +0.000448] vfs_write+0xba/0x1c0
[ +0.000395] ksys_write+0x5a/0xe0
[ +0.000382] do_syscall_64+0x60/0x210
[ +0.000418] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ +0.000573] RIP: 0033:0x7fd1f7442fa8
[ +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8
[ +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003
[ +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80
[ +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70

2019-05-02 21:45:56

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

On Thu, May 2, 2019 at 4:50 PM Verma, Vishal L <[email protected]> wrote:
>
> On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> > The series of operations look like this:
> >
> > 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
> > and free ramdisk.
> > 2. Convert raw pmem0 to devdax
> > ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> > 3. Hotadd to System RAM
> > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > echo online_movable > /sys/devices/system/memoryXXX/state
> > 4. Before reboot hotremove device-dax memory from System RAM
> > echo offline > /sys/devices/system/memoryXXX/state
> > echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>
> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.

Hi Verma,

Thank you for testing. I wonder if there is a command sequence that I
could run to reproduce it?
Also, could you please send your config and qemu arguments.

Thank you,
Pasha

>
>
> [ +0.004886] ======================================================
> [ +0.001576] WARNING: possible circular locking dependency detected
> [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
> [ +0.000929] ------------------------------------------------------
> [ +0.000708] daxctl/22950 is trying to acquire lock:
> [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000922]
> but task is already holding lock:
> [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
> [ +0.000960]
> which lock already depends on the new lock.
>
> [ +0.001001]
> the existing dependency chain (in reverse order) is:
> [ +0.000837]
> -> #3 (mem_sysfs_mutex){+.+.}:
> [ +0.000631] __mutex_lock+0x82/0x9a0
> [ +0.000477] unregister_memory_section+0x22/0xa0
> [ +0.000582] __remove_pages+0xe9/0x520
> [ +0.000489] arch_remove_memory+0x81/0xc0
> [ +0.000510] devm_memremap_pages_release+0x180/0x270
> [ +0.000633] release_nodes+0x234/0x280
> [ +0.000483] device_release_driver_internal+0xf4/0x1d0
> [ +0.000701] bus_remove_device+0xfc/0x170
> [ +0.000529] device_del+0x16a/0x380
> [ +0.000459] unregister_dev_dax+0x23/0x50
> [ +0.000526] release_nodes+0x234/0x280
> [ +0.000487] device_release_driver_internal+0xf4/0x1d0
> [ +0.000646] unbind_store+0x9b/0x130
> [ +0.000467] kernfs_fop_write+0xf0/0x1a0
> [ +0.000510] vfs_write+0xba/0x1c0
> [ +0.000438] ksys_write+0x5a/0xe0
> [ +0.000521] do_syscall_64+0x60/0x210
> [ +0.000489] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ +0.000637]
> -> #2 (mem_hotplug_lock.rw_sem){++++}:
> [ +0.000717] get_online_mems+0x3e/0x80
> [ +0.000491] kmem_cache_create_usercopy+0x2e/0x270
> [ +0.000609] kmem_cache_create+0x12/0x20
> [ +0.000507] ptlock_cache_init+0x20/0x28
> [ +0.000506] start_kernel+0x240/0x4d0
> [ +0.000480] secondary_startup_64+0xa4/0xb0
> [ +0.000539]
> -> #1 (cpu_hotplug_lock.rw_sem){++++}:
> [ +0.000784] cpus_read_lock+0x3e/0x80
> [ +0.000511] online_pages+0x37/0x310
> [ +0.000469] memory_subsys_online+0x34/0x60
> [ +0.000611] device_online+0x60/0x80
> [ +0.000611] state_store+0x66/0xd0
> [ +0.000552] kernfs_fop_write+0xf0/0x1a0
> [ +0.000649] vfs_write+0xba/0x1c0
> [ +0.000487] ksys_write+0x5a/0xe0
> [ +0.000459] do_syscall_64+0x60/0x210
> [ +0.000482] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ +0.000646]
> -> #0 (kn->count#424){++++}:
> [ +0.000669] lock_acquire+0x9e/0x180
> [ +0.000471] __kernfs_remove+0x26a/0x310
> [ +0.000518] kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000583] remove_files.isra.1+0x30/0x70
> [ +0.000555] sysfs_remove_group+0x3d/0x80
> [ +0.000524] sysfs_remove_groups+0x29/0x40
> [ +0.000532] device_remove_attrs+0x42/0x80
> [ +0.000522] device_del+0x162/0x380
> [ +0.000464] device_unregister+0x16/0x60
> [ +0.000505] unregister_memory_section+0x6e/0xa0
> [ +0.000591] __remove_pages+0xe9/0x520
> [ +0.000492] arch_remove_memory+0x81/0xc0
> [ +0.000568] try_remove_memory+0xba/0xd0
> [ +0.000510] remove_memory+0x23/0x40
> [ +0.000483] dev_dax_kmem_remove+0x29/0x57 [kmem]
> [ +0.000608] device_release_driver_internal+0xe4/0x1d0
> [ +0.000637] unbind_store+0x9b/0x130
> [ +0.000464] kernfs_fop_write+0xf0/0x1a0
> [ +0.000685] vfs_write+0xba/0x1c0
> [ +0.000594] ksys_write+0x5a/0xe0
> [ +0.000449] do_syscall_64+0x60/0x210
> [ +0.000481] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ +0.000619]
> other info that might help us debug this:
>
> [ +0.000889] Chain exists of:
> kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
>
> [ +0.001269] Possible unsafe locking scenario:
>
> [ +0.000652] CPU0 CPU1
> [ +0.000505] ---- ----
> [ +0.000523] lock(mem_sysfs_mutex);
> [ +0.000422] lock(mem_hotplug_lock.rw_sem);
> [ +0.000905] lock(mem_sysfs_mutex);
> [ +0.000793] lock(kn->count#424);
> [ +0.000394]
> *** DEADLOCK ***
>
> [ +0.000665] 7 locks held by daxctl/22950:
> [ +0.000458] #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0
> [ +0.000943] #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0
> [ +0.000895] #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0
> [ +0.001019] #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40
> [ +0.000942] #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0
> [ +0.001019] #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120
> [ +0.001118] #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
> [ +0.001033]
> stack backtrace:
> [ +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G O 5.1.0-rc7+ #13
> [ +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
> [ +0.001360] Call Trace:
> [ +0.000293] dump_stack+0x85/0xc0
> [ +0.000390] print_circular_bug.isra.41.cold.60+0x15c/0x195
> [ +0.000651] check_prev_add.constprop.50+0x5fd/0xbe0
> [ +0.000563] ? call_rcu_zapped+0x80/0x80
> [ +0.000449] __lock_acquire+0xcee/0xfd0
> [ +0.000437] lock_acquire+0x9e/0x180
> [ +0.000428] ? kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000531] __kernfs_remove+0x26a/0x310
> [ +0.000451] ? kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000529] ? kernfs_name_hash+0x12/0x80
> [ +0.000462] kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000513] remove_files.isra.1+0x30/0x70
> [ +0.000483] sysfs_remove_group+0x3d/0x80
> [ +0.000458] sysfs_remove_groups+0x29/0x40
> [ +0.000477] device_remove_attrs+0x42/0x80
> [ +0.000461] device_del+0x162/0x380
> [ +0.000399] device_unregister+0x16/0x60
> [ +0.000442] unregister_memory_section+0x6e/0xa0
> [ +0.001232] __remove_pages+0xe9/0x520
> [ +0.000443] arch_remove_memory+0x81/0xc0
> [ +0.000459] try_remove_memory+0xba/0xd0
> [ +0.000460] remove_memory+0x23/0x40
> [ +0.000461] dev_dax_kmem_remove+0x29/0x57 [kmem]
> [ +0.000603] device_release_driver_internal+0xe4/0x1d0
> [ +0.000590] unbind_store+0x9b/0x130
> [ +0.000409] kernfs_fop_write+0xf0/0x1a0
> [ +0.000448] vfs_write+0xba/0x1c0
> [ +0.000395] ksys_write+0x5a/0xe0
> [ +0.000382] do_syscall_64+0x60/0x210
> [ +0.000418] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ +0.000573] RIP: 0033:0x7fd1f7442fa8
> [ +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
> [ +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8
> [ +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003
> [ +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80
> [ +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [ +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70

2019-05-02 22:32:38

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote:

> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
>
> Hi Verma,
>
> Thank you for testing. I wonder if there is a command sequence that I
> could run to reproduce it?
> Also, could you please send your config and qemu arguments.
>
Yes, here is the qemu config:

qemu-system-x86_64
-machine accel=kvm
-machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
-cpu Haswell-noTSX
-m 12G,slots=3,maxmem=44G
-realtime mlock=off
-smp 8,sockets=2,cores=4,threads=1
-numa node,nodeid=0,cpus=0-3,mem=6G
-numa node,nodeid=1,cpus=4-7,mem=6G
-numa node,nodeid=2
-numa node,nodeid=3
-drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
-object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
-device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
-object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
-device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
-serial stdio
-display none

For the command list - I'm using WIP patches to ndctl/daxctl to add the
command I mentioned earlier. Using this command, I can reproduce the
lockdep issue. I thought I should be able to reproduce the issue by
onlining/offlining through sysfs directly too - something like:

node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
for mem in /sys/devices/system/node/node"$node"/memory*; do
echo "offline" > $mem/state
done

But with that I can't reproduce the problem.

I'll try to dig a bit deeper into what might be happening, the daxctl
modifications simply amount to doing the same thing as above in C, so
I'm not immediately sure what might be happening.

If you're interested, I can post the ndctl patches - maybe as an RFC -
to test with.

Thanks,
-Vishal



2019-05-02 22:39:25

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

On Thu, May 2, 2019 at 6:29 PM Verma, Vishal L <[email protected]> wrote:
>
> On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote:
>
> > > In running with these patches, and testing the offlining part, I ran
> > > into the following lockdep below.
> > >
> > > This is with just these three patches on top of -rc7.
> >
> > Hi Verma,
> >
> > Thank you for testing. I wonder if there is a command sequence that I
> > could run to reproduce it?
> > Also, could you please send your config and qemu arguments.
> >
> Yes, here is the qemu config:
>
> qemu-system-x86_64
> -machine accel=kvm
> -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
> -cpu Haswell-noTSX
> -m 12G,slots=3,maxmem=44G
> -realtime mlock=off
> -smp 8,sockets=2,cores=4,threads=1
> -numa node,nodeid=0,cpus=0-3,mem=6G
> -numa node,nodeid=1,cpus=4-7,mem=6G
> -numa node,nodeid=2
> -numa node,nodeid=3
> -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
> -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
> -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
> -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
> -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
> -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
> -serial stdio
> -display none
>
> For the command list - I'm using WIP patches to ndctl/daxctl to add the
> command I mentioned earlier. Using this command, I can reproduce the
> lockdep issue. I thought I should be able to reproduce the issue by
> onlining/offlining through sysfs directly too - something like:
>
> node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
> for mem in /sys/devices/system/node/node"$node"/memory*; do
> echo "offline" > $mem/state
> done
>
> But with that I can't reproduce the problem.
>
> I'll try to dig a bit deeper into what might be happening, the daxctl
> modifications simply amount to doing the same thing as above in C, so
> I'm not immediately sure what might be happening.
>
> If you're interested, I can post the ndctl patches - maybe as an RFC -
> to test with.

I could apply the patches and test with them. Also, could you please
send your kernel config.

Thank you,
Pasha

>
> Thanks,
> -Vishal
>
>
>

2019-05-03 10:29:15

by David Hildenbrand

[permalink] [raw]
Subject: Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

On 02.05.19 20:43, Pavel Tatashin wrote:
> As of right now remove_memory() interface is inherently broken. It tries
> to remove memory but panics if some memory is not offline. The problem
> is that it is impossible to ensure that all memory blocks are offline as
> this function also takes lock_device_hotplug that is required to
> change memory state via sysfs.
>

The existing interface can actually work today by registering a hotplug
notifier and rejecting any onlining attempts. But I agree that this way,
the interface becomes more usable.

> So, between calling this function and offlining all memory blocks there
> is always a window when lock_device_hotplug is released, and therefore,
> there is always a chance for a panic during this window.
>
> Make this interface to return an error if memory removal fails. This way
> it is safe to call this function without panicking machine, and also
> makes it symmetric to add_memory() which already returns an error.
>
> Signed-off-by: Pavel Tatashin <[email protected]>> ---
> include/linux/memory_hotplug.h | 8 +++--
> mm/memory_hotplug.c | 61 ++++++++++++++++++++++------------
> 2 files changed, 46 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 8ade08c50d26..5438a2d92560 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -304,7 +304,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
> extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
> extern void try_offline_node(int nid);
> extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
> extern void __remove_memory(int nid, u64 start, u64 size);
>
> #else
> @@ -321,7 +321,11 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> return -EINVAL;
> }
>
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline bool remove_memory(int nid, u64 start, u64 size)
> +{
> + return -EBUSY;
> +}
> +
> static inline void __remove_memory(int nid, u64 start, u64 size) {}
> #endif /* CONFIG_MEMORY_HOTREMOVE */
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8c454e82d4f6..a826aededa1a 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1778,9 +1778,10 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
> endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
> pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
> &beginpa, &endpa);
> - }
>
> - return ret;
> + return -EBUSY;
> + }
> + return 0;
> }
>
> static int check_cpu_on_node(pg_data_t *pgdat)
> @@ -1843,19 +1844,9 @@ void try_offline_node(int nid)
> }
> EXPORT_SYMBOL(try_offline_node);
>
> -/**
> - * remove_memory
> - * @nid: the node ID
> - * @start: physical address of the region to remove
> - * @size: size of the region to remove
> - *
> - * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> - * and online/offline operations before this call, as required by
> - * try_offline_node().
> - */
> -void __ref __remove_memory(int nid, u64 start, u64 size)
> +static int __ref try_remove_memory(int nid, u64 start, u64 size)
> {
> - int ret;
> + int rc = 0;
>
> BUG_ON(check_hotplug_memory_range(start, size));
>
> @@ -1863,13 +1854,13 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>
> /*
> * All memory blocks must be offlined before removing memory. Check
> - * whether all memory blocks in question are offline and trigger a BUG()
> + * whether all memory blocks in question are offline and return error
> * if this is not the case.
> */
> - ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> - check_memblock_offlined_cb);
> - if (ret)
> - BUG();
> + rc = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> + check_memblock_offlined_cb);
> + if (rc)
> + goto done;
>
> /* remove memmap entry */
> firmware_map_remove(start, start + size, "System RAM");
> @@ -1879,14 +1870,42 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>
> try_offline_node(nid);
>
> +done:
> mem_hotplug_done();
> + return rc;
> }
>
> -void remove_memory(int nid, u64 start, u64 size)
> +/**
> + * remove_memory
> + * @nid: the node ID
> + * @start: physical address of the region to remove
> + * @size: size of the region to remove
> + *
> + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> + * and online/offline operations before this call, as required by
> + * try_offline_node().
> + */
> +void __remove_memory(int nid, u64 start, u64 size)
> {
> +
> + /*
> + * trigger BUG() is some memory is not offlined prior to calling this
> + * function
> + */
> + if (try_remove_memory(nid, start, size))
> + BUG();
> +}
> +
> +/* Remove memory if every memory block is offline, otherwise return false */

Comment is wrong "return false"

> +int remove_memory(int nid, u64 start, u64 size)
> +{
> + int rc;
> +
> lock_device_hotplug();
> - __remove_memory(nid, start, size);
> + rc = try_remove_memory(nid, start, size);
> unlock_device_hotplug();
> +
> + return rc;
> }
> EXPORT_SYMBOL_GPL(remove_memory);
> #endif /* CONFIG_MEMORY_HOTREMOVE */
>


Looks sane to me

Reviewed-by: David Hildenbrand <[email protected]>

--

Thanks,

David / dhildenb

2019-05-03 22:30:34

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

On Thu, 2019-05-02 at 18:36 -0400, Pavel Tatashin wrote:
> > Yes, here is the qemu config:
> >
> > qemu-system-x86_64
> > -machine accel=kvm
> > -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
> > -cpu Haswell-noTSX
> > -m 12G,slots=3,maxmem=44G
> > -realtime mlock=off
> > -smp 8,sockets=2,cores=4,threads=1
> > -numa node,nodeid=0,cpus=0-3,mem=6G
> > -numa node,nodeid=1,cpus=4-7,mem=6G
> > -numa node,nodeid=2
> > -numa node,nodeid=3
> > -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
> > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
> > -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
> > -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
> > -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
> > -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
> > -serial stdio
> > -display none
> >
> > For the command list - I'm using WIP patches to ndctl/daxctl to add the
> > command I mentioned earlier. Using this command, I can reproduce the
> > lockdep issue. I thought I should be able to reproduce the issue by
> > onlining/offlining through sysfs directly too - something like:
> >
> > node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
> > for mem in /sys/devices/system/node/node"$node"/memory*; do
> > echo "offline" > $mem/state
> > done
> >
> > But with that I can't reproduce the problem.
> >
> > I'll try to dig a bit deeper into what might be happening, the daxctl
> > modifications simply amount to doing the same thing as above in C, so
> > I'm not immediately sure what might be happening.
> >
> > If you're interested, I can post the ndctl patches - maybe as an RFC -
> > to test with.
>
> I could apply the patches and test with them. Also, could you please
> send your kernel config.
>
Hi Pavel,

I've CC'd you on the patches mentioned above, and also pushed them to a
'kmem-pending' branch on github:

https://github.com/pmem/ndctl/tree/kmem-pending

After building ndctl from the above, you will want to run:

# daxctl reconfigure-device --mode=system-ram dax0.0
(this will also have onlined the memory sections)

# daxctl reconfigure-device --mode=devdax --attempt-offline dax0.0
(this triggers the lockdep warnings)

I've attached the kernel config here too (gzipped).

Thanks,
-Vishal


Attachments:
config.gz (29.66 kB)
smime.p7s (3.19 kB)
Download all attachments

2019-05-06 17:58:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline bool remove_memory(int nid, u64 start, u64 size)
> +{
> + return -EBUSY;
> +}

This seems like an appropriate place for a WARN_ONCE(), if someone
manages to call remove_memory() with hotplug disabled.

BTW, I looked and can't think of a better errno, but -EBUSY probably
isn't the best error code, right?

> -void remove_memory(int nid, u64 start, u64 size)
> +/**
> + * remove_memory
> + * @nid: the node ID
> + * @start: physical address of the region to remove
> + * @size: size of the region to remove
> + *
> + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> + * and online/offline operations before this call, as required by
> + * try_offline_node().
> + */
> +void __remove_memory(int nid, u64 start, u64 size)
> {
> +
> + /*
> + * trigger BUG() is some memory is not offlined prior to calling this
> + * function
> + */
> + if (try_remove_memory(nid, start, size))
> + BUG();
> +}

Could we call this remove_offline_memory()? That way, it makes _some_
sense why we would BUG() if the memory isn't offline.

2019-05-06 18:02:42

by Dan Williams

[permalink] [raw]
Subject: Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

On Mon, May 6, 2019 at 10:57 AM Dave Hansen <[email protected]> wrote:
>
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > +{
> > + return -EBUSY;
> > +}
>
> This seems like an appropriate place for a WARN_ONCE(), if someone
> manages to call remove_memory() with hotplug disabled.
>
> BTW, I looked and can't think of a better errno, but -EBUSY probably
> isn't the best error code, right?
>
> > -void remove_memory(int nid, u64 start, u64 size)
> > +/**
> > + * remove_memory
> > + * @nid: the node ID
> > + * @start: physical address of the region to remove
> > + * @size: size of the region to remove
> > + *
> > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > + * and online/offline operations before this call, as required by
> > + * try_offline_node().
> > + */
> > +void __remove_memory(int nid, u64 start, u64 size)
> > {
> > +
> > + /*
> > + * trigger BUG() is some memory is not offlined prior to calling this
> > + * function
> > + */
> > + if (try_remove_memory(nid, start, size))
> > + BUG();
> > +}
>
> Could we call this remove_offline_memory()? That way, it makes _some_
> sense why we would BUG() if the memory isn't offline.

Please WARN() instead of BUG() because failing to remove memory should
not be system fatal.

2019-05-06 18:06:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

On 5/6/19 11:01 AM, Dan Williams wrote:
>>> +void __remove_memory(int nid, u64 start, u64 size)
>>> {
>>> +
>>> + /*
>>> + * trigger BUG() is some memory is not offlined prior to calling this
>>> + * function
>>> + */
>>> + if (try_remove_memory(nid, start, size))
>>> + BUG();
>>> +}
>> Could we call this remove_offline_memory()? That way, it makes _some_
>> sense why we would BUG() if the memory isn't offline.
> Please WARN() instead of BUG() because failing to remove memory should
> not be system fatal.

That is my preference as well. But, the existing code BUG()s, so I'm
OK-ish with this staying for the moment until we have a better handle on
what all the callers do if this fails.

2019-05-06 18:15:08

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

On Mon, May 6, 2019 at 1:57 PM Dave Hansen <[email protected]> wrote:
>
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > +{
> > + return -EBUSY;
> > +}
>
> This seems like an appropriate place for a WARN_ONCE(), if someone
> manages to call remove_memory() with hotplug disabled.
>
> BTW, I looked and can't think of a better errno, but -EBUSY probably
> isn't the best error code, right?

Same here, I looked and did not find any better then -EBUSY. Also, it
is close to check_cpu_on_node() in the same file.

>
> > -void remove_memory(int nid, u64 start, u64 size)
> > +/**
> > + * remove_memory
> > + * @nid: the node ID
> > + * @start: physical address of the region to remove
> > + * @size: size of the region to remove
> > + *
> > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > + * and online/offline operations before this call, as required by
> > + * try_offline_node().
> > + */
> > +void __remove_memory(int nid, u64 start, u64 size)
> > {
> > +
> > + /*
> > + * trigger BUG() is some memory is not offlined prior to calling this
> > + * function
> > + */
> > + if (try_remove_memory(nid, start, size))
> > + BUG();
> > +}
>
> Could we call this remove_offline_memory()? That way, it makes _some_
> sense why we would BUG() if the memory isn't offline.

Sure, I will rename this function.

Thank you,
Pasha

2019-05-06 18:19:45

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

On Mon, May 6, 2019 at 2:04 PM Dave Hansen <[email protected]> wrote:
>
> On 5/6/19 11:01 AM, Dan Williams wrote:
> >>> +void __remove_memory(int nid, u64 start, u64 size)
> >>> {
> >>> +
> >>> + /*
> >>> + * trigger BUG() is some memory is not offlined prior to calling this
> >>> + * function
> >>> + */
> >>> + if (try_remove_memory(nid, start, size))
> >>> + BUG();
> >>> +}
> >> Could we call this remove_offline_memory()? That way, it makes _some_
> >> sense why we would BUG() if the memory isn't offline.
> > Please WARN() instead of BUG() because failing to remove memory should
> > not be system fatal.
>
> That is my preference as well. But, the existing code BUG()s, so I'm
> OK-ish with this staying for the moment until we have a better handle on
> what all the callers do if this fails.

Yes, this is the reason why I BUG() here. The current code does this,
and I was not sure what would happen if we simply continue executing.
Of course, I would prefer to return failure, so the callers can act
appropriately, but let's make one thing at a time, this should not be
part of this series.

Thank you,
Pasha

2019-05-15 18:13:34

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.
>
>
> [ +0.004886] ======================================================
> [ +0.001576] WARNING: possible circular locking dependency detected
> [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
> [ +0.000929] ------------------------------------------------------
> [ +0.000708] daxctl/22950 is trying to acquire lock:
> [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000922]
> but task is already holding lock:
> [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0

I have studied this issue, and now have a clear understanding why it
happens, I am not yet sure how to fix it, so suggestions are welcomed
:)

Here is the problem:

When we offline pages we have the following call stack:

# echo offline > /sys/devices/system/memory/memory8/state
ksys_write
vfs_write
__vfs_write
kernfs_fop_write
kernfs_get_active
lock_acquire kn->count#122 (lock for
"memory8/state" kn)
sysfs_kf_write
dev_attr_store
state_store
device_offline
memory_subsys_offline
memory_block_action
offline_pages
__offline_pages
percpu_down_write
down_write
lock_acquire mem_hotplug_lock.rw_sem

When we unbind dax0.0 we have the following stack:
# echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
drv_attr_store
unbind_store
device_driver_detach
device_release_driver_internal
dev_dax_kmem_remove
remove_memory device_hotplug_lock
try_remove_memory mem_hotplug_lock.rw_sem
arch_remove_memory
__remove_pages
__remove_section
unregister_memory_section
remove_memory_section mem_sysfs_mutex
unregister_memory
device_unregister
device_del
device_remove_attrs
sysfs_remove_groups
sysfs_remove_group
remove_files
kernfs_remove_by_name
kernfs_remove_by_name_ns
__kernfs_remove kn->count#122

So, lockdep found the ordering issue with the above two stacks:

1. kn->count#122 -> mem_hotplug_lock.rw_sem
2. mem_hotplug_lock.rw_sem -> kn->count#122

2019-05-16 01:51:57

by Dan Williams

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
<[email protected]> wrote:
>
> > Hi Pavel,
> >
> > I am working on adding this sort of a workflow into a new daxctl command
> > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> > dax device to kmem, online the resulting memory, and with your patches,
> > also attempt to offline the memory, and change back to device-dax.
> >
> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
> >
> >
> > [ +0.004886] ======================================================
> > [ +0.001576] WARNING: possible circular locking dependency detected
> > [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
> > [ +0.000929] ------------------------------------------------------
> > [ +0.000708] daxctl/22950 is trying to acquire lock:
> > [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> > [ +0.000922]
> > but task is already holding lock:
> > [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
>
> I have studied this issue, and now have a clear understanding why it
> happens, I am not yet sure how to fix it, so suggestions are welcomed
> :)

I would think that ACPI hotplug would have a similar problem, but it does this:

acpi_unbind_memory_blocks(info);
__remove_memory(nid, info->start_addr, info->length);

I wonder if that ordering prevents going too deep into the
device_unregister() call stack that you highlighted below.


>
> Here is the problem:
>
> When we offline pages we have the following call stack:
>
> # echo offline > /sys/devices/system/memory/memory8/state
> ksys_write
> vfs_write
> __vfs_write
> kernfs_fop_write
> kernfs_get_active
> lock_acquire kn->count#122 (lock for
> "memory8/state" kn)
> sysfs_kf_write
> dev_attr_store
> state_store
> device_offline
> memory_subsys_offline
> memory_block_action
> offline_pages
> __offline_pages
> percpu_down_write
> down_write
> lock_acquire mem_hotplug_lock.rw_sem
>
> When we unbind dax0.0 we have the following stack:
> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> drv_attr_store
> unbind_store
> device_driver_detach
> device_release_driver_internal
> dev_dax_kmem_remove
> remove_memory device_hotplug_lock
> try_remove_memory mem_hotplug_lock.rw_sem
> arch_remove_memory
> __remove_pages
> __remove_section
> unregister_memory_section
> remove_memory_section mem_sysfs_mutex
> unregister_memory
> device_unregister
> device_del
> device_remove_attrs
> sysfs_remove_groups
> sysfs_remove_group
> remove_files
> kernfs_remove_by_name
> kernfs_remove_by_name_ns
> __kernfs_remove kn->count#122
>
> So, lockdep found the ordering issue with the above two stacks:
>
> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> 2. mem_hotplug_lock.rw_sem -> kn->count#122

2019-05-16 07:12:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

On 16.05.19 02:42, Dan Williams wrote:
> On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
> <[email protected]> wrote:
>>
>>> Hi Pavel,
>>>
>>> I am working on adding this sort of a workflow into a new daxctl command
>>> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
>>> dax device to kmem, online the resulting memory, and with your patches,
>>> also attempt to offline the memory, and change back to device-dax.
>>>
>>> In running with these patches, and testing the offlining part, I ran
>>> into the following lockdep below.
>>>
>>> This is with just these three patches on top of -rc7.
>>>
>>>
>>> [ +0.004886] ======================================================
>>> [ +0.001576] WARNING: possible circular locking dependency detected
>>> [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
>>> [ +0.000929] ------------------------------------------------------
>>> [ +0.000708] daxctl/22950 is trying to acquire lock:
>>> [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
>>> [ +0.000922]
>>> but task is already holding lock:
>>> [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
>>
>> I have studied this issue, and now have a clear understanding why it
>> happens, I am not yet sure how to fix it, so suggestions are welcomed
>> :)
>
> I would think that ACPI hotplug would have a similar problem, but it does this:
>
> acpi_unbind_memory_blocks(info);
> __remove_memory(nid, info->start_addr, info->length);
>
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
>

If that doesn't help, after we have

[PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling

we could probably pull the memory device removal phase out from the
mem_hotplug_lock protection and let it be protected by the
device_hotplug_lock only. Might require some more work, though.

--

Thanks,

David / dhildenb

2019-05-17 14:29:21

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

>
> I would think that ACPI hotplug would have a similar problem, but it does this:
>
> acpi_unbind_memory_blocks(info);
> __remove_memory(nid, info->start_addr, info->length);

ACPI does have exactly the same problem, so this is not a bug for this
series, I will submit a new version of my series with comments
addressed, but without fix for this issue.

I was able to reproduce this issue on the current mainline kernel.
Also, I been thinking more about how to fix it, and there is no easy
fix without a major hotplug redesign. Basically, we have to remove
sysfs memory entries before or after memory is hotplugged/hotremoved.
But, we also have to guarantee that hotplug/hotremove will succeed or
reinstate sysfs entries.

Qemu script:

qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-parallel none \
-echr 1 \
-serial none \
-chardev stdio,id=console,signal=off,mux=on \
-serial chardev:console \
-mon chardev=console \
-vga none \
-display none \
-kernel pmem/native/arch/x86/boot/bzImage \
-m 8G,slots=1,maxmem=16G \
-smp 8 \
-fsdev local,id=virtfs1,path=/,security_model=none \
-device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \
-append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp loglevel=7'

Config is attached.

Steps to reproduce:
#
# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
(qemu)

# echo online_movable > /sys/devices/system/memory/memory79/state
[ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370
[ 23.032591] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [ 32.013950] Offlined Pages 32768
[ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022
[ 32.014843] Policy zone: Normal
[ 32.015733]
[ 32.015881] ======================================================
[ 32.016390] WARNING: possible circular locking dependency detected
[ 32.016881] 5.1.0_pt_pmem #38 Not tainted
[ 32.017202] ------------------------------------------------------
[ 32.017680] kworker/u16:4/380 is trying to acquire lock:
[ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
kernfs_remove_by_name_ns+0x3b/0x80
[ 32.018745]
[ 32.018745] but task is already holding lock:
[ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[ 32.019859]
[ 32.019859] which lock already depends on the new lock.
[ 32.019859]
[ 32.020499]
[ 32.020499] the existing dependency chain (in reverse order) is:
[ 32.021080]
[ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
[ 32.021522] __mutex_lock+0x8b/0x900
[ 32.021843] hotplug_memory_register+0x26/0xa0
[ 32.022231] __add_pages+0xe7/0x160
[ 32.022545] add_pages+0xd/0x60
[ 32.022835] add_memory_resource+0xc3/0x1d0
[ 32.023207] __add_memory+0x57/0x80
[ 32.023530] acpi_memory_device_add+0x13a/0x2d0
[ 32.023928] acpi_bus_attach+0xf1/0x200
[ 32.024272] acpi_bus_scan+0x3e/0x90
[ 32.024597] acpi_device_hotplug+0x284/0x3e0
[ 32.024972] acpi_hotplug_work_fn+0x15/0x20
[ 32.025342] process_one_work+0x2a0/0x650
[ 32.025755] worker_thread+0x34/0x3d0
[ 32.026077] kthread+0x118/0x130
[ 32.026442] ret_from_fork+0x3a/0x50
[ 32.026766]
[ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
[ 32.027261] get_online_mems+0x39/0x80
[ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0
[ 32.028019] kmem_cache_create+0xd/0x10
[ 32.028367] ptlock_cache_init+0x1b/0x23
[ 32.028724] start_kernel+0x1d2/0x4b8
[ 32.029060] secondary_startup_64+0xa4/0xb0
[ 32.029447]
[ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
[ 32.030007] cpus_read_lock+0x39/0x80
[ 32.030360] __offline_pages+0x32/0x790
[ 32.030709] memory_subsys_offline+0x3a/0x60
[ 32.031089] device_offline+0x7e/0xb0
[ 32.031425] acpi_bus_offline+0xd8/0x140
[ 32.031821] acpi_device_hotplug+0x1b2/0x3e0
[ 32.032202] acpi_hotplug_work_fn+0x15/0x20
[ 32.032576] process_one_work+0x2a0/0x650
[ 32.032942] worker_thread+0x34/0x3d0
[ 32.033283] kthread+0x118/0x130
[ 32.033588] ret_from_fork+0x3a/0x50
[ 32.033919]
[ 32.033919] -> #1 (&device->physical_node_lock){+.+.}:
[ 32.034450] __mutex_lock+0x8b/0x900
[ 32.034784] acpi_get_first_physical_node+0x16/0x60
[ 32.035217] acpi_companion_match+0x3b/0x60
[ 32.035594] acpi_device_uevent_modalias+0x9/0x20
[ 32.036012] platform_uevent+0xd/0x40
[ 32.036352] dev_uevent+0x85/0x1c0
[ 32.036674] kobject_uevent_env+0x1e2/0x640
[ 32.037044] kobject_synth_uevent+0x2b7/0x324
[ 32.037428] uevent_store+0x17/0x30
[ 32.037752] kernfs_fop_write+0xeb/0x1a0
[ 32.038112] vfs_write+0xb2/0x1b0
[ 32.038417] ksys_write+0x57/0xd0
[ 32.038721] do_syscall_64+0x4b/0x1a0
[ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 32.039491]
[ 32.039491] -> #0 (kn->count#18){++++}:
[ 32.039913] lock_acquire+0xaa/0x180
[ 32.040242] __kernfs_remove+0x244/0x2d0
[ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80
[ 32.040991] device_del+0x14a/0x370
[ 32.041309] device_unregister+0x9/0x20
[ 32.041653] unregister_memory_section+0x69/0xa0
[ 32.042059] __remove_pages+0x112/0x460
[ 32.042402] arch_remove_memory+0x6f/0xa0
[ 32.042758] __remove_memory+0xab/0x130
[ 32.043103] acpi_memory_device_remove+0x67/0xe0
[ 32.043537] acpi_bus_trim+0x50/0x90
[ 32.043889] acpi_device_hotplug+0x2fa/0x3e0
[ 32.044300] acpi_hotplug_work_fn+0x15/0x20
[ 32.044686] process_one_work+0x2a0/0x650
[ 32.045044] worker_thread+0x34/0x3d0
[ 32.045381] kthread+0x118/0x130
[ 32.045679] ret_from_fork+0x3a/0x50
[ 32.046005]
[ 32.046005] other info that might help us debug this:
[ 32.046005]
[ 32.046636] Chain exists of:
[ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
[ 32.046636]
[ 32.047514] Possible unsafe locking scenario:
[ 32.047514]
[ 32.047976] CPU0 CPU1
[ 32.048337] ---- ----
[ 32.048697] lock(mem_sysfs_mutex);
[ 32.048983] lock(mem_hotplug_lock.rw_sem);
[ 32.049519] lock(mem_sysfs_mutex);
[ 32.050004] lock(kn->count#18);
[ 32.050270]
[ 32.050270] *** DEADLOCK ***
[ 32.050270]
[ 32.050736] 7 locks held by kworker/u16:4/380:
[ 32.051087] #0: 00000000a22fe78e
((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
[ 32.051830] #1: 00000000944f2dca
((work_completion)(&hpw->work)){+.+.}, at:
process_one_work+0x21e/0x650
[ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
acpi_device_hotplug+0x2e/0x3e0
[ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
acpi_device_hotplug+0x3c/0x3e0
[ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
at: __remove_memory+0x3b/0x130
[ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
at: percpu_down_write+0x1d/0x110
[ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[ 32.056016]
[ 32.056016] stack backtrace:
[ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
[ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 32.058144] Call Trace:
[ 32.058344] dump_stack+0x67/0x90
[ 32.058604] print_circular_bug.cold.60+0x15c/0x195
[ 32.058989] __lock_acquire+0x17de/0x1d30
[ 32.059308] ? find_held_lock+0x2d/0x90
[ 32.059611] ? __kernfs_remove+0x199/0x2d0
[ 32.059937] lock_acquire+0xaa/0x180
[ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80
[ 32.060596] __kernfs_remove+0x244/0x2d0
[ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80
[ 32.061283] ? kernfs_name_hash+0xd/0x80
[ 32.061596] ? kernfs_find_ns+0x68/0xf0
[ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80
[ 32.062266] device_del+0x14a/0x370
[ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0
[ 32.062973] device_unregister+0x9/0x20
[ 32.063285] unregister_memory_section+0x69/0xa0
[ 32.063651] __remove_pages+0x112/0x460
[ 32.063949] arch_remove_memory+0x6f/0xa0
[ 32.064271] __remove_memory+0xab/0x130
[ 32.064579] ? walk_memory_range+0xa1/0xe0
[ 32.064907] acpi_memory_device_remove+0x67/0xe0
[ 32.065274] acpi_bus_trim+0x50/0x90
[ 32.065560] acpi_device_hotplug+0x2fa/0x3e0
[ 32.065900] acpi_hotplug_work_fn+0x15/0x20
[ 32.066249] process_one_work+0x2a0/0x650
[ 32.066591] worker_thread+0x34/0x3d0
[ 32.066925] ? process_one_work+0x650/0x650
[ 32.067275] kthread+0x118/0x130
[ 32.067542] ? kthread_create_on_node+0x60/0x60
[ 32.067909] ret_from_fork+0x3a/0x50

>
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
>
>
> >
> > Here is the problem:
> >
> > When we offline pages we have the following call stack:
> >
> > # echo offline > /sys/devices/system/memory/memory8/state
> > ksys_write
> > vfs_write
> > __vfs_write
> > kernfs_fop_write
> > kernfs_get_active
> > lock_acquire kn->count#122 (lock for
> > "memory8/state" kn)
> > sysfs_kf_write
> > dev_attr_store
> > state_store
> > device_offline
> > memory_subsys_offline
> > memory_block_action
> > offline_pages
> > __offline_pages
> > percpu_down_write
> > down_write
> > lock_acquire mem_hotplug_lock.rw_sem
> >
> > When we unbind dax0.0 we have the following stack:
> > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> > drv_attr_store
> > unbind_store
> > device_driver_detach
> > device_release_driver_internal
> > dev_dax_kmem_remove
> > remove_memory device_hotplug_lock
> > try_remove_memory mem_hotplug_lock.rw_sem
> > arch_remove_memory
> > __remove_pages
> > __remove_section
> > unregister_memory_section
> > remove_memory_section mem_sysfs_mutex
> > unregister_memory
> > device_unregister
> > device_del
> > device_remove_attrs
> > sysfs_remove_groups
> > sysfs_remove_group
> > remove_files
> > kernfs_remove_by_name
> > kernfs_remove_by_name_ns
> > __kernfs_remove kn->count#122
> >
> > So, lockdep found the ordering issue with the above two stacks:
> >
> > 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> > 2. mem_hotplug_lock.rw_sem -> kn->count#122


Attachments:
x86.config.bz2 (24.12 kB)

2019-05-17 19:18:27

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

Hi Dan,

Thank you very much for your review, my comments below:

On Mon, May 6, 2019 at 2:01 PM Dan Williams <[email protected]> wrote:
>
> On Mon, May 6, 2019 at 10:57 AM Dave Hansen <[email protected]> wrote:
> >
> > > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > > +{
> > > + return -EBUSY;
> > > +}
> >
> > This seems like an appropriate place for a WARN_ONCE(), if someone
> > manages to call remove_memory() with hotplug disabled.

I decided not to do WARN_ONCE(), in all likelihood compiler will
simply optimize this function out, but with WARN_ONCE() some traces of
it will remain.

> >
> > BTW, I looked and can't think of a better errno, but -EBUSY probably
> > isn't the best error code, right?

-EBUSY is the only error that is returned in case of error by real
remove_memory(), so I think it is OK to keep it here.

> >
> > > -void remove_memory(int nid, u64 start, u64 size)
> > > +/**
> > > + * remove_memory
> > > + * @nid: the node ID
> > > + * @start: physical address of the region to remove
> > > + * @size: size of the region to remove
> > > + *
> > > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > > + * and online/offline operations before this call, as required by
> > > + * try_offline_node().
> > > + */
> > > +void __remove_memory(int nid, u64 start, u64 size)
> > > {
> > > +
> > > + /*
> > > + * trigger BUG() is some memory is not offlined prior to calling this
> > > + * function
> > > + */
> > > + if (try_remove_memory(nid, start, size))
> > > + BUG();
> > > +}
> >
> > Could we call this remove_offline_memory()? That way, it makes _some_
> > sense why we would BUG() if the memory isn't offline.

It is this particular code path, the second one: remove_memory(),
actually tries to remove memory and returns failure if it can't. So, I
think the current name is OK.

>
> Please WARN() instead of BUG() because failing to remove memory should
> not be system fatal.

As mentioned earlier, I will keep BUG(), because existing code does
that, and there is no good handling of this code to return on error.

Thank you,
Pavel

2019-05-20 09:20:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [v5 0/3] "Hotremove" persistent memory

On 17.05.19 16:09, Pavel Tatashin wrote:
>>
>> I would think that ACPI hotplug would have a similar problem, but it does this:
>>
>> acpi_unbind_memory_blocks(info);
>> __remove_memory(nid, info->start_addr, info->length);
>
> ACPI does have exactly the same problem, so this is not a bug for this
> series, I will submit a new version of my series with comments
> addressed, but without fix for this issue.
>
> I was able to reproduce this issue on the current mainline kernel.
> Also, I been thinking more about how to fix it, and there is no easy
> fix without a major hotplug redesign. Basically, we have to remove
> sysfs memory entries before or after memory is hotplugged/hotremoved.
> But, we also have to guarantee that hotplug/hotremove will succeed or
> reinstate sysfs entries.
>
> Qemu script:
>
> qemu-system-x86_64 \
> -enable-kvm \
> -cpu host \
> -parallel none \
> -echr 1 \
> -serial none \
> -chardev stdio,id=console,signal=off,mux=on \
> -serial chardev:console \
> -mon chardev=console \
> -vga none \
> -display none \
> -kernel pmem/native/arch/x86/boot/bzImage \
> -m 8G,slots=1,maxmem=16G \
> -smp 8 \
> -fsdev local,id=virtfs1,path=/,security_model=none \
> -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \
> -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
> TERM=xterm ip=dhcp loglevel=7'
>
> Config is attached.
>
> Steps to reproduce:
> #
> # QEMU 4.0.0 monitor - type 'help' for more information
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> (qemu)
>
> # echo online_movable > /sys/devices/system/memory/memory79/state
> [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370
> [ 23.032591] Policy zone: Normal
> # (qemu) device_del dimm1
> (qemu) [ 32.013950] Offlined Pages 32768
> [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022
> [ 32.014843] Policy zone: Normal
> [ 32.015733]
> [ 32.015881] ======================================================
> [ 32.016390] WARNING: possible circular locking dependency detected
> [ 32.016881] 5.1.0_pt_pmem #38 Not tainted
> [ 32.017202] ------------------------------------------------------
> [ 32.017680] kworker/u16:4/380 is trying to acquire lock:
> [ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
> kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.018745]
> [ 32.018745] but task is already holding lock:
> [ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [ 32.019859]
> [ 32.019859] which lock already depends on the new lock.
> [ 32.019859]
> [ 32.020499]
> [ 32.020499] the existing dependency chain (in reverse order) is:
> [ 32.021080]
> [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
> [ 32.021522] __mutex_lock+0x8b/0x900
> [ 32.021843] hotplug_memory_register+0x26/0xa0
> [ 32.022231] __add_pages+0xe7/0x160
> [ 32.022545] add_pages+0xd/0x60
> [ 32.022835] add_memory_resource+0xc3/0x1d0
> [ 32.023207] __add_memory+0x57/0x80
> [ 32.023530] acpi_memory_device_add+0x13a/0x2d0
> [ 32.023928] acpi_bus_attach+0xf1/0x200
> [ 32.024272] acpi_bus_scan+0x3e/0x90
> [ 32.024597] acpi_device_hotplug+0x284/0x3e0
> [ 32.024972] acpi_hotplug_work_fn+0x15/0x20
> [ 32.025342] process_one_work+0x2a0/0x650
> [ 32.025755] worker_thread+0x34/0x3d0
> [ 32.026077] kthread+0x118/0x130
> [ 32.026442] ret_from_fork+0x3a/0x50
> [ 32.026766]
> [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
> [ 32.027261] get_online_mems+0x39/0x80
> [ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0
> [ 32.028019] kmem_cache_create+0xd/0x10
> [ 32.028367] ptlock_cache_init+0x1b/0x23
> [ 32.028724] start_kernel+0x1d2/0x4b8
> [ 32.029060] secondary_startup_64+0xa4/0xb0
> [ 32.029447]
> [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
> [ 32.030007] cpus_read_lock+0x39/0x80
> [ 32.030360] __offline_pages+0x32/0x790
> [ 32.030709] memory_subsys_offline+0x3a/0x60
> [ 32.031089] device_offline+0x7e/0xb0
> [ 32.031425] acpi_bus_offline+0xd8/0x140
> [ 32.031821] acpi_device_hotplug+0x1b2/0x3e0
> [ 32.032202] acpi_hotplug_work_fn+0x15/0x20
> [ 32.032576] process_one_work+0x2a0/0x650
> [ 32.032942] worker_thread+0x34/0x3d0
> [ 32.033283] kthread+0x118/0x130
> [ 32.033588] ret_from_fork+0x3a/0x50
> [ 32.033919]
> [ 32.033919] -> #1 (&device->physical_node_lock){+.+.}:
> [ 32.034450] __mutex_lock+0x8b/0x900
> [ 32.034784] acpi_get_first_physical_node+0x16/0x60
> [ 32.035217] acpi_companion_match+0x3b/0x60
> [ 32.035594] acpi_device_uevent_modalias+0x9/0x20
> [ 32.036012] platform_uevent+0xd/0x40
> [ 32.036352] dev_uevent+0x85/0x1c0
> [ 32.036674] kobject_uevent_env+0x1e2/0x640
> [ 32.037044] kobject_synth_uevent+0x2b7/0x324
> [ 32.037428] uevent_store+0x17/0x30
> [ 32.037752] kernfs_fop_write+0xeb/0x1a0
> [ 32.038112] vfs_write+0xb2/0x1b0
> [ 32.038417] ksys_write+0x57/0xd0
> [ 32.038721] do_syscall_64+0x4b/0x1a0
> [ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 32.039491]
> [ 32.039491] -> #0 (kn->count#18){++++}:
> [ 32.039913] lock_acquire+0xaa/0x180
> [ 32.040242] __kernfs_remove+0x244/0x2d0
> [ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.040991] device_del+0x14a/0x370
> [ 32.041309] device_unregister+0x9/0x20
> [ 32.041653] unregister_memory_section+0x69/0xa0
> [ 32.042059] __remove_pages+0x112/0x460
> [ 32.042402] arch_remove_memory+0x6f/0xa0
> [ 32.042758] __remove_memory+0xab/0x130
> [ 32.043103] acpi_memory_device_remove+0x67/0xe0
> [ 32.043537] acpi_bus_trim+0x50/0x90
> [ 32.043889] acpi_device_hotplug+0x2fa/0x3e0
> [ 32.044300] acpi_hotplug_work_fn+0x15/0x20
> [ 32.044686] process_one_work+0x2a0/0x650
> [ 32.045044] worker_thread+0x34/0x3d0
> [ 32.045381] kthread+0x118/0x130
> [ 32.045679] ret_from_fork+0x3a/0x50
> [ 32.046005]
> [ 32.046005] other info that might help us debug this:
> [ 32.046005]
> [ 32.046636] Chain exists of:
> [ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
> [ 32.046636]
> [ 32.047514] Possible unsafe locking scenario:
> [ 32.047514]
> [ 32.047976] CPU0 CPU1
> [ 32.048337] ---- ----
> [ 32.048697] lock(mem_sysfs_mutex);
> [ 32.048983] lock(mem_hotplug_lock.rw_sem);
> [ 32.049519] lock(mem_sysfs_mutex);
> [ 32.050004] lock(kn->count#18);
> [ 32.050270]
> [ 32.050270] *** DEADLOCK ***
> [ 32.050270]
> [ 32.050736] 7 locks held by kworker/u16:4/380:
> [ 32.051087] #0: 00000000a22fe78e
> ((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
> [ 32.051830] #1: 00000000944f2dca
> ((work_completion)(&hpw->work)){+.+.}, at:
> process_one_work+0x21e/0x650
> [ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
> acpi_device_hotplug+0x2e/0x3e0
> [ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
> acpi_device_hotplug+0x3c/0x3e0
> [ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
> at: __remove_memory+0x3b/0x130
> [ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
> at: percpu_down_write+0x1d/0x110
> [ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [ 32.056016]
> [ 32.056016] stack backtrace:
> [ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
> [ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.12.0-20181126_142135-anatol 04/01/2014
> [ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [ 32.058144] Call Trace:
> [ 32.058344] dump_stack+0x67/0x90
> [ 32.058604] print_circular_bug.cold.60+0x15c/0x195
> [ 32.058989] __lock_acquire+0x17de/0x1d30
> [ 32.059308] ? find_held_lock+0x2d/0x90
> [ 32.059611] ? __kernfs_remove+0x199/0x2d0
> [ 32.059937] lock_acquire+0xaa/0x180
> [ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.060596] __kernfs_remove+0x244/0x2d0
> [ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.061283] ? kernfs_name_hash+0xd/0x80
> [ 32.061596] ? kernfs_find_ns+0x68/0xf0
> [ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.062266] device_del+0x14a/0x370
> [ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0
> [ 32.062973] device_unregister+0x9/0x20
> [ 32.063285] unregister_memory_section+0x69/0xa0
> [ 32.063651] __remove_pages+0x112/0x460
> [ 32.063949] arch_remove_memory+0x6f/0xa0
> [ 32.064271] __remove_memory+0xab/0x130
> [ 32.064579] ? walk_memory_range+0xa1/0xe0
> [ 32.064907] acpi_memory_device_remove+0x67/0xe0
> [ 32.065274] acpi_bus_trim+0x50/0x90
> [ 32.065560] acpi_device_hotplug+0x2fa/0x3e0
> [ 32.065900] acpi_hotplug_work_fn+0x15/0x20
> [ 32.066249] process_one_work+0x2a0/0x650
> [ 32.066591] worker_thread+0x34/0x3d0
> [ 32.066925] ? process_one_work+0x650/0x650
> [ 32.067275] kthread+0x118/0x130
> [ 32.067542] ? kthread_create_on_node+0x60/0x60
> [ 32.067909] ret_from_fork+0x3a/0x50
>
>>
>> I wonder if that ordering prevents going too deep into the
>> device_unregister() call stack that you highlighted below.
>>
>>
>>>
>>> Here is the problem:
>>>
>>> When we offline pages we have the following call stack:
>>>
>>> # echo offline > /sys/devices/system/memory/memory8/state
>>> ksys_write
>>> vfs_write
>>> __vfs_write
>>> kernfs_fop_write
>>> kernfs_get_active
>>> lock_acquire kn->count#122 (lock for
>>> "memory8/state" kn)
>>> sysfs_kf_write
>>> dev_attr_store
>>> state_store
>>> device_offline
>>> memory_subsys_offline
>>> memory_block_action
>>> offline_pages
>>> __offline_pages
>>> percpu_down_write
>>> down_write
>>> lock_acquire mem_hotplug_lock.rw_sem
>>>
>>> When we unbind dax0.0 we have the following stack:
>>> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>>> drv_attr_store
>>> unbind_store
>>> device_driver_detach
>>> device_release_driver_internal
>>> dev_dax_kmem_remove
>>> remove_memory device_hotplug_lock
>>> try_remove_memory mem_hotplug_lock.rw_sem
>>> arch_remove_memory
>>> __remove_pages
>>> __remove_section
>>> unregister_memory_section
>>> remove_memory_section mem_sysfs_mutex
>>> unregister_memory
>>> device_unregister
>>> device_del
>>> device_remove_attrs
>>> sysfs_remove_groups
>>> sysfs_remove_group
>>> remove_files
>>> kernfs_remove_by_name
>>> kernfs_remove_by_name_ns
>>> __kernfs_remove kn->count#122
>>>
>>> So, lockdep found the ordering issue with the above two stacks:
>>>
>>> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
>>> 2. mem_hotplug_lock.rw_sem -> kn->count#122

I once documented locking behavior in

Documentation/core-api/memory-hotplug.rst

Both, device_online() and __remove_memory() always have to be called
holding the device_hotplug_lock(), to avoid such races.

# echo offline > /sys/devices/system/memory/memory8/state
and
# echo 0 > /sys/devices/system/memory/memory8/online

either end up in:

-> online_store()
-- lock_device_hotplug_sysfs();
-- device_offline(dev)

-> state_store()
-- lock_device_hotplug_sysfs();
-- device_online(&mem->dev);

So the device_hotplug_lock prohibits the race you describe.

BUT

There is a possible race between the device_hotplug_lock and the
kn->count#122. However, that race can never trigger, as userspace
properly backs off in case it cannot get grip of the device_hotplug_lock.

(lock_device_hotplug_sysfs does a mutex_trylock(&device_hotplug_lock))

--

Thanks,

David / dhildenb