v3 spurred a bunch of really good discussion. Thanks to everybody
that made comments and suggestions!
I would still love some Acks on this from the folks on cc, even if it
is on just the patch touching your area.
Note: these are based on commit d2f33c19644 in:
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git libnvdimm-pending
Changes since v3:
* Move HMM-related resource warning instead of removing it
* Use __request_resource() directly instead of devm.
* Create a separate DAX_PMEM Kconfig option, complete with help text
* Update patch descriptions and cover letter to give a better
overview of use-cases and hardware where this might be useful.
Changes since v2:
* Updates to dev_dax_kmem_probe() in patch 5:
* Reject probes for devices with bad NUMA nodes. Keeps slow
memory from being added to node 0.
* Use raw request_mem_region()
* Add comments about permanent reservation
* use dev_*() instead of printk's
* Add references to nvdimm documentation in descriptions
* Remove unneeded GPL export
* Add Kconfig prompt and help text
Changes since v1:
* Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
* Use binding/unbinding from "dax bus" code
* Move over to a "dax bus" driver from being an nvdimm driver
--
Persistent memory is cool. But, currently, you have to rewrite
your applications to use it. Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory? Well... have I got the patch
series for you!
== Background / Use Cases ==
Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves
are described in detail in Documentation/nvdimm/nvdimm.txt.
However, this documentation focuses on actually using them as
storage. This set is focused on using NVDIMMs as DRAM replacement.
This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
persistent memory) NVDIMMs. These DIMMs are physically persistent,
more akin to flash than traditional RAM. They are also expected to
be more cost-effective than using RAM, which is why folks want this
set in the first place.
This set is not intended for RAM-based NVDIMMs. Those are not
cost-effective vs. plain RAM, and this using them here would simply
be a waste.
But, why would you bother with this approach? Intel itself [1]
has announced a hardware feature that does something very similar:
"Memory Mode" which turns DRAM into a cache in front of persistent
memory, which is then as a whole used as normal "RAM"?
Here are a few reasons:
1. The capacity of memory mode is the size of your persistent
memory that you dedicate. DRAM capacity is "lost" because it
is used for cache. With this, you get PMEM+DRAM capacity for
memory.
2. DRAM acts as a cache with memory mode, and caches can lead to
unpredictable latencies. Since memory mode is all-or-nothing
(either all your DRAM is used as cache or none is), your entire
memory space is exposed to these unpredictable latencies. This
solution lets you guarantee DRAM latencies if you need them.
3. The new "tier" of memory is exposed to software. That means
that you can build tiered applications or infrastructure. A
cloud provider could sell cheaper VMs that use more PMEM and
more expensive ones that use DRAM. That's impossible with
memory mode.
Don't take this as criticism of memory mode. Memory mode is
awesome, and doesn't strictly require *any* software changes (we
have software changes proposed for optimizing it though). It has
tons of other advantages over *this* approach. Basically, we
believe that the approach in these patches is complementary to
memory mode and that both can live side-by-side in harmony.
== Patch Set Overview ==
This series adds a new "driver" to which pmem devices can be
attached. Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory. On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.
== Testing Overview ==
Here's how I set up a system to test this thing:
1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory. Reserving a spot a 2GB
physical seems to work: memmap=512M!0x0000000080000000
This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
ndctl create-namespace -fe namespace0.0 -m dax
4. See patch 4 for instructions on binding the kmem driver
to a device.
5. Now, online the new memory sections. Perhaps:
grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
echo $f: `cat $f`
echo online_movable > $f
grep ^MemTotal /proc/meminfo
done
1. https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/#gs.RKG7BeIu
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Yaowei Bai <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Jerome Glisse <[email protected]>
From: Dave Hansen <[email protected]>
HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there. It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.
When HMM consumes address space, it makes a best guess about
what to consume. However, it is possible that a future memory
or device hotplug can collide with the reserved area. In the
case of these conflicts, there is an error message in
register_memory_resource().
Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.
Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available. This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Jerome Glisse <[email protected]>
---
b/kernel/resource.c | 10 ++++++++++
b/mm/memory_hotplug.c | 5 -----
2 files changed, 10 insertions(+), 5 deletions(-)
diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
--- a/kernel/resource.c~move-request_region-check 2019-01-24 15:13:14.453199539 -0800
+++ b/kernel/resource.c 2019-01-24 15:13:14.458199539 -0800
@@ -1123,6 +1123,16 @@ struct resource * __request_region(struc
conflict = __request_resource(parent, res);
if (!conflict)
break;
+ /*
+ * mm/hmm.c reserves physical addresses which then
+ * become unavailable to other users. Conflicts are
+ * not expected. Be verbose if one is encountered.
+ */
+ if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
+ pr_debug("Resource conflict with unaddressable "
+ "device memory at %#010llx !\n",
+ (unsigned long long)start);
+ }
if (conflict != parent) {
if (!(conflict->flags & IORESOURCE_BUSY)) {
parent = conflict;
diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~move-request_region-check 2019-01-24 15:13:14.455199539 -0800
+++ b/mm/memory_hotplug.c 2019-01-24 15:13:14.459199539 -0800
@@ -109,11 +109,6 @@ static struct resource *register_memory_
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
conflict = request_resource_conflict(&iomem_resource, res);
if (conflict) {
- if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
- pr_debug("Device unaddressable memory block "
- "memory hotplug at %#010llx !\n",
- (unsigned long long)start);
- }
pr_debug("System RAM resource %pR cannot be added\n", res);
kfree(res);
return ERR_PTR(-EEXIST);
_
From: Dave Hansen <[email protected]>
This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement. Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.
Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).
However, this limits persistent memory use to applications which
*have* been modified. To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.
To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:
echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
and then bind it to this new driver:
echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.
This rebinding procedure is currently a one-way trip. Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.
The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver. There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly. Two, the kmem driver destroys data on the
device. Think of if you had good data on a pmem device. It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver. kmem would take over the device and
write volatile data all over your good data.
This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware. On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node. That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.
Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.
There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches. But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().
Signed-off-by: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Yaowei Bai <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Jerome Glisse <[email protected]>
---
b/drivers/base/memory.c | 1
b/drivers/dax/Kconfig | 16 +++++++
b/drivers/dax/Makefile | 1
b/drivers/dax/kmem.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 126 insertions(+)
diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
--- a/drivers/base/memory.c~dax-kmem-try-4 2019-01-24 15:13:15.987199535 -0800
+++ b/drivers/base/memory.c 2019-01-24 15:13:15.994199535 -0800
@@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_b
{
return MIN_MEMORY_BLOCK_SIZE;
}
+EXPORT_SYMBOL_GPL(memory_block_size_bytes);
static unsigned long get_memory_block_size(void)
{
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-24 15:13:15.988199535 -0800
+++ b/drivers/dax/Kconfig 2019-01-24 15:13:15.994199535 -0800
@@ -32,6 +32,22 @@ config DEV_DAX_PMEM
Say M if unsure
+config DEV_DAX_KMEM
+ tristate "KMEM DAX: volatile-use of persistent memory"
+ default DEV_DAX
+ depends on DEV_DAX
+ depends on MEMORY_HOTPLUG # for add_memory() and friends
+ help
+ Support access to persistent memory as if it were RAM. This
+ allows easier use of persistent memory by unmodified
+ applications.
+
+ To use this feature, a DAX device must be unbound from the
+ device_dax driver (PMEM DAX) and bound to this kmem driver
+ on each boot.
+
+ Say N if unsure.
+
config DEV_DAX_PMEM_COMPAT
tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null 2018-12-03 08:41:47.355756491 -0800
+++ b/drivers/dax/kmem.c 2019-01-24 15:13:15.994199535 -0800
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pfn_t.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct resource *res = &dev_dax->region->res;
+ resource_size_t kmem_start;
+ resource_size_t kmem_size;
+ resource_size_t kmem_end;
+ struct resource *new_res;
+ int numa_node;
+ int rc;
+
+ /*
+ * Ensure good NUMA information for the persistent memory.
+ * Without this check, there is a risk that slow memory
+ * could be mixed in a node with faster memory, causing
+ * unavoidable performance issues.
+ */
+ numa_node = dev_dax->target_node;
+ if (numa_node < 0) {
+ dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
+ res, numa_node);
+ return -EINVAL;
+ }
+
+ /* Hotplug starting at the beginning of the next block: */
+ kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+ kmem_size = resource_size(res);
+ /* Adjust the size down to compensate for moving up kmem_start: */
+ kmem_size -= kmem_start - res->start;
+ /* Align the size down to cover only complete blocks: */
+ kmem_size &= ~(memory_block_size_bytes() - 1);
+ kmem_end = kmem_start+kmem_size;
+
+ /* Region is permanently reserved. Hot-remove not yet implemented. */
+ new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev));
+ if (!new_res) {
+ dev_warn(dev, "could not reserve region [%pa-%pa]\n",
+ &kmem_start, &kmem_end);
+ return -EBUSY;
+ }
+
+ /*
+ * Set flags appropriate for System RAM. Leave ..._BUSY clear
+ * so that add_memory() can add a child resource. Do not
+ * inherit flags from the parent since it may set new flags
+ * unknown to us that will break add_memory() below.
+ */
+ new_res->flags = IORESOURCE_SYSTEM_RAM;
+ new_res->name = dev_name(dev);
+
+ rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+ if (rc)
+ return rc;
+
+ return 0;
+}
+
+static int dev_dax_kmem_remove(struct device *dev)
+{
+ /*
+ * Purposely leak the request_mem_region() for the device-dax
+ * range and return '0' to ->remove() attempts. The removal of
+ * the device from the driver always succeeds, but the region
+ * is permanently pinned as reserved by the unreleased
+ * request_mem_region().
+ */
+ return -EBUSY;
+}
+
+static struct dax_device_driver device_dax_kmem_driver = {
+ .drv = {
+ .probe = dev_dax_kmem_probe,
+ .remove = dev_dax_kmem_remove,
+ },
+};
+
+static int __init dax_kmem_init(void)
+{
+ return dax_driver_register(&device_dax_kmem_driver);
+}
+
+static void __exit dax_kmem_exit(void)
+{
+ dax_driver_unregister(&device_dax_kmem_driver);
+}
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
+module_init(dax_kmem_init);
+module_exit(dax_kmem_exit);
+MODULE_ALIAS_DAX_DEVICE(0);
diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
--- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-24 15:13:15.990199535 -0800
+++ b/drivers/dax/Makefile 2019-01-24 15:13:15.994199535 -0800
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_DAX) += dax.o
obj-$(CONFIG_DEV_DAX) += device_dax.o
+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
dax-y := super.o
dax-y += bus.o
_
From: Dave Hansen <[email protected]>
In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.
However, it currently only finds memory resources which are
"top-level" iomem_resources. Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):
a0000000-bfffffff : Persistent Memory (legacy)
a0000000-afffffff : System RAM
Changing the true->false here allows children to be searched
as well. We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Yaowei Bai <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Jerome Glisse <[email protected]>
---
b/kernel/resource.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources 2019-01-24 15:13:15.482199536 -0800
+++ b/kernel/resource.c 2019-01-24 15:13:15.486199536 -0800
@@ -445,6 +445,9 @@ int walk_mem_res(u64 start, u64 end, voi
* This function calls the @func callback against all memory ranges of type
* System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
* It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
*/
int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
void *arg, int (*func)(unsigned long, unsigned long, void *))
@@ -460,7 +463,7 @@ int walk_system_ram_range(unsigned long
flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
while (start < end &&
!find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
- true, &res)) {
+ false, &res)) {
pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
end_pfn = (res.end + 1) >> PAGE_SHIFT;
if (end_pfn > pfn)
_
From: Dave Hansen <[email protected]>
The mm/resource.c code is used to manage the physical address
space. The current resource configuration can be viewed in
/proc/iomem. An example of this is at the bottom of this
description.
The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory". The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.
The request_resource_conflict() API only deals with the
top-level resources. Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.
We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space. But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.
This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.
00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
01000000-01e071d0 : Kernel code
01e071d1-027dfdff : Kernel data
02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00
Signed-off-by: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Yaowei Bai <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Jerome Glisse <[email protected]>
---
b/mm/memory_hotplug.c | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 2019-01-24 15:13:14.979199537 -0800
+++ b/mm/memory_hotplug.c 2019-01-24 15:13:14.983199537 -0800
@@ -98,19 +98,21 @@ void mem_hotplug_done(void)
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
- struct resource *res, *conflict;
- res = kzalloc(sizeof(struct resource), GFP_KERNEL);
- if (!res)
- return ERR_PTR(-ENOMEM);
+ struct resource *res;
+ unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+ char *resource_name = "System RAM";
- res->name = "System RAM";
- res->start = start;
- res->end = start + size - 1;
- res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
- conflict = request_resource_conflict(&iomem_resource, res);
- if (conflict) {
- pr_debug("System RAM resource %pR cannot be added\n", res);
- kfree(res);
+ /*
+ * Request ownership of the new memory range. This might be
+ * a child of an existing resource that was present but
+ * not marked as busy.
+ */
+ res = __request_region(&iomem_resource, start, size,
+ resource_name, flags);
+
+ if (!res) {
+ pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
+ start, start + size);
return ERR_PTR(-EEXIST);
}
return res;
_
From: Dave Hansen <[email protected]>
walk_system_ram_range() can return an error code either becuase *it*
failed, or because the 'func' that it calls returned an error. The
memory hotplug does the following:
ret = walk_system_ram_range(..., func);
if (ret)
return ret;
and 'ret' makes it out to userspace, eventually. The problem is,
walk_system_ram_range() failues that result from *it* failing (as
opposed to 'func') return -1. That leads to a very odd -EPERM (-1)
return code out to userspace.
Make walk_system_ram_range() return -EINVAL for internal failures to
keep userspace less confused.
This return code is compatible with all the callers that I audited.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Yaowei Bai <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Jerome Glisse <[email protected]>
---
b/kernel/resource.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 2019-01-24 15:13:13.950199540 -0800
+++ b/kernel/resource.c 2019-01-24 15:13:13.954199540 -0800
@@ -375,7 +375,7 @@ static int __walk_iomem_res_desc(resourc
int (*func)(struct resource *, void *))
{
struct resource res;
- int ret = -1;
+ int ret = -EINVAL;
while (start < end &&
!find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
@@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
unsigned long flags;
struct resource res;
unsigned long pfn, end_pfn;
- int ret = -1;
+ int ret = -EINVAL;
start = (u64) start_pfn << PAGE_SHIFT;
end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_
Hi, Dave,
While chatting with my colleague Erwin about the patchset, it occurred
that we're not clear about the error handling part. Specifically,
1. If an uncorrectable error is detected during a 'load' in the hot
plugged pmem region, how will the error be handled? will it be
handled like PMEM or DRAM?
2. If a poison is set, and is persistent, which entity should clear
the poison, and badblock(if applicable)? If it's user's responsibility,
does ndctl support the clearing in this mode?
thanks!
-jane
On 1/24/2019 3:14 PM, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> This is intended for use with NVDIMMs that are physically persistent
> (physically like flash) so that they can be used as a cost-effective
> RAM replacement. Intel Optane DC persistent memory is one
> implementation of this kind of NVDIMM.
>
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries. (DIMM-based
> persistent memory hardware/software is described in great detail
> here: Documentation/nvdimm/nvdimm.txt).
>
> However, this limits persistent memory use to applications which
> *have* been modified. To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed and used just like
> normal RAM would be.
>
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>
> and then bind it to this new driver:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
>
> This rebinding procedure is currently a one-way trip. Once memory
> is bound to "kmem", it's there permanently and can not be
> unbound and assigned back to device_dax.
>
> The kmem driver will never bind to a dax device unless the device
> is *explicitly* bound to the driver. There are two reasons for
> this: One, since it is a one-way trip, it can not be undone if
> bound incorrectly. Two, the kmem driver destroys data on the
> device. Think of if you had good data on a pmem device. It
> would be catastrophic if you compile-in "kmem", but leave out
> the "device_dax" driver. kmem would take over the device and
> write volatile data all over your good data.
>
> This inherits any existing NUMA information for the newly-added
> memory from the persistent memory device that came from the
> firmware. On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node. That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
>
> Because NUMA nodes are created, the existing NUMA APIs and tools
> are sufficient to create policies for applications or memory areas
> to have affinity for or an aversion to using this memory.
>
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches. But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
> This calculation is also the reason we export
> memory_block_size_bytes().
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: Yaowei Bai <[email protected]>
> Cc: Takashi Iwai <[email protected]>
> Cc: Jerome Glisse <[email protected]>
> ---
>
> b/drivers/base/memory.c | 1
> b/drivers/dax/Kconfig | 16 +++++++
> b/drivers/dax/Makefile | 1
> b/drivers/dax/kmem.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 126 insertions(+)
>
> diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
> --- a/drivers/base/memory.c~dax-kmem-try-4 2019-01-24 15:13:15.987199535 -0800
> +++ b/drivers/base/memory.c 2019-01-24 15:13:15.994199535 -0800
> @@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_b
> {
> return MIN_MEMORY_BLOCK_SIZE;
> }
> +EXPORT_SYMBOL_GPL(memory_block_size_bytes);
>
> static unsigned long get_memory_block_size(void)
> {
> diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
> --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-24 15:13:15.988199535 -0800
> +++ b/drivers/dax/Kconfig 2019-01-24 15:13:15.994199535 -0800
> @@ -32,6 +32,22 @@ config DEV_DAX_PMEM
>
> Say M if unsure
>
> +config DEV_DAX_KMEM
> + tristate "KMEM DAX: volatile-use of persistent memory"
> + default DEV_DAX
> + depends on DEV_DAX
> + depends on MEMORY_HOTPLUG # for add_memory() and friends
> + help
> + Support access to persistent memory as if it were RAM. This
> + allows easier use of persistent memory by unmodified
> + applications.
> +
> + To use this feature, a DAX device must be unbound from the
> + device_dax driver (PMEM DAX) and bound to this kmem driver
> + on each boot.
> +
> + Say N if unsure.
> +
> config DEV_DAX_PMEM_COMPAT
> tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
> depends on DEV_DAX_PMEM
> diff -puN /dev/null drivers/dax/kmem.c
> --- /dev/null 2018-12-03 08:41:47.355756491 -0800
> +++ b/drivers/dax/kmem.c 2019-01-24 15:13:15.994199535 -0800
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/memory.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pfn_t.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include "dax-private.h"
> +#include "bus.h"
> +
> +int dev_dax_kmem_probe(struct device *dev)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct resource *res = &dev_dax->region->res;
> + resource_size_t kmem_start;
> + resource_size_t kmem_size;
> + resource_size_t kmem_end;
> + struct resource *new_res;
> + int numa_node;
> + int rc;
> +
> + /*
> + * Ensure good NUMA information for the persistent memory.
> + * Without this check, there is a risk that slow memory
> + * could be mixed in a node with faster memory, causing
> + * unavoidable performance issues.
> + */
> + numa_node = dev_dax->target_node;
> + if (numa_node < 0) {
> + dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
> + res, numa_node);
> + return -EINVAL;
> + }
> +
> + /* Hotplug starting at the beginning of the next block: */
> + kmem_start = ALIGN(res->start, memory_block_size_bytes());
> +
> + kmem_size = resource_size(res);
> + /* Adjust the size down to compensate for moving up kmem_start: */
> + kmem_size -= kmem_start - res->start;
> + /* Align the size down to cover only complete blocks: */
> + kmem_size &= ~(memory_block_size_bytes() - 1);
> + kmem_end = kmem_start+kmem_size;
> +
> + /* Region is permanently reserved. Hot-remove not yet implemented. */
> + new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev));
> + if (!new_res) {
> + dev_warn(dev, "could not reserve region [%pa-%pa]\n",
> + &kmem_start, &kmem_end);
> + return -EBUSY;
> + }
> +
> + /*
> + * Set flags appropriate for System RAM. Leave ..._BUSY clear
> + * so that add_memory() can add a child resource. Do not
> + * inherit flags from the parent since it may set new flags
> + * unknown to us that will break add_memory() below.
> + */
> + new_res->flags = IORESOURCE_SYSTEM_RAM;
> + new_res->name = dev_name(dev);
> +
> + rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> + if (rc)
> + return rc;
> +
> + return 0;
> +}
> +
> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> + /*
> + * Purposely leak the request_mem_region() for the device-dax
> + * range and return '0' to ->remove() attempts. The removal of
> + * the device from the driver always succeeds, but the region
> + * is permanently pinned as reserved by the unreleased
> + * request_mem_region().
> + */
> + return -EBUSY;
> +}
> +
> +static struct dax_device_driver device_dax_kmem_driver = {
> + .drv = {
> + .probe = dev_dax_kmem_probe,
> + .remove = dev_dax_kmem_remove,
> + },
> +};
> +
> +static int __init dax_kmem_init(void)
> +{
> + return dax_driver_register(&device_dax_kmem_driver);
> +}
> +
> +static void __exit dax_kmem_exit(void)
> +{
> + dax_driver_unregister(&device_dax_kmem_driver);
> +}
> +
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_LICENSE("GPL v2");
> +module_init(dax_kmem_init);
> +module_exit(dax_kmem_exit);
> +MODULE_ALIAS_DAX_DEVICE(0);
> diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
> --- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-24 15:13:15.990199535 -0800
> +++ b/drivers/dax/Makefile 2019-01-24 15:13:15.994199535 -0800
> @@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_DAX) += dax.o
> obj-$(CONFIG_DEV_DAX) += device_dax.o
> +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>
> dax-y := super.o
> dax-y += bus.o
> _
> _______________________________________________
> Linux-nvdimm mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
On Thu, Jan 24, 2019 at 10:13 PM Jane Chu <[email protected]> wrote:
>
> Hi, Dave,
>
> While chatting with my colleague Erwin about the patchset, it occurred
> that we're not clear about the error handling part. Specifically,
>
> 1. If an uncorrectable error is detected during a 'load' in the hot
> plugged pmem region, how will the error be handled? will it be
> handled like PMEM or DRAM?
DRAM.
> 2. If a poison is set, and is persistent, which entity should clear
> the poison, and badblock(if applicable)? If it's user's responsibility,
> does ndctl support the clearing in this mode?
With persistent memory advertised via a static logical-to-physical
storage/dax device mapping, once an error develops it destroys a
physical *and* logical part of a device address space. That loss of
logical address space makes error clearing a necessity. However, with
the DRAM / "System RAM" error handling model, the OS can just offline
the page and map a different one to repair the logical address space.
So, no, ndctl will not have explicit enabling to clear volatile
errors, the OS will just dynamically offline problematic pages.
Dan
Thanks for the insights!
Can I say, the UCE is delivered from h/w to OS in a single way in case of machine
check, only PMEM/DAX stuff filter out UC address and managed in its own way by
badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will kick in,
right?
And how about when ARS is involved but no machine check fired for the function
of this patchset?
>-----Original Message-----
>From: Linux-nvdimm [mailto:[email protected]] On Behalf
>Of Dan Williams
>Sent: Friday, January 25, 2019 2:28 PM
>To: Jane Chu <[email protected]>
>Cc: Tom Lendacky <[email protected]>; Michal Hocko
><[email protected]>; linux-nvdimm <[email protected]>; Takashi
>Iwai <[email protected]>; Dave Hansen <[email protected]>; Huang,
>Ying <[email protected]>; Linux Kernel Mailing List
><[email protected]>; Linux MM <[email protected]>; J?r?me
>Glisse <[email protected]>; Borislav Petkov <[email protected]>; Yaowei Bai
><[email protected]>; Ross Zwisler <[email protected]>;
>Bjorn Helgaas <[email protected]>; Andrew Morton
><[email protected]>; Wu, Fengguang <[email protected]>
>Subject: Re: [PATCH 5/5] dax: "Hotplug" persistent memory for use like
>normal RAM
>
>On Thu, Jan 24, 2019 at 10:13 PM Jane Chu <[email protected]> wrote:
>>
>> Hi, Dave,
>>
>> While chatting with my colleague Erwin about the patchset, it occurred
>> that we're not clear about the error handling part. Specifically,
>>
>> 1. If an uncorrectable error is detected during a 'load' in the hot
>> plugged pmem region, how will the error be handled? will it be
>> handled like PMEM or DRAM?
>
>DRAM.
>
>> 2. If a poison is set, and is persistent, which entity should clear
>> the poison, and badblock(if applicable)? If it's user's responsibility,
>> does ndctl support the clearing in this mode?
>
>With persistent memory advertised via a static logical-to-physical
>storage/dax device mapping, once an error develops it destroys a
>physical *and* logical part of a device address space. That loss of
>logical address space makes error clearing a necessity. However, with
>the DRAM / "System RAM" error handling model, the OS can just offline
>the page and map a different one to repair the logical address space.
>So, no, ndctl will not have explicit enabling to clear volatile
>errors, the OS will just dynamically offline problematic pages.
>_______________________________________________
>Linux-nvdimm mailing list
>[email protected]
>https://lists.01.org/mailman/listinfo/linux-nvdimm
On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <[email protected]> wrote:
>
> Dan
>
> Thanks for the insights!
>
> Can I say, the UCE is delivered from h/w to OS in a single way in case of machine
> check, only PMEM/DAX stuff filter out UC address and managed in its own way by
> badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will kick in,
> right?
The common RAS workflow always kicks in, it's just the page state
presented by a DAX mapping needs distinct handling. Once it is
hot-plugged it no longer needs to be treated differently than "System
RAM".
> And how about when ARS is involved but no machine check fired for the function
> of this patchset?
The hotplug effectively disconnects this address range from the ARS
results. They will still be reported in the libnvdimm "region" level
badblocks instance, but there's no safe / coordinated way to go clear
those errors without additional kernel enabling. There is no "clear
error" semantic for "System RAM".
On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <[email protected]> wrote:
> > Dan
> >
> > Thanks for the insights!
> >
> > Can I say, the UCE is delivered from h/w to OS in a single way in
> > case of machine
> > check, only PMEM/DAX stuff filter out UC address and managed in its
> > own way by
> > badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
> > kick in,
> > right?
>
> The common RAS workflow always kicks in, it's just the page state
> presented by a DAX mapping needs distinct handling. Once it is
> hot-plugged it no longer needs to be treated differently than "System
> RAM".
>
> > And how about when ARS is involved but no machine check fired for
> > the function
> > of this patchset?
>
> The hotplug effectively disconnects this address range from the ARS
> results. They will still be reported in the libnvdimm "region" level
> badblocks instance, but there's no safe / coordinated way to go clear
> those errors without additional kernel enabling. There is no "clear
> error" semantic for "System RAM".
>
Perhaps as future enabling, the kernel can go perform "clear error" for
offlined pages, and make them usable again. But I'm not sure how
prepared mm is to re-accept pages previously offlined.
On Thu, Jan 24, 2019 at 03:14:41PM -0800, Dave Hansen wrote:
> v3 spurred a bunch of really good discussion. Thanks to everybody
> that made comments and suggestions!
>
> I would still love some Acks on this from the folks on cc, even if it
> is on just the patch touching your area.
>
> Note: these are based on commit d2f33c19644 in:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git libnvdimm-pending
>
> Changes since v3:
> * Move HMM-related resource warning instead of removing it
> * Use __request_resource() directly instead of devm.
> * Create a separate DAX_PMEM Kconfig option, complete with help text
> * Update patch descriptions and cover letter to give a better
> overview of use-cases and hardware where this might be useful.
This one looks good to me, i will give it a go on monday to
test against nouveau and HMM.
Cheers,
J?r?me
On Thu, Jan 24, 2019 at 03:14:44PM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> HMM consumes physical address space for its own use, even
> though nothing is mapped or accessible there. It uses a
> special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
> to uniquely identify these areas.
>
> When HMM consumes address space, it makes a best guess about
> what to consume. However, it is possible that a future memory
> or device hotplug can collide with the reserved area. In the
> case of these conflicts, there is an error message in
> register_memory_resource().
>
> Later patches in this series move register_memory_resource()
> from using request_resource_conflict() to __request_region().
> Unfortunately, __request_region() does not return the conflict
> like the previous function did, which makes it impossible to
> check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
> resource.
>
> Instead of warning in register_memory_resource(), move the
> check into the core resource code itself (__request_region())
> where the conflicting resource _is_ available. This has the
> added bonus of producing a warning in case of HMM conflicts
> with devices *or* RAM address space, as opposed to the RAM-
> only warnings that were there previously.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
> ---
>
> b/kernel/resource.c | 10 ++++++++++
> b/mm/memory_hotplug.c | 5 -----
> 2 files changed, 10 insertions(+), 5 deletions(-)
>
> diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
> --- a/kernel/resource.c~move-request_region-check 2019-01-24 15:13:14.453199539 -0800
> +++ b/kernel/resource.c 2019-01-24 15:13:14.458199539 -0800
> @@ -1123,6 +1123,16 @@ struct resource * __request_region(struc
> conflict = __request_resource(parent, res);
> if (!conflict)
> break;
> + /*
> + * mm/hmm.c reserves physical addresses which then
> + * become unavailable to other users. Conflicts are
> + * not expected. Be verbose if one is encountered.
> + */
> + if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> + pr_debug("Resource conflict with unaddressable "
> + "device memory at %#010llx !\n",
> + (unsigned long long)start);
> + }
> if (conflict != parent) {
> if (!(conflict->flags & IORESOURCE_BUSY)) {
> parent = conflict;
> diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~move-request_region-check 2019-01-24 15:13:14.455199539 -0800
> +++ b/mm/memory_hotplug.c 2019-01-24 15:13:14.459199539 -0800
> @@ -109,11 +109,6 @@ static struct resource *register_memory_
> res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> conflict = request_resource_conflict(&iomem_resource, res);
> if (conflict) {
> - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> - pr_debug("Device unaddressable memory block "
> - "memory hotplug at %#010llx !\n",
> - (unsigned long long)start);
> - }
> pr_debug("System RAM resource %pR cannot be added\n", res);
> kfree(res);
> return ERR_PTR(-EEXIST);
> _
On 1/25/2019 10:20 AM, Verma, Vishal L wrote:
>
> On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
>> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <[email protected]> wrote:
>>> Dan
>>>
>>> Thanks for the insights!
>>>
>>> Can I say, the UCE is delivered from h/w to OS in a single way in
>>> case of machine
>>> check, only PMEM/DAX stuff filter out UC address and managed in its
>>> own way by
>>> badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
>>> kick in,
>>> right?
>>
>> The common RAS workflow always kicks in, it's just the page state
>> presented by a DAX mapping needs distinct handling. Once it is
>> hot-plugged it no longer needs to be treated differently than "System
>> RAM".
>>
>>> And how about when ARS is involved but no machine check fired for
>>> the function
>>> of this patchset?
>>
>> The hotplug effectively disconnects this address range from the ARS
>> results. They will still be reported in the libnvdimm "region" level
>> badblocks instance, but there's no safe / coordinated way to go clear
>> those errors without additional kernel enabling. There is no "clear
>> error" semantic for "System RAM".
>>
> Perhaps as future enabling, the kernel can go perform "clear error" for
> offlined pages, and make them usable again. But I'm not sure how
> prepared mm is to re-accept pages previously offlined.
>
Offlining a DRAM backed page due to an UC makes sense because
a. the physical DRAM cell might still have an error
b. power cycle, scrubing could potentially 'repair' the DRAM cell,
making the page usable again.
But for a PMEM backed page, neither is true. If a poison bit is set in
a page, that indicates the underlying hardware has completed the repair
work, all that's left is for software to recover. Secondly, because
poison is persistent, unless software explicitly clear the bit,
the page is permanently unusable.
thanks,
-jane
On Fri, Jan 25, 2019 at 11:10 AM Jane Chu <[email protected]> wrote:
>
>
> On 1/25/2019 10:20 AM, Verma, Vishal L wrote:
> >
> > On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
> >> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <[email protected]> wrote:
> >>> Dan
> >>>
> >>> Thanks for the insights!
> >>>
> >>> Can I say, the UCE is delivered from h/w to OS in a single way in
> >>> case of machine
> >>> check, only PMEM/DAX stuff filter out UC address and managed in its
> >>> own way by
> >>> badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
> >>> kick in,
> >>> right?
> >>
> >> The common RAS workflow always kicks in, it's just the page state
> >> presented by a DAX mapping needs distinct handling. Once it is
> >> hot-plugged it no longer needs to be treated differently than "System
> >> RAM".
> >>
> >>> And how about when ARS is involved but no machine check fired for
> >>> the function
> >>> of this patchset?
> >>
> >> The hotplug effectively disconnects this address range from the ARS
> >> results. They will still be reported in the libnvdimm "region" level
> >> badblocks instance, but there's no safe / coordinated way to go clear
> >> those errors without additional kernel enabling. There is no "clear
> >> error" semantic for "System RAM".
> >>
> > Perhaps as future enabling, the kernel can go perform "clear error" for
> > offlined pages, and make them usable again. But I'm not sure how
> > prepared mm is to re-accept pages previously offlined.
> >
>
> Offlining a DRAM backed page due to an UC makes sense because
> a. the physical DRAM cell might still have an error
> b. power cycle, scrubing could potentially 'repair' the DRAM cell,
> making the page usable again.
>
> But for a PMEM backed page, neither is true. If a poison bit is set in
> a page, that indicates the underlying hardware has completed the repair
> work, all that's left is for software to recover. Secondly, because
> poison is persistent, unless software explicitly clear the bit,
> the page is permanently unusable.
Not permanently... system-owner always has the option to use the
device-DAX and ARS mechanisms to clear errors at the next boot.
There's just no kernel enabling to do that automatically as a part of
this patch set.
However, we should consider this along with the userspace enabling to
control which device-dax instances are set aside for hotplug. It would
make sense to have a "clear errors before hotplug" configuration
option.
On Thu, Jan 24, 2019 at 5:21 PM Dave Hansen <[email protected]> wrote:
>
>
> From: Dave Hansen <[email protected]>
>
> walk_system_ram_range() can return an error code either becuase *it*
> failed, or because the 'func' that it calls returned an error. The
> memory hotplug does the following:
>
> ret = walk_system_ram_range(..., func);
> if (ret)
> return ret;
>
> and 'ret' makes it out to userspace, eventually. The problem is,
> walk_system_ram_range() failues that result from *it* failing (as
> opposed to 'func') return -1. That leads to a very odd -EPERM (-1)
> return code out to userspace.
>
> Make walk_system_ram_range() return -EINVAL for internal failures to
> keep userspace less confused.
>
> This return code is compatible with all the callers that I audited.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: Yaowei Bai <[email protected]>
> Cc: Takashi Iwai <[email protected]>
> Cc: Jerome Glisse <[email protected]>
> ---
>
> b/kernel/resource.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
> --- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 2019-01-24 15:13:13.950199540 -0800
> +++ b/kernel/resource.c 2019-01-24 15:13:13.954199540 -0800
> @@ -375,7 +375,7 @@ static int __walk_iomem_res_desc(resourc
> int (*func)(struct resource *, void *))
> {
> struct resource res;
> - int ret = -1;
> + int ret = -EINVAL;
>
> while (start < end &&
> !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
> @@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
> unsigned long flags;
> struct resource res;
> unsigned long pfn, end_pfn;
> - int ret = -1;
> + int ret = -EINVAL;
Can you either make a similar change to the powerpc version of
walk_system_ram_range() in arch/powerpc/mm/mem.c or explain why it's
not needed? It *seems* like we'd want both versions of
walk_system_ram_range() to behave similarly in this respect.
> start = (u64) start_pfn << PAGE_SHIFT;
> end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
> _
On 1/25/19 1:02 PM, Bjorn Helgaas wrote:
>> @@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
>> unsigned long flags;
>> struct resource res;
>> unsigned long pfn, end_pfn;
>> - int ret = -1;
>> + int ret = -EINVAL;
> Can you either make a similar change to the powerpc version of
> walk_system_ram_range() in arch/powerpc/mm/mem.c or explain why it's
> not needed? It *seems* like we'd want both versions of
> walk_system_ram_range() to behave similarly in this respect.
Sure. A quick grep shows powerpc being the only other implementation.
I'll just add this hunk:
> diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
> --- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 2019-01-25 12:57:00.000004446 -0800
> +++ b/arch/powerpc/mm/mem.c 2019-01-25 12:58:13.215004263 -0800
> @@ -188,7 +188,7 @@ walk_system_ram_range(unsigned long star
> struct memblock_region *reg;
> unsigned long end_pfn = start_pfn + nr_pages;
> unsigned long tstart, tend;
> - int ret = -1;
> + int ret = -EINVAL;
I'll also dust off the ol' cross-compiler and make sure I didn't
fat-finger anything.
On Fri, Jan 25, 2019 at 3:10 PM Dave Hansen <[email protected]> wrote:
>
> On 1/25/19 1:02 PM, Bjorn Helgaas wrote:
> >> @@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
> >> unsigned long flags;
> >> struct resource res;
> >> unsigned long pfn, end_pfn;
> >> - int ret = -1;
> >> + int ret = -EINVAL;
> > Can you either make a similar change to the powerpc version of
> > walk_system_ram_range() in arch/powerpc/mm/mem.c or explain why it's
> > not needed? It *seems* like we'd want both versions of
> > walk_system_ram_range() to behave similarly in this respect.
>
> Sure. A quick grep shows powerpc being the only other implementation.
> I'll just add this hunk:
>
> > diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
> > --- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 2019-01-25 12:57:00.000004446 -0800
> > +++ b/arch/powerpc/mm/mem.c 2019-01-25 12:58:13.215004263 -0800
> > @@ -188,7 +188,7 @@ walk_system_ram_range(unsigned long star
> > struct memblock_region *reg;
> > unsigned long end_pfn = start_pfn + nr_pages;
> > unsigned long tstart, tend;
> > - int ret = -1;
> > + int ret = -EINVAL;
>
> I'll also dust off the ol' cross-compiler and make sure I didn't
> fat-finger anything.
Sounds good. Then add my
Reviewed-by: Bjorn Helgaas <[email protected]>
On Thu, Jan 24, 2019 at 5:21 PM Dave Hansen <[email protected]> wrote:
>
>
> From: Dave Hansen <[email protected]>
>
> HMM consumes physical address space for its own use, even
> though nothing is mapped or accessible there. It uses a
> special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
> to uniquely identify these areas.
>
> When HMM consumes address space, it makes a best guess about
> what to consume. However, it is possible that a future memory
> or device hotplug can collide with the reserved area. In the
> case of these conflicts, there is an error message in
> register_memory_resource().
>
> Later patches in this series move register_memory_resource()
> from using request_resource_conflict() to __request_region().
> Unfortunately, __request_region() does not return the conflict
> like the previous function did, which makes it impossible to
> check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
> resource.
>
> Instead of warning in register_memory_resource(), move the
> check into the core resource code itself (__request_region())
> where the conflicting resource _is_ available. This has the
> added bonus of producing a warning in case of HMM conflicts
> with devices *or* RAM address space, as opposed to the RAM-
> only warnings that were there previously.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
> Cc: Jerome Glisse <[email protected]>
> ---
>
> b/kernel/resource.c | 10 ++++++++++
> b/mm/memory_hotplug.c | 5 -----
> 2 files changed, 10 insertions(+), 5 deletions(-)
>
> diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
> --- a/kernel/resource.c~move-request_region-check 2019-01-24 15:13:14.453199539 -0800
> +++ b/kernel/resource.c 2019-01-24 15:13:14.458199539 -0800
> @@ -1123,6 +1123,16 @@ struct resource * __request_region(struc
> conflict = __request_resource(parent, res);
> if (!conflict)
> break;
> + /*
> + * mm/hmm.c reserves physical addresses which then
> + * become unavailable to other users. Conflicts are
> + * not expected. Be verbose if one is encountered.
> + */
> + if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> + pr_debug("Resource conflict with unaddressable "
> + "device memory at %#010llx !\n",
> + (unsigned long long)start);
I don't object to the change, but are you really OK with this being a
pr_debug() message that is only emitted when enabled via either the
dynamic debug mechanism or DEBUG being defined? From the comments, it
seems more like a KERN_INFO sort of message.
Also, maybe the message would be more useful if it included the
conflicting resource as well as the region you're requesting? Many of
the other callers of request_resource_conflict() have something like
this:
dev_err(dev, "resource collision: %pR conflicts with %s %pR\n",
new, conflict->name, conflict);
> + }
> if (conflict != parent) {
> if (!(conflict->flags & IORESOURCE_BUSY)) {
> parent = conflict;
> diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~move-request_region-check 2019-01-24 15:13:14.455199539 -0800
> +++ b/mm/memory_hotplug.c 2019-01-24 15:13:14.459199539 -0800
> @@ -109,11 +109,6 @@ static struct resource *register_memory_
> res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> conflict = request_resource_conflict(&iomem_resource, res);
> if (conflict) {
> - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> - pr_debug("Device unaddressable memory block "
> - "memory hotplug at %#010llx !\n",
> - (unsigned long long)start);
> - }
> pr_debug("System RAM resource %pR cannot be added\n", res);
> kfree(res);
> return ERR_PTR(-EEXIST);
> _
>
On 1/25/19 1:18 PM, Bjorn Helgaas wrote:
> On Thu, Jan 24, 2019 at 5:21 PM Dave Hansen <[email protected]> wrote:
>> diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
>> --- a/kernel/resource.c~move-request_region-check 2019-01-24 15:13:14.453199539 -0800
>> +++ b/kernel/resource.c 2019-01-24 15:13:14.458199539 -0800
>> @@ -1123,6 +1123,16 @@ struct resource * __request_region(struc
>> conflict = __request_resource(parent, res);
>> if (!conflict)
>> break;
>> + /*
>> + * mm/hmm.c reserves physical addresses which then
>> + * become unavailable to other users. Conflicts are
>> + * not expected. Be verbose if one is encountered.
>> + */
>> + if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
>> + pr_debug("Resource conflict with unaddressable "
>> + "device memory at %#010llx !\n",
>> + (unsigned long long)start);
>
> I don't object to the change, but are you really OK with this being a
> pr_debug() message that is only emitted when enabled via either the
> dynamic debug mechanism or DEBUG being defined? From the comments, it
> seems more like a KERN_INFO sort of message.
I left it consistent with the original message that was in the code.
I'm happy to change it, though, if the consumers of it (Jerome,
basically) want something different.
> Also, maybe the message would be more useful if it included the
> conflicting resource as well as the region you're requesting? Many of
> the other callers of request_resource_conflict() have something like
> this:
>
> dev_err(dev, "resource collision: %pR conflicts with %s %pR\n",
> new, conflict->name, conflict);
Seems sane. I was just trying to change as little as possible.
On 1/25/2019 11:15 AM, Dan Williams wrote:
> On Fri, Jan 25, 2019 at 11:10 AM Jane Chu <[email protected]> wrote:
>>
>>
>> On 1/25/2019 10:20 AM, Verma, Vishal L wrote:
>>>
>>> On Fri, 2019-01-25 at 09:18 -0800, Dan Williams wrote:
>>>> On Fri, Jan 25, 2019 at 12:20 AM Du, Fan <[email protected]> wrote:
>>>>> Dan
>>>>>
>>>>> Thanks for the insights!
>>>>>
>>>>> Can I say, the UCE is delivered from h/w to OS in a single way in
>>>>> case of machine
>>>>> check, only PMEM/DAX stuff filter out UC address and managed in its
>>>>> own way by
>>>>> badblocks, if PMEM/DAX doesn't do so, then common RAS workflow will
>>>>> kick in,
>>>>> right?
>>>>
>>>> The common RAS workflow always kicks in, it's just the page state
>>>> presented by a DAX mapping needs distinct handling. Once it is
>>>> hot-plugged it no longer needs to be treated differently than "System
>>>> RAM".
>>>>
>>>>> And how about when ARS is involved but no machine check fired for
>>>>> the function
>>>>> of this patchset?
>>>>
>>>> The hotplug effectively disconnects this address range from the ARS
>>>> results. They will still be reported in the libnvdimm "region" level
>>>> badblocks instance, but there's no safe / coordinated way to go clear
>>>> those errors without additional kernel enabling. There is no "clear
>>>> error" semantic for "System RAM".
>>>>
>>> Perhaps as future enabling, the kernel can go perform "clear error" for
>>> offlined pages, and make them usable again. But I'm not sure how
>>> prepared mm is to re-accept pages previously offlined.
>>>
>>
>> Offlining a DRAM backed page due to an UC makes sense because
>> a. the physical DRAM cell might still have an error
>> b. power cycle, scrubing could potentially 'repair' the DRAM cell,
>> making the page usable again.
>>
>> But for a PMEM backed page, neither is true. If a poison bit is set in
>> a page, that indicates the underlying hardware has completed the repair
>> work, all that's left is for software to recover. Secondly, because
>> poison is persistent, unless software explicitly clear the bit,
>> the page is permanently unusable.
>
> Not permanently... system-owner always has the option to use the
> device-DAX and ARS mechanisms to clear errors at the next boot.
> There's just no kernel enabling to do that automatically as a part of
> this patch set.
>
> However, we should consider this along with the userspace enabling to
> control which device-dax instances are set aside for hotplug. It would
> make sense to have a "clear errors before hotplug" configuration
> option.
>
Agreed, it would be nice to clear error prior to the hotplug operation,
better if that can be handled by the kernel.
thanks,
-jane
On Fri 25-01-19 11:15:08, Dan Williams wrote:
[...]
> However, we should consider this along with the userspace enabling to
> control which device-dax instances are set aside for hotplug. It would
> make sense to have a "clear errors before hotplug" configuration
> option.
I am not sure I understand. Do you mean to clear HWPoison when the
memory is hotadded (add_pages) or onlined (resp. move_pfn_range_to_zone)?
--
Michal Hocko
SUSE Labs
On Thu, Jan 24, 2019 at 03:14:41PM -0800, Dave Hansen wrote:
> v3 spurred a bunch of really good discussion. Thanks to everybody
> that made comments and suggestions!
>
> I would still love some Acks on this from the folks on cc, even if it
> is on just the patch touching your area.
>
> Note: these are based on commit d2f33c19644 in:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git libnvdimm-pending
>
> Changes since v3:
> * Move HMM-related resource warning instead of removing it
> * Use __request_resource() directly instead of devm.
> * Create a separate DAX_PMEM Kconfig option, complete with help text
> * Update patch descriptions and cover letter to give a better
> overview of use-cases and hardware where this might be useful.
>
> Changes since v2:
> * Updates to dev_dax_kmem_probe() in patch 5:
> * Reject probes for devices with bad NUMA nodes. Keeps slow
> memory from being added to node 0.
> * Use raw request_mem_region()
> * Add comments about permanent reservation
> * use dev_*() instead of printk's
> * Add references to nvdimm documentation in descriptions
> * Remove unneeded GPL export
> * Add Kconfig prompt and help text
>
> Changes since v1:
> * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
> * Use binding/unbinding from "dax bus" code
> * Move over to a "dax bus" driver from being an nvdimm driver
>
> --
>
> Persistent memory is cool. But, currently, you have to rewrite
> your applications to use it. Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory? Well... have I got the patch
> series for you!
>
> == Background / Use Cases ==
>
> Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves
> are described in detail in Documentation/nvdimm/nvdimm.txt.
> However, this documentation focuses on actually using them as
> storage. This set is focused on using NVDIMMs as DRAM replacement.
>
> This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
> persistent memory) NVDIMMs. These DIMMs are physically persistent,
> more akin to flash than traditional RAM. They are also expected to
> be more cost-effective than using RAM, which is why folks want this
> set in the first place.
What variant of NVDIMM's F/P or both?
>
> This set is not intended for RAM-based NVDIMMs. Those are not
> cost-effective vs. plain RAM, and this using them here would simply
> be a waste.
>
Sounds like NVDIMM (P)
> But, why would you bother with this approach? Intel itself [1]
> has announced a hardware feature that does something very similar:
> "Memory Mode" which turns DRAM into a cache in front of persistent
> memory, which is then as a whole used as normal "RAM"?
>
> Here are a few reasons:
> 1. The capacity of memory mode is the size of your persistent
> memory that you dedicate. DRAM capacity is "lost" because it
> is used for cache. With this, you get PMEM+DRAM capacity for
> memory.
> 2. DRAM acts as a cache with memory mode, and caches can lead to
> unpredictable latencies. Since memory mode is all-or-nothing
> (either all your DRAM is used as cache or none is), your entire
> memory space is exposed to these unpredictable latencies. This
> solution lets you guarantee DRAM latencies if you need them.
> 3. The new "tier" of memory is exposed to software. That means
> that you can build tiered applications or infrastructure. A
> cloud provider could sell cheaper VMs that use more PMEM and
> more expensive ones that use DRAM. That's impossible with
> memory mode.
>
> Don't take this as criticism of memory mode. Memory mode is
> awesome, and doesn't strictly require *any* software changes (we
> have software changes proposed for optimizing it though). It has
> tons of other advantages over *this* approach. Basically, we
> believe that the approach in these patches is complementary to
> memory mode and that both can live side-by-side in harmony.
>
> == Patch Set Overview ==
>
> This series adds a new "driver" to which pmem devices can be
> attached. Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory. On
> systems with an HMAT (a new ACPI table), each socket (roughly)
> will have a separate NUMA node for its persistent memory so
> this newly-added memory can be selected by its unique NUMA
> node.
NUMA is distance based topology, does HMAT solve these problems?
How do we prevent fallback nodes of normal nodes being pmem nodes?
On an unexpected crash/failure is there a scrubbing mechanism
or do we rely on the allocator to do the right thing prior to
reallocating any memory. Will frequent zero'ing hurt NVDIMM/pmem's
life times?
Balbir Singh.
On Mon, Jan 28, 2019 at 1:26 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 25-01-19 11:15:08, Dan Williams wrote:
> [...]
> > However, we should consider this along with the userspace enabling to
> > control which device-dax instances are set aside for hotplug. It would
> > make sense to have a "clear errors before hotplug" configuration
> > option.
>
> I am not sure I understand. Do you mean to clear HWPoison when the
> memory is hotadded (add_pages) or onlined (resp. move_pfn_range_to_zone)?
Before the memory is hot-added via the kmem driver it shows up as an
independent persistent memory namespace. A namespace can be configured
as a block device and errors cleared by writing to the given "bad
block". Once all media errors are cleared the namespace can be
assigned as volatile memory to the core-kernel mm. The memory range
starts as a namespace each boot and must be hotplugged via startup
scripts, those scripts can be made to handle the bad block pruning.
On 1/28/19 3:09 AM, Balbir Singh wrote:
>> This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
>> persistent memory) NVDIMMs. These DIMMs are physically persistent,
>> more akin to flash than traditional RAM. They are also expected to
>> be more cost-effective than using RAM, which is why folks want this
>> set in the first place.
> What variant of NVDIMM's F/P or both?
I'd expect this to get used in any cases where the NVDIMM is
cost-effective vs. DRAM. Today, I think that's only NVDIMM-P. At least
from what Wikipedia tells me about F vs. P vs. N:
https://en.wikipedia.org/wiki/NVDIMM
>> == Patch Set Overview ==
>>
>> This series adds a new "driver" to which pmem devices can be
>> attached. Once attached, the memory "owned" by the device is
>> hot-added to the kernel and managed like any other memory. On
>> systems with an HMAT (a new ACPI table), each socket (roughly)
>> will have a separate NUMA node for its persistent memory so
>> this newly-added memory can be selected by its unique NUMA
>> node.
>
> NUMA is distance based topology, does HMAT solve these problems?
NUMA is no longer just distance-based. Any memory with different
properties, like different memory-side caches or bandwidth properties
can be in its own, discrete NUMA node.
> How do we prevent fallback nodes of normal nodes being pmem nodes?
NUMA policies.
> On an unexpected crash/failure is there a scrubbing mechanism
> or do we rely on the allocator to do the right thing prior to
> reallocating any memory.
Yes, but this is not unique to persistent memory. On a kexec-based
crash, there might be old, sensitive data in *RAM* when the kernel comes
up. We depend on the allocator to zero things there. We also just
plain depend on the allocator to zero things so we don't leak
information when recycling pages in the first place.
I can't think of a scenario where some kind of "leak" of old data
wouldn't also be a bug with normal, volatile RAM.
> Will frequent zero'ing hurt NVDIMM/pmem's life times?
Everybody reputable that sells things with limited endurance quantifies
the endurance. I'd suggest that folks know what the endurance of their
media is before enabling this.
Dave Hansen <[email protected]> writes:
> On 1/25/19 1:02 PM, Bjorn Helgaas wrote:
>>> @@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
>>> unsigned long flags;
>>> struct resource res;
>>> unsigned long pfn, end_pfn;
>>> - int ret = -1;
>>> + int ret = -EINVAL;
>> Can you either make a similar change to the powerpc version of
>> walk_system_ram_range() in arch/powerpc/mm/mem.c or explain why it's
>> not needed? It *seems* like we'd want both versions of
>> walk_system_ram_range() to behave similarly in this respect.
>
> Sure. A quick grep shows powerpc being the only other implementation.
Ugh gross, why are we reimplementing it? ...
Oh right, memblock vs iomem. We should fix that one day :/
> I'll just add this hunk:
>
>> diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
>> --- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 2019-01-25 12:57:00.000004446 -0800
>> +++ b/arch/powerpc/mm/mem.c 2019-01-25 12:58:13.215004263 -0800
>> @@ -188,7 +188,7 @@ walk_system_ram_range(unsigned long star
>> struct memblock_region *reg;
>> unsigned long end_pfn = start_pfn + nr_pages;
>> unsigned long tstart, tend;
>> - int ret = -1;
>> + int ret = -EINVAL;
>
> I'll also dust off the ol' cross-compiler and make sure I didn't
> fat-finger anything.
Modern Fedora & Ubuntu have packaged cross toolchains. Otherwise there's
the kernel.org ones, or bootlin has versions with libc if you need it.
Patch looks fine. That value could only get to userspace if we have no
memory, which would be interesting.
Acked-by: Michael Ellerman <[email protected]> (powerpc)
cheers
Dave Hansen <[email protected]> writes:
> On 1/25/19 1:18 PM, Bjorn Helgaas wrote:
>> On Thu, Jan 24, 2019 at 5:21 PM Dave Hansen <[email protected]> wrote:
>>> diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
>>> --- a/kernel/resource.c~move-request_region-check 2019-01-24 15:13:14.453199539 -0800
>>> +++ b/kernel/resource.c 2019-01-24 15:13:14.458199539 -0800
>>> @@ -1123,6 +1123,16 @@ struct resource * __request_region(struc
>>> conflict = __request_resource(parent, res);
>>> if (!conflict)
>>> break;
>>> + /*
>>> + * mm/hmm.c reserves physical addresses which then
>>> + * become unavailable to other users. Conflicts are
>>> + * not expected. Be verbose if one is encountered.
>>> + */
>>> + if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
>>> + pr_debug("Resource conflict with unaddressable "
>>> + "device memory at %#010llx !\n",
>>> + (unsigned long long)start);
>>
>> I don't object to the change, but are you really OK with this being a
>> pr_debug() message that is only emitted when enabled via either the
>> dynamic debug mechanism or DEBUG being defined? From the comments, it
>> seems more like a KERN_INFO sort of message.
>
> I left it consistent with the original message that was in the code.
> I'm happy to change it, though, if the consumers of it (Jerome,
> basically) want something different.
At least using pr_debug() doesn't match the comment, ie. the comment
says "Be verbose" but pr_debug() is silent by default.
cheers
On 2/9/19 3:00 AM, Brice Goglin wrote:
> I've used your patches on fake hardware (memmap=xx!yy) with an older
> nvdimm-pending branch (without Keith's patches). It worked fine. This
> time I am running on real Intel hardware. Any idea where to look ?
I've run them on real Intel hardware too.
Could you share the exact sequence of commands you're issuing to
reproduce the hang? My guess would be that there's some odd interaction
between Dan's latest branch and my now (slightly) stale patches.
I'll refresh them this week and see if I can reproduce what you're seeing.
Le 11/02/2019 à 17:22, Dave Hansen a écrit :
> On 2/9/19 3:00 AM, Brice Goglin wrote:
>> I've used your patches on fake hardware (memmap=xx!yy) with an older
>> nvdimm-pending branch (without Keith's patches). It worked fine. This
>> time I am running on real Intel hardware. Any idea where to look ?
> I've run them on real Intel hardware too.
>
> Could you share the exact sequence of commands you're issuing to
> reproduce the hang? My guess would be that there's some odd interaction
> between Dan's latest branch and my now (slightly) stale patches.
>
> I'll refresh them this week and see if I can reproduce what you're seeing.
# ndctl disable-region all
# ndctl zero-labels all
# ndctl enable-region region0
# ndctl create-namespace -r region0 -t pmem -m devdax
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":"1488.37 GiB (1598.13 GB)",
"uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
"daxregion":{
"id":0,
"size":"1488.37 GiB (1598.13 GB)",
"align":2097152,
"devices":[
{
"chardev":"dax0.0",
"size":"1488.37 GiB (1598.13 GB)"
}
]
},
"align":2097152
}
# ndctl enable-namespace namespace0.0
# echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
<hang>
I tried with and without dax_pmem_compat loaded, but it doesn't help.
Brice
On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <[email protected]> wrote:
>
> Le 11/02/2019 à 17:22, Dave Hansen a écrit :
>
> > On 2/9/19 3:00 AM, Brice Goglin wrote:
> >> I've used your patches on fake hardware (memmap=xx!yy) with an older
> >> nvdimm-pending branch (without Keith's patches). It worked fine. This
> >> time I am running on real Intel hardware. Any idea where to look ?
> > I've run them on real Intel hardware too.
> >
> > Could you share the exact sequence of commands you're issuing to
> > reproduce the hang? My guess would be that there's some odd interaction
> > between Dan's latest branch and my now (slightly) stale patches.
> >
> > I'll refresh them this week and see if I can reproduce what you're seeing.
>
> # ndctl disable-region all
> # ndctl zero-labels all
> # ndctl enable-region region0
> # ndctl create-namespace -r region0 -t pmem -m devdax
> {
> "dev":"namespace0.0",
> "mode":"devdax",
> "map":"dev",
> "size":"1488.37 GiB (1598.13 GB)",
> "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
> "daxregion":{
> "id":0,
> "size":"1488.37 GiB (1598.13 GB)",
> "align":2097152,
> "devices":[
> {
> "chardev":"dax0.0",
> "size":"1488.37 GiB (1598.13 GB)"
> }
> ]
> },
> "align":2097152
> }
> # ndctl enable-namespace namespace0.0
> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> <hang>
>
> I tried with and without dax_pmem_compat loaded, but it doesn't help.
I think this is due to:
a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
I missed that this path is also called in the remove_id path. Thanks
for the bug report! I'll get this fixed up.
Le 13/02/2019 à 01:30, Dan Williams a écrit :
> On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <[email protected]> wrote:
>> # ndctl disable-region all
>> # ndctl zero-labels all
>> # ndctl enable-region region0
>> # ndctl create-namespace -r region0 -t pmem -m devdax
>> {
>> "dev":"namespace0.0",
>> "mode":"devdax",
>> "map":"dev",
>> "size":"1488.37 GiB (1598.13 GB)",
>> "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
>> "daxregion":{
>> "id":0,
>> "size":"1488.37 GiB (1598.13 GB)",
>> "align":2097152,
>> "devices":[
>> {
>> "chardev":"dax0.0",
>> "size":"1488.37 GiB (1598.13 GB)"
>> }
>> ]
>> },
>> "align":2097152
>> }
>> # ndctl enable-namespace namespace0.0
>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>> <hang>
>>
>> I tried with and without dax_pmem_compat loaded, but it doesn't help.
> I think this is due to:
>
> a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
>
> I missed that this path is also called in the remove_id path. Thanks
> for the bug report! I'll get this fixed up.
Now that remove_id is fixed, things fails later in Dave's procedure:
# echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
# echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
# echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
# echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
-bash: echo: write error: No such device
(And nothing seems to have changed in /sys/devices/system/memory/*/state)
Brice
On Wed, Feb 13, 2019 at 12:12 AM Brice Goglin <[email protected]> wrote:
>
> Le 13/02/2019 à 01:30, Dan Williams a écrit :
> > On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <[email protected]> wrote:
> >> # ndctl disable-region all
> >> # ndctl zero-labels all
> >> # ndctl enable-region region0
> >> # ndctl create-namespace -r region0 -t pmem -m devdax
> >> {
> >> "dev":"namespace0.0",
> >> "mode":"devdax",
> >> "map":"dev",
> >> "size":"1488.37 GiB (1598.13 GB)",
> >> "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
> >> "daxregion":{
> >> "id":0,
> >> "size":"1488.37 GiB (1598.13 GB)",
> >> "align":2097152,
> >> "devices":[
> >> {
> >> "chardev":"dax0.0",
> >> "size":"1488.37 GiB (1598.13 GB)"
> >> }
> >> ]
> >> },
> >> "align":2097152
> >> }
> >> # ndctl enable-namespace namespace0.0
> >> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> >> <hang>
> >>
> >> I tried with and without dax_pmem_compat loaded, but it doesn't help.
> > I think this is due to:
> >
> > a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
> >
> > I missed that this path is also called in the remove_id path. Thanks
> > for the bug report! I'll get this fixed up.
>
>
> Now that remove_id is fixed, things fails later in Dave's procedure:
>
> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
In the current version of the code the bind is not necessary, so the
lack of error messages here means the bind succeeded.
> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
> -bash: echo: write error: No such device
This also happens when the device is already bound.
>
> (And nothing seems to have changed in /sys/devices/system/memory/*/state)
What does "cat /proc/iomem" say?
Le 13/02/2019 à 09:24, Dan Williams a écrit :
> On Wed, Feb 13, 2019 at 12:12 AM Brice Goglin <[email protected]> wrote:
>> Le 13/02/2019 à 01:30, Dan Williams a écrit :
>>> On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <[email protected]> wrote:
>>>> # ndctl disable-region all
>>>> # ndctl zero-labels all
>>>> # ndctl enable-region region0
>>>> # ndctl create-namespace -r region0 -t pmem -m devdax
>>>> {
>>>> "dev":"namespace0.0",
>>>> "mode":"devdax",
>>>> "map":"dev",
>>>> "size":"1488.37 GiB (1598.13 GB)",
>>>> "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
>>>> "daxregion":{
>>>> "id":0,
>>>> "size":"1488.37 GiB (1598.13 GB)",
>>>> "align":2097152,
>>>> "devices":[
>>>> {
>>>> "chardev":"dax0.0",
>>>> "size":"1488.37 GiB (1598.13 GB)"
>>>> }
>>>> ]
>>>> },
>>>> "align":2097152
>>>> }
>>>> # ndctl enable-namespace namespace0.0
>>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>>>> <hang>
>>>>
>>>> I tried with and without dax_pmem_compat loaded, but it doesn't help.
>>> I think this is due to:
>>>
>>> a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
>>>
>>> I missed that this path is also called in the remove_id path. Thanks
>>> for the bug report! I'll get this fixed up.
>>
>> Now that remove_id is fixed, things fails later in Dave's procedure:
>>
>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> In the current version of the code the bind is not necessary, so the
> lack of error messages here means the bind succeeded.
>
>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>> -bash: echo: write error: No such device
> This also happens when the device is already bound.
>
>> (And nothing seems to have changed in /sys/devices/system/memory/*/state)
> What does "cat /proc/iomem" say?
3060000000-1aa5fffffff : Persistent Memory
3060000000-36481fffff : namespace0.0
3680000000-1a9ffffffff : dax0.0
3680000000-1a9ffffffff : System RAM
(the last line wasn't here before attaching to kmem)
I said nothing changed in memory/*/state, I actually meant that nothing
was offline. But things are actually working!
First, node4 appeared, all memory is already attached to it without
having to write to memory/*/state
Node 4 MemTotal: 1558183936 kB
Node 4 MemFree: 1558068564 kB
Node 4 MemUsed: 115372 kB
I wasn't expecting node4 to appear because the machine has no
/sys/firmware/acpi/tables/HMAT when running in 1LM (there's one in 2LM).
I thought you said in the past that no HMAT would mean memory would be
added to the existing DDR node?
Thanks!
Brice
Le 13/02/2019 à 09:43, Brice Goglin a écrit :
> Le 13/02/2019 à 09:24, Dan Williams a écrit :
>> On Wed, Feb 13, 2019 at 12:12 AM Brice Goglin <[email protected]> wrote:
>>> Le 13/02/2019 à 01:30, Dan Williams a écrit :
>>>> On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <[email protected]> wrote:
>>>>> # ndctl disable-region all
>>>>> # ndctl zero-labels all
>>>>> # ndctl enable-region region0
>>>>> # ndctl create-namespace -r region0 -t pmem -m devdax
>>>>> {
>>>>> "dev":"namespace0.0",
>>>>> "mode":"devdax",
>>>>> "map":"dev",
>>>>> "size":"1488.37 GiB (1598.13 GB)",
>>>>> "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
>>>>> "daxregion":{
>>>>> "id":0,
>>>>> "size":"1488.37 GiB (1598.13 GB)",
>>>>> "align":2097152,
>>>>> "devices":[
>>>>> {
>>>>> "chardev":"dax0.0",
>>>>> "size":"1488.37 GiB (1598.13 GB)"
>>>>> }
>>>>> ]
>>>>> },
>>>>> "align":2097152
>>>>> }
>>>>> # ndctl enable-namespace namespace0.0
>>>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>>>>> <hang>
>>>>>
>>>>> I tried with and without dax_pmem_compat loaded, but it doesn't help.
>>>> I think this is due to:
>>>>
>>>> a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
>>>>
>>>> I missed that this path is also called in the remove_id path. Thanks
>>>> for the bug report! I'll get this fixed up.
>>> Now that remove_id is fixed, things fails later in Dave's procedure:
>>>
>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>> In the current version of the code the bind is not necessary, so the
>> lack of error messages here means the bind succeeded.
It looks like "unbind" is required to make the PMEM appear as a new
node. If I remove_id from devdax and new_id to kmem without "unbind" in
the middle, nothing appears.
Writing to "kmem/bind" didn't seem necessary.
Brice
>>
>>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>>> -bash: echo: write error: No such device
>> This also happens when the device is already bound.
>>
>>> (And nothing seems to have changed in /sys/devices/system/memory/*/state)
>> What does "cat /proc/iomem" say?
>
> 3060000000-1aa5fffffff : Persistent Memory
> 3060000000-36481fffff : namespace0.0
> 3680000000-1a9ffffffff : dax0.0
> 3680000000-1a9ffffffff : System RAM
> (the last line wasn't here before attaching to kmem)
>
> I said nothing changed in memory/*/state, I actually meant that nothing
> was offline. But things are actually working!
>
> First, node4 appeared, all memory is already attached to it without
> having to write to memory/*/state
>
> Node 4 MemTotal: 1558183936 kB
> Node 4 MemFree: 1558068564 kB
> Node 4 MemUsed: 115372 kB
>
> I wasn't expecting node4 to appear because the machine has no
> /sys/firmware/acpi/tables/HMAT when running in 1LM (there's one in 2LM).
> I thought you said in the past that no HMAT would mean memory would be
> added to the existing DDR node?
>
> Thanks!
>
> Brice
>
>
On Wed, Feb 13, 2019 at 5:07 AM Brice Goglin <[email protected]> wrote:
>
>
> Le 13/02/2019 à 09:43, Brice Goglin a écrit :
> > Le 13/02/2019 à 09:24, Dan Williams a écrit :
> >> On Wed, Feb 13, 2019 at 12:12 AM Brice Goglin <[email protected]> wrote:
> >>> Le 13/02/2019 à 01:30, Dan Williams a écrit :
> >>>> On Tue, Feb 12, 2019 at 11:59 AM Brice Goglin <[email protected]> wrote:
> >>>>> # ndctl disable-region all
> >>>>> # ndctl zero-labels all
> >>>>> # ndctl enable-region region0
> >>>>> # ndctl create-namespace -r region0 -t pmem -m devdax
> >>>>> {
> >>>>> "dev":"namespace0.0",
> >>>>> "mode":"devdax",
> >>>>> "map":"dev",
> >>>>> "size":"1488.37 GiB (1598.13 GB)",
> >>>>> "uuid":"ad0096d7-3fe7-4402-b529-ad64ed0bf789",
> >>>>> "daxregion":{
> >>>>> "id":0,
> >>>>> "size":"1488.37 GiB (1598.13 GB)",
> >>>>> "align":2097152,
> >>>>> "devices":[
> >>>>> {
> >>>>> "chardev":"dax0.0",
> >>>>> "size":"1488.37 GiB (1598.13 GB)"
> >>>>> }
> >>>>> ]
> >>>>> },
> >>>>> "align":2097152
> >>>>> }
> >>>>> # ndctl enable-namespace namespace0.0
> >>>>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> >>>>> <hang>
> >>>>>
> >>>>> I tried with and without dax_pmem_compat loaded, but it doesn't help.
> >>>> I think this is due to:
> >>>>
> >>>> a9f1ffdb6a20 device-dax: Auto-bind device after successful new_id
> >>>>
> >>>> I missed that this path is also called in the remove_id path. Thanks
> >>>> for the bug report! I'll get this fixed up.
> >>> Now that remove_id is fixed, things fails later in Dave's procedure:
> >>>
> >>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> >>> # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> >>> # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> >> In the current version of the code the bind is not necessary, so the
> >> lack of error messages here means the bind succeeded.
>
>
> It looks like "unbind" is required to make the PMEM appear as a new
> node. If I remove_id from devdax and new_id to kmem without "unbind" in
> the middle, nothing appears.
>
> Writing to "kmem/bind" didn't seem necessary.
Yes, in short:
device_dax/remove_id: not required, this driver attaches to any and
all device-dax devices by default
device_dax/unbind: required, nothing else will free the device for
kmem to attach
kmem/new_id: required, it will attach if the device is currently
unbound otherwise the device must be unbound before proceeding
kmem/bind: only required if the device was busy / attached to
device_dax when new_id was written.