I would like to get this queued up to get merged. Since most of the
churn is in the nvdimm code, and it also depends on some refactoring
that only exists in the nvdimm tree, it seems like putting it in *via*
the nvdimm tree is the best path.
But, this series makes non-trivial changes to the "resource" code and
memory hotplug. I'd really like to get some acks from folks on the
first three patches which affect those areas.
Borislav and Bjorn, you seem to be the most active in the resource code.
Michal, I'd really appreciate at look at all of this from a mem hotplug
perspective.
Note: these are based on commit d2f33c19644 in:
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git libnvdimm-pending
Changes since v1:
* Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
* Use binding/unbinding from "dax bus" code
* Move over to a "dax bus" driver from being an nvdimm driver
--
Persistent memory is cool. But, currently, you have to rewrite
your applications to use it. Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory? Well... have I got the patch
series for you!
This series adds a new "driver" to which pmem devices can be
attached. Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory. On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.
Here's how I set up a system to test this thing:
1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory. Reserving a spot a 2GB
physical seems to work: memmap=512M!0x0000000080000000
This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
ndctl create-namespace -fe namespace0.0 -m dax
4. See patch 4 for instructions on binding the kmem driver
to a device.
5. Now, online the new memory sections. Perhaps:
grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
echo $f: `cat $f`
echo online_movable > $f
grep ^MemTotal /proc/meminfo
done
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Yaowei Bai <[email protected]>
Cc: Takashi Iwai <[email protected]>
From: Dave Hansen <[email protected]>
In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.
However, it currently only finds memory resources which are
"top-level" iomem_resources. Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):
a0000000-bfffffff : Persistent Memory (legacy)
a0000000-afffffff : System RAM
Changing the true->false here allows children to be searched
as well. We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
b/kernel/resource.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources 2018-12-20 11:48:42.824771932 -0800
+++ b/kernel/resource.c 2018-12-20 11:48:42.827771932 -0800
@@ -445,6 +445,9 @@ int walk_mem_res(u64 start, u64 end, voi
* This function calls the @func callback against all memory ranges of type
* System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
* It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
*/
int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
void *arg, int (*func)(unsigned long, unsigned long, void *))
@@ -460,7 +463,7 @@ int walk_system_ram_range(unsigned long
flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
while (start < end &&
!find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
- true, &res)) {
+ false, &res)) {
pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
end_pfn = (res.end + 1) >> PAGE_SHIFT;
if (end_pfn > pfn)
_
From: Dave Hansen <[email protected]>
The mm/resource.c code is used to manage the physical address
space. We can view the current resource configuration in
/proc/iomem. An example of this is at the bottom of this
description.
The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory". We want to use this persistent memory, but
as volatile memory, just like RAM. The best way to do this is
to leave the existing resource in place, but add a "System RAM"
resource underneath it. This clearly communicates the ownership
relationship of this memory.
The request_resource_conflict() API only deals with the
top-level resources. Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.
We also rework the old error message a bit since we do not get
the conflicting entry back: only an indication that we *had* a
conflict.
We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space. But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.
This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.
00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
01000000-01e071d0 : Kernel code
01e071d1-027dfdff : Kernel data
02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
b/mm/memory_hotplug.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)
diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 2018-12-20 11:48:42.317771933 -0800
+++ b/mm/memory_hotplug.c 2018-12-20 11:48:42.322771933 -0800
@@ -98,24 +98,21 @@ void mem_hotplug_done(void)
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
- struct resource *res, *conflict;
- res = kzalloc(sizeof(struct resource), GFP_KERNEL);
- if (!res)
- return ERR_PTR(-ENOMEM);
+ struct resource *res;
+ unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+ char *resource_name = "System RAM";
- res->name = "System RAM";
- res->start = start;
- res->end = start + size - 1;
- res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
- conflict = request_resource_conflict(&iomem_resource, res);
- if (conflict) {
- if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
- pr_debug("Device unaddressable memory block "
- "memory hotplug at %#010llx !\n",
- (unsigned long long)start);
- }
- pr_debug("System RAM resource %pR cannot be added\n", res);
- kfree(res);
+ /*
+ * Request ownership of the new memory range. This might be
+ * a child of an existing resource that was present but
+ * not marked as busy.
+ */
+ res = __request_region(&iomem_resource, start, size,
+ resource_name, flags);
+
+ if (!res) {
+ pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
+ start, start + size);
return ERR_PTR(-EEXIST);
}
return res;
_
On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
<[email protected]> wrote:
>
>
> From: Dave Hansen <[email protected]>
>
> walk_system_ram_range() can return an error code either becuase *it*
> failed, or because the 'func' that it calls returned an error. The
> memory hotplug does the following:
>
> ret = walk_system_ram_range(..., func);
> if (ret)
> return ret;
>
> and 'ret' makes it out to userspace, eventually. The problem is,
> walk_system_ram_range() failues that result from *it* failing (as
> opposed to 'func') return -1. That leads to a very odd -EPERM (-1)
> return code out to userspace.
>
> Make walk_system_ram_range() return -EINVAL for internal failures to
> keep userspace less confused.
>
> This return code is compatible with all the callers that I audited.
>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
>
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> b/kernel/resource.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
> --- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 2018-12-20 11:48:41.810771934 -0800
> +++ b/kernel/resource.c 2018-12-20 11:48:41.814771934 -0800
> @@ -375,7 +375,7 @@ static int __walk_iomem_res_desc(resourc
> int (*func)(struct resource *, void *))
> {
> struct resource res;
> - int ret = -1;
> + int ret = -EINVAL;
>
> while (start < end &&
> !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
> @@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
> unsigned long flags;
> struct resource res;
> unsigned long pfn, end_pfn;
> - int ret = -1;
> + int ret = -EINVAL;
Don't you want a similar change in the powerpc version in arch/powerpc/mm/mem.c?
>
> start = (u64) start_pfn << PAGE_SHIFT;
> end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
> _
On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
<[email protected]> wrote:
>
>
> From: Dave Hansen <[email protected]>
>
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries.
Is there any documentation about exactly what persistent memory is?
In Documentation/, I see references to pstore and pmem, which sound
sort of similar, but maybe not quite the same?
> However, this limits persistent memory use to applications which
> *have* been modified. To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed ad used just like
> normal RAM would be.
s/ad/and/
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>
> and then bind it to this new driver:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
>
> Note: this inherits any existing NUMA information for the newly-
> added memory from the persistent memory device that came from the
> firmware. On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node. That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
>
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches. But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: Yaowei Bai <[email protected]>
> Cc: Takashi Iwai <[email protected]>
> ---
>
> b/drivers/dax/Kconfig | 5 ++
> b/drivers/dax/Makefile | 1
> b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 99 insertions(+)
>
> diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
> --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800
> +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800
> @@ -32,6 +32,11 @@ config DEV_DAX_PMEM
>
> Say M if unsure
>
> +config DEV_DAX_KMEM
> + def_bool y
Is "y" the right default here? I periodically see Linus complain
about new things defaulting to "on", but I admit I haven't paid enough
attention to know whether that would apply here.
> + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure
> + depends on MEMORY_HOTPLUG # for add_memory() and friends
> +
> config DEV_DAX_PMEM_COMPAT
> tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
> depends on DEV_DAX_PMEM
> diff -puN /dev/null drivers/dax/kmem.c
> --- /dev/null 2018-12-03 08:41:47.355756491 -0800
> +++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/memory.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pfn_t.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include "dax-private.h"
> +#include "bus.h"
> +
> +int dev_dax_kmem_probe(struct device *dev)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct resource *res = &dev_dax->region->res;
> + resource_size_t kmem_start;
> + resource_size_t kmem_size;
> + struct resource *new_res;
> + int numa_node;
> + int rc;
> +
> + /* Hotplug starting at the beginning of the next block: */
> + kmem_start = ALIGN(res->start, memory_block_size_bytes());
> +
> + kmem_size = resource_size(res);
> + /* Adjust the size down to compensate for moving up kmem_start: */
> + kmem_size -= kmem_start - res->start;
> + /* Align the size down to cover only complete blocks: */
> + kmem_size &= ~(memory_block_size_bytes() - 1);
> +
> + new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
> + dev_name(dev));
> +
> + if (!new_res) {
> + printk("could not reserve region %016llx -> %016llx\n",
> + kmem_start, kmem_start+kmem_size);
1) It'd be nice to have some sort of module tag in the output that
ties it to this driver.
2) It might be nice to print the range in the same format as %pR,
i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ).
> + return -EBUSY;
> + }
> +
> + /*
> + * Set flags appropriate for System RAM. Leave ..._BUSY clear
> + * so that add_memory() can add a child resource.
> + */
> + new_res->flags = IORESOURCE_SYSTEM_RAM;
IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
devm_request_mem_region() path. I think you should keep at least
IORESOURCE_MEM so the iomem_resource tree stays consistent.
> + new_res->name = dev_name(dev);
> +
> + numa_node = dev_dax->target_node;
> + if (numa_node < 0) {
> + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
It'd be nice to again have a module tag and an indication of what
range is affected, e.g., %pR of new_res.
You don't save the new_res pointer anywhere, which I guess you intend
for now since there's no remove or anything else to do with this
resource? I thought maybe devm_request_mem_region() would implicitly
save it, but it doesn't; it only saves the parent (iomem_resource, the
start (kmem_start), and the size (kmem_size)).
> + numa_node = 0;
> + }
> +
> + rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> + if (rc)
> + return rc;
> +
> + return 0;
Doesn't this mean "return rc" or even just "return add_memory(...)"?
> +}
> +EXPORT_SYMBOL_GPL(dev_dax_kmem_probe);
> +
> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> + /* Assume that hot-remove will fail for now */
> + return -EBUSY;
> +}
> +
> +static struct dax_device_driver device_dax_kmem_driver = {
> + .drv = {
> + .probe = dev_dax_kmem_probe,
> + .remove = dev_dax_kmem_remove,
> + },
> +};
> +
> +static int __init dax_kmem_init(void)
> +{
> + return dax_driver_register(&device_dax_kmem_driver);
> +}
> +
> +static void __exit dax_kmem_exit(void)
> +{
> + dax_driver_unregister(&device_dax_kmem_driver);
> +}
> +
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_LICENSE("GPL v2");
> +module_init(dax_kmem_init);
> +module_exit(dax_kmem_exit);
> +MODULE_ALIAS_DAX_DEVICE(0);
> diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
> --- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874 -0800
> +++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800
> @@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_DAX) += dax.o
> obj-$(CONFIG_DEV_DAX) += device_dax.o
> +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>
> dax-y := super.o
> dax-y += bus.o
> _
On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
> <[email protected]> wrote:
>> From: Dave Hansen <[email protected]>
>> Currently, a persistent memory region is "owned" by a device driver,
>> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
>> allow applications to explicitly use persistent memory, generally
>> by being modified to use special, new libraries.
>
> Is there any documentation about exactly what persistent memory is?
> In Documentation/, I see references to pstore and pmem, which sound
> sort of similar, but maybe not quite the same?
One instance of persistent memory is nonvolatile DIMMS. They're
described in great detail here: Documentation/nvdimm/nvdimm.txt
>> +config DEV_DAX_KMEM
>> + def_bool y
>
> Is "y" the right default here? I periodically see Linus complain
> about new things defaulting to "on", but I admit I haven't paid enough
> attention to know whether that would apply here.
>
>> + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure
>> + depends on MEMORY_HOTPLUG # for add_memory() and friends
Well, it doesn't default to "on for everyone". It inherits the state of
DEV_DAX_PMEM so it's only foisted on folks who have already opted in to
generic pmem support.
>> +int dev_dax_kmem_probe(struct device *dev)
>> +{
>> + struct dev_dax *dev_dax = to_dev_dax(dev);
>> + struct resource *res = &dev_dax->region->res;
>> + resource_size_t kmem_start;
>> + resource_size_t kmem_size;
>> + struct resource *new_res;
>> + int numa_node;
>> + int rc;
>> +
>> + /* Hotplug starting at the beginning of the next block: */
>> + kmem_start = ALIGN(res->start, memory_block_size_bytes());
>> +
>> + kmem_size = resource_size(res);
>> + /* Adjust the size down to compensate for moving up kmem_start: */
>> + kmem_size -= kmem_start - res->start;
>> + /* Align the size down to cover only complete blocks: */
>> + kmem_size &= ~(memory_block_size_bytes() - 1);
>> +
>> + new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
>> + dev_name(dev));
>> +
>> + if (!new_res) {
>> + printk("could not reserve region %016llx -> %016llx\n",
>> + kmem_start, kmem_start+kmem_size);
>
> 1) It'd be nice to have some sort of module tag in the output that
> ties it to this driver.
Good point. That should probably be a dev_printk().
> 2) It might be nice to print the range in the same format as %pR,
> i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ).
Sure, that sounds like a sane thing to do as well.
>> + return -EBUSY;
>> + }
>> +
>> + /*
>> + * Set flags appropriate for System RAM. Leave ..._BUSY clear
>> + * so that add_memory() can add a child resource.
>> + */
>> + new_res->flags = IORESOURCE_SYSTEM_RAM;
>
> IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> devm_request_mem_region() path. I think you should keep at least
> IORESOURCE_MEM so the iomem_resource tree stays consistent.
>
>> + new_res->name = dev_name(dev);
>> +
>> + numa_node = dev_dax->target_node;
>> + if (numa_node < 0) {
>> + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
>
> It'd be nice to again have a module tag and an indication of what
> range is affected, e.g., %pR of new_res.
>
> You don't save the new_res pointer anywhere, which I guess you intend
> for now since there's no remove or anything else to do with this
> resource? I thought maybe devm_request_mem_region() would implicitly
> save it, but it doesn't; it only saves the parent (iomem_resource, the
> start (kmem_start), and the size (kmem_size)).
Yeah, that's the intention: removal is currently not supported. I'll
add a comment to clarify.
>> + numa_node = 0;
>> + }
>> +
>> + rc = add_memory(numa_node, new_res->start, resource_size(new_res));
>> + if (rc)
>> + return rc;
>> +
>> + return 0;
>
> Doesn't this mean "return rc" or even just "return add_memory(...)"?
Yeah, all of those are equivalent. I guess I just prefer the explicit
error handling path.
On Wed, Jan 16, 2019 at 3:40 PM Dave Hansen <[email protected]> wrote:
> On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> > On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
> > <[email protected]> wrote:
> >> From: Dave Hansen <[email protected]>
> >> Currently, a persistent memory region is "owned" by a device driver,
> >> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
> >> allow applications to explicitly use persistent memory, generally
> >> by being modified to use special, new libraries.
> >
> > Is there any documentation about exactly what persistent memory is?
> > In Documentation/, I see references to pstore and pmem, which sound
> > sort of similar, but maybe not quite the same?
>
> One instance of persistent memory is nonvolatile DIMMS. They're
> described in great detail here: Documentation/nvdimm/nvdimm.txt
Thanks! Some bread crumbs in the changelog to lead there would be great.
Bjorn
On Wed, Jan 16, 2019 at 10:25 AM Dave Hansen
<[email protected]> wrote:
>
>
> From: Dave Hansen <[email protected]>
>
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries.
>
> However, this limits persistent memory use to applications which
> *have* been modified. To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed ad used just like
> normal RAM would be.
>
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>
> and then bind it to this new driver:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
>
> Note: this inherits any existing NUMA information for the newly-
> added memory from the persistent memory device that came from the
> firmware. On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node. That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
>
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches. But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: Yaowei Bai <[email protected]>
> Cc: Takashi Iwai <[email protected]>
> ---
>
> b/drivers/dax/Kconfig | 5 ++
> b/drivers/dax/Makefile | 1
> b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 99 insertions(+)
>
> diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
> --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800
> +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800
> @@ -32,6 +32,11 @@ config DEV_DAX_PMEM
>
> Say M if unsure
>
> +config DEV_DAX_KMEM
> + def_bool y
> + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure
> + depends on MEMORY_HOTPLUG # for add_memory() and friends
> +
I think this should be:
config DEV_DAX_KMEM
tristate "<kmem title>"
depends on DEV_DAX
default DEV_DAX
depends on MEMORY_HOTPLUG # for add_memory() and friends
help
<kmem description>
...because the DEV_DAX_KMEM implementation with the device-DAX reworks
is independent of pmem. It just so happens that pmem is the only
source for device-DAX instances, but that need not always be the case
and kmem is device-DAX origin generic.
> config DEV_DAX_PMEM_COMPAT
> tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
> depends on DEV_DAX_PMEM
> diff -puN /dev/null drivers/dax/kmem.c
> --- /dev/null 2018-12-03 08:41:47.355756491 -0800
> +++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/memory.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pfn_t.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include "dax-private.h"
> +#include "bus.h"
> +
> +int dev_dax_kmem_probe(struct device *dev)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct resource *res = &dev_dax->region->res;
> + resource_size_t kmem_start;
> + resource_size_t kmem_size;
> + struct resource *new_res;
> + int numa_node;
> + int rc;
> +
> + /* Hotplug starting at the beginning of the next block: */
> + kmem_start = ALIGN(res->start, memory_block_size_bytes());
> +
> + kmem_size = resource_size(res);
> + /* Adjust the size down to compensate for moving up kmem_start: */
> + kmem_size -= kmem_start - res->start;
> + /* Align the size down to cover only complete blocks: */
> + kmem_size &= ~(memory_block_size_bytes() - 1);
> +
> + new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
> + dev_name(dev));
> +
> + if (!new_res) {
> + printk("could not reserve region %016llx -> %016llx\n",
> + kmem_start, kmem_start+kmem_size);
dev_err() please.
> + return -EBUSY;
> + }
> +
> + /*
> + * Set flags appropriate for System RAM. Leave ..._BUSY clear
> + * so that add_memory() can add a child resource.
> + */
> + new_res->flags = IORESOURCE_SYSTEM_RAM;
> + new_res->name = dev_name(dev);
> +
> + numa_node = dev_dax->target_node;
> + if (numa_node < 0) {
> + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
I think this should be dev_info(dev, "no numa_node, defaulting to
0\n"), or dev_dbg():
1/ so we can backtrack which device is missing numa information
2/ NUMA_NO_NODE may be a common occurrence so it's not really a "warn"
level concern afaics.
3/ no real need for _once I don't see this as a log spam risk.
> + numa_node = 0;
> + }
> +
> + rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> + if (rc)
> + return rc;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(dev_dax_kmem_probe);
No need to export this afaics.
> +
> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> + /* Assume that hot-remove will fail for now */
> + return -EBUSY;
> +}
> +
> +static struct dax_device_driver device_dax_kmem_driver = {
> + .drv = {
> + .probe = dev_dax_kmem_probe,
> + .remove = dev_dax_kmem_remove,
> + },
> +};
> +
> +static int __init dax_kmem_init(void)
> +{
> + return dax_driver_register(&device_dax_kmem_driver);
> +}
> +
> +static void __exit dax_kmem_exit(void)
> +{
> + dax_driver_unregister(&device_dax_kmem_driver);
> +}
> +
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_LICENSE("GPL v2");
> +module_init(dax_kmem_init);
> +module_exit(dax_kmem_exit);
> +MODULE_ALIAS_DAX_DEVICE(0);
> diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
> --- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874 -0800
> +++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800
> @@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_DAX) += dax.o
> obj-$(CONFIG_DEV_DAX) += device_dax.o
> +obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>
> dax-y := super.o
> dax-y += bus.o
> _
On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
>> + /*
>> + * Set flags appropriate for System RAM. Leave ..._BUSY clear
>> + * so that add_memory() can add a child resource.
>> + */
>> + new_res->flags = IORESOURCE_SYSTEM_RAM;
> IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> devm_request_mem_region() path. I think you should keep at least
> IORESOURCE_MEM so the iomem_resource tree stays consistent.
I went to look at fixing this. It looks like "IORESOURCE_SYSTEM_RAM"
includes IORESOURCE_MEM:
> #define IORESOURCE_SYSTEM_RAM (IORESOURCE_MEM|IORESOURCE_SYSRAM)
Did you want the patch to expand this #define, or did you just want to
ensure that IORESORUCE_MEM got in there somehow?
On Wed, Jan 16, 2019 at 3:53 PM Dave Hansen <[email protected]> wrote:
>
> On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> >> + /*
> >> + * Set flags appropriate for System RAM. Leave ..._BUSY clear
> >> + * so that add_memory() can add a child resource.
> >> + */
> >> + new_res->flags = IORESOURCE_SYSTEM_RAM;
> > IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> > devm_request_mem_region() path. I think you should keep at least
> > IORESOURCE_MEM so the iomem_resource tree stays consistent.
>
> I went to look at fixing this. It looks like "IORESOURCE_SYSTEM_RAM"
> includes IORESOURCE_MEM:
>
> > #define IORESOURCE_SYSTEM_RAM (IORESOURCE_MEM|IORESOURCE_SYSRAM)
>
> Did you want the patch to expand this #define, or did you just want to
> ensure that IORESORUCE_MEM got in there somehow?
The latter. Since it's already included, forget I said anything :)
Although if your intent is only to clear IORESOURCE_BUSY, maybe it
would be safer to just clear that bit instead of overwriting
everything? That might also help people grepping for IORESOURCE_BUSY
usage.
On Wed, Jan 16, 2019 at 1:40 PM Dave Hansen <[email protected]> wrote:
>
> On 1/16/19 1:16 PM, Bjorn Helgaas wrote:
> > On Wed, Jan 16, 2019 at 12:25 PM Dave Hansen
> > <[email protected]> wrote:
> >> From: Dave Hansen <[email protected]>
> >> Currently, a persistent memory region is "owned" by a device driver,
> >> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
> >> allow applications to explicitly use persistent memory, generally
> >> by being modified to use special, new libraries.
> >
> > Is there any documentation about exactly what persistent memory is?
> > In Documentation/, I see references to pstore and pmem, which sound
> > sort of similar, but maybe not quite the same?
>
> One instance of persistent memory is nonvolatile DIMMS. They're
> described in great detail here: Documentation/nvdimm/nvdimm.txt
>
> >> +config DEV_DAX_KMEM
> >> + def_bool y
> >
> > Is "y" the right default here? I periodically see Linus complain
> > about new things defaulting to "on", but I admit I haven't paid enough
> > attention to know whether that would apply here.
> >
> >> + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure
> >> + depends on MEMORY_HOTPLUG # for add_memory() and friends
>
> Well, it doesn't default to "on for everyone". It inherits the state of
> DEV_DAX_PMEM so it's only foisted on folks who have already opted in to
> generic pmem support.
>
> >> +int dev_dax_kmem_probe(struct device *dev)
> >> +{
> >> + struct dev_dax *dev_dax = to_dev_dax(dev);
> >> + struct resource *res = &dev_dax->region->res;
> >> + resource_size_t kmem_start;
> >> + resource_size_t kmem_size;
> >> + struct resource *new_res;
> >> + int numa_node;
> >> + int rc;
> >> +
> >> + /* Hotplug starting at the beginning of the next block: */
> >> + kmem_start = ALIGN(res->start, memory_block_size_bytes());
> >> +
> >> + kmem_size = resource_size(res);
> >> + /* Adjust the size down to compensate for moving up kmem_start: */
> >> + kmem_size -= kmem_start - res->start;
> >> + /* Align the size down to cover only complete blocks: */
> >> + kmem_size &= ~(memory_block_size_bytes() - 1);
> >> +
> >> + new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
> >> + dev_name(dev));
> >> +
> >> + if (!new_res) {
> >> + printk("could not reserve region %016llx -> %016llx\n",
> >> + kmem_start, kmem_start+kmem_size);
> >
> > 1) It'd be nice to have some sort of module tag in the output that
> > ties it to this driver.
>
> Good point. That should probably be a dev_printk().
>
> > 2) It might be nice to print the range in the same format as %pR,
> > i.e., "[mem %#010x-%#010x]" with the end included (start + size -1 ).
>
> Sure, that sounds like a sane thing to do as well.
Does %pR protect physical address disclosure to non-root by default?
At least the pmem driver is using %pR rather than manually printing
raw physical address values, but you would need to create a local
modified version of the passed in resource.
> >> + return -EBUSY;
> >> + }
> >> +
> >> + /*
> >> + * Set flags appropriate for System RAM. Leave ..._BUSY clear
> >> + * so that add_memory() can add a child resource.
> >> + */
> >> + new_res->flags = IORESOURCE_SYSTEM_RAM;
> >
> > IIUC, new_res->flags was set to "IORESOURCE_MEM | ..." in the
> > devm_request_mem_region() path. I think you should keep at least
> > IORESOURCE_MEM so the iomem_resource tree stays consistent.
> >
> >> + new_res->name = dev_name(dev);
> >> +
> >> + numa_node = dev_dax->target_node;
> >> + if (numa_node < 0) {
> >> + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
> >
> > It'd be nice to again have a module tag and an indication of what
> > range is affected, e.g., %pR of new_res.
> >
> > You don't save the new_res pointer anywhere, which I guess you intend
> > for now since there's no remove or anything else to do with this
> > resource? I thought maybe devm_request_mem_region() would implicitly
> > save it, but it doesn't; it only saves the parent (iomem_resource, the
> > start (kmem_start), and the size (kmem_size)).
>
> Yeah, that's the intention: removal is currently not supported. I'll
> add a comment to clarify.
I would clarify that *driver* removal is supported because there's no
Linux facility for drivers to fail removal (nothing checks the return
code from ->remove()). Instead the protection is that the resource
must remain pinned forever. In that case devm_request_mem_region() is
the wrong function to use. You want to explicitly use the non-devm
request_mem_region() and purposely leak it to keep the memory reserved
indefinitely.
On 1/16/19 11:16 AM, Jerome Glisse wrote:
>> We also rework the old error message a bit since we do not get
>> the conflicting entry back: only an indication that we *had* a
>> conflict.
> We should keep the device private check (moving it in __request_region)
> as device private can try to register un-use physical address (un-use
> at time of device private registration) that latter can block valid
> physical address the error message you are removing report such event.
If a resource can't support having a child, shouldn't it just be marked
IORESOURCE_BUSY, rather than trying to somehow special-case
IORES_DESC_DEVICE_PRIVATE_MEMORY behavior?
From: Dave Hansen <[email protected]>
walk_system_ram_range() can return an error code either becuase *it*
failed, or because the 'func' that it calls returned an error. The
memory hotplug does the following:
ret = walk_system_ram_range(..., func);
if (ret)
return ret;
and 'ret' makes it out to userspace, eventually. The problem is,
walk_system_ram_range() failues that result from *it* failing (as
opposed to 'func') return -1. That leads to a very odd -EPERM (-1)
return code out to userspace.
Make walk_system_ram_range() return -EINVAL for internal failures to
keep userspace less confused.
This return code is compatible with all the callers that I audited.
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
b/kernel/resource.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 2018-12-20 11:48:41.810771934 -0800
+++ b/kernel/resource.c 2018-12-20 11:48:41.814771934 -0800
@@ -375,7 +375,7 @@ static int __walk_iomem_res_desc(resourc
int (*func)(struct resource *, void *))
{
struct resource res;
- int ret = -1;
+ int ret = -EINVAL;
while (start < end &&
!find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
@@ -453,7 +453,7 @@ int walk_system_ram_range(unsigned long
unsigned long flags;
struct resource res;
unsigned long pfn, end_pfn;
- int ret = -1;
+ int ret = -EINVAL;
start = (u64) start_pfn << PAGE_SHIFT;
end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_
From: Dave Hansen <[email protected]>
Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries.
However, this limits persistent memory use to applications which
*have* been modified. To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed ad used just like
normal RAM would be.
To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:
echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
and then bind it to this new driver:
echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.
Note: this inherits any existing NUMA information for the newly-
added memory from the persistent memory device that came from the
firmware. On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node. That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.
There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches. But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Yaowei Bai <[email protected]>
Cc: Takashi Iwai <[email protected]>
---
b/drivers/dax/Kconfig | 5 ++
b/drivers/dax/Makefile | 1
b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 99 insertions(+)
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800
+++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800
@@ -32,6 +32,11 @@ config DEV_DAX_PMEM
Say M if unsure
+config DEV_DAX_KMEM
+ def_bool y
+ depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure
+ depends on MEMORY_HOTPLUG # for add_memory() and friends
+
config DEV_DAX_PMEM_COMPAT
tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null 2018-12-03 08:41:47.355756491 -0800
+++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pfn_t.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct resource *res = &dev_dax->region->res;
+ resource_size_t kmem_start;
+ resource_size_t kmem_size;
+ struct resource *new_res;
+ int numa_node;
+ int rc;
+
+ /* Hotplug starting at the beginning of the next block: */
+ kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+ kmem_size = resource_size(res);
+ /* Adjust the size down to compensate for moving up kmem_start: */
+ kmem_size -= kmem_start - res->start;
+ /* Align the size down to cover only complete blocks: */
+ kmem_size &= ~(memory_block_size_bytes() - 1);
+
+ new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
+ dev_name(dev));
+
+ if (!new_res) {
+ printk("could not reserve region %016llx -> %016llx\n",
+ kmem_start, kmem_start+kmem_size);
+ return -EBUSY;
+ }
+
+ /*
+ * Set flags appropriate for System RAM. Leave ..._BUSY clear
+ * so that add_memory() can add a child resource.
+ */
+ new_res->flags = IORESOURCE_SYSTEM_RAM;
+ new_res->name = dev_name(dev);
+
+ numa_node = dev_dax->target_node;
+ if (numa_node < 0) {
+ pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
+ numa_node = 0;
+ }
+
+ rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+ if (rc)
+ return rc;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dev_dax_kmem_probe);
+
+static int dev_dax_kmem_remove(struct device *dev)
+{
+ /* Assume that hot-remove will fail for now */
+ return -EBUSY;
+}
+
+static struct dax_device_driver device_dax_kmem_driver = {
+ .drv = {
+ .probe = dev_dax_kmem_probe,
+ .remove = dev_dax_kmem_remove,
+ },
+};
+
+static int __init dax_kmem_init(void)
+{
+ return dax_driver_register(&device_dax_kmem_driver);
+}
+
+static void __exit dax_kmem_exit(void)
+{
+ dax_driver_unregister(&device_dax_kmem_driver);
+}
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
+module_init(dax_kmem_init);
+module_exit(dax_kmem_exit);
+MODULE_ALIAS_DAX_DEVICE(0);
diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
--- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874 -0800
+++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_DAX) += dax.o
obj-$(CONFIG_DEV_DAX) += device_dax.o
+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
dax-y := super.o
dax-y += bus.o
_
On Wed, Jan 16, 2019 at 10:19:02AM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> The mm/resource.c code is used to manage the physical address
> space. We can view the current resource configuration in
> /proc/iomem. An example of this is at the bottom of this
> description.
>
> The nvdimm subsystem "owns" the physical address resources which
> map to persistent memory and has resources inserted for them as
> "Persistent Memory". We want to use this persistent memory, but
> as volatile memory, just like RAM. The best way to do this is
> to leave the existing resource in place, but add a "System RAM"
> resource underneath it. This clearly communicates the ownership
> relationship of this memory.
>
> The request_resource_conflict() API only deals with the
> top-level resources. Replace it with __request_region() which
> will search for !IORESOURCE_BUSY areas lower in the resource
> tree than the top level.
>
> We also rework the old error message a bit since we do not get
> the conflicting entry back: only an indication that we *had* a
> conflict.
We should keep the device private check (moving it in __request_region)
as device private can try to register un-use physical address (un-use
at time of device private registration) that latter can block valid
physical address the error message you are removing report such event.
>
> We *could* also simply truncate the existing top-level
> "Persistent Memory" resource and take over the released address
> space. But, this means that if we ever decide to hot-unplug the
> "RAM" and give it back, we need to recreate the original setup,
> which may mean going back to the BIOS tables.
>
> This should have no real effect on the existing collision
> detection because the areas that truly conflict should be marked
> IORESOURCE_BUSY.
Still i am worrying that this might allow device private to register
itself as a child of some un-busy resource as this patch obviously
change the behavior of register_memory_resource()
What about instead explicitly providing parent resource to add_memory()
and then to register_memory_resource() so if it is provided as an
argument (!NULL) then you can __request_region(arg_res, ...) otherwise
you keep existing code intact ?
Cheers,
J?r?me
>
> 00000000-00000fff : Reserved
> 00001000-0009fbff : System RAM
> 0009fc00-0009ffff : Reserved
> 000a0000-000bffff : PCI Bus 0000:00
> 000c0000-000c97ff : Video ROM
> 000c9800-000ca5ff : Adapter ROM
> 000f0000-000fffff : Reserved
> 000f0000-000fffff : System ROM
> 00100000-9fffffff : System RAM
> 01000000-01e071d0 : Kernel code
> 01e071d1-027dfdff : Kernel data
> 02dc6000-0305dfff : Kernel bss
> a0000000-afffffff : Persistent Memory (legacy)
> a0000000-a7ffffff : System RAM
> b0000000-bffdffff : System RAM
> bffe0000-bfffffff : Reserved
> c0000000-febfffff : PCI Bus 0000:00
>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> b/mm/memory_hotplug.c | 31 ++++++++++++++-----------------
> 1 file changed, 14 insertions(+), 17 deletions(-)
>
> diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 2018-12-20 11:48:42.317771933 -0800
> +++ b/mm/memory_hotplug.c 2018-12-20 11:48:42.322771933 -0800
> @@ -98,24 +98,21 @@ void mem_hotplug_done(void)
> /* add this memory to iomem resource */
> static struct resource *register_memory_resource(u64 start, u64 size)
> {
> - struct resource *res, *conflict;
> - res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> - if (!res)
> - return ERR_PTR(-ENOMEM);
> + struct resource *res;
> + unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> + char *resource_name = "System RAM";
>
> - res->name = "System RAM";
> - res->start = start;
> - res->end = start + size - 1;
> - res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> - conflict = request_resource_conflict(&iomem_resource, res);
> - if (conflict) {
> - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> - pr_debug("Device unaddressable memory block "
> - "memory hotplug at %#010llx !\n",
> - (unsigned long long)start);
> - }
> - pr_debug("System RAM resource %pR cannot be added\n", res);
> - kfree(res);
> + /*
> + * Request ownership of the new memory range. This might be
> + * a child of an existing resource that was present but
> + * not marked as busy.
> + */
> + res = __request_region(&iomem_resource, start, size,
> + resource_name, flags);
> +
> + if (!res) {
> + pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
> + start, start + size);
> return ERR_PTR(-EEXIST);
> }
> return res;
> _
>
On Wed, Jan 16, 2019 at 03:01:39PM -0800, Dave Hansen wrote:
> On 1/16/19 11:16 AM, Jerome Glisse wrote:
> >> We also rework the old error message a bit since we do not get
> >> the conflicting entry back: only an indication that we *had* a
> >> conflict.
> > We should keep the device private check (moving it in __request_region)
> > as device private can try to register un-use physical address (un-use
> > at time of device private registration) that latter can block valid
> > physical address the error message you are removing report such event.
>
> If a resource can't support having a child, shouldn't it just be marked
> IORESOURCE_BUSY, rather than trying to somehow special-case
> IORES_DESC_DEVICE_PRIVATE_MEMORY behavior?
So the thing about IORES_DESC_DEVICE_PRIVATE_MEMORY is that they
are not necessarily link to any real resource ie they can just be
random range of physical address that at the time of registration
had no resource.
Now you can latter hotplug some memory that would conflict with
this IORES_DESC_DEVICE_PRIVATE_MEMORY and if that happens we want
to tell that to the user ie:
"Sorry we registered some fake memory at fake physical address
and now you have hotplug something that conflict with that."
Why no existing resource ? Well it depends on the platform. In some
case memory for HMM is just not accessible by the CPU _at_ all so
there is obviously no physical address from CPU point of view for
this kind of memory. The other case is PCIE and BAR size. If we
have PCIE bar resizing working everywhere we could potentialy
use the resized PCIE bar (thought i think some device have bug on
that front so i need to check device side too). So when HMM was
design without the PCIE resize and with totaly un-accessible memory
the only option was to pick some unuse physical address range as
anyway memory we are hotpluging is not CPU accessible.
It has been on my TODO to try to find a better way to reserve a
physical range but this is highly platform specific. I need to
investigate if i can report to ACPI on x86 that i want to make
sure the system never assign some physical address range.
Checking PCIE bar resize is also on my TODO (on device side as
i think some device are just buggy there and won't accept BAR
bigger than 256MB and freakout if you try).
So right now i would rather that we keep properly reporting this
hazard so that at least we know it failed because of that. This
also include making sure that we can not register private memory
as a child of an un-busy resource that does exist but might not
have yet been claim by its rightful owner.
Existing code make sure of that, with your change this is a case
that i would not be able to stop. Well i would have to hot unplug
and try a different physical address i guess.
Cheers,
J?r?me
>-----Original Message-----
>From: Linux-nvdimm [mailto:[email protected]] On Behalf
>Of Dave Hansen
>Sent: Thursday, January 17, 2019 2:19 AM
>To: [email protected]
>Cc: [email protected]; [email protected];
>[email protected]; [email protected]; Dave Hansen
><[email protected]>; Huang, Ying <[email protected]>;
>[email protected]; [email protected]; [email protected];
>[email protected]; [email protected];
>[email protected]; Wu, Fengguang <[email protected]>;
>[email protected]
>Subject: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal
>RAM
>
>
>From: Dave Hansen <[email protected]>
>
>Currently, a persistent memory region is "owned" by a device driver,
>either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
>allow applications to explicitly use persistent memory, generally
>by being modified to use special, new libraries.
>
>However, this limits persistent memory use to applications which
>*have* been modified. To make it more broadly usable, this driver
>"hotplugs" memory into the kernel, to be managed ad used just like
>normal RAM would be.
>
>To make this work, management software must remove the device from
>being controlled by the "Device DAX" infrastructure:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>
>and then bind it to this new driver:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
Is there any plan to introduce additional mode, e.g. "kmem" in the userspace
ndctl tool to do the configuration?
>After this, there will be a number of new memory sections visible
>in sysfs that can be onlined, or that may get onlined by existing
>udev-initiated memory hotplug rules.
>
>Note: this inherits any existing NUMA information for the newly-
>added memory from the persistent memory device that came from the
>firmware. On Intel platforms, the firmware has guarantees that
>require each socket's persistent memory to be in a separate
>memory-only NUMA node. That means that this patch is not expected
>to create NUMA nodes, but will simply hotplug memory into existing
>nodes.
>
>There is currently some metadata at the beginning of pmem regions.
>The section-size memory hotplug restrictions, plus this small
>reserved area can cause the "loss" of a section or two of capacity.
>This should be fixable in follow-on patches. But, as a first step,
>losing 256MB of memory (worst case) out of hundreds of gigabytes
>is a good tradeoff vs. the required code to fix this up precisely.
>
>Signed-off-by: Dave Hansen <[email protected]>
>Cc: Dan Williams <[email protected]>
>Cc: Dave Jiang <[email protected]>
>Cc: Ross Zwisler <[email protected]>
>Cc: Vishal Verma <[email protected]>
>Cc: Tom Lendacky <[email protected]>
>Cc: Andrew Morton <[email protected]>
>Cc: Michal Hocko <[email protected]>
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: Huang Ying <[email protected]>
>Cc: Fengguang Wu <[email protected]>
>Cc: Borislav Petkov <[email protected]>
>Cc: Bjorn Helgaas <[email protected]>
>Cc: Yaowei Bai <[email protected]>
>Cc: Takashi Iwai <[email protected]>
>---
>
> b/drivers/dax/Kconfig | 5 ++
> b/drivers/dax/Makefile | 1
> b/drivers/dax/kmem.c | 93
>+++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 99 insertions(+)
>
>diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
>--- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874
>-0800
>+++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800
>@@ -32,6 +32,11 @@ config DEV_DAX_PMEM
>
> Say M if unsure
>
>+config DEV_DAX_KMEM
>+ def_bool y
>+ depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure
>+ depends on MEMORY_HOTPLUG # for add_memory() and friends
>+
> config DEV_DAX_PMEM_COMPAT
> tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
> depends on DEV_DAX_PMEM
>diff -puN /dev/null drivers/dax/kmem.c
>--- /dev/null 2018-12-03 08:41:47.355756491 -0800
>+++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800
>@@ -0,0 +1,93 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
>+#include <linux/memremap.h>
>+#include <linux/pagemap.h>
>+#include <linux/memory.h>
>+#include <linux/module.h>
>+#include <linux/device.h>
>+#include <linux/pfn_t.h>
>+#include <linux/slab.h>
>+#include <linux/dax.h>
>+#include <linux/fs.h>
>+#include <linux/mm.h>
>+#include <linux/mman.h>
>+#include "dax-private.h"
>+#include "bus.h"
>+
>+int dev_dax_kmem_probe(struct device *dev)
>+{
>+ struct dev_dax *dev_dax = to_dev_dax(dev);
>+ struct resource *res = &dev_dax->region->res;
>+ resource_size_t kmem_start;
>+ resource_size_t kmem_size;
>+ struct resource *new_res;
>+ int numa_node;
>+ int rc;
>+
>+ /* Hotplug starting at the beginning of the next block: */
>+ kmem_start = ALIGN(res->start, memory_block_size_bytes());
>+
>+ kmem_size = resource_size(res);
>+ /* Adjust the size down to compensate for moving up kmem_start: */
>+ kmem_size -= kmem_start - res->start;
>+ /* Align the size down to cover only complete blocks: */
>+ kmem_size &= ~(memory_block_size_bytes() - 1);
>+
>+ new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
>+ dev_name(dev));
>+
>+ if (!new_res) {
>+ printk("could not reserve region %016llx -> %016llx\n",
>+ kmem_start, kmem_start+kmem_size);
>+ return -EBUSY;
>+ }
>+
>+ /*
>+ * Set flags appropriate for System RAM. Leave ..._BUSY clear
>+ * so that add_memory() can add a child resource.
>+ */
>+ new_res->flags = IORESOURCE_SYSTEM_RAM;
>+ new_res->name = dev_name(dev);
>+
>+ numa_node = dev_dax->target_node;
>+ if (numa_node < 0) {
>+ pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
>+ numa_node = 0;
>+ }
>+
>+ rc = add_memory(numa_node, new_res->start, resource_size(new_res));
>+ if (rc)
>+ return rc;
>+
>+ return 0;
>+}
>+EXPORT_SYMBOL_GPL(dev_dax_kmem_probe);
>+
>+static int dev_dax_kmem_remove(struct device *dev)
>+{
>+ /* Assume that hot-remove will fail for now */
>+ return -EBUSY;
>+}
>+
>+static struct dax_device_driver device_dax_kmem_driver = {
>+ .drv = {
>+ .probe = dev_dax_kmem_probe,
>+ .remove = dev_dax_kmem_remove,
>+ },
>+};
>+
>+static int __init dax_kmem_init(void)
>+{
>+ return dax_driver_register(&device_dax_kmem_driver);
>+}
>+
>+static void __exit dax_kmem_exit(void)
>+{
>+ dax_driver_unregister(&device_dax_kmem_driver);
>+}
>+
>+MODULE_AUTHOR("Intel Corporation");
>+MODULE_LICENSE("GPL v2");
>+module_init(dax_kmem_init);
>+module_exit(dax_kmem_exit);
>+MODULE_ALIAS_DAX_DEVICE(0);
>diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
>--- a/drivers/dax/Makefile~dax-kmem-try-4 2019-01-08 09:54:44.053694874
>-0800
>+++ b/drivers/dax/Makefile 2019-01-08 09:54:44.056694874 -0800
>@@ -1,6 +1,7 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_DAX) += dax.o
> obj-$(CONFIG_DEV_DAX) += device_dax.o
>+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>
> dax-y := super.o
> dax-y += bus.o
>_
>_______________________________________________
>Linux-nvdimm mailing list
>[email protected]
>https://lists.01.org/mailman/listinfo/linux-nvdimm
On 2019/1/17 上午2:19, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries.
>
> However, this limits persistent memory use to applications which
> *have* been modified. To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed ad used just like
> normal RAM would be.
>
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>
> and then bind it to this new driver:
>
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
>
> Note: this inherits any existing NUMA information for the newly-
> added memory from the persistent memory device that came from the
> firmware. On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node. That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
>
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches. But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: Yaowei Bai <[email protected]>
> Cc: Takashi Iwai <[email protected]>
> ---
>
> b/drivers/dax/Kconfig | 5 ++
> b/drivers/dax/Makefile | 1
> b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 99 insertions(+)
>
> diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
> --- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800
> +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800
> @@ -32,6 +32,11 @@ config DEV_DAX_PMEM
>
> Say M if unsure
>
> +config DEV_DAX_KMEM
> + def_bool y
> + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure
> + depends on MEMORY_HOTPLUG # for add_memory() and friends
> +
> config DEV_DAX_PMEM_COMPAT
> tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
> depends on DEV_DAX_PMEM
> diff -puN /dev/null drivers/dax/kmem.c
> --- /dev/null 2018-12-03 08:41:47.355756491 -0800
> +++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/memory.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pfn_t.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include "dax-private.h"
> +#include "bus.h"
> +
> +int dev_dax_kmem_probe(struct device *dev)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + struct resource *res = &dev_dax->region->res;
> + resource_size_t kmem_start;
> + resource_size_t kmem_size;
> + struct resource *new_res;
> + int numa_node;
> + int rc;
> +
> + /* Hotplug starting at the beginning of the next block: */
> + kmem_start = ALIGN(res->start, memory_block_size_bytes());
> +
> + kmem_size = resource_size(res);
> + /* Adjust the size down to compensate for moving up kmem_start: */
> + kmem_size -= kmem_start - res->start;
> + /* Align the size down to cover only complete blocks: */
> + kmem_size &= ~(memory_block_size_bytes() - 1);
> +
> + new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
> + dev_name(dev));
> +
> + if (!new_res) {
> + printk("could not reserve region %016llx -> %016llx\n",
> + kmem_start, kmem_start+kmem_size);
> + return -EBUSY;
> + }
> +
> + /*
> + * Set flags appropriate for System RAM. Leave ..._BUSY clear
> + * so that add_memory() can add a child resource.
> + */
> + new_res->flags = IORESOURCE_SYSTEM_RAM;
> + new_res->name = dev_name(dev);
> +
> + numa_node = dev_dax->target_node;
> + if (numa_node < 0) {
> + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
> + numa_node = 0;
> + }
> +
> + rc = add_memory(numa_node, new_res->start, resource_size(new_res));
I didn't try pmem and I am wondering it's slower than DRAM.
Should a flag, such like _GFP_PMEM, be added to distinguish it from
DRAM?
If it's used for DMA, perhaps it might not satisfy device DMA request on
time?
On 1/17/19 12:19 AM, Yanmin Zhang wrote:
>>
> I didn't try pmem and I am wondering it's slower than DRAM.
> Should a flag, such like _GFP_PMEM, be added to distinguish it from
> DRAM?
Absolutely not. :)
We already have performance-differentiated memory, and lots of ways to
enumerate and select it in the kernel (all of our NUMA infrastructure).
PMEM is also just the first of many "kinds" of memory that folks want to
build in systems and use a "RAM". We literally don't have space to put
a flag in for each type.
Dave Hansen <[email protected]> writes:
> Persistent memory is cool. But, currently, you have to rewrite
> your applications to use it. Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory? Well... have I got the patch
> series for you!
So, isn't that what memory mode is for?
https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
Why do we need this code in the kernel?
-Jeff
On Thu, Jan 17, 2019 at 11:29:10AM -0500, Jeff Moyer wrote:
> Dave Hansen <[email protected]> writes:
> > Persistent memory is cool. But, currently, you have to rewrite
> > your applications to use it. Wouldn't it be cool if you could
> > just have it show up in your system like normal RAM and get to
> > it like a slow blob of memory? Well... have I got the patch
> > series for you!
>
> So, isn't that what memory mode is for?
> https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
>
> Why do we need this code in the kernel?
I don't think those are the same thing. The "memory mode" in the link
refers to platforms that sequester DRAM to side cache memory access, where
this series doesn't have that platform dependency nor hides faster DRAM.
On Thu, Jan 17, 2019 at 8:29 AM Jeff Moyer <[email protected]> wrote:
>
> Dave Hansen <[email protected]> writes:
>
> > Persistent memory is cool. But, currently, you have to rewrite
> > your applications to use it. Wouldn't it be cool if you could
> > just have it show up in your system like normal RAM and get to
> > it like a slow blob of memory? Well... have I got the patch
> > series for you!
>
> So, isn't that what memory mode is for?
> https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
That's a hardware cache that privately manages DRAM in front of PMEM.
It benefits from some help from software [1].
> Why do we need this code in the kernel?
This goes further and enables software managed allocation decisions
with the full DRAM + PMEM address space.
[1]: https://lore.kernel.org/lkml/154767945660.1983228.12167020940431682725.stgit@dwillia2-desk3.amr.corp.intel.com/
On Wed, Jan 16, 2019 at 9:21 PM Du, Fan <[email protected]> wrote:
[..]
> >From: Dave Hansen <[email protected]>
> >
> >Currently, a persistent memory region is "owned" by a device driver,
> >either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
> >allow applications to explicitly use persistent memory, generally
> >by being modified to use special, new libraries.
> >
> >However, this limits persistent memory use to applications which
> >*have* been modified. To make it more broadly usable, this driver
> >"hotplugs" memory into the kernel, to be managed ad used just like
> >normal RAM would be.
> >
> >To make this work, management software must remove the device from
> >being controlled by the "Device DAX" infrastructure:
> >
> > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
> > echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> >
> >and then bind it to this new driver:
> >
> > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
>
> Is there any plan to introduce additional mode, e.g. "kmem" in the userspace
> ndctl tool to do the configuration?
>
Yes, but not to ndctl. The daxctl tool will grow a helper for this.
The policy of what device-dax instances should be hotplugged at system
init will be managed by a persistent configuration file and udev
rules.
Keith Busch <[email protected]> writes:
> On Thu, Jan 17, 2019 at 11:29:10AM -0500, Jeff Moyer wrote:
>> Dave Hansen <[email protected]> writes:
>> > Persistent memory is cool. But, currently, you have to rewrite
>> > your applications to use it. Wouldn't it be cool if you could
>> > just have it show up in your system like normal RAM and get to
>> > it like a slow blob of memory? Well... have I got the patch
>> > series for you!
>>
>> So, isn't that what memory mode is for?
>> https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
>>
>> Why do we need this code in the kernel?
>
> I don't think those are the same thing. The "memory mode" in the link
> refers to platforms that sequester DRAM to side cache memory access, where
> this series doesn't have that platform dependency nor hides faster DRAM.
OK, so you are making two arguments, here. 1) platforms may not support
memory mode, and 2) this series allows for performance differentiated
memory (even though applications may not modified to make use of
that...).
With this patch set, an unmodified application would either use:
1) whatever memory it happened to get
2) only the faster dram (via numactl --membind=)
3) only the slower pmem (again, via numactl --membind1)
4) preferentially one or the other (numactl --preferred=)
The other options are:
- as mentioned above, memory mode, which uses DRAM as a cache for the
slower persistent memory. Note that it isn't all or nothing--you can
configure your system with both memory mode and appdirect. The
limitation, of course, is that your platform has to support this.
This seems like the obvious solution if you want to make use of the
larger pmem capacity as regular volatile memory (and your platform
supports it). But maybe there is some other limitation that motivated
this work?
- libmemkind or pmdk. These options typically* require application
modifications, but allow those applications to actively decide which
data lives in fast versus slow media.
This seems like the obvious answer for applications that care about
access latency.
* you could override the system malloc, but some libraries/application
stacks already do that, so it isn't a universal solution.
Listing something like this in the headers of these patch series would
considerably reduce the head-scratching for reviewers.
Keith, you seem to be implying that there are platforms that won't
support memory mode. Do you also have some insight into how customers
want to use this, beyond my speculation? It's really frustrating to see
patch sets like this go by without any real use cases provided.
Cheers,
Jeff
On Thu, Jan 17, 2019 at 12:20:06PM -0500, Jeff Moyer wrote:
> Keith Busch <[email protected]> writes:
> > On Thu, Jan 17, 2019 at 11:29:10AM -0500, Jeff Moyer wrote:
> >> Dave Hansen <[email protected]> writes:
> >> > Persistent memory is cool. But, currently, you have to rewrite
> >> > your applications to use it. Wouldn't it be cool if you could
> >> > just have it show up in your system like normal RAM and get to
> >> > it like a slow blob of memory? Well... have I got the patch
> >> > series for you!
> >>
> >> So, isn't that what memory mode is for?
> >> https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
> >>
> >> Why do we need this code in the kernel?
> >
> > I don't think those are the same thing. The "memory mode" in the link
> > refers to platforms that sequester DRAM to side cache memory access, where
> > this series doesn't have that platform dependency nor hides faster DRAM.
>
> OK, so you are making two arguments, here. 1) platforms may not support
> memory mode, and 2) this series allows for performance differentiated
> memory (even though applications may not modified to make use of
> that...).
>
> With this patch set, an unmodified application would either use:
>
> 1) whatever memory it happened to get
> 2) only the faster dram (via numactl --membind=)
> 3) only the slower pmem (again, via numactl --membind1)
> 4) preferentially one or the other (numactl --preferred=)
Yes, numactl and mbind are good ways for unmodified applications
to use these different memory types when they're available.
Tangentially related, I have another series[1] that provides supplementary
information that can be used to help make these decisions for platforms
that provide HMAT (heterogeneous memory attribute tables).
> The other options are:
> - as mentioned above, memory mode, which uses DRAM as a cache for the
> slower persistent memory. Note that it isn't all or nothing--you can
> configure your system with both memory mode and appdirect. The
> limitation, of course, is that your platform has to support this.
>
> This seems like the obvious solution if you want to make use of the
> larger pmem capacity as regular volatile memory (and your platform
> supports it). But maybe there is some other limitation that motivated
> this work?
The hardware supported implementation is one way it may be used, and it's
up side is that accessing the cached memory is transparent to the OS and
applications. They can use memory unaware that this is happening, so it
has a low barrier for applications to make use of the large available
address space.
There are some minimal things software may do that improve this mode,
as Dan mentioned in his reply [2], but it is still usable even without
such optimizations.
On the downside, a reboot would be required if you want to change the
memory configuration at a later time, like you decide more or less DRAM
as cache is needed. This series has runtime hot pluggable capabilities.
It's also possible the customer may know better which applications require
more hot vs cold data, but the memory mode caching doesn't give them as
much control since the faster memory is hidden.
> - libmemkind or pmdk. These options typically* require application
> modifications, but allow those applications to actively decide which
> data lives in fast versus slow media.
>
> This seems like the obvious answer for applications that care about
> access latency.
>
> * you could override the system malloc, but some libraries/application
> stacks already do that, so it isn't a universal solution.
>
> Listing something like this in the headers of these patch series would
> considerably reduce the head-scratching for reviewers.
>
> Keith, you seem to be implying that there are platforms that won't
> support memory mode. Do you also have some insight into how customers
> want to use this, beyond my speculation? It's really frustrating to see
> patch sets like this go by without any real use cases provided.
Right, most NFIT reporting platforms today don't have memory mode, and
the kernel currently only supports the persistent DAX mode with these.
This series adds another option for those platforms.
I think numactl as you mentioned is the first consideration for how
customers may make use. Dave or Dan might have other use cases in mind.
Just thinking out loud, if we wanted an in-kernel use case, it may be
interesting to make slower memory a swap tier so the host can manage
the cache rather than the hardware.
[1]
https://lore.kernel.org/patchwork/cover/1032688/
[2]
https://lore.kernel.org/lkml/154767945660.1983228.12167020940431682725.stgit@dwillia2-desk3.amr.corp.intel.com/
Keith Busch <[email protected]> writes:
>> Keith, you seem to be implying that there are platforms that won't
>> support memory mode. Do you also have some insight into how customers
>> want to use this, beyond my speculation? It's really frustrating to see
>> patch sets like this go by without any real use cases provided.
>
> Right, most NFIT reporting platforms today don't have memory mode, and
> the kernel currently only supports the persistent DAX mode with these.
> This series adds another option for those platforms.
All NFIT reporting platforms today are shipping NVDIMM-Ns, where it
makes absolutely no sense to use them as regular DRAM. I don't think
that's a good argument to make.
> I think numactl as you mentioned is the first consideration for how
> customers may make use. Dave or Dan might have other use cases in mind.
Well, it sure looks like this took a lot of work, so I thought there
were known use cases or users asking for this functionality.
Cheers,
Jeff
On 1/17/19 8:29 AM, Jeff Moyer wrote:
>> Persistent memory is cool. But, currently, you have to rewrite
>> your applications to use it. Wouldn't it be cool if you could
>> just have it show up in your system like normal RAM and get to
>> it like a slow blob of memory? Well... have I got the patch
>> series for you!
> So, isn't that what memory mode is for?
> https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
>
> Why do we need this code in the kernel?
So, my bad for not mentioning memory mode. This patch set existed
before we could talk about it publicly, so it simply ignores its
existence. It's a pretty glaring omissions at this point, sorry.
I'll add this to the patches, but here are a few reasons you might want
this instead of memory mode:
1. Memory mode is all-or-nothing. Either 100% of your persistent memory
is used for memory mode, or nothing is. With this set, you can
(theoretically) have very granular (128MB) assignment of PMEM to
either volatile or persistent uses. We have a few practical matters
to fix to get us down to that 128MB value, but we can get there.
2. The capacity of memory mode is the size of your persistent memory.
DRAM capacity is "lost" because it is used for cache. With this,
you get PMEM+DRAM capacity for memory.
3. DRAM acts as a cache with memory mode, and caches can lead to
unpredictable latencies. Since memory mode is all-or-nothing, your
entire memory space is exposed to these unpredictable latencies.
This solution lets you guarantee DRAM latencies if you need them.
4. The new "tier" of memory is exposed to software. That means that you
can build tiered applications or infrastructure. A cloud provider
could sell cheaper VMs that use more PMEM and more expensive ones
that use DRAM. That's impossible with memory mode.
Don't take this as criticism of memory mode. Memory mode is awesome,
and doesn't strictly require *any* software changes (we have software
changes proposed for optimizing it though). It has tons of other
advantages over *this* approach. Basically, they are complementary
enough that we think both can live side-by-side.
On 2019/1/17 下午11:17, Dave Hansen wrote:
> On 1/17/19 12:19 AM, Yanmin Zhang wrote:
>>>
>> I didn't try pmem and I am wondering it's slower than DRAM.
>> Should a flag, such like _GFP_PMEM, be added to distinguish it from
>> DRAM?
>
> Absolutely not. :)
Agree.
>
> We already have performance-differentiated memory, and lots of ways to
> enumerate and select it in the kernel (all of our NUMA infrastructure).
Kernel does manage memory like what you say.
My question is: with your patch, PMEM becomes normal RAM, then there is
a chance for kernel to allocate PMEM as DMA buffer.
Some super speed devices like 10Giga NIC, USB (SSIC connecting modem),
might not work well if DMA buffer is in PMEM as it's slower than DRAM.
Should your patchset consider it?
>
> PMEM is also just the first of many "kinds" of memory that folks want to
> build in systems and use a "RAM". We literally don't have space to put
> a flag in for each type.
>
>
>With this patch set, an unmodified application would either use:
>
>1) whatever memory it happened to get
>2) only the faster dram (via numactl --membind=)
>3) only the slower pmem (again, via numactl --membind1)
>4) preferentially one or the other (numactl --preferred=)
Yet another option:
MemoryOptimizer -- hot page accounting and migration daemon
https://github.com/intel/memory-optimizer
Once PMEM NUMA nodes are available, we may run a user space daemon to
walk page tables of virtual machines (EPT) or processes, collect the
"accessed" bits to find out hot pages, and finally migrate hot pages
to DRAM and cold pages to PMEM.
In that scenario, only kernel and the migrate daemon need to be aware
of the PMEM nodes. Unmodified virtual machines and processes can enjoy
the added memory space w/o knowing whether it's using DRAM or PMEM.
Thanks,
Fengguang
On 1/17/19 11:47 PM, Yanmin Zhang wrote:
> a chance for kernel to allocate PMEM as DMA buffer.
> Some super speed devices like 10Giga NIC, USB (SSIC connecting modem),
> might not work well if DMA buffer is in PMEM as it's slower than DRAM.
>
> Should your patchset consider it?
No, I don't think so.
They can DMA to persistent memory whether this patch set exists or not.
So, if the hardware falls over, that's a separate problem.
If an app wants memory that performs in a particular way, then I would
suggest those app find the NUMA nodes on the system that match their
needs with these patches:
> http://lkml.kernel.org/r/[email protected]
and use the existing NUMA APIs to select that memory.
On 1/16/19 11:16 AM, Jerome Glisse wrote:
>> We *could* also simply truncate the existing top-level
>> "Persistent Memory" resource and take over the released address
>> space. But, this means that if we ever decide to hot-unplug the
>> "RAM" and give it back, we need to recreate the original setup,
>> which may mean going back to the BIOS tables.
>>
>> This should have no real effect on the existing collision
>> detection because the areas that truly conflict should be marked
>> IORESOURCE_BUSY.
>
> Still i am worrying that this might allow device private to register
> itself as a child of some un-busy resource as this patch obviously
> change the behavior of register_memory_resource()
>
> What about instead explicitly providing parent resource to add_memory()
> and then to register_memory_resource() so if it is provided as an
> argument (!NULL) then you can __request_region(arg_res, ...) otherwise
> you keep existing code intact ?
We don't have the locking to do this, do we? For instance, all the
locking is done below register_memory_resource(), so any previous
resource lookup is invalid by the time we get to register_memory_resource().
On Fri, Jan 18, 2019 at 11:58:54AM -0800, Dave Hansen wrote:
> On 1/16/19 11:16 AM, Jerome Glisse wrote:
> >> We *could* also simply truncate the existing top-level
> >> "Persistent Memory" resource and take over the released address
> >> space. But, this means that if we ever decide to hot-unplug the
> >> "RAM" and give it back, we need to recreate the original setup,
> >> which may mean going back to the BIOS tables.
> >>
> >> This should have no real effect on the existing collision
> >> detection because the areas that truly conflict should be marked
> >> IORESOURCE_BUSY.
> >
> > Still i am worrying that this might allow device private to register
> > itself as a child of some un-busy resource as this patch obviously
> > change the behavior of register_memory_resource()
> >
> > What about instead explicitly providing parent resource to add_memory()
> > and then to register_memory_resource() so if it is provided as an
> > argument (!NULL) then you can __request_region(arg_res, ...) otherwise
> > you keep existing code intact ?
>
> We don't have the locking to do this, do we? For instance, all the
> locking is done below register_memory_resource(), so any previous
> resource lookup is invalid by the time we get to register_memory_resource().
Yeah you are right, maybe just a bool then ? bool as_child
Cheers,
J?r?me
[Sorry for a late reply]
On Wed 16-01-19 10:19:02, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> The mm/resource.c code is used to manage the physical address
> space. We can view the current resource configuration in
> /proc/iomem. An example of this is at the bottom of this
> description.
>
> The nvdimm subsystem "owns" the physical address resources which
> map to persistent memory and has resources inserted for them as
> "Persistent Memory". We want to use this persistent memory, but
> as volatile memory, just like RAM. The best way to do this is
> to leave the existing resource in place, but add a "System RAM"
> resource underneath it. This clearly communicates the ownership
> relationship of this memory.
>
> The request_resource_conflict() API only deals with the
> top-level resources. Replace it with __request_region() which
> will search for !IORESOURCE_BUSY areas lower in the resource
> tree than the top level.
>
> We also rework the old error message a bit since we do not get
> the conflicting entry back: only an indication that we *had* a
> conflict.
>
> We *could* also simply truncate the existing top-level
> "Persistent Memory" resource and take over the released address
> space. But, this means that if we ever decide to hot-unplug the
> "RAM" and give it back, we need to recreate the original setup,
> which may mean going back to the BIOS tables.
>
> This should have no real effect on the existing collision
> detection because the areas that truly conflict should be marked
> IORESOURCE_BUSY.
>
> 00000000-00000fff : Reserved
> 00001000-0009fbff : System RAM
> 0009fc00-0009ffff : Reserved
> 000a0000-000bffff : PCI Bus 0000:00
> 000c0000-000c97ff : Video ROM
> 000c9800-000ca5ff : Adapter ROM
> 000f0000-000fffff : Reserved
> 000f0000-000fffff : System ROM
> 00100000-9fffffff : System RAM
> 01000000-01e071d0 : Kernel code
> 01e071d1-027dfdff : Kernel data
> 02dc6000-0305dfff : Kernel bss
> a0000000-afffffff : Persistent Memory (legacy)
> a0000000-a7ffffff : System RAM
> b0000000-bffdffff : System RAM
> bffe0000-bfffffff : Reserved
> c0000000-febfffff : PCI Bus 0000:00
This is the only memory hotplug related change in this series AFAICS.
Unfortunately I am not really familiar with guts for resources
infrastructure so I cannot judge the correctness. The change looks
sensible to me although I do not feel like acking it.
Overall design of this feature makes a lot of sense to me. It doesn't
really add any weird APIs yet it allows to use nvdimms as a memory
transparently. All future policies are to be defined by the userspace
and I like that. I was especially astonished by the sheer size of the
driver and changes it required to achieve that. Really nice!
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Huang Ying <[email protected]>
> Cc: Fengguang Wu <[email protected]>
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> b/mm/memory_hotplug.c | 31 ++++++++++++++-----------------
> 1 file changed, 14 insertions(+), 17 deletions(-)
>
> diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 2018-12-20 11:48:42.317771933 -0800
> +++ b/mm/memory_hotplug.c 2018-12-20 11:48:42.322771933 -0800
> @@ -98,24 +98,21 @@ void mem_hotplug_done(void)
> /* add this memory to iomem resource */
> static struct resource *register_memory_resource(u64 start, u64 size)
> {
> - struct resource *res, *conflict;
> - res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> - if (!res)
> - return ERR_PTR(-ENOMEM);
> + struct resource *res;
> + unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> + char *resource_name = "System RAM";
>
> - res->name = "System RAM";
> - res->start = start;
> - res->end = start + size - 1;
> - res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> - conflict = request_resource_conflict(&iomem_resource, res);
> - if (conflict) {
> - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> - pr_debug("Device unaddressable memory block "
> - "memory hotplug at %#010llx !\n",
> - (unsigned long long)start);
> - }
> - pr_debug("System RAM resource %pR cannot be added\n", res);
> - kfree(res);
> + /*
> + * Request ownership of the new memory range. This might be
> + * a child of an existing resource that was present but
> + * not marked as busy.
> + */
> + res = __request_region(&iomem_resource, start, size,
> + resource_name, flags);
> +
> + if (!res) {
> + pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
> + start, start + size);
> return ERR_PTR(-EEXIST);
> }
> return res;
> _
--
Michal Hocko
SUSE Labs
On 1/16/19 3:38 PM, Jerome Glisse wrote:
> So right now i would rather that we keep properly reporting this
> hazard so that at least we know it failed because of that. This
> also include making sure that we can not register private memory
> as a child of an un-busy resource that does exist but might not
> have yet been claim by its rightful owner.
I can definitely keep the warning in. But, I don't think there's a
chance of HMM registering a IORES_DESC_DEVICE_PRIVATE_MEMORY region as
the child of another. The region_intersects() check *should* find that:
> for (; addr > size && addr >= iomem_resource.start; addr -= size) {
> ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
> if (ret != REGION_DISJOINT)
> continue;
On Wed, Jan 23, 2019 at 12:03:54PM -0800, Dave Hansen wrote:
> On 1/16/19 3:38 PM, Jerome Glisse wrote:
> > So right now i would rather that we keep properly reporting this
> > hazard so that at least we know it failed because of that. This
> > also include making sure that we can not register private memory
> > as a child of an un-busy resource that does exist but might not
> > have yet been claim by its rightful owner.
>
> I can definitely keep the warning in. But, I don't think there's a
> chance of HMM registering a IORES_DESC_DEVICE_PRIVATE_MEMORY region as
> the child of another. The region_intersects() check *should* find that:
Sounds fine to (just keep the warning).
Cheers,
J?r?me
>
> > for (; addr > size && addr >= iomem_resource.start; addr -= size) {
> > ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
> > if (ret != REGION_DISJOINT)
> > continue;
>