2023-12-12 19:09:00

by Vishal Verma

[permalink] [raw]
Subject: [PATCH v4 0/3] Add DAX ABI for memmap_on_memory

The DAX drivers were missing sysfs ABI documentation entirely. Add this
missing documentation for the sysfs ABI for DAX regions and Dax devices
in patch 1. Define guard(device) semantics for Scope Based Resource
Management for device_lock, and convert device_{lock,unlock} flows in
drivers/dax/bus.c to use this in patch 2. Add a new ABI for toggling
memmap_on_memory semantics in patch 3.

The missing ABI was spotted in [1], this series is a split of the new
ABI additions behind the initial documentation creation.

[1]: https://lore.kernel.org/linux-cxl/[email protected]/

Changes in v4:
- Hold the device lock when checking if the dax_dev is bound to kmem
(Ying, Dan)
- Remove dax region checks (and locks) as they were unnecessary.
- Introduce guard(device) for device lock/unlock (Dan)
- Convert the rest of drivers/dax/bus.c to guard(device)
- Link to v3: https://lore.kernel.org/r/[email protected]

Changes in v3:
- Fix typo in ABI docs (Zhijian Li)
- Add kernel config and module parameter dependencies to the ABI docs
entry (David Hildenbrand)
- Ensure kmem isn't active when setting the sysfs attribute (Ying
Huang)
- Simplify returning from memmap_on_memory_store()
- Link to v2: https://lore.kernel.org/r/[email protected]

Changes in v2:
- Fix CC lists, patch 1/2 didn't get sent correctly in v1
- Link to v1: https://lore.kernel.org/r/[email protected]

---
Vishal Verma (3):
Documentatiion/ABI: Add ABI documentation for sys-bus-dax
dax/bus: Introduce guard(device) for device_{lock,unlock} flows
dax: add a sysfs knob to control memmap_on_memory behavior

include/linux/device.h | 2 +
drivers/dax/bus.c | 141 ++++++++++++++-------------
Documentation/ABI/testing/sysfs-bus-dax | 168 ++++++++++++++++++++++++++++++++
3 files changed, 244 insertions(+), 67 deletions(-)
---
base-commit: c4e1ccfad42352918810802095a8ace8d1c744c9
change-id: 20231025-vv-dax_abi-17a219c46076

Best regards,
--
Vishal Verma <[email protected]>


2023-12-12 19:09:04

by Vishal Verma

[permalink] [raw]
Subject: [PATCH v4 1/3] Documentatiion/ABI: Add ABI documentation for sys-bus-dax

Add the missing sysfs ABI documentation for the device DAX subsystem.
Various ABI attributes under this have been present since v5.1, and more
have been added over time. In preparation for adding a new attribute,
add this file with the historical details.

Cc: Dan Williams <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-dax | 151 ++++++++++++++++++++++++++++++++
1 file changed, 151 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
new file mode 100644
index 000000000000..a61a7b186017
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -0,0 +1,151 @@
+What: /sys/bus/dax/devices/daxX.Y/align
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RW) Provides a way to specify an alignment for a dax device.
+ Values allowed are constrained by the physical address ranges
+ that back the dax device, and also by arch requirements.
+
+What: /sys/bus/dax/devices/daxX.Y/mapping
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (WO) Provides a way to allocate a mapping range under a dax
+ device. Specified in the format <start>-<end>.
+
+What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RO) A dax device may have multiple constituent discontiguous
+ address ranges. These are represented by the different
+ 'mappingX' subdirectories. The 'start' attribute indicates the
+ start physical address for the given range.
+
+What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RO) A dax device may have multiple constituent discontiguous
+ address ranges. These are represented by the different
+ 'mappingX' subdirectories. The 'end' attribute indicates the
+ end physical address for the given range.
+
+What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RO) A dax device may have multiple constituent discontiguous
+ address ranges. These are represented by the different
+ 'mappingX' subdirectories. The 'page_offset' attribute indicates the
+ offset of the current range in the dax device.
+
+What: /sys/bus/dax/devices/daxX.Y/resource
+Date: June, 2019
+KernelVersion: v5.3
+Contact: [email protected]
+Description:
+ (RO) The resource attribute indicates the starting physical
+ address of a dax device. In case of a device with multiple
+ constituent ranges, it indicates the starting address of the
+ first range.
+
+What: /sys/bus/dax/devices/daxX.Y/size
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RW) The size attribute indicates the total size of a dax
+ device. For creating subdivided dax devices, or for resizing
+ an existing device, the new size can be written to this as
+ part of the reconfiguration process.
+
+What: /sys/bus/dax/devices/daxX.Y/numa_node
+Date: November, 2019
+KernelVersion: v5.5
+Contact: [email protected]
+Description:
+ (RO) If NUMA is enabled and the platform has affinitized the
+ backing device for this dax device, emit the CPU node
+ affinity for this device.
+
+What: /sys/bus/dax/devices/daxX.Y/target_node
+Date: February, 2019
+KernelVersion: v5.1
+Contact: [email protected]
+Description:
+ (RO) The target-node attribute is the Linux numa-node that a
+ device-dax instance may create when it is online. Prior to
+ being online the device's 'numa_node' property reflects the
+ closest online cpu node which is the typical expectation of a
+ device 'numa_node'. Once it is online it becomes its own
+ distinct numa node.
+
+What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/available_size
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RO) The available_size attribute tracks available dax region
+ capacity. This only applies to volatile hmem devices, not pmem
+ devices, since pmem devices are defined by nvdimm namespace
+ boundaries.
+
+What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/size
+Date: July, 2017
+KernelVersion: v5.1
+Contact: [email protected]
+Description:
+ (RO) The size attribute indicates the size of a given dax region
+ in bytes.
+
+What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/align
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RO) The align attribute indicates alignment of the dax region.
+ Changes on align may not always be valid, when say certain
+ mappings were created with 2M and then we switch to 1G. This
+ validates all ranges against the new value being attempted, post
+ resizing.
+
+What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/seed
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RO) The seed device is a concept for dynamic dax regions to be
+ able to split the region amongst multiple sub-instances. The
+ seed device, similar to libnvdimm seed devices, is a device
+ that starts with zero capacity allocated and unbound to a
+ driver.
+
+What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/create
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (RW) The create interface to the dax region provides a way to
+ create a new unconfigured dax device under the given region, which
+ can then be configured (with a size etc.) and then probed.
+
+What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/delete
+Date: October, 2020
+KernelVersion: v5.10
+Contact: [email protected]
+Description:
+ (WO) The delete interface for a dax region provides for deletion
+ of any 0-sized and idle dax devices.
+
+What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/id
+Date: July, 2017
+KernelVersion: v5.1
+Contact: [email protected]
+Description:
+ (RO) The id attribute indicates the region id of a dax region.

--
2.41.0

2023-12-12 19:09:09

by Vishal Verma

[permalink] [raw]
Subject: [PATCH v4 3/3] dax: add a sysfs knob to control memmap_on_memory behavior

Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.

The default memmap_on_memory setting for dax devices originating via
pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above
defaults if needed.

Cc: David Hildenbrand <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Huang Ying <[email protected]>
Tested-by: Li Zhijian <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
drivers/dax/bus.c | 32 ++++++++++++++++++++++++++++++++
Documentation/ABI/testing/sysfs-bus-dax | 17 +++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index ce1356ac6dc2..423adee6f802 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1245,6 +1245,37 @@ static ssize_t numa_node_show(struct device *dev,
}
static DEVICE_ATTR_RO(numa_node);

+static ssize_t memmap_on_memory_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+
+ return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
+}
+
+static ssize_t memmap_on_memory_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ ssize_t rc;
+ bool val;
+
+ rc = kstrtobool(buf, &val);
+ if (rc)
+ return rc;
+
+ guard(device)(dev);
+ if (dev_dax->memmap_on_memory != val &&
+ dax_drv->type == DAXDRV_KMEM_TYPE)
+ return -EBUSY;
+ dev_dax->memmap_on_memory = val;
+
+ return len;
+}
+static DEVICE_ATTR_RW(memmap_on_memory);
+
static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
{
struct device *dev = container_of(kobj, struct device, kobj);
@@ -1271,6 +1302,7 @@ static struct attribute *dev_dax_attributes[] = {
&dev_attr_align.attr,
&dev_attr_resource.attr,
&dev_attr_numa_node.attr,
+ &dev_attr_memmap_on_memory.attr,
NULL,
};

diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
index a61a7b186017..b1fd8bf8a7de 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -149,3 +149,20 @@ KernelVersion: v5.1
Contact: [email protected]
Description:
(RO) The id attribute indicates the region id of a dax region.
+
+What: /sys/bus/dax/devices/daxX.Y/memmap_on_memory
+Date: October, 2023
+KernelVersion: v6.8
+Contact: [email protected]
+Description:
+ (RW) Control the memmap_on_memory setting if the dax device
+ were to be hotplugged as system memory. This determines whether
+ the 'altmap' for the hotplugged memory will be placed on the
+ device being hotplugged (memmap_on_memory=1) or if it will be
+ placed on regular memory (memmap_on_memory=0). This attribute
+ must be set before the device is handed over to the 'kmem'
+ driver (i.e. hotplugged into system-ram). Additionally, this
+ depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
+ memmap_on_memory parameter for memory_hotplug. This is
+ typically set on the kernel command line -
+ memory_hotplug.memmap_on_memory set to 'true' or 'force'."

--
2.41.0

2023-12-13 01:12:30

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] dax: add a sysfs knob to control memmap_on_memory behavior

Vishal Verma <[email protected]> writes:

> Add a sysfs knob for dax devices to control the memmap_on_memory setting
> if the dax device were to be hotplugged as system memory.
>
> The default memmap_on_memory setting for dax devices originating via
> pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Huang Ying <[email protected]>
> Tested-by: Li Zhijian <[email protected]>
> Reviewed-by: Jonathan Cameron <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> drivers/dax/bus.c | 32 ++++++++++++++++++++++++++++++++
> Documentation/ABI/testing/sysfs-bus-dax | 17 +++++++++++++++++
> 2 files changed, 49 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index ce1356ac6dc2..423adee6f802 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1245,6 +1245,37 @@ static ssize_t numa_node_show(struct device *dev,
> }
> static DEVICE_ATTR_RO(numa_node);
>
> +static ssize_t memmap_on_memory_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> +}
> +
> +static ssize_t memmap_on_memory_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + ssize_t rc;
> + bool val;
> +
> + rc = kstrtobool(buf, &val);
> + if (rc)
> + return rc;
> +
> + guard(device)(dev);
> + if (dev_dax->memmap_on_memory != val &&
> + dax_drv->type == DAXDRV_KMEM_TYPE)

Should we check "dev->driver != NULL" here, and should we move

dax_drv = to_dax_drv(dev->driver);

here with device lock held?

--
Best Regards,
Huang, Ying

> + return -EBUSY;
> + dev_dax->memmap_on_memory = val;
> +
> + return len;
> +}
> +static DEVICE_ATTR_RW(memmap_on_memory);
> +
> static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
> {
> struct device *dev = container_of(kobj, struct device, kobj);
> @@ -1271,6 +1302,7 @@ static struct attribute *dev_dax_attributes[] = {
> &dev_attr_align.attr,
> &dev_attr_resource.attr,
> &dev_attr_numa_node.attr,
> + &dev_attr_memmap_on_memory.attr,
> NULL,
> };
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax
> index a61a7b186017..b1fd8bf8a7de 100644
> --- a/Documentation/ABI/testing/sysfs-bus-dax
> +++ b/Documentation/ABI/testing/sysfs-bus-dax
> @@ -149,3 +149,20 @@ KernelVersion: v5.1
> Contact: [email protected]
> Description:
> (RO) The id attribute indicates the region id of a dax region.
> +
> +What: /sys/bus/dax/devices/daxX.Y/memmap_on_memory
> +Date: October, 2023
> +KernelVersion: v6.8
> +Contact: [email protected]
> +Description:
> + (RW) Control the memmap_on_memory setting if the dax device
> + were to be hotplugged as system memory. This determines whether
> + the 'altmap' for the hotplugged memory will be placed on the
> + device being hotplugged (memmap_on_memory=1) or if it will be
> + placed on regular memory (memmap_on_memory=0). This attribute
> + must be set before the device is handed over to the 'kmem'
> + driver (i.e. hotplugged into system-ram). Additionally, this
> + depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
> + memmap_on_memory parameter for memory_hotplug. This is
> + typically set on the kernel command line -
> + memory_hotplug.memmap_on_memory set to 'true' or 'force'."

2023-12-13 16:51:10

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] Documentatiion/ABI: Add ABI documentation for sys-bus-dax

On Tue, 12 Dec 2023 12:08:30 -0700
Vishal Verma <[email protected]> wrote:

> Add the missing sysfs ABI documentation for the device DAX subsystem.
> Various ABI attributes under this have been present since v5.1, and more
> have been added over time. In preparation for adding a new attribute,
> add this file with the historical details.
>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>

Hi Vishal, One editorial suggestions.

I don't know the interface well enough to do a good review of the content
so leaving that for Dan or others.

> +What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
> +Date: October, 2020
> +KernelVersion: v5.10
> +Contact: [email protected]
> +Description:
> + (RO) A dax device may have multiple constituent discontiguous
> + address ranges. These are represented by the different
> + 'mappingX' subdirectories. The 'start' attribute indicates the
> + start physical address for the given range.

A common option for these files is to have a single entry with two What:
lines. Here that would avoid duplication of majority of this text across
the start, end and page_offset entries. Alternatively you could do an
entry for the mapping[0..N] directory with the shared text then separate
entries for the 3 files under there.


> +
> +What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
> +Date: October, 2020
> +KernelVersion: v5.10
> +Contact: [email protected]
> +Description:
> + (RO) A dax device may have multiple constituent discontiguous
> + address ranges. These are represented by the different
> + 'mappingX' subdirectories. The 'end' attribute indicates the
> + end physical address for the given range.
> +
> +What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset
> +Date: October, 2020
> +KernelVersion: v5.10
> +Contact: [email protected]
> +Description:
> + (RO) A dax device may have multiple constituent discontiguous
> + address ranges. These are represented by the different
> + 'mappingX' subdirectories. The 'page_offset' attribute indicates the
> + offset of the current range in the dax device.