2024-03-25 03:49:22

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

A git tree of this series can be found here:

https://github.com/weiny2/linux-kernel/tree/dcd-2024-03-24

Pre-requisite:
==============

The locking introduced by Vishal for DAX regions:
https://lore.kernel.org/all/[email protected]/T/#u

Background
==========

A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
device that allows the memory capacity to change dynamically, without
the need for resetting the device, reconfiguring HDM decoders, or
reconfiguring software DAX regions.

One of the biggest use cases for Dynamic Capacity is to allow hosts to
share memory dynamically within a data center without increasing the
per-host attached memory.

The general flow for the addition or removal of memory is to have an
orchestrator coordinate the use of the memory. Generally there are 5
actors in such a system, the Orchestrator, Fabric Manager, the Device
the host sees, the Host Kernel, and a Host User.

Typical work flows are shown below.

Orchestrator FM Device Host Kernel Host User

| | | | |
|-------------- Create region ----------------------->|
| | | | |
| | | |<-- Create ---|
| | | | Region |
|<------------- Signal done --------------------------|
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Accept -|<- Accept -| |
| | Extent | Extent | |
| | | |<- Create --->|
| | | | DAX dev |-- Use memory
| | | | | |
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
| | | | |
|<------------- Signal done --------------------------|
| | | | |
|-- Remove -->|- Release->|- Release ->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Accept -|<- Accept -| |
| | Extent | Extent | |
| | | |<- Create ----|
| | | | DAX dev |-- Use memory
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
|<------------- Signal done --------------------------|
| | | | |
|-- Remove -->|- Release->|- Release ->| |
| Capacity | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | | |
|-- Add ----->|-- Add --->|--- Add --->| |
| Capacity | Extent | Extent | |
| | | |<- Create ----|
| | | | DAX dev |-- Use memory
| | | | | |
|-- Remove -->|- Release->|- Release ->| | |
| Capacity | Extent | Extent | | |
| | | | | |
| | | (Release Ignored) | |
| | | | | |
| | | |<- Release ---| <-+
| | | | DAX dev |
|<------------- Signal done --------------------------|
| | | | |
| |- Release->|- Release ->| |
| | Extent | Extent | |
| | | | |
| |<- Release-|<- Release -| |
| | Extent | Extent | |
| | | |<- Destroy ---|
| | | | Region |
| | | | |

Previous RFCs of this series[0] resulted in significant architectural
comments. Previous versions allowed memory capacity to be accepted by
the host regardless of the existence of a software region being mapped.

With this new patch set the order of the create region and DAX device
creation must be synchronized with the Orchestrator adding/removing
capacity. The host kernel will reject an add extent event if the region
is not created yet. It will also ignore a release if the DAX device is
created and referencing an extent.

Neither of these synchronizations are anticipated to be an issue with
real applications.

In order to allow for capacity to be added and removed a new concept of
a sparse DAX region is introduced. A sparse DAX region may have 0 or
more bytes of available space. The total space depends on the number
and size of the extents which have been added.

Initially it is anticipated that users of the memory will carefully
coordinate the surfacing of additional capacity with the creation of DAX
devices which use that capacity. Therefore, the allocation of the
memory to DAX devices does not allow for specific associations between
DAX device and extent. This keeps allocations very similar to existing
DAX region behavior.

Great care was taken to greatly simplify extent tracking. Specifically,
in comparison to previous versions of the patch set, all extent tracking
xarrays have been eliminated from the code. In addition, most of the
extra software objects and associated referenced counts have been
eliminated.

In this version, extents are tracked purely as sub-devices of the
region. This ensures that the region destruction cleans up all extent
allocations properly. Device managed callbacks are wired to ensure any
additional data required for DAX device references are handled
correctly.

Due to these major changes I'm setting this new series to V1.

In summary the major functionality of this series includes:

- Getting the dynamic capacity (DC) configuration information from cxl
devices

- Configuring the DC regions reported by hardware

- Enhancing the CXL and DAX regions for dynamic capacity support
a. Maintain a logical separation between hardware extents and
software managed region extents. This provides an
abstraction between the layers and should allow for
interleaving in the future

- Get hardware extent lists for endpoint decoders upon
region creation.

- Adjust extent/region memory available on the following events.
a. Add capacity Events
b. Release capacity events

- Host response for add capacity
a. do not accept the extent if:
If the region does not exist
or an error occurs realizing the extent
B. If the region does exist
realize a DAX region extent with 1:1 mapping (no
interleave yet)

- Host response for remove capacity
a. If no DAX devices reference the extent release the extent
b. If a reference does exist, ignore the request.
(Require FM to issue release again.)

- Modify DAX device creation/resize to account for extents within a
sparse DAX region

- Trace Dynamic Capacity events for debugging

- Add cxl-test infrastructure to allow for faster unit testing
(See new ndctl branch for cxl-dcd.sh test[1])

Fan Ni's latest v5 of Qemu DCD was used for testing.[2]

Remaining work:

1) Integrate the QoS work from Dave Jiang
2) Interleave support

Possible additional work depending on requirements:

1) Allow mapping to specific extents (perhaps based on
label/tag)
2) Release extents when DAX devices are released if a release
was previously seen from the device
3) Accept a new extent which extends (but overlaps) an existing
extent(s)

[0] RFC v2: https://lore.kernel.org/r/[email protected]
[1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-03-22
[2] https://lore.kernel.org/all/[email protected]/

---
Changes for v1:
- iweiny: Largely new series
- iweiny: Remove review tags due to the series being a major rework
- iweiny: Fix authorship for Navneet patches
- iweiny: Remove extent xarrays
- iweiny: Remove kreferences, replace with 1 use count protected under dax_rwsem
- iweiny: Mark all sysfs entries for the 6.10 June 2024 kernel
- iweiny: Remove gotos
- iweiny: Fix 0day issues
- Jonathan Cameron: address comments
- Navneet Singh: address comments
- Dan Williams: address comments
- Dave Jiang: address comments
- Fan Ni: address comments
- Jørgen Hansen: address comments
- Link to RFC v2: https://lore.kernel.org/r/[email protected]

---
Ira Weiny (12):
cxl/core: Simplify cxl_dpa_set_mode()
cxl/events: Factor out event msgnum configuration
cxl/pci: Delay event buffer allocation
cxl/pci: Factor out interrupt policy check
range: Add range_overlaps()
dax/bus: Factor out dev dax resize logic
dax: Document dax dev range tuple
dax/region: Prevent range mapping allocation on sparse regions
dax/region: Support DAX device creation on sparse DAX regions
tools/testing/cxl: Make event logs dynamic
tools/testing/cxl: Add DC Regions to mock mem data
tools/testing/cxl: Add Dynamic Capacity events

Navneet Singh (14):
cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
cxl/core: Separate region mode from decoder mode
cxl/mem: Read dynamic capacity configuration from the device
cxl/region: Add dynamic capacity decoder and region modes
cxl/port: Add Dynamic Capacity mode support to endpoint decoders
cxl/port: Add dynamic capacity size support to endpoint decoders
cxl/mem: Expose device dynamic capacity capabilities
cxl/region: Add Dynamic Capacity CXL region support
cxl/mem: Configure dynamic capacity interrupts
cxl/region: Read existing extents on region creation
cxl/extent: Realize extent devices
dax/region: Create extent resources on DAX region driver load
cxl/mem: Handle DCD add & release capacity events.
cxl/mem: Trace Dynamic capacity Event Record

Documentation/ABI/testing/sysfs-bus-cxl | 60 ++-
drivers/cxl/core/Makefile | 1 +
drivers/cxl/core/core.h | 10 +
drivers/cxl/core/extent.c | 145 +++++
drivers/cxl/core/hdm.c | 254 +++++++--
drivers/cxl/core/mbox.c | 591 ++++++++++++++++++++-
drivers/cxl/core/memdev.c | 76 +++
drivers/cxl/core/port.c | 19 +
drivers/cxl/core/region.c | 334 +++++++++++-
drivers/cxl/core/trace.h | 65 +++
drivers/cxl/cxl.h | 127 ++++-
drivers/cxl/cxlmem.h | 114 ++++
drivers/cxl/mem.c | 45 ++
drivers/cxl/pci.c | 122 +++--
drivers/dax/bus.c | 353 +++++++++---
drivers/dax/bus.h | 4 +-
drivers/dax/cxl.c | 127 ++++-
drivers/dax/dax-private.h | 40 +-
drivers/dax/hmem/hmem.c | 2 +-
drivers/dax/pmem.c | 2 +-
fs/btrfs/ordered-data.c | 10 +-
include/linux/cxl-event.h | 31 ++
include/linux/range.h | 7 +
tools/testing/cxl/Kbuild | 1 +
tools/testing/cxl/test/mem.c | 914 ++++++++++++++++++++++++++++----
25 files changed, 3152 insertions(+), 302 deletions(-)
---
base-commit: dff54316795991e88a453a095a9322718a34034a
change-id: 20230604-dcd-type2-upstream-0cd15f6216fd

Best regards,
--
Ira Weiny <[email protected]>



2024-03-25 03:49:49

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 16/26] cxl/extent: Realize extent devices

From: Navneet Singh <[email protected]>

Once all extents of an interleave set are present a region must
surface an extent to the region.

Without interleaving; endpoint decoder and region extents have a 1:1
relationship. Future support for IW > 1 will maintain a N:1
relationship between the device extents and region extents.

Create a region extent device for every device extent found. Release of
the extent device triggers a response to the underlying hardware extent.

There is no strong use case to support the addition of extents which
overlap previously accepted extent ranges. Reject such new extents
until such time as a good use case emerges.

Expose the necessary details of region extents by creating the following
sysfs entries.

/sys/bus/cxl/devices/dax_regionX/extentY
/sys/bus/cxl/devices/dax_regionX/extentY/offset
/sys/bus/cxl/devices/dax_regionX/extentY/length
/sys/bus/cxl/devices/dax_regionX/extentY/label

The use of the extent devices by the DAX layer is deferred to later
patches.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[iweiny: new patch]
[iweiny: Rename 'dr_extent' to 'region_extent']
---
drivers/cxl/core/Makefile | 1 +
drivers/cxl/core/extent.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/mbox.c | 43 +++++++++++++++
drivers/cxl/core/region.c | 76 +++++++++++++++++++++++++-
drivers/cxl/cxl.h | 37 +++++++++++++
tools/testing/cxl/Kbuild | 1 +
6 files changed, 290 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 9259bcc6773c..35c5c76bfcf1 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -14,5 +14,6 @@ cxl_core-y += pci.o
cxl_core-y += hdm.o
cxl_core-y += pmu.o
cxl_core-y += cdat.o
+cxl_core-y += extent.o
cxl_core-$(CONFIG_TRACING) += trace.o
cxl_core-$(CONFIG_CXL_REGION) += region.o
diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
new file mode 100644
index 000000000000..487c220f1c3c
--- /dev/null
+++ b/drivers/cxl/core/extent.c
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
+
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <cxl.h>
+
+static DEFINE_IDA(cxl_extent_ida);
+
+static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct region_extent *reg_ext = to_region_extent(dev);
+
+ return sysfs_emit(buf, "%pa\n", &reg_ext->hpa_range.start);
+}
+static DEVICE_ATTR_RO(offset);
+
+static ssize_t length_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct region_extent *reg_ext = to_region_extent(dev);
+ u64 length = range_len(&reg_ext->hpa_range);
+
+ return sysfs_emit(buf, "%pa\n", &length);
+}
+static DEVICE_ATTR_RO(length);
+
+static ssize_t label_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct region_extent *reg_ext = to_region_extent(dev);
+
+ return sysfs_emit(buf, "%s\n", reg_ext->label);
+}
+static DEVICE_ATTR_RO(label);
+
+static struct attribute *region_extent_attrs[] = {
+ &dev_attr_offset.attr,
+ &dev_attr_length.attr,
+ &dev_attr_label.attr,
+ NULL,
+};
+
+static const struct attribute_group region_extent_attribute_group = {
+ .attrs = region_extent_attrs,
+};
+
+static const struct attribute_group *region_extent_attribute_groups[] = {
+ &region_extent_attribute_group,
+ NULL,
+};
+
+static void region_extent_release(struct device *dev)
+{
+ struct region_extent *reg_ext = to_region_extent(dev);
+
+ cxl_release_ed_extent(&reg_ext->ed_ext);
+ ida_free(&cxl_extent_ida, reg_ext->dev.id);
+ kfree(reg_ext);
+}
+
+static const struct device_type region_extent_type = {
+ .name = "extent",
+ .release = region_extent_release,
+ .groups = region_extent_attribute_groups,
+};
+
+bool is_region_extent(struct device *dev)
+{
+ return dev->type == &region_extent_type;
+}
+EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL);
+
+static void region_extent_unregister(void *ext)
+{
+ struct region_extent *reg_ext = ext;
+
+ dev_dbg(&reg_ext->dev, "DAX region rm extent HPA %#llx - %#llx\n",
+ reg_ext->hpa_range.start, reg_ext->hpa_range.end);
+ device_unregister(&reg_ext->dev);
+}
+
+int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
+ struct range *hpa_range,
+ const char *label,
+ struct range *dpa_range,
+ struct cxl_endpoint_decoder *cxled)
+{
+ struct region_extent *reg_ext;
+ struct device *dev;
+ int rc, id;
+
+ id = ida_alloc(&cxl_extent_ida, GFP_KERNEL);
+ if (id < 0)
+ return -ENOMEM;
+
+ reg_ext = kzalloc(sizeof(*reg_ext), GFP_KERNEL);
+ if (!reg_ext)
+ return -ENOMEM;
+
+ reg_ext->hpa_range = *hpa_range;
+ reg_ext->ed_ext.dpa_range = *dpa_range;
+ reg_ext->ed_ext.cxled = cxled;
+ snprintf(reg_ext->label, DAX_EXTENT_LABEL_LEN, "%s", label);
+
+ dev = &reg_ext->dev;
+ device_initialize(dev);
+ dev->id = id;
+ device_set_pm_not_required(dev);
+ dev->parent = &cxlr_dax->dev;
+ dev->type = &region_extent_type;
+ rc = dev_set_name(dev, "extent%d", dev->id);
+ if (rc)
+ goto err;
+
+ rc = device_add(dev);
+ if (rc)
+ goto err;
+
+ dev_dbg(dev, "DAX region extent HPA %#llx - %#llx\n",
+ reg_ext->hpa_range.start, reg_ext->hpa_range.end);
+
+ return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
+ reg_ext);
+
+err:
+ dev_err(&cxlr_dax->dev, "Failed to initialize DAX extent dev HPA %#llx - %#llx\n",
+ reg_ext->hpa_range.start, reg_ext->hpa_range.end);
+
+ put_device(dev);
+ return rc;
+}
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 9e33a0976828..6b00e717e42b 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
return rc;
}

+static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
+ struct range *extent, int opcode)
+{
+ struct cxl_mbox_cmd mbox_cmd;
+ size_t size;
+
+ struct cxl_mbox_dc_response *dc_res __free(kfree);
+ size = struct_size(dc_res, extent_list, 1);
+ dc_res = kzalloc(size, GFP_KERNEL);
+ if (!dc_res)
+ return -ENOMEM;
+
+ dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
+ memset(dc_res->extent_list[0].reserved, 0, 8);
+ dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
+ dc_res->extent_list_size = cpu_to_le32(1);
+
+ mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = opcode,
+ .size_in = size,
+ .payload_in = dc_res,
+ };
+
+ return cxl_internal_send_cmd(mds, &mbox_cmd);
+}
+
static struct cxl_memdev_state *
cxled_to_mds(struct cxl_endpoint_decoder *cxled)
{
@@ -1029,6 +1055,23 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
return container_of(cxlds, struct cxl_memdev_state, cxlds);
}

+void cxl_release_ed_extent(struct cxl_ed_extent *extent)
+{
+ struct cxl_endpoint_decoder *cxled = extent->cxled;
+ struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+ struct device *dev = mds->cxlds.dev;
+ int rc;
+
+ dev_dbg(dev, "Releasing DC extent DPA %#llx - %#llx\n",
+ extent->dpa_range.start, extent->dpa_range.end);
+
+ rc = cxl_send_dc_cap_response(mds, &extent->dpa_range, CXL_MBOX_OP_RELEASE_DC);
+ if (rc)
+ dev_dbg(dev, "Failed to respond releasing extent DPA %#llx - %#llx; %d\n",
+ extent->dpa_range.start, extent->dpa_range.end, rc);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_release_ed_extent, CXL);
+
static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
enum cxl_event_log_type type)
{
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 3e563ab29afe..7635ff109578 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1450,11 +1450,81 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
return 0;
}

+static int extent_check_overlap(struct device *dev, void *arg)
+{
+ struct range *new_range = arg;
+ struct region_extent *ext;
+
+ if (!is_region_extent(dev))
+ return 0;
+
+ ext = to_region_extent(dev);
+ return range_overlaps(&ext->hpa_range, new_range);
+}
+
+static int extent_overlaps(struct cxl_dax_region *cxlr_dax,
+ struct range *hpa_range)
+{
+ struct device *dev __free(put_device) =
+ device_find_child(&cxlr_dax->dev, hpa_range, extent_check_overlap);
+
+ if (dev)
+ return -EINVAL;
+ return 0;
+}
+
/* Callers are expected to ensure cxled has been attached to a region */
int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
struct cxl_dc_extent *dc_extent)
{
- return 0;
+ struct cxl_region *cxlr = cxled->cxld.region;
+ struct range ext_dpa_range, ext_hpa_range;
+ struct device *dev = &cxlr->dev;
+ resource_size_t dpa_offset, hpa;
+
+ /*
+ * Interleave ways == 1 means this coresponds to a 1:1 mapping between
+ * device extents and DAX region extents. Future implementations
+ * should hold DC region extents here until the full dax region extent
+ * can be realized.
+ */
+ if (cxlr->params.interleave_ways != 1) {
+ dev_err(dev, "Interleaving DC not supported\n");
+ return -EINVAL;
+ }
+
+ ext_dpa_range = (struct range) {
+ .start = le64_to_cpu(dc_extent->start_dpa),
+ .end = le64_to_cpu(dc_extent->start_dpa) +
+ le64_to_cpu(dc_extent->length) - 1,
+ };
+
+ dev_dbg(dev, "Adding DC extent DPA %#llx - %#llx\n",
+ ext_dpa_range.start, ext_dpa_range.end);
+
+ /*
+ * Without interleave...
+ * HPA offset == DPA offset
+ * ... but do the math anyway
+ */
+ dpa_offset = ext_dpa_range.start - cxled->dpa_res->start;
+ hpa = cxled->cxld.hpa_range.start + dpa_offset;
+
+ ext_hpa_range = (struct range) {
+ .start = hpa - cxlr->cxlr_dax->hpa_range.start,
+ .end = ext_hpa_range.start + range_len(&ext_dpa_range) - 1,
+ };
+
+ if (extent_overlaps(cxlr->cxlr_dax, &ext_hpa_range))
+ return -EINVAL;
+
+ dev_dbg(dev, "Realizing region extent at HPA %#llx - %#llx\n",
+ ext_hpa_range.start, ext_hpa_range.end);
+
+ return dax_region_create_ext(cxlr->cxlr_dax, &ext_hpa_range,
+ (char *)dc_extent->tag,
+ &ext_dpa_range,
+ cxled);
}

static int cxl_region_attach_position(struct cxl_region *cxlr,
@@ -2684,6 +2754,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)

dev = &cxlr_dax->dev;
cxlr_dax->cxlr = cxlr;
+ cxlr->cxlr_dax = cxlr_dax;
device_initialize(dev);
lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
device_set_pm_not_required(dev);
@@ -2799,7 +2870,10 @@ static int cxl_region_read_extents(struct cxl_region *cxlr)
static void cxlr_dax_unregister(void *_cxlr_dax)
{
struct cxl_dax_region *cxlr_dax = _cxlr_dax;
+ struct cxl_region *cxlr = cxlr_dax->cxlr;

+ cxlr->cxlr_dax = NULL;
+ cxlr_dax->cxlr = NULL;
device_unregister(&cxlr_dax->dev);
}

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d585f5fdd3ae..5379ad7f5852 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -564,6 +564,7 @@ struct cxl_region_params {
* @type: Endpoint decoder target type
* @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
* @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
+ * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
* @flags: Region state flags
* @params: active + config params for the region
*/
@@ -574,6 +575,7 @@ struct cxl_region {
enum cxl_decoder_type type;
struct cxl_nvdimm_bridge *cxl_nvb;
struct cxl_pmem_region *cxlr_pmem;
+ struct cxl_dax_region *cxlr_dax;
unsigned long flags;
struct cxl_region_params params;
};
@@ -617,6 +619,41 @@ struct cxl_dax_region {
struct range hpa_range;
};

+/**
+ * struct cxl_ed_extent - Extent within an endpoint decoder
+ * @dpa_range: DPA range this extent covers within the decoder
+ * @cxled: reference to the endpoint decoder
+ */
+struct cxl_ed_extent {
+ struct range dpa_range;
+ struct cxl_endpoint_decoder *cxled;
+};
+void cxl_release_ed_extent(struct cxl_ed_extent *extent);
+
+/**
+ * struct region_extent - CXL DAX region extent
+ * @dev: device representing this extent
+ * @hpa_range: HPA range of this extent
+ * @label: label of the extent
+ * @ed_ext: Endpoint decoder extent which backs this extent
+ */
+#define DAX_EXTENT_LABEL_LEN 64
+struct region_extent {
+ struct device dev;
+ struct range hpa_range;
+ char label[DAX_EXTENT_LABEL_LEN];
+ struct cxl_ed_extent ed_ext;
+};
+
+int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
+ struct range *hpa_range,
+ const char *label,
+ struct range *dpa_range,
+ struct cxl_endpoint_decoder *cxled);
+
+bool is_region_extent(struct device *dev);
+#define to_region_extent(dev) container_of(dev, struct region_extent, dev)
+
/**
* struct cxl_port - logical collection of upstream port devices and
* downstream port devices to construct a CXL memory
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 030b388800f0..dc0cc1d5e6a0 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -60,6 +60,7 @@ cxl_core-y += $(CXL_CORE_SRC)/pci.o
cxl_core-y += $(CXL_CORE_SRC)/hdm.o
cxl_core-y += $(CXL_CORE_SRC)/pmu.o
cxl_core-y += $(CXL_CORE_SRC)/cdat.o
+cxl_core-y += $(CXL_CORE_SRC)/extent.o
cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
cxl_core-y += config_check.o

--
2.44.0


2024-03-25 03:50:31

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 08/26] cxl/mem: Expose device dynamic capacity capabilities

From: Navneet Singh <[email protected]>

To properly configure CXL regions on Dynamic Capacity Devices (DCD),
user space will need to know the details of the DC Regions available on
a device.

Expose driver dynamic capacity capabilities through sysfs
attributes.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1:
[iweiny: remove review tags]
[iweiny: mark sysfs for 6.10 kernel]
---
Documentation/ABI/testing/sysfs-bus-cxl | 17 ++++++++
drivers/cxl/core/memdev.c | 76 +++++++++++++++++++++++++++++++++
2 files changed, 93 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 8b3efaf6563c..8a4f572c8498 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -54,6 +54,23 @@ Description:
identically named field in the Identify Memory Device Output
Payload in the CXL-2.0 specification.

+What: /sys/bus/cxl/devices/memX/dc/region_count
+Date: June, 2024
+KernelVersion: v6.10
+Contact: [email protected]
+Description:
+ (RO) Number of Dynamic Capacity (DC) regions supported on the
+ device. May be 0 if the device does not support Dynamic
+ Capacity.
+
+What: /sys/bus/cxl/devices/memX/dc/regionY_size
+Date: June, 2024
+KernelVersion: v6.10
+Contact: [email protected]
+Description:
+ (RO) Size of the Dynamic Capacity (DC) region Y. Only
+ available on devices which support DC and only for those
+ region indexes supported by the device.

What: /sys/bus/cxl/devices/memX/pmem/qos_class
Date: May, 2023
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index d4e259f3a7e9..a7b880e33a7e 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -101,6 +101,18 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
static struct device_attribute dev_attr_pmem_size =
__ATTR(size, 0444, pmem_size_show, NULL);

+static ssize_t region_count_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+ return sysfs_emit(buf, "%d\n", mds->nr_dc_region);
+}
+
+static struct device_attribute dev_attr_region_count =
+ __ATTR(region_count, 0444, region_count_show, NULL);
+
static ssize_t serial_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -492,6 +504,63 @@ static struct attribute *cxl_memdev_security_attributes[] = {
NULL,
};

+static ssize_t show_size_regionN(struct cxl_memdev *cxlmd, char *buf, int pos)
+{
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+ return sysfs_emit(buf, "%#llx\n", mds->dc_region[pos].decode_len);
+}
+
+#define REGION_SIZE_ATTR_RO(n) \
+static ssize_t region##n##_size_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return show_size_regionN(to_cxl_memdev(dev), buf, (n)); \
+} \
+static DEVICE_ATTR_RO(region##n##_size)
+REGION_SIZE_ATTR_RO(0);
+REGION_SIZE_ATTR_RO(1);
+REGION_SIZE_ATTR_RO(2);
+REGION_SIZE_ATTR_RO(3);
+REGION_SIZE_ATTR_RO(4);
+REGION_SIZE_ATTR_RO(5);
+REGION_SIZE_ATTR_RO(6);
+REGION_SIZE_ATTR_RO(7);
+
+static struct attribute *cxl_memdev_dc_attributes[] = {
+ &dev_attr_region0_size.attr,
+ &dev_attr_region1_size.attr,
+ &dev_attr_region2_size.attr,
+ &dev_attr_region3_size.attr,
+ &dev_attr_region4_size.attr,
+ &dev_attr_region5_size.attr,
+ &dev_attr_region6_size.attr,
+ &dev_attr_region7_size.attr,
+ &dev_attr_region_count.attr,
+ NULL,
+};
+
+static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
+
+ /* Not a memory device */
+ if (!mds)
+ return 0;
+
+ if (a == &dev_attr_region_count.attr)
+ return a->mode;
+
+ /* Show only the regions supported */
+ if (n < mds->nr_dc_region)
+ return a->mode;
+
+ return 0;
+}
+
static umode_t cxl_memdev_visible(struct kobject *kobj, struct attribute *a,
int n)
{
@@ -567,11 +636,18 @@ static struct attribute_group cxl_memdev_security_attribute_group = {
.is_visible = cxl_memdev_security_visible,
};

+static struct attribute_group cxl_memdev_dc_attribute_group = {
+ .name = "dc",
+ .attrs = cxl_memdev_dc_attributes,
+ .is_visible = cxl_dc_visible,
+};
+
static const struct attribute_group *cxl_memdev_attribute_groups[] = {
&cxl_memdev_attribute_group,
&cxl_memdev_ram_attribute_group,
&cxl_memdev_pmem_attribute_group,
&cxl_memdev_security_attribute_group,
+ &cxl_memdev_dc_attribute_group,
NULL,
};


--
2.44.0


2024-03-25 03:51:32

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)

From: Navneet Singh <[email protected]>

Per the CXL 3.1 specification software must check the Command Effects
Log (CEL) to know if a device supports dynamic capacity (DC). If the
device does support DC the specifics of the DC Regions (0-7) are read
through the mailbox.

Flag DC Device (DCD) commands in a device if they are supported.
Subsequent patches will key off these bits to configure DCD.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Changes for v1
[iweiny: update to latest master]
[iweiny: update commit message]
[iweiny: Based on the fix:
https://lore.kernel.org/all/[email protected]/
[jonathan: remove unneeded format change]
[jonathan: don't split security code in mbox.c]
---
drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
drivers/cxl/cxlmem.h | 15 +++++++++++++++
2 files changed, 48 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 9adda4795eb7..ed4131c6f50b 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -161,6 +161,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
}
}

+static bool cxl_is_dcd_command(u16 opcode)
+{
+#define CXL_MBOX_OP_DCD_CMDS 0x48
+
+ return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
+}
+
+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
+ u16 opcode)
+{
+ switch (opcode) {
+ case CXL_MBOX_OP_GET_DC_CONFIG:
+ set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+ break;
+ case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+ set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
+ break;
+ case CXL_MBOX_OP_ADD_DC_RESPONSE:
+ set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
+ break;
+ case CXL_MBOX_OP_RELEASE_DC:
+ set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
+ break;
+ default:
+ break;
+ }
+}
+
static bool cxl_is_poison_command(u16 opcode)
{
#define CXL_MBOX_OP_POISON_CMDS 0x43
@@ -733,6 +761,11 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
enabled++;
}

+ if (cxl_is_dcd_command(opcode)) {
+ cxl_set_dcd_cmd_enabled(mds, opcode);
+ enabled++;
+ }
+
dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
enabled ? "enabled" : "unsupported by driver");
}
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 20fb3b35e89e..79a67cff9143 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -238,6 +238,15 @@ struct cxl_event_state {
struct mutex log_lock;
};

+/* Device enabled DCD commands */
+enum dcd_cmd_enabled_bits {
+ CXL_DCD_ENABLED_GET_CONFIG,
+ CXL_DCD_ENABLED_GET_EXTENT_LIST,
+ CXL_DCD_ENABLED_ADD_RESPONSE,
+ CXL_DCD_ENABLED_RELEASE,
+ CXL_DCD_ENABLED_MAX
+};
+
/* Device enabled poison commands */
enum poison_cmd_enabled_bits {
CXL_POISON_ENABLED_LIST,
@@ -454,6 +463,7 @@ struct cxl_dev_state {
* (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
* @mbox_mutex: Mutex to synchronize mailbox access.
* @firmware_version: Firmware version for the memory device.
+ * @dcd_cmds: List of DCD commands implemented by memory device
* @enabled_cmds: Hardware commands found enabled in CEL.
* @exclusive_cmds: Commands that are kernel-internal only
* @total_bytes: sum of all possible capacities
@@ -481,6 +491,7 @@ struct cxl_memdev_state {
size_t lsa_size;
struct mutex mbox_mutex; /* Protects device mailbox and firmware */
char firmware_version[0x10];
+ DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
u64 total_bytes;
@@ -551,6 +562,10 @@ enum cxl_opcode {
CXL_MBOX_OP_UNLOCK = 0x4503,
CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
+ CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
+ CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
+ CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
+ CXL_MBOX_OP_RELEASE_DC = 0x4803,
CXL_MBOX_OP_MAX = 0x10000
};


--
2.44.0


2024-03-25 03:51:32

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 09/26] cxl/region: Add Dynamic Capacity CXL region support

From: Navneet Singh <[email protected]>

CXL devices optionally support dynamic capacity. CXL Regions must be
configured correctly to access this capacity. Similar to ram and pmem
partitions, DC Regions, as they are called in CXL 3.1, represent
different partitions of the DPA space.

Introduce the concept of a sparse DAX region. Add the create_dc_region
sysfs entry to create sparse DC DAX regions. Special case DC capable
regions to create a 0 sized seed DAX device to maintain backwards
compatibility with older software which needs a default DAX device to
hold the region reference.

Flag sparse DAX regions to indicate 0 capacity available until such time
as DC capacity is added.

Interleaving is deferred in this series. Add an early check.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1:
[djiang: mark sysfs entries to be in 6.10 kernel including date]
[djbw: change dax region typing to be 'sparse' rather than 'dynamic']
[iweiny: rebase changes to master instead of type2 patches]
---
Documentation/ABI/testing/sysfs-bus-cxl | 22 +++++++++++-----------
drivers/cxl/core/core.h | 1 +
drivers/cxl/core/port.c | 1 +
drivers/cxl/core/region.c | 33 +++++++++++++++++++++++++++++++++
drivers/dax/bus.c | 8 ++++++++
drivers/dax/bus.h | 1 +
drivers/dax/cxl.c | 15 +++++++++++++--
7 files changed, 68 insertions(+), 13 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
index 8a4f572c8498..f0cf52fff9fa 100644
--- a/Documentation/ABI/testing/sysfs-bus-cxl
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -411,20 +411,20 @@ Description:
interleave_granularity).


-What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
-Date: May, 2022, January, 2023
-KernelVersion: v6.0 (pmem), v6.3 (ram)
+What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
+Date: May, 2022, January, 2023, June 2024
+KernelVersion: v6.0 (pmem), v6.3 (ram), v6.10 (dc)
Contact: [email protected]
Description:
(RW) Write a string in the form 'regionZ' to start the process
- of defining a new persistent, or volatile memory region
- (interleave-set) within the decode range bounded by root decoder
- 'decoderX.Y'. The value written must match the current value
- returned from reading this attribute. An atomic compare exchange
- operation is done on write to assign the requested id to a
- region and allocate the region-id for the next creation attempt.
- EBUSY is returned if the region name written does not match the
- current cached value.
+ of defining a new persistent, volatile, or Dynamic Capacity
+ (DC) memory region (interleave-set) within the decode range
+ bounded by root decoder 'decoderX.Y'. The value written must
+ match the current value returned from reading this attribute.
+ An atomic compare exchange operation is done on write to assign
+ the requested id to a region and allocate the region-id for the
+ next creation attempt. EBUSY is returned if the region name
+ written does not match the current cached value.


What: /sys/bus/cxl/devices/decoderX.Y/delete_region
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 3b64fb1b9ed0..91abeffbe985 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -13,6 +13,7 @@ extern struct attribute_group cxl_base_attribute_group;
#ifdef CONFIG_CXL_REGION
extern struct device_attribute dev_attr_create_pmem_region;
extern struct device_attribute dev_attr_create_ram_region;
+extern struct device_attribute dev_attr_create_dc_region;
extern struct device_attribute dev_attr_delete_region;
extern struct device_attribute dev_attr_region;
extern const struct device_type cxl_pmem_region_type;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 036b61cb3007..661177b575f7 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -335,6 +335,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
&dev_attr_qos_class.attr,
SET_CXL_REGION_ATTR(create_pmem_region)
SET_CXL_REGION_ATTR(create_ram_region)
+ SET_CXL_REGION_ATTR(create_dc_region)
SET_CXL_REGION_ATTR(delete_region)
NULL,
};
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index ec3b8c6948e9..0d7b09a49dcf 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2205,6 +2205,7 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
switch (mode) {
case CXL_REGION_RAM:
case CXL_REGION_PMEM:
+ case CXL_REGION_DC:
break;
default:
dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
@@ -2314,6 +2315,32 @@ static ssize_t create_ram_region_store(struct device *dev,
}
DEVICE_ATTR_RW(create_ram_region);

+static ssize_t create_dc_region_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return __create_region_show(to_cxl_root_decoder(dev), buf);
+}
+
+static ssize_t create_dc_region_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
+ struct cxl_region *cxlr;
+ int rc, id;
+
+ rc = sscanf(buf, "region%d\n", &id);
+ if (rc != 1)
+ return -EINVAL;
+
+ cxlr = __create_region(cxlrd, CXL_REGION_DC, id);
+ if (IS_ERR(cxlr))
+ return PTR_ERR(cxlr);
+
+ return len;
+}
+DEVICE_ATTR_RW(create_dc_region);
+
static ssize_t region_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -2759,6 +2786,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
struct device *dev;
int rc;

+ if (cxlr->mode == CXL_REGION_DC && cxlr->params.interleave_ways != 1) {
+ dev_err(&cxlr->dev, "Interleaving DC not supported\n");
+ return -EINVAL;
+ }
+
cxlr_dax = cxl_dax_region_alloc(cxlr);
if (IS_ERR(cxlr_dax))
return PTR_ERR(cxlr_dax);
@@ -3040,6 +3072,7 @@ static int cxl_region_probe(struct device *dev)
case CXL_REGION_PMEM:
return devm_cxl_add_pmem_region(cxlr);
case CXL_REGION_RAM:
+ case CXL_REGION_DC:
/*
* The region can not be manged by CXL if any portion of
* it is already online as 'System RAM'
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index cb148f74ceda..903566aff5eb 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -181,6 +181,11 @@ static bool is_static(struct dax_region *dax_region)
return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
}

+static bool is_sparse(struct dax_region *dax_region)
+{
+ return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
+}
+
bool static_dev_dax(struct dev_dax *dev_dax)
{
return is_static(dev_dax->region);
@@ -304,6 +309,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)

WARN_ON_ONCE(!rwsem_is_locked(&dax_region_rwsem));

+ if (is_sparse(dax_region))
+ return 0;
+
for_each_dax_region_resource(dax_region, res)
size -= resource_size(res);
return size;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index cbbf64443098..783bfeef42cc 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,6 +13,7 @@ struct dax_region;
/* dax bus specific ioresource flags */
#define IORESOURCE_DAX_STATIC BIT(0)
#define IORESOURCE_DAX_KMEM BIT(1)
+#define IORESOURCE_DAX_SPARSE_CAP BIT(2)

struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index c696837ab23c..415d03fbf9b6 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -13,19 +13,30 @@ static int cxl_dax_region_probe(struct device *dev)
struct cxl_region *cxlr = cxlr_dax->cxlr;
struct dax_region *dax_region;
struct dev_dax_data data;
+ resource_size_t dev_size;
+ unsigned long flags;

if (nid == NUMA_NO_NODE)
nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);

+ flags = IORESOURCE_DAX_KMEM;
+ if (cxlr->mode == CXL_REGION_DC)
+ flags |= IORESOURCE_DAX_SPARSE_CAP;
+
dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
- PMD_SIZE, IORESOURCE_DAX_KMEM);
+ PMD_SIZE, flags);
if (!dax_region)
return -ENOMEM;

+ dev_size = range_len(&cxlr_dax->hpa_range);
+ /* Add empty seed dax device */
+ if (cxlr->mode == CXL_REGION_DC)
+ dev_size = 0;
+
data = (struct dev_dax_data) {
.dax_region = dax_region,
.id = -1,
- .size = range_len(&cxlr_dax->hpa_range),
+ .size = dev_size,
.memmap_on_memory = true,
};


--
2.44.0


2024-03-25 03:51:56

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 10/26] cxl/events: Factor out event msgnum configuration

Dynamic Capacity Devices (DCD) require events to process extent addition
or removal. BIOS may have control over memory event processing.

Factor out cxl_event_config_msgnums() in preparation for setting up DCD
event interrupts separate from memory events.

Signed-off-by: Ira Weiny <[email protected]>
---
drivers/cxl/pci.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 216881455364..cedd9b05f129 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -698,35 +698,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
return cxl_event_get_int_policy(mds, policy);
}

-static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
+static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy)
{
struct cxl_dev_state *cxlds = &mds->cxlds;
- struct cxl_event_interrupt_policy policy;
int rc;

- rc = cxl_event_config_msgnums(mds, &policy);
- if (rc)
- return rc;
-
- rc = cxl_event_req_irq(cxlds, policy.info_settings);
+ rc = cxl_event_req_irq(cxlds, policy->info_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
return rc;
}

- rc = cxl_event_req_irq(cxlds, policy.warn_settings);
+ rc = cxl_event_req_irq(cxlds, policy->warn_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
return rc;
}

- rc = cxl_event_req_irq(cxlds, policy.failure_settings);
+ rc = cxl_event_req_irq(cxlds, policy->failure_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
return rc;
}

- rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
+ rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
if (rc) {
dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
return rc;
@@ -745,7 +741,7 @@ static bool cxl_event_int_is_fw(u8 setting)
static int cxl_event_config(struct pci_host_bridge *host_bridge,
struct cxl_memdev_state *mds, bool irq_avail)
{
- struct cxl_event_interrupt_policy policy;
+ struct cxl_event_interrupt_policy policy = { 0 };
int rc;

/*
@@ -777,7 +773,11 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
return -EBUSY;
}

- rc = cxl_event_irqsetup(mds);
+ rc = cxl_event_config_msgnums(mds, &policy);
+ if (rc)
+ return rc;
+
+ rc = cxl_event_irqsetup(mds, &policy);
if (rc)
return rc;


--
2.44.0


2024-03-25 03:52:07

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

cxl_dpa_set_mode() checks the mode for validity two times, once outside
of the DPA RW semaphore and again within. The function is not in a
critical path. Prior to Dynamic Capacity the extra check was not much
of an issue. The addition of DC modes increases the complexity of
the check.

Simplify the mode check before adding the more complex DC modes.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1:
[iweiny: new patch]
[Jonathan: based on getting rid of the loop in cxl_dpa_set_mode]
[Jonathan: standardize on resource_size() == 0]
---
drivers/cxl/core/hdm.c | 45 ++++++++++++++++++---------------------------
1 file changed, 18 insertions(+), 27 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 7d97790b893d..66b8419fd0c3 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -411,44 +411,35 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
struct device *dev = &cxled->cxld.dev;
- int rc;

+ guard(rwsem_write)(&cxl_dpa_rwsem);
+ if (cxled->cxld.flags & CXL_DECODER_F_ENABLE)
+ return -EBUSY;
+
+ /*
+ * Check that the mode is supported by the current partition
+ * configuration
+ */
switch (mode) {
case CXL_DECODER_RAM:
+ if (!resource_size(&cxlds->ram_res)) {
+ dev_dbg(dev, "no available ram capacity\n");
+ return -ENXIO;
+ }
+ break;
case CXL_DECODER_PMEM:
+ if (!resource_size(&cxlds->pmem_res)) {
+ dev_dbg(dev, "no available pmem capacity\n");
+ return -ENXIO;
+ }
break;
default:
dev_dbg(dev, "unsupported mode: %d\n", mode);
return -EINVAL;
}

- down_write(&cxl_dpa_rwsem);
- if (cxled->cxld.flags & CXL_DECODER_F_ENABLE) {
- rc = -EBUSY;
- goto out;
- }
-
- /*
- * Only allow modes that are supported by the current partition
- * configuration
- */
- if (mode == CXL_DECODER_PMEM && !resource_size(&cxlds->pmem_res)) {
- dev_dbg(dev, "no available pmem capacity\n");
- rc = -ENXIO;
- goto out;
- }
- if (mode == CXL_DECODER_RAM && !resource_size(&cxlds->ram_res)) {
- dev_dbg(dev, "no available ram capacity\n");
- rc = -ENXIO;
- goto out;
- }
-
cxled->mode = mode;
- rc = 0;
-out:
- up_write(&cxl_dpa_rwsem);
-
- return rc;
+ return 0;
}

int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)

--
2.44.0


2024-03-25 03:53:18

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 26/26] tools/testing/cxl: Add Dynamic Capacity events

cxl_test provides a good way to ensure quick smoke and regression
testing. The complexity of DCD and the new sparse DAX regions required
to use them benefits greatly with a series of smoke tests.

The only part of the kernel stack which must be bypassed is the actual
irq of DCD events. However, the event processing itself can be tested
via cxl_test calling directly the event processing function directly.

In this way the rest of the stack; management of sparse regions, the
extent device lifetimes, and the dax device operations can be tested.

Add Dynamic Capacity Device tests for kernels which have DCD support in
cxl_test.

Add events on DCD extent injection. Directly call the event irq
callback to simulate irqs to process the test extents.

Signed-off-by: Ira Weiny <[email protected]>
---
Changes for v1
[iweiny: Adjust to new events]
---
tools/testing/cxl/test/mem.c | 58 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 7d1d897d9f2b..e7efb1d3e20f 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -2122,6 +2122,49 @@ static bool new_extent_valid(struct device *dev, size_t new_start,
return true;
}

+struct cxl_test_dcd {
+ uuid_t id;
+ struct cxl_event_dcd rec;
+} __packed;
+
+struct cxl_test_dcd dcd_event_rec_template = {
+ .id = CXL_EVENT_DC_EVENT_UUID,
+ .rec = {
+ .hdr = {
+ .length = sizeof(struct cxl_test_dcd),
+ },
+ },
+};
+
+static int log_dc_event(struct cxl_mockmem_data *mdata, enum dc_event type,
+ u64 start, u64 length, const char *tag_str)
+{
+ struct device *dev = mdata->mds->cxlds.dev;
+ struct cxl_test_dcd *dcd_event;
+
+ dev_dbg(dev, "mock device log event %d\n", type);
+
+ dcd_event = devm_kmemdup(dev, &dcd_event_rec_template,
+ sizeof(*dcd_event), GFP_KERNEL);
+ if (!dcd_event)
+ return -ENOMEM;
+
+ dcd_event->rec.event_type = type;
+ dcd_event->rec.extent.start_dpa = cpu_to_le64(start);
+ dcd_event->rec.extent.length = cpu_to_le64(length);
+ memcpy(dcd_event->rec.extent.tag, tag_str,
+ min(sizeof(dcd_event->rec.extent.tag),
+ strlen(tag_str)));
+
+ mes_add_event(mdata, CXL_EVENT_TYPE_DCD,
+ (struct cxl_event_record_raw *)dcd_event);
+
+ /* Fake the irq */
+ cxl_mem_get_event_records(mdata->mds, CXLDEV_EVENT_STATUS_DCD);
+
+ return 0;
+}
+
/*
* Format <start>:<length>:<tag>
*
@@ -2134,6 +2177,7 @@ static ssize_t dc_inject_extent_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
unsigned long long start, length;
char *len_str, *tag_str;
size_t buf_len = count;
@@ -2181,6 +2225,12 @@ static ssize_t dc_inject_extent_store(struct device *dev,
return rc;
}

+ rc = log_dc_event(mdata, DCD_ADD_CAPACITY, start, length, tag_str);
+ if (rc) {
+ dev_err(dev, "Failed to add event %d\n", rc);
+ return rc;
+ }
+
return count;
}
static DEVICE_ATTR_WO(dc_inject_extent);
@@ -2190,8 +2240,10 @@ static ssize_t __dc_del_extent_store(struct device *dev,
const char *buf, size_t count,
enum dc_event type)
{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
unsigned long long start, length;
char *len_str;
+ int rc;

char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
if (!start_str)
@@ -2221,6 +2273,12 @@ static ssize_t __dc_del_extent_store(struct device *dev,
dev_dbg(dev, "Forcing delete of extent %#llx len:%#llx\n",
start, length);

+ rc = log_dc_event(mdata, type, start, length, "");
+ if (rc) {
+ dev_err(dev, "Failed to add event %d\n", rc);
+ return rc;
+ }
+
return count;
}


--
2.44.0


2024-03-25 03:54:47

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 18/26] cxl/mem: Handle DCD add & release capacity events.

From: Navneet Singh <[email protected]>

A dynamic capacity devices (DCD) send events to signal the host about
changes in the availability of Dynamic Capacity (DC) memory. These
events contain extents, the addition or removal of which may occur at
any time.

Adding memory is straight forward. If no region exists the extent is
rejected. If a region does exist, a region extent is formed and
surfaced.

Removing memory requires checking if the memory is currently in use.
Memory use tracking is added in a subsequent patch so here the memory is
never in use and the removal occurs immediately.

Most often extents will be offered to and accepted by the host in well
defined chunks. However, part of an extent may be requested for
release. Simplify extent tracking by signaling removal of any extent
which overlaps the requested release range.

Force removal is intended as a mechanism between the FM and the device
and intended only when the host is unresponsive or otherwise broken.
Purposely ignore force removal events.

Process DCD extents.

Recall that all devices of an interleave set must offer a corresponding
extent for the region extent to be realized. This patch limits
interleave to 1. Thus the 1:1 mapping between device extent and DAX
region extent allows immediate surfacing.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[iweiny: remove all xarrays]
[iweiny: entirely new architecture]
---
drivers/cxl/core/extent.c | 4 ++
drivers/cxl/core/mbox.c | 142 +++++++++++++++++++++++++++++++++++++++++++---
drivers/cxl/core/region.c | 139 ++++++++++++++++++++++++++++++++++++++++-----
drivers/cxl/cxl.h | 34 +++++++++++
drivers/cxl/cxlmem.h | 21 +++----
drivers/cxl/mem.c | 45 +++++++++++++++
drivers/dax/cxl.c | 22 +++++++
include/linux/cxl-event.h | 31 ++++++++++
8 files changed, 405 insertions(+), 33 deletions(-)

diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index 487c220f1c3c..e98acd98ebe2 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -118,6 +118,10 @@ int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
if (rc)
goto err;

+ rc = cxl_region_notify_extent(cxled->cxld.region, DCD_ADD_CAPACITY, reg_ext);
+ if (rc)
+ goto err;
+
dev_dbg(dev, "DAX region extent HPA %#llx - %#llx\n",
reg_ext->hpa_range.start, reg_ext->hpa_range.end);

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 6b00e717e42b..7babac2d1c95 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -870,6 +870,37 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
}
EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);

+static int cxl_notify_dc_extent(struct cxl_memdev_state *mds,
+ enum dc_event event,
+ struct cxl_dc_extent *dc_extent)
+{
+ struct cxl_drv_nd nd = (struct cxl_drv_nd) {
+ .event = event,
+ .dc_extent = dc_extent
+ };
+ struct device *dev;
+ int rc = -ENXIO;
+
+ dev = &mds->cxlds.cxlmd->dev;
+ dev_dbg(dev, "Notify: type %d DPA:%#llx LEN:%#llx\n",
+ event, le64_to_cpu(dc_extent->start_dpa),
+ le64_to_cpu(dc_extent->length));
+
+ device_lock(dev);
+ if (dev->driver) {
+ struct cxl_driver *mem_drv = to_cxl_drv(dev->driver);
+
+ if (mem_drv->notify) {
+ dev_dbg(dev, "Notify driver type %d DPA:%#llx LEN:%#llx\n",
+ event, le64_to_cpu(dc_extent->start_dpa),
+ le64_to_cpu(dc_extent->length));
+ rc = mem_drv->notify(dev, &nd);
+ }
+ }
+ device_unlock(dev);
+ return rc;
+}
+
static int cxl_validate_extent(struct cxl_memdev_state *mds,
struct cxl_dc_extent *dc_extent)
{
@@ -897,8 +928,8 @@ static int cxl_validate_extent(struct cxl_memdev_state *mds,
return -EINVAL;
}

-static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
- struct cxl_dc_extent *extent)
+bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dc_extent *extent)
{
uint64_t start = le64_to_cpu(extent->start_dpa);
uint64_t length = le64_to_cpu(extent->length);
@@ -916,6 +947,7 @@ static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,

return range_contains(&ed_range, &ext_range);
}
+EXPORT_SYMBOL_NS_GPL(cxl_dc_extent_in_ed, CXL);

void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
enum cxl_event_log_type type,
@@ -1027,15 +1059,20 @@ static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
size_t size;

struct cxl_mbox_dc_response *dc_res __free(kfree);
- size = struct_size(dc_res, extent_list, 1);
+ if (!extent)
+ size = struct_size(dc_res, extent_list, 0);
+ else
+ size = struct_size(dc_res, extent_list, 1);
dc_res = kzalloc(size, GFP_KERNEL);
if (!dc_res)
return -ENOMEM;

- dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
- memset(dc_res->extent_list[0].reserved, 0, 8);
- dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
- dc_res->extent_list_size = cpu_to_le32(1);
+ if (extent) {
+ dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
+ memset(dc_res->extent_list[0].reserved, 0, 8);
+ dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
+ dc_res->extent_list_size = cpu_to_le32(1);
+ }

mbox_cmd = (struct cxl_mbox_cmd) {
.opcode = opcode,
@@ -1072,6 +1109,85 @@ void cxl_release_ed_extent(struct cxl_ed_extent *extent)
}
EXPORT_SYMBOL_NS_GPL(cxl_release_ed_extent, CXL);

+static int cxl_handle_dcd_release_event(struct cxl_memdev_state *mds,
+ struct cxl_dc_extent *dc_extent)
+{
+ return cxl_notify_dc_extent(mds, DCD_RELEASE_CAPACITY, dc_extent);
+}
+
+static int cxl_handle_dcd_add_event(struct cxl_memdev_state *mds,
+ struct cxl_dc_extent *dc_extent)
+{
+ struct range alloc_range, *resp_range;
+ struct device *dev = mds->cxlds.dev;
+ int rc;
+
+ alloc_range = (struct range){
+ .start = le64_to_cpu(dc_extent->start_dpa),
+ .end = le64_to_cpu(dc_extent->start_dpa) +
+ le64_to_cpu(dc_extent->length) - 1,
+ };
+ resp_range = &alloc_range;
+
+ rc = cxl_notify_dc_extent(mds, DCD_ADD_CAPACITY, dc_extent);
+ if (rc) {
+ dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
+ le64_to_cpu(dc_extent->start_dpa),
+ le64_to_cpu(dc_extent->length));
+ resp_range = NULL;
+ }
+
+ return cxl_send_dc_cap_response(mds, resp_range,
+ CXL_MBOX_OP_ADD_DC_RESPONSE);
+}
+
+static char *cxl_dcd_evt_type_str(u8 type)
+{
+ switch (type) {
+ case DCD_ADD_CAPACITY:
+ return "add";
+ case DCD_RELEASE_CAPACITY:
+ return "release";
+ case DCD_FORCED_CAPACITY_RELEASE:
+ return "force release";
+ default:
+ break;
+ }
+
+ return "<unknown>";
+}
+
+static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
+ struct cxl_event_record_raw *raw_rec)
+{
+ struct cxl_event_dcd *event = &raw_rec->event.dcd;
+ struct cxl_dc_extent *dc_extent = &event->extent;
+ struct device *dev = mds->cxlds.dev;
+ uuid_t *id = &raw_rec->id;
+
+ if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
+ return -EINVAL;
+
+ dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
+ cxl_dcd_evt_type_str(event->event_type),
+ le64_to_cpu(dc_extent->start_dpa),
+ le64_to_cpu(dc_extent->length));
+
+ switch (event->event_type) {
+ case DCD_ADD_CAPACITY:
+ return cxl_handle_dcd_add_event(mds, dc_extent);
+ case DCD_RELEASE_CAPACITY:
+ return cxl_handle_dcd_release_event(mds, dc_extent);
+ case DCD_FORCED_CAPACITY_RELEASE:
+ dev_err_ratelimited(dev, "Forced release event ignored.\n");
+ return 0;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
enum cxl_event_log_type type)
{
@@ -1109,9 +1225,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
if (!nr_rec)
break;

- for (i = 0; i < nr_rec; i++)
+ for (i = 0; i < nr_rec; i++) {
__cxl_event_trace_record(cxlmd, type,
&payload->records[i]);
+ if (type == CXL_EVENT_TYPE_DCD) {
+ rc = cxl_handle_dcd_event_records(mds,
+ &payload->records[i]);
+ if (rc)
+ dev_err_ratelimited(dev, "dcd event failed: %d\n",
+ rc);
+ }
+ }

if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
trace_cxl_overflow(cxlmd, type, payload);
@@ -1143,6 +1267,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status)
{
dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status);

+ if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD))
+ cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD);
if (status & CXLDEV_EVENT_STATUS_FATAL)
cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL);
if (status & CXLDEV_EVENT_STATUS_FAIL)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 7635ff109578..a07d95136f0d 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1450,6 +1450,57 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
return 0;
}

+int cxl_region_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+ struct region_extent *reg_ext)
+{
+ struct cxl_dax_region *cxlr_dax;
+ struct device *dev;
+ int rc = -ENXIO;
+
+ cxlr_dax = cxlr->cxlr_dax;
+ dev = &cxlr_dax->dev;
+ dev_dbg(dev, "Trying notify: type %d HPA %#llx - %#llx\n",
+ event, reg_ext->hpa_range.start, reg_ext->hpa_range.end);
+
+ device_lock(dev);
+ if (dev->driver) {
+ struct cxl_driver *reg_drv = to_cxl_drv(dev->driver);
+ struct cxl_drv_nd nd = (struct cxl_drv_nd) {
+ .event = event,
+ .reg_ext = reg_ext,
+ };
+
+ if (reg_drv->notify) {
+ dev_dbg(dev, "Notify: type %d HPA %#llx - %#llx\n",
+ event, reg_ext->hpa_range.start,
+ reg_ext->hpa_range.end);
+ rc = reg_drv->notify(dev, &nd);
+ }
+ }
+ device_unlock(dev);
+ return rc;
+}
+
+static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dax_region *cxlr_dax,
+ struct cxl_dc_extent *dc_extent,
+ struct range *dpa_range,
+ struct range *hpa_range)
+{
+ resource_size_t dpa_offset, hpa;
+
+ /*
+ * Without interleave...
+ * HPA offset == DPA offset
+ * ... but do the math anyway
+ */
+ dpa_offset = dpa_range->start - cxled->dpa_res->start;
+ hpa = cxled->cxld.hpa_range.start + dpa_offset;
+
+ hpa_range->start = hpa - cxlr_dax->hpa_range.start;
+ hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
+}
+
static int extent_check_overlap(struct device *dev, void *arg)
{
struct range *new_range = arg;
@@ -1480,7 +1531,6 @@ int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
struct cxl_region *cxlr = cxled->cxld.region;
struct range ext_dpa_range, ext_hpa_range;
struct device *dev = &cxlr->dev;
- resource_size_t dpa_offset, hpa;

/*
* Interleave ways == 1 means this coresponds to a 1:1 mapping between
@@ -1502,18 +1552,7 @@ int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
dev_dbg(dev, "Adding DC extent DPA %#llx - %#llx\n",
ext_dpa_range.start, ext_dpa_range.end);

- /*
- * Without interleave...
- * HPA offset == DPA offset
- * ... but do the math anyway
- */
- dpa_offset = ext_dpa_range.start - cxled->dpa_res->start;
- hpa = cxled->cxld.hpa_range.start + dpa_offset;
-
- ext_hpa_range = (struct range) {
- .start = hpa - cxlr->cxlr_dax->hpa_range.start,
- .end = ext_hpa_range.start + range_len(&ext_dpa_range) - 1,
- };
+ calc_hpa_range(cxled, cxlr->cxlr_dax, dc_extent, &ext_dpa_range, &ext_hpa_range);

if (extent_overlaps(cxlr->cxlr_dax, &ext_hpa_range))
return -EINVAL;
@@ -1527,6 +1566,80 @@ int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
cxled);
}

+static void cxl_ed_rm_region_extent(struct cxl_region *cxlr,
+ struct region_extent *reg_ext)
+{
+ cxl_region_notify_extent(cxlr, DCD_RELEASE_CAPACITY, reg_ext);
+}
+
+struct rm_data {
+ struct cxl_region *cxlr;
+ struct range *range;
+};
+
+static int cxl_rm_reg_ext_by_range(struct device *dev, void *data)
+{
+ struct rm_data *rm_data = data;
+ struct region_extent *reg_ext;
+
+ if (!is_region_extent(dev))
+ return 0;
+ reg_ext = to_region_extent(dev);
+
+ /*
+ * Any extent which 'touches' the released range is notified
+ * for removal. No partials of the extent are released.
+ */
+ if (range_overlaps(rm_data->range, &reg_ext->hpa_range)) {
+ struct cxl_region *cxlr = rm_data->cxlr;
+
+ dev_dbg(dev, "Remove DAX region ext HPA %#llx - %#llx\n",
+ reg_ext->hpa_range.start, reg_ext->hpa_range.end);
+ cxl_ed_rm_region_extent(cxlr, reg_ext);
+ }
+ return 0;
+}
+
+static int cxl_ed_rm_extent(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dc_extent *dc_extent)
+{
+ struct cxl_region *cxlr = cxled->cxld.region;
+ struct range hpa_range;
+
+ struct range rel_dpa_range = {
+ .start = le64_to_cpu(dc_extent->start_dpa),
+ .end = le64_to_cpu(dc_extent->start_dpa) +
+ le64_to_cpu(dc_extent->length) - 1,
+ };
+
+ calc_hpa_range(cxled, cxlr->cxlr_dax, dc_extent, &rel_dpa_range, &hpa_range);
+
+ struct rm_data rm_data = {
+ .cxlr = cxlr,
+ .range = &hpa_range,
+ };
+
+ return device_for_each_child(&cxlr->cxlr_dax->dev, &rm_data,
+ cxl_rm_reg_ext_by_range);
+}
+
+int cxl_ed_notify_extent(struct cxl_endpoint_decoder *cxled,
+ struct cxl_drv_nd *nd)
+{
+ switch (nd->event) {
+ case DCD_ADD_CAPACITY:
+ return cxl_ed_add_one_extent(cxled, nd->dc_extent);
+ case DCD_RELEASE_CAPACITY:
+ return cxl_ed_rm_extent(cxled, nd->dc_extent);
+ case DCD_FORCED_CAPACITY_RELEASE:
+ default:
+ dev_err(&cxled->cxld.dev, "Unknown DC event %d\n", nd->event);
+ break;
+ }
+ return -ENXIO;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_ed_notify_extent, CXL);
+
static int cxl_region_attach_position(struct cxl_region *cxlr,
struct cxl_root_decoder *cxlrd,
struct cxl_endpoint_decoder *cxled,
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 5379ad7f5852..156d7c9a8de5 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -10,6 +10,7 @@
#include <linux/log2.h>
#include <linux/node.h>
#include <linux/io.h>
+#include <linux/cxl-event.h>

/**
* DOC: cxl objects
@@ -613,6 +614,14 @@ struct cxl_pmem_region {
struct cxl_pmem_region_mapping mapping[];
};

+/* See CXL 3.0 8.2.9.2.1.5 */
+enum dc_event {
+ DCD_ADD_CAPACITY,
+ DCD_RELEASE_CAPACITY,
+ DCD_FORCED_CAPACITY_RELEASE,
+ DCD_REGION_CONFIGURATION_UPDATED,
+};
+
struct cxl_dax_region {
struct device dev;
struct cxl_region *cxlr;
@@ -891,10 +900,18 @@ bool is_cxl_region(struct device *dev);

extern struct bus_type cxl_bus_type;

+/* Driver Notifier Data */
+struct cxl_drv_nd {
+ enum dc_event event;
+ struct cxl_dc_extent *dc_extent;
+ struct region_extent *reg_ext;
+};
+
struct cxl_driver {
const char *name;
int (*probe)(struct device *dev);
void (*remove)(struct device *dev);
+ int (*notify)(struct device *dev, struct cxl_drv_nd *nd);
struct device_driver drv;
int id;
};
@@ -933,6 +950,8 @@ bool is_cxl_nvdimm(struct device *dev);
bool is_cxl_nvdimm_bridge(struct device *dev);
int devm_cxl_add_nvdimm(struct cxl_memdev *cxlmd);
struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct cxl_memdev *cxlmd);
+bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dc_extent *extent);

#ifdef CONFIG_CXL_REGION
bool is_cxl_pmem_region(struct device *dev);
@@ -940,6 +959,10 @@ struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
int cxl_add_to_region(struct cxl_port *root,
struct cxl_endpoint_decoder *cxled);
struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
+int cxl_ed_notify_extent(struct cxl_endpoint_decoder *cxled,
+ struct cxl_drv_nd *nd);
+int cxl_region_notify_extent(struct cxl_region *cxlr, enum dc_event event,
+ struct region_extent *reg_ext);
#else
static inline bool is_cxl_pmem_region(struct device *dev)
{
@@ -958,6 +981,17 @@ static inline struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
{
return NULL;
}
+static inline int cxl_ed_notify_extent(struct cxl_endpoint_decoder *cxled,
+ struct cxl_drv_nd *nd)
+{
+ return 0;
+}
+static inline int cxl_region_notify_extent(struct cxl_region *cxlr,
+ enum dc_event event,
+ struct region_extent *reg_ext)
+{
+ return 0;
+}
#endif

void cxl_endpoint_parse_cdat(struct cxl_port *port);
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 8f2d8944d334..eb10cae99ff0 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -619,18 +619,6 @@ struct cxl_mbox_dc_response {
} __packed extent_list[];
} __packed;

-/*
- * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
- */
-#define CXL_DC_EXTENT_TAG_LEN 0x10
-struct cxl_dc_extent {
- __le64 start_dpa;
- __le64 length;
- u8 tag[CXL_DC_EXTENT_TAG_LEN];
- __le16 shared_extn_seq;
- u8 reserved[6];
-} __packed;
-
/*
* Get Dynamic Capacity Extent List; Input Payload
* CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
@@ -714,6 +702,14 @@ struct cxl_mbox_identify {
UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \
0x13, 0xb7, 0x74)

+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1; Table 8-43
+ */
+#define CXL_EVENT_DC_EVENT_UUID \
+ UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \
+ 0x10, 0x1a, 0x2a)
+
/*
* Get Event Records output payload
* CXL rev 3.0 section 8.2.9.2.2; Table 8-50
@@ -739,6 +735,7 @@ enum cxl_event_log_type {
CXL_EVENT_TYPE_WARN,
CXL_EVENT_TYPE_FAIL,
CXL_EVENT_TYPE_FATAL,
+ CXL_EVENT_TYPE_DCD,
CXL_EVENT_TYPE_MAX
};

diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 0c79d9ce877c..20832f09c40c 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -103,6 +103,50 @@ static int cxl_debugfs_poison_clear(void *data, u64 dpa)
DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_clear_fops, NULL,
cxl_debugfs_poison_clear, "%llx\n");

+static int match_ep_decoder_by_range(struct device *dev, void *data)
+{
+ struct cxl_dc_extent *dc_extent = data;
+ struct cxl_endpoint_decoder *cxled;
+
+ if (!is_endpoint_decoder(dev))
+ return 0;
+
+ cxled = to_cxl_endpoint_decoder(dev);
+ if (!cxled->cxld.region)
+ return 0;
+
+ return cxl_dc_extent_in_ed(cxled, dc_extent);
+}
+
+static int cxl_mem_notify(struct device *dev, struct cxl_drv_nd *nd)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_port *endpoint = cxlmd->endpoint;
+ struct cxl_endpoint_decoder *cxled;
+ struct cxl_dc_extent *dc_extent;
+ struct device *ep_dev;
+ int rc;
+
+ dc_extent = nd->dc_extent;
+ dev_dbg(dev, "notify DC action %d DPA:%#llx LEN:%#llx\n",
+ nd->event, le64_to_cpu(dc_extent->start_dpa),
+ le64_to_cpu(dc_extent->length));
+
+ ep_dev = device_find_child(&endpoint->dev, dc_extent,
+ match_ep_decoder_by_range);
+ if (!ep_dev) {
+ dev_dbg(dev, "Extent DPA:%#llx LEN:%#llx not mapped; evt %d\n",
+ le64_to_cpu(dc_extent->start_dpa),
+ le64_to_cpu(dc_extent->length), nd->event);
+ return -ENXIO;
+ }
+
+ cxled = to_cxl_endpoint_decoder(ep_dev);
+ rc = cxl_ed_notify_extent(cxled, nd);
+ put_device(ep_dev);
+ return rc;
+}
+
static int cxl_mem_probe(struct device *dev)
{
struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
@@ -244,6 +288,7 @@ __ATTRIBUTE_GROUPS(cxl_mem);
static struct cxl_driver cxl_mem_driver = {
.name = "cxl_mem",
.probe = cxl_mem_probe,
+ .notify = cxl_mem_notify,
.id = CXL_DEVICE_MEMORY_EXPANDER,
.drv = {
.dev_groups = cxl_mem_groups,
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 70bdc7a878ab..83ee45aff69a 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -42,6 +42,27 @@ static void cxl_dax_region_add_extents(struct cxl_dax_region *cxlr_dax,
device_for_each_child(&cxlr_dax->dev, dax_region, cxl_dax_region_add_extent);
}

+static int cxl_dax_region_notify(struct device *dev,
+ struct cxl_drv_nd *nd)
+{
+ struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+ struct region_extent *reg_ext = nd->reg_ext;
+
+ switch (nd->event) {
+ case DCD_ADD_CAPACITY:
+ return __cxl_dax_region_add_extent(dax_region, reg_ext);
+ case DCD_RELEASE_CAPACITY:
+ return 0;
+ case DCD_FORCED_CAPACITY_RELEASE:
+ default:
+ dev_err(&cxlr_dax->dev, "Unknown DC event %d\n", nd->event);
+ break;
+ }
+
+ return -ENXIO;
+}
+
static int cxl_dax_region_probe(struct device *dev)
{
struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
@@ -85,6 +106,7 @@ static int cxl_dax_region_probe(struct device *dev)
static struct cxl_driver cxl_dax_region_driver = {
.name = "cxl_dax_region",
.probe = cxl_dax_region_probe,
+ .notify = cxl_dax_region_notify,
.id = CXL_DEVICE_DAX_REGION,
.drv = {
.suppress_bind_attrs = true,
diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h
index 03fa6d50d46f..6b745c913f96 100644
--- a/include/linux/cxl-event.h
+++ b/include/linux/cxl-event.h
@@ -91,11 +91,42 @@ struct cxl_event_mem_module {
u8 reserved[0x3d];
} __packed;

+/*
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
+ */
+#define CXL_DC_EXTENT_TAG_LEN 0x10
+struct cxl_dc_extent {
+ __le64 start_dpa;
+ __le64 length;
+ u8 tag[CXL_DC_EXTENT_TAG_LEN];
+ __le16 shared_extn_seq;
+ u8 reserved[0x6];
+} __packed;
+
+/*
+ * Dynamic Capacity Event Record
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
+ */
+struct cxl_event_dcd {
+ struct cxl_event_record_hdr hdr;
+ u8 event_type;
+ u8 validity_flags;
+ __le16 host_id;
+ u8 region_index;
+ u8 flags;
+ u8 reserved1[0x2];
+ struct cxl_dc_extent extent;
+ u8 reserved2[0x18];
+ __le32 num_avail_extents;
+ __le32 num_avail_tags;
+} __packed;
+
union cxl_event {
struct cxl_event_generic generic;
struct cxl_event_gen_media gen_media;
struct cxl_event_dram dram;
struct cxl_event_mem_module mem_module;
+ struct cxl_event_dcd dcd;
} __packed;

/*

--
2.44.0


2024-03-25 03:55:11

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 19/26] dax/bus: Factor out dev dax resize logic

Dynamic Capacity regions must limit dev dax resources to those areas
which have extents backing real memory. Such DAX regions are dubbed
'sparse' regions. In order to manage where memory is available four
alternatives were considered:

1) Create a single region resource child on region creation which
reserves the entire region. Then as extents are added punch holes in
this reservation. This requires new resource manipulation to punch
the holes and still requires an additional iteration over the extent
areas which may already have existing dev dax resources used.

2) Maintain an ordered xarray of extents which can be queried while
processing the resize logic. The issue is that existing region->res
children may artificially limit the allocation size sent to
alloc_dev_dax_range(). IE the resource children can't be directly
used in the resize logic to find where space in the region is. This
also poses a problem of managing the available size in 2 places.

3) Maintain a separate resource tree with extents. This option is the
same as 2) but with the different data structure. Most ideally there
should be a unified representation of the resource tree not two places
to look for space.

4) Create region resource children for each extent. Manage the dax dev
resize logic in the same way as before but use a region child
(extent) resource as the parents to find space within each extent.

Option 4 can leverage the existing resize algorithm to find space within
the extents. It manages the available space in a singular resource tree
which is less complicated for finding space.

In preparation for this change, factor out the dev_dax_resize logic.
For static regions use dax_region->res as the parent to find space for
the dax ranges. Future patches will use the same algorithm with
individual extent resources as the parent.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V1
[iweiny: Rebase on new DAX region locking]
[iweiny: Reword commit message]
[iweiny: Drop reviews]
---
drivers/dax/bus.c | 129 +++++++++++++++++++++++++++++++++---------------------
1 file changed, 79 insertions(+), 50 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 4d5ed7ab6537..bab19fc578d0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -928,11 +928,9 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
return 0;
}

-static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
- resource_size_t size)
+static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
+ u64 start, resource_size_t size)
{
- struct dax_region *dax_region = dev_dax->region;
- struct resource *res = &dax_region->res;
struct device *dev = &dev_dax->dev;
struct dev_dax_range *ranges;
unsigned long pgoff = 0;
@@ -950,14 +948,14 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
return 0;
}

- alloc = __request_region(res, start, size, dev_name(dev), 0);
+ alloc = __request_region(parent, start, size, dev_name(dev), 0);
if (!alloc)
return -ENOMEM;

ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
* (dev_dax->nr_range + 1), GFP_KERNEL);
if (!ranges) {
- __release_region(res, alloc->start, resource_size(alloc));
+ __release_region(parent, alloc->start, resource_size(alloc));
return -ENOMEM;
}

@@ -1110,50 +1108,45 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
return true;
}

-static ssize_t dev_dax_resize(struct dax_region *dax_region,
- struct dev_dax *dev_dax, resource_size_t size)
+/**
+ * dev_dax_resize_static - Expand the device into the unused portion of the
+ * region. This may involve adjusting the end of an existing resource, or
+ * allocating a new resource.
+ *
+ * @parent: parent resource to allocate this range in
+ * @dev_dax: DAX device to be expanded
+ * @to_alloc: amount of space to alloc; must be <= space available in @parent
+ *
+ * Return the amount of space allocated or -ERRNO on failure
+ */
+static ssize_t dev_dax_resize_static(struct resource *parent,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc)
{
- resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
- resource_size_t dev_size = dev_dax_size(dev_dax);
- struct resource *region_res = &dax_region->res;
- struct device *dev = &dev_dax->dev;
struct resource *res, *first;
- resource_size_t alloc = 0;
int rc;

- if (dev->driver)
- return -EBUSY;
- if (size == dev_size)
- return 0;
- if (size > dev_size && size - dev_size > avail)
- return -ENOSPC;
- if (size < dev_size)
- return dev_dax_shrink(dev_dax, size);
-
- to_alloc = size - dev_size;
- if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
- "resize of %pa misaligned\n", &to_alloc))
- return -ENXIO;
-
- /*
- * Expand the device into the unused portion of the region. This
- * may involve adjusting the end of an existing resource, or
- * allocating a new resource.
- */
-retry:
- first = region_res->child;
- if (!first)
- return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+ first = parent->child;
+ if (!first) {
+ rc = alloc_dev_dax_range(parent, dev_dax,
+ parent->start, to_alloc);
+ if (rc)
+ return rc;
+ return to_alloc;
+ }

- rc = -ENOSPC;
for (res = first; res; res = res->sibling) {
struct resource *next = res->sibling;
+ resource_size_t alloc;

/* space at the beginning of the region */
- if (res == first && res->start > dax_region->res.start) {
- alloc = min(res->start - dax_region->res.start, to_alloc);
- rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
- break;
+ if (res == first && res->start > parent->start) {
+ alloc = min(res->start - parent->start, to_alloc);
+ rc = alloc_dev_dax_range(parent, dev_dax,
+ parent->start, alloc);
+ if (rc)
+ return rc;
+ return alloc;
}

alloc = 0;
@@ -1162,21 +1155,55 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
alloc = min(next->start - (res->end + 1), to_alloc);

/* space at the end of the region */
- if (!alloc && !next && res->end < region_res->end)
- alloc = min(region_res->end - res->end, to_alloc);
+ if (!alloc && !next && res->end < parent->end)
+ alloc = min(parent->end - res->end, to_alloc);

if (!alloc)
continue;

if (adjust_ok(dev_dax, res)) {
rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
- break;
+ if (rc)
+ return rc;
+ return alloc;
}
- rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
- break;
+ rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+ if (rc)
+ return rc;
+ return alloc;
}
- if (rc)
- return rc;
+
+ /* available was already calculated and should never be an issue */
+ dev_WARN_ONCE(&dev_dax->dev, 1, "space not found?");
+ return 0;
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+ struct dev_dax *dev_dax, resource_size_t size)
+{
+ resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
+ resource_size_t dev_size = dev_dax_size(dev_dax);
+ struct device *dev = &dev_dax->dev;
+ resource_size_t alloc = 0;
+
+ if (dev->driver)
+ return -EBUSY;
+ if (size == dev_size)
+ return 0;
+ if (size > dev_size && size - dev_size > avail)
+ return -ENOSPC;
+ if (size < dev_size)
+ return dev_dax_shrink(dev_dax, size);
+
+ to_alloc = size - dev_size;
+ if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
+ "resize of %pa misaligned\n", &to_alloc))
+ return -ENXIO;
+
+retry:
+ alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+ if (alloc <= 0)
+ return alloc;
to_alloc -= alloc;
if (to_alloc)
goto retry;
@@ -1283,7 +1310,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,

to_alloc = range_len(&r);
if (alloc_is_aligned(dev_dax, to_alloc))
- rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+ rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
+ to_alloc);
up_write(&dax_dev_rwsem);
up_write(&dax_region_rwsem);

@@ -1506,7 +1534,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
device_initialize(dev);
dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);

- rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
+ rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
+ data->size);
if (rc)
goto err_range;


--
2.44.0


2024-03-25 03:56:05

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 21/26] dax/region: Prevent range mapping allocation on sparse regions

Sparse regions are not fully populated with memory and this complicates
range mapping of dax devices on those regions. There is no use case for
range mapping on sparse regions.

Avoid the complication by prevent range mapping of dax devices on sparse
regions.

Signed-off-by: Ira Weiny <[email protected]>
---
drivers/dax/bus.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index bab19fc578d0..56dddaceeccb 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1452,6 +1452,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
return 0;
if (a == &dev_attr_mapping.attr && is_static(dax_region))
return 0;
+ if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
+ return 0;
if ((a == &dev_attr_align.attr ||
a == &dev_attr_size.attr) && is_static(dax_region))
return 0444;

--
2.44.0


2024-03-25 03:56:25

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 23/26] cxl/mem: Trace Dynamic capacity Event Record

From: Navneet Singh <[email protected]>

CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
Notify the host of extents being added or removed. User space has
little use for these events other than for debugging.

Add DC trace points to the trace log for debugging purposes.

Signed-off-by: Navneet Singh <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[iweiny: Adjust to new trace code]
---
drivers/cxl/core/mbox.c | 4 +++
drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 7babac2d1c95..cb4576890187 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -978,6 +978,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
ev_type = CXL_CPER_EVENT_DRAM;
else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
ev_type = CXL_CPER_EVENT_MEM_MODULE;
+ else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
+ trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
+ return;
+ }

cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
}
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index bdf117a33744..7646fdd9aee3 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -707,6 +707,71 @@ TRACE_EVENT(cxl_poison,
)
);

+/*
+ * DYNAMIC CAPACITY Event Record - DER
+ *
+ * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
+ */
+
+#define CXL_DC_ADD_CAPACITY 0x00
+#define CXL_DC_REL_CAPACITY 0x01
+#define CXL_DC_FORCED_REL_CAPACITY 0x02
+#define CXL_DC_REG_CONF_UPDATED 0x03
+#define show_dc_evt_type(type) __print_symbolic(type, \
+ { CXL_DC_ADD_CAPACITY, "Add capacity"}, \
+ { CXL_DC_REL_CAPACITY, "Release capacity"}, \
+ { CXL_DC_FORCED_REL_CAPACITY, "Forced capacity release"}, \
+ { CXL_DC_REG_CONF_UPDATED, "Region Configuration Updated" } \
+)
+
+TRACE_EVENT(cxl_dynamic_capacity,
+
+ TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
+ struct cxl_event_dcd *rec),
+
+ TP_ARGS(cxlmd, log, rec),
+
+ TP_STRUCT__entry(
+ CXL_EVT_TP_entry
+
+ /* Dynamic capacity Event */
+ __field(u8, event_type)
+ __field(u16, hostid)
+ __field(u8, region_id)
+ __field(u64, dpa_start)
+ __field(u64, length)
+ __array(u8, tag, CXL_DC_EXTENT_TAG_LEN)
+ __field(u16, sh_extent_seq)
+ ),
+
+ TP_fast_assign(
+ CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
+
+ /* Dynamic_capacity Event */
+ __entry->event_type = rec->event_type;
+
+ /* DCD event record data */
+ __entry->hostid = le16_to_cpu(rec->host_id);
+ __entry->region_id = rec->region_index;
+ __entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
+ __entry->length = le64_to_cpu(rec->extent.length);
+ memcpy(__entry->tag, &rec->extent.tag, CXL_DC_EXTENT_TAG_LEN);
+ __entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
+ ),
+
+ CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
+ "starting_dpa=%llx length=%llx tag=%s " \
+ "shared_extent_sequence=%d",
+ show_dc_evt_type(__entry->event_type),
+ __entry->hostid,
+ __entry->region_id,
+ __entry->dpa_start,
+ __entry->length,
+ __print_hex(__entry->tag, CXL_DC_EXTENT_TAG_LEN),
+ __entry->sh_extent_seq
+ )
+);
+
#endif /* _CXL_EVENTS_H */

#define TRACE_INCLUDE_FILE trace

--
2.44.0


2024-03-25 04:03:22

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 25/26] tools/testing/cxl: Add DC Regions to mock mem data

cxl_test provides a good way to ensure quick smoke and regression
testing. The complexity of Dynamic Capacity (DC) devices and the new
sparse DAX regions required to use them benefits greatly with a series
of smoke tests.

To test DC regions the mock memory devices will need mock DC information
and manage fake extent data.

Define mock_dc_region information within the mock memory data. Add
sysfs entries on the mock device to inject and delete extents.

The inject format is <start>:<length>:<tag>
The delete format is <start>:<length>

Add DC mailbox commands to the CEL and implement those commands.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[iweiny: adjust to new events]
[iweiny: remove most extent checks to allow negative testing]
---
tools/testing/cxl/test/mem.c | 575 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 574 insertions(+), 1 deletion(-)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index d8d62e6eeb18..7d1d897d9f2b 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -18,6 +18,7 @@
#define FW_SLOTS 3
#define DEV_SIZE SZ_2G
#define EFFECT(x) (1U << x)
+#define BASE_DYNAMIC_CAP_DPA DEV_SIZE

#define MOCK_INJECT_DEV_MAX 8
#define MOCK_INJECT_TEST_MAX 128
@@ -95,6 +96,22 @@ static struct cxl_cel_entry mock_cel[] = {
EFFECT(SECURITY_CHANGE_IMMEDIATE) |
EFFECT(BACKGROUND_OP)),
},
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_CONFIG),
+ .effect = CXL_CMD_EFFECT_NONE,
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_GET_DC_EXTENT_LIST),
+ .effect = CXL_CMD_EFFECT_NONE,
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_ADD_DC_RESPONSE),
+ .effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+ },
+ {
+ .opcode = cpu_to_le16(CXL_MBOX_OP_RELEASE_DC),
+ .effect = cpu_to_le16(EFFECT(CONF_CHANGE_IMMEDIATE)),
+ },
};

/* See CXL 2.0 Table 181 Get Health Info Output Payload */
@@ -152,6 +169,7 @@ struct mock_event_store {
u32 ev_status;
};

+#define NUM_MOCK_DC_REGIONS 2
struct cxl_mockmem_data {
void *lsa;
void *fw;
@@ -168,6 +186,11 @@ struct cxl_mockmem_data {
u8 event_buf[SZ_4K];
u64 timestamp;
unsigned long sanitize_timeout;
+ struct cxl_dc_region_config dc_regions[NUM_MOCK_DC_REGIONS];
+ u32 dc_ext_generation;
+ struct mutex ext_lock;
+ struct xarray dc_extents;
+ struct xarray dc_accepted_exts;
};

static struct mock_event_log *event_find_log(struct device *dev, int log_type)
@@ -558,6 +581,200 @@ static void cxl_mock_event_trigger(struct device *dev)
cxl_mem_get_event_records(mdata->mds, mes->ev_status);
}

+struct cxl_dc_extent_data {
+ u64 dpa_start;
+ u64 length;
+ u8 tag[CXL_DC_EXTENT_TAG_LEN];
+};
+
+static int __devm_add_extent(struct device *dev, struct xarray *array,
+ u64 start, u64 length, const char *tag)
+{
+ struct cxl_dc_extent_data *extent;
+
+ extent = devm_kzalloc(dev, sizeof(*extent), GFP_KERNEL);
+ if (!extent)
+ return -ENOMEM;
+
+ extent->dpa_start = start;
+ extent->length = length;
+ memcpy(extent->tag, tag, min(sizeof(extent->tag), strlen(tag)));
+
+ if (xa_insert(array, start, extent, GFP_KERNEL)) {
+ devm_kfree(dev, extent);
+ dev_err(dev, "Failed xarry insert %#llx\n", start);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int devm_add_extent(struct device *dev, u64 start, u64 length,
+ const char *tag)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+
+ guard(mutex)(&mdata->ext_lock);
+ return __devm_add_extent(dev, &mdata->dc_extents, start, length, tag);
+}
+
+/* It is known that ext and the new range are not equal */
+static struct cxl_dc_extent_data *
+split_ext(struct device *dev, struct xarray *array,
+ struct cxl_dc_extent_data *ext, u64 start, u64 length)
+{
+ u64 new_start, new_length;
+
+ if (ext->dpa_start == start) {
+ new_start = start + length;
+ new_length = (ext->dpa_start + ext->length) - new_start;
+
+ if (__devm_add_extent(dev, array, new_start, new_length,
+ ext->tag))
+ return NULL;
+
+ ext = xa_erase(array, ext->dpa_start);
+ if (__devm_add_extent(dev, array, start, length, ext->tag))
+ return NULL;
+
+ return xa_load(array, start);
+ }
+
+ /* ext->dpa_start != start */
+
+ if (__devm_add_extent(dev, array, start, length, ext->tag))
+ return NULL;
+
+ new_start = ext->dpa_start;
+ new_length = start - ext->dpa_start;
+
+ ext = xa_erase(array, ext->dpa_start);
+ if (__devm_add_extent(dev, array, new_start, new_length, ext->tag))
+ return NULL;
+
+ return xa_load(array, start);
+}
+
+/*
+ * Do not handle extents which are not inside a single extent sent to
+ * the host.
+ */
+static struct cxl_dc_extent_data *
+find_create_ext(struct device *dev, struct xarray *array, u64 start, u64 length)
+{
+ struct cxl_dc_extent_data *ext;
+ unsigned long index;
+
+ xa_for_each(array, index, ext) {
+ u64 end = start + length;
+
+ /* start < [ext) <= start */
+ if (start < ext->dpa_start ||
+ (ext->dpa_start + ext->length) <= start)
+ continue;
+
+ if (end <= ext->dpa_start ||
+ (ext->dpa_start + ext->length) < end) {
+ dev_err(dev, "Invalid range %#llx-%#llx\n", start,
+ end);
+ return NULL;
+ }
+
+ break;
+ }
+
+ if (!ext)
+ return NULL;
+
+ if (start == ext->dpa_start && length == ext->length)
+ return ext;
+
+ return split_ext(dev, array, ext, start, length);
+}
+
+static int dc_accept_extent(struct device *dev, u64 start, u64 length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_dc_extent_data *ext;
+
+ dev_dbg(dev, "Host accepting extent %#llx\n", start);
+ mdata->dc_ext_generation++;
+
+ guard(mutex)(&mdata->ext_lock);
+ ext = find_create_ext(dev, &mdata->dc_extents, start, length);
+ if (!ext) {
+ dev_err(dev, "Extent %#llx-%#llx not found\n",
+ start, start + length);
+ return -ENOMEM;
+ }
+ ext = xa_erase(&mdata->dc_extents, ext->dpa_start);
+ return xa_insert(&mdata->dc_accepted_exts, start, ext, GFP_KERNEL);
+}
+
+static void release_dc_ext(void *md)
+{
+ struct cxl_mockmem_data *mdata = md;
+
+ xa_destroy(&mdata->dc_extents);
+ xa_destroy(&mdata->dc_accepted_exts);
+}
+
+static int cxl_mock_dc_region_setup(struct device *dev)
+{
+#define DUMMY_EXT_OFFSET SZ_256M
+#define DUMMY_EXT_LENGTH SZ_256M
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ u64 base_dpa = BASE_DYNAMIC_CAP_DPA;
+ u32 dsmad_handle = 0xFADE;
+ u64 decode_length = SZ_1G;
+ u64 block_size = SZ_512;
+ /* For testing make this smaller than decode length */
+ u64 length = SZ_1G;
+ int rc;
+
+ mutex_init(&mdata->ext_lock);
+ xa_init(&mdata->dc_extents);
+ xa_init(&mdata->dc_accepted_exts);
+
+ rc = devm_add_action_or_reset(dev, release_dc_ext, mdata);
+ if (rc)
+ return rc;
+
+ for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+ struct cxl_dc_region_config *conf = &mdata->dc_regions[i];
+
+ dev_dbg(dev, "Creating DC region DC%d DPA:%#llx LEN:%#llx\n",
+ i, base_dpa, length);
+
+ conf->region_base = cpu_to_le64(base_dpa);
+ conf->region_decode_length = cpu_to_le64(decode_length /
+ CXL_CAPACITY_MULTIPLIER);
+ conf->region_length = cpu_to_le64(length);
+ conf->region_block_size = cpu_to_le64(block_size);
+ conf->region_dsmad_handle = cpu_to_le32(dsmad_handle);
+ dsmad_handle++;
+
+ /* Pretend to have some previous accepted extents */
+ rc = devm_add_extent(dev, base_dpa + DUMMY_EXT_OFFSET,
+ DUMMY_EXT_LENGTH, "CXL-TEST");
+ if (rc) {
+ dev_err(dev, "Failed to add extent DC%d DPA:%#llx LEN:%#x; %d\n",
+ i, base_dpa + DUMMY_EXT_OFFSET,
+ DUMMY_EXT_LENGTH, rc);
+ return rc;
+ }
+
+ rc = dc_accept_extent(dev, base_dpa + DUMMY_EXT_OFFSET,
+ DUMMY_EXT_LENGTH);
+ if (rc)
+ return rc;
+
+ base_dpa += decode_length;
+ }
+
+ return 0;
+}
+
static int mock_gsl(struct cxl_mbox_cmd *cmd)
{
if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1371,6 +1588,177 @@ static int mock_activate_fw(struct cxl_mockmem_data *mdata,
return -EINVAL;
}

+static int mock_get_dc_config(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_get_dc_config_in *dc_config = cmd->payload_in;
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ u8 region_requested, region_start_idx, region_ret_cnt;
+ struct cxl_mbox_get_dc_config_out *resp;
+
+ region_requested = dc_config->region_count;
+ if (region_requested > NUM_MOCK_DC_REGIONS)
+ region_requested = NUM_MOCK_DC_REGIONS;
+
+ if (cmd->size_out < struct_size(resp, region, region_requested))
+ return -EINVAL;
+
+ memset(cmd->payload_out, 0, cmd->size_out);
+ resp = cmd->payload_out;
+
+ region_start_idx = dc_config->start_region_index;
+ region_ret_cnt = 0;
+ for (int i = 0; i < NUM_MOCK_DC_REGIONS; i++) {
+ if (i >= region_start_idx) {
+ memcpy(&resp->region[region_ret_cnt],
+ &mdata->dc_regions[i],
+ sizeof(resp->region[region_ret_cnt]));
+ region_ret_cnt++;
+ }
+ }
+ resp->avail_region_count = region_ret_cnt;
+
+ dev_dbg(dev, "Returning %d dc regions\n", region_ret_cnt);
+ return 0;
+}
+
+static int mock_get_dc_extent_list(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_mbox_get_dc_extent_in *get = cmd->payload_in;
+ struct cxl_mbox_get_dc_extent_out *resp = cmd->payload_out;
+ u32 total_avail = 0, total_ret = 0;
+ struct cxl_dc_extent_data *ext;
+ u32 ext_count, start_idx;
+ unsigned long i;
+
+ ext_count = le32_to_cpu(get->extent_cnt);
+ start_idx = le32_to_cpu(get->start_extent_index);
+
+ memset(resp, 0, sizeof(*resp));
+
+ guard(mutex)(&mdata->ext_lock);
+ /*
+ * Total available needs to be calculated and returned regardless of
+ * how many can actually be returned.
+ */
+ xa_for_each(&mdata->dc_accepted_exts, i, ext)
+ total_avail++;
+
+ if (start_idx > total_avail)
+ return -EINVAL;
+
+ xa_for_each(&mdata->dc_accepted_exts, i, ext) {
+ if (total_ret >= ext_count)
+ break;
+
+ if (total_ret >= start_idx) {
+ resp->extent[total_ret].start_dpa =
+ cpu_to_le64(ext->dpa_start);
+ resp->extent[total_ret].length =
+ cpu_to_le64(ext->length);
+ memcpy(&resp->extent[total_ret].tag, ext->tag,
+ sizeof(resp->extent[total_ret]));
+ total_ret++;
+ }
+ }
+
+ resp->ret_extent_cnt = cpu_to_le32(total_ret);
+ resp->total_extent_cnt = cpu_to_le32(total_avail);
+ resp->extent_list_num = cpu_to_le32(mdata->dc_ext_generation);
+
+ dev_dbg(dev, "Returning %d extents of %d total\n",
+ total_ret, total_avail);
+
+ return 0;
+}
+
+static int mock_add_dc_response(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_dc_response *req = cmd->payload_in;
+ u32 list_size = le32_to_cpu(req->extent_list_size);
+
+ for (int i = 0; i < list_size; i++) {
+ u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+ u64 length = le64_to_cpu(req->extent_list[i].length);
+ int rc;
+
+ rc = dc_accept_extent(dev, start, length);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+
+static void dc_delete_extent(struct device *dev, unsigned long long start,
+ unsigned long long length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ unsigned long long end = start + length;
+ struct cxl_dc_extent_data *ext;
+ unsigned long index;
+
+ dev_dbg(dev, "Deleting extent at %#llx len:%#llx\n", start, length);
+
+ guard(mutex)(&mdata->ext_lock);
+ xa_for_each(&mdata->dc_extents, index, ext) {
+ u64 extent_end = ext->dpa_start + ext->length;
+
+ /*
+ * Any extent which 'touches' the released delete range will be
+ * removed.
+ */
+ if ((start <= ext->dpa_start && ext->dpa_start < end) ||
+ (start <= extent_end && extent_end < end)) {
+ xa_erase(&mdata->dc_extents, ext->dpa_start);
+ }
+ }
+
+ /*
+ * If the extent was accepted let it be for the host to drop
+ * later.
+ */
+}
+
+static int release_accepted_extent(struct device *dev,
+ unsigned long long start,
+ unsigned long long length)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_dc_extent_data *ext;
+
+ guard(mutex)(&mdata->ext_lock);
+ ext = find_create_ext(dev, &mdata->dc_accepted_exts, start, length);
+ if (!ext) {
+ dev_err(dev, "Extent %#llx not in accepted state\n", start);
+ return -EINVAL;
+ }
+ xa_erase(&mdata->dc_accepted_exts, ext->dpa_start);
+ mdata->dc_ext_generation++;
+
+ return 0;
+}
+
+static int mock_dc_release(struct device *dev,
+ struct cxl_mbox_cmd *cmd)
+{
+ struct cxl_mbox_dc_response *req = cmd->payload_in;
+ u32 list_size = le32_to_cpu(req->extent_list_size);
+
+ for (int i = 0; i < list_size; i++) {
+ u64 start = le64_to_cpu(req->extent_list[i].dpa_start);
+ u64 length = le64_to_cpu(req->extent_list[i].length);
+
+ dev_dbg(dev, "Extent %#llx released by host\n", start);
+ release_accepted_extent(dev, start, length);
+ }
+
+ return 0;
+}
+
static int cxl_mock_mbox_send(struct cxl_memdev_state *mds,
struct cxl_mbox_cmd *cmd)
{
@@ -1455,6 +1843,18 @@ static int cxl_mock_mbox_send(struct cxl_memdev_state *mds,
case CXL_MBOX_OP_ACTIVATE_FW:
rc = mock_activate_fw(mdata, cmd);
break;
+ case CXL_MBOX_OP_GET_DC_CONFIG:
+ rc = mock_get_dc_config(dev, cmd);
+ break;
+ case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
+ rc = mock_get_dc_extent_list(dev, cmd);
+ break;
+ case CXL_MBOX_OP_ADD_DC_RESPONSE:
+ rc = mock_add_dc_response(dev, cmd);
+ break;
+ case CXL_MBOX_OP_RELEASE_DC:
+ rc = mock_dc_release(dev, cmd);
+ break;
default:
break;
}
@@ -1499,6 +1899,14 @@ static void init_event_log(struct mock_event_log *log)
log->next_handle = 1;
}

+static void cxl_mock_mem_remove(struct platform_device *pdev)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(&pdev->dev);
+ struct cxl_memdev_state *mds = mdata->mds;
+
+ dev_dbg(mds->cxlds.dev, "Removing extents\n");
+}
+
static int cxl_mock_mem_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
@@ -1513,6 +1921,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
return -ENOMEM;
dev_set_drvdata(dev, mdata);

+ rc = cxl_mock_dc_region_setup(dev);
+ if (rc)
+ return rc;
+
mdata->lsa = vmalloc(LSA_SIZE);
if (!mdata->lsa)
return -ENOMEM;
@@ -1561,6 +1973,10 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
if (rc)
return rc;

+ rc = cxl_dev_dynamic_capacity_identify(mds);
+ if (rc)
+ return rc;
+
rc = cxl_mem_create_range_info(mds);
if (rc)
return rc;
@@ -1673,14 +2089,170 @@ static ssize_t sanitize_timeout_store(struct device *dev,

return count;
}
-
static DEVICE_ATTR_RW(sanitize_timeout);

+/* Return if the proposed extent would break the test code */
+static bool new_extent_valid(struct device *dev, size_t new_start,
+ size_t new_len)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct cxl_dc_extent_data *extent;
+ size_t new_end, i;
+
+ if (!new_len)
+ return false;
+
+ new_end = new_start + new_len;
+
+ dev_dbg(dev, "New extent %zx-%zx\n", new_start, new_end);
+
+ guard(mutex)(&mdata->ext_lock);
+ dev_dbg(dev, "Checking extents starts...\n");
+ xa_for_each(&mdata->dc_extents, i, extent) {
+ if (extent->dpa_start == new_start)
+ return false;
+ }
+
+ dev_dbg(dev, "Checking accepted extents starts...\n");
+ xa_for_each(&mdata->dc_accepted_exts, i, extent) {
+ if (extent->dpa_start == new_start)
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * Format <start>:<length>:<tag>
+ *
+ * start and length must be a multiple of the configured region block size.
+ * Tag can be any string up to 16 bytes.
+ *
+ * Extents must be exclusive of other extents
+ */
+static ssize_t dc_inject_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long long start, length;
+ char *len_str, *tag_str;
+ size_t buf_len = count;
+ int rc;
+
+ char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+ if (!start_str)
+ return -ENOMEM;
+
+ len_str = strnchr(start_str, buf_len, ':');
+ if (!len_str) {
+ dev_err(dev, "Extent failed to find len_str: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ *len_str = '\0';
+ len_str += 1;
+ buf_len -= strlen(start_str);
+
+ tag_str = strnchr(len_str, buf_len, ':');
+ if (!tag_str) {
+ dev_err(dev, "Extent failed to find tag_str: %s\n", len_str);
+ return -EINVAL;
+ }
+ *tag_str = '\0';
+ tag_str += 1;
+
+ if (kstrtoull(start_str, 0, &start)) {
+ dev_err(dev, "Extent failed to parse start: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ if (kstrtoull(len_str, 0, &length)) {
+ dev_err(dev, "Extent failed to parse length: %s\n", len_str);
+ return -EINVAL;
+ }
+
+ if (!new_extent_valid(dev, start, length))
+ return -EINVAL;
+
+ rc = devm_add_extent(dev, start, length, tag_str);
+ if (rc) {
+ dev_err(dev, "Failed to add extent DPA:%#llx LEN:%#llx; %d\n",
+ start, length, rc);
+ return rc;
+ }
+
+ return count;
+}
+static DEVICE_ATTR_WO(dc_inject_extent);
+
+static ssize_t __dc_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count,
+ enum dc_event type)
+{
+ unsigned long long start, length;
+ char *len_str;
+
+ char *start_str __free(kfree) = kstrdup(buf, GFP_KERNEL);
+ if (!start_str)
+ return -ENOMEM;
+
+ len_str = strnchr(start_str, count, ':');
+ if (!len_str) {
+ dev_err(dev, "Failed to find len_str: %s\n", start_str);
+ return -EINVAL;
+ }
+ *len_str = '\0';
+ len_str += 1;
+
+ if (kstrtoull(start_str, 0, &start)) {
+ dev_err(dev, "Failed to parse start: %s\n", start_str);
+ return -EINVAL;
+ }
+
+ if (kstrtoull(len_str, 0, &length)) {
+ dev_err(dev, "Failed to parse length: %s\n", len_str);
+ return -EINVAL;
+ }
+
+ dc_delete_extent(dev, start, length);
+
+ if (type == DCD_FORCED_CAPACITY_RELEASE)
+ dev_dbg(dev, "Forcing delete of extent %#llx len:%#llx\n",
+ start, length);
+
+ return count;
+}
+
+/*
+ * Format <start>:<length>
+ */
+static ssize_t dc_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_del_extent_store(dev, attr, buf, count,
+ DCD_RELEASE_CAPACITY);
+}
+static DEVICE_ATTR_WO(dc_del_extent);
+
+static ssize_t dc_force_del_extent_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ return __dc_del_extent_store(dev, attr, buf, count,
+ DCD_FORCED_CAPACITY_RELEASE);
+}
+static DEVICE_ATTR_WO(dc_force_del_extent);
+
static struct attribute *cxl_mock_mem_attrs[] = {
&dev_attr_security_lock.attr,
&dev_attr_event_trigger.attr,
&dev_attr_fw_buf_checksum.attr,
&dev_attr_sanitize_timeout.attr,
+ &dev_attr_dc_inject_extent.attr,
+ &dev_attr_dc_del_extent.attr,
+ &dev_attr_dc_force_del_extent.attr,
NULL
};
ATTRIBUTE_GROUPS(cxl_mock_mem);
@@ -1694,6 +2266,7 @@ MODULE_DEVICE_TABLE(platform, cxl_mock_mem_ids);

static struct platform_driver cxl_mock_mem_driver = {
.probe = cxl_mock_mem_probe,
+ .remove_new = cxl_mock_mem_remove,
.id_table = cxl_mock_mem_ids,
.driver = {
.name = KBUILD_MODNAME,

--
2.44.0


2024-03-25 04:04:04

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts

From: Navneet Singh <[email protected]>

Dynamic Capacity Devices (DCD) support extent change notifications
through the event log mechanism. The interrupt mailbox commands were
extended in CXL 3.1 to support these notifications.

Firmware can't configure DCD events to be FW controlled but can retain
control of memory events. Split irq configuration of memory events and
DCD events to allow for FW control of memory events while DCD is host
controlled.

Configure DCD event log interrupts on devices supporting dynamic
capacity. Disable DCD if interrupts are not supported.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[iweiny: rebase to upstream irq code]
[iweiny: disable DCD if irqs not supported]
---
drivers/cxl/core/mbox.c | 9 ++++++-
drivers/cxl/cxl.h | 4 ++-
drivers/cxl/cxlmem.h | 4 +++
drivers/cxl/pci.c | 71 ++++++++++++++++++++++++++++++++++++++++---------
4 files changed, 74 insertions(+), 14 deletions(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 14e8a7528a8b..58b31fa47b93 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1323,10 +1323,17 @@ static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
return rc;
}

-static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
+bool cxl_dcd_supported(struct cxl_memdev_state *mds)
{
return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
}
+EXPORT_SYMBOL_NS_GPL(cxl_dcd_supported, CXL);
+
+void cxl_disable_dcd(struct cxl_memdev_state *mds)
+{
+ clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_disable_dcd, CXL);

/**
* cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 15d418b3bc9b..d585f5fdd3ae 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -164,11 +164,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
#define CXLDEV_EVENT_STATUS_WARN BIT(1)
#define CXLDEV_EVENT_STATUS_FAIL BIT(2)
#define CXLDEV_EVENT_STATUS_FATAL BIT(3)
+#define CXLDEV_EVENT_STATUS_DCD BIT(4)

#define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \
CXLDEV_EVENT_STATUS_WARN | \
CXLDEV_EVENT_STATUS_FAIL | \
- CXLDEV_EVENT_STATUS_FATAL)
+ CXLDEV_EVENT_STATUS_FATAL| \
+ CXLDEV_EVENT_STATUS_DCD)

/* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
#define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0)
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 4624cf612c1e..01bee6eedff3 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -225,7 +225,9 @@ struct cxl_event_interrupt_policy {
u8 warn_settings;
u8 failure_settings;
u8 fatal_settings;
+ u8 dcd_settings;
} __packed;
+#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */

/**
* struct cxl_event_state - Event log driver state
@@ -890,6 +892,8 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
enum cxl_event_log_type type,
enum cxl_event_type event_type,
const uuid_t *uuid, union cxl_event *evt);
+bool cxl_dcd_supported(struct cxl_memdev_state *mds);
+void cxl_disable_dcd(struct cxl_memdev_state *mds);
int cxl_set_timestamp(struct cxl_memdev_state *mds);
int cxl_poison_state_init(struct cxl_memdev_state *mds);
int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 12cd5d399230..ef482eae09e9 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -669,22 +669,33 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
}

static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
- struct cxl_event_interrupt_policy *policy)
+ struct cxl_event_interrupt_policy *policy,
+ bool native_cxl)
{
struct cxl_mbox_cmd mbox_cmd;
+ size_t size_in;
int rc;

- *policy = (struct cxl_event_interrupt_policy) {
- .info_settings = CXL_INT_MSI_MSIX,
- .warn_settings = CXL_INT_MSI_MSIX,
- .failure_settings = CXL_INT_MSI_MSIX,
- .fatal_settings = CXL_INT_MSI_MSIX,
- };
+ if (native_cxl) {
+ *policy = (struct cxl_event_interrupt_policy) {
+ .info_settings = CXL_INT_MSI_MSIX,
+ .warn_settings = CXL_INT_MSI_MSIX,
+ .failure_settings = CXL_INT_MSI_MSIX,
+ .fatal_settings = CXL_INT_MSI_MSIX,
+ .dcd_settings = 0,
+ };
+ }
+ size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
+
+ if (cxl_dcd_supported(mds)) {
+ policy->dcd_settings = CXL_INT_MSI_MSIX;
+ size_in += sizeof(policy->dcd_settings);
+ }

mbox_cmd = (struct cxl_mbox_cmd) {
.opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
.payload_in = policy,
- .size_in = sizeof(*policy),
+ .size_in = size_in,
};

rc = cxl_internal_send_cmd(mds, &mbox_cmd);
@@ -731,6 +742,31 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
return 0;
}

+static int cxl_irqsetup(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy,
+ bool native_cxl)
+{
+ struct cxl_dev_state *cxlds = &mds->cxlds;
+ int rc;
+
+ if (native_cxl) {
+ rc = cxl_event_irqsetup(mds, policy);
+ if (rc)
+ return rc;
+ }
+
+ if (cxl_dcd_supported(mds)) {
+ rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
+ if (rc) {
+ dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
+ cxl_disable_dcd(mds);
+ return rc;
+ }
+ }
+
+ return 0;
+}
+
static bool cxl_event_int_is_fw(u8 setting)
{
u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
@@ -757,17 +793,25 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
struct cxl_memdev_state *mds, bool irq_avail)
{
struct cxl_event_interrupt_policy policy = { 0 };
+ bool native_cxl = host_bridge->native_cxl_error;
int rc;

/*
* When BIOS maintains CXL error reporting control, it will process
* event records. Only one agent can do so.
+ *
+ * If BIOS has control of events and DCD is not supported skip event
+ * configuration.
*/
- if (!host_bridge->native_cxl_error)
+ if (!native_cxl && !cxl_dcd_supported(mds))
return 0;

if (!irq_avail) {
dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
+ if (cxl_dcd_supported(mds)) {
+ dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
+ cxl_disable_dcd(mds);
+ }
return 0;
}

@@ -775,10 +819,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;

- if (!cxl_event_validate_mem_policy(mds, &policy))
+ if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
return -EBUSY;

- rc = cxl_event_config_msgnums(mds, &policy);
+ rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
if (rc)
return rc;

@@ -786,12 +830,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;

- rc = cxl_event_irqsetup(mds, &policy);
+ rc = cxl_irqsetup(mds, &policy, native_cxl);
if (rc)
return rc;

cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);

+ dev_dbg(mds->cxlds.dev, "Event config : %d %d\n",
+ native_cxl, cxl_dcd_supported(mds));
+
return 0;
}


--
2.44.0


2024-03-25 04:14:38

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 15/26] range: Add range_overlaps()

Code to support CXL Dynamic Capacity devices will have extent ranges
which need to be compared for intersection not a subset as is being
checked in range_contains().

range_overlaps() is defined in btrfs with a different meaning from what
is required in the standard range code. Dan Williams pointed this out
in [1]. Adjust the btrfs call according to his suggestion there.

Then add a generic range_overlaps().

Cc: Dan Williams <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: David Sterba <[email protected]>
Cc: [email protected]
Signed-off-by: Ira Weiny <[email protected]>

[1] https://lore.kernel.org/all/[email protected]/
---
fs/btrfs/ordered-data.c | 10 +++++-----
include/linux/range.h | 7 +++++++
2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 59850dc17b22..032d30a49edc 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -111,8 +111,8 @@ static struct rb_node *__tree_search(struct rb_root *root, u64 file_offset,
return NULL;
}

-static int range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
- u64 len)
+static int btrfs_range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
+ u64 len)
{
if (file_offset + len <= entry->file_offset ||
entry->file_offset + entry->num_bytes <= file_offset)
@@ -914,7 +914,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(

while (1) {
entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
- if (range_overlaps(entry, file_offset, len))
+ if (btrfs_range_overlaps(entry, file_offset, len))
break;

if (entry->file_offset >= file_offset + len) {
@@ -1043,12 +1043,12 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
}
if (prev) {
entry = rb_entry(prev, struct btrfs_ordered_extent, rb_node);
- if (range_overlaps(entry, file_offset, len))
+ if (btrfs_range_overlaps(entry, file_offset, len))
goto out;
}
if (next) {
entry = rb_entry(next, struct btrfs_ordered_extent, rb_node);
- if (range_overlaps(entry, file_offset, len))
+ if (btrfs_range_overlaps(entry, file_offset, len))
goto out;
}
/* No ordered extent in the range */
diff --git a/include/linux/range.h b/include/linux/range.h
index 6ad0b73cb7ad..9a46f3212965 100644
--- a/include/linux/range.h
+++ b/include/linux/range.h
@@ -13,11 +13,18 @@ static inline u64 range_len(const struct range *range)
return range->end - range->start + 1;
}

+/* True if r1 completely contains r2 */
static inline bool range_contains(struct range *r1, struct range *r2)
{
return r1->start <= r2->start && r1->end >= r2->end;
}

+/* True if any part of r1 overlaps r2 */
+static inline bool range_overlaps(struct range *r1, struct range *r2)
+{
+ return r1->start <= r2->end && r1->end >= r2->start;
+}
+
int add_range(struct range *range, int az, int nr_range,
u64 start, u64 end);


--
2.44.0


2024-03-25 04:23:16

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 04/26] cxl/region: Add dynamic capacity decoder and region modes

From: Navneet Singh <[email protected]>

Region mode must reflect a general dynamic capacity type which is
associated with a specific Dynamic Capacity (DC) partitions in each
device decoder within the region. DC partitions are also know as DC
regions per CXL 3.1.

Decoder mode reflects a specific DC partition.

Define the new modes to use in subsequent patches and the helper
functions required to make the association between these new modes.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Changes for v1
[iweiny: split out from: Add dynamic capacity cxl region support.]
---
drivers/cxl/core/region.c | 4 ++++
drivers/cxl/cxl.h | 23 +++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 1723d17f121e..ec3b8c6948e9 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1690,6 +1690,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
return true;
if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
return true;
+ if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
+ return true;

return false;
}
@@ -2824,6 +2826,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
return CXL_REGION_RAM;
case CXL_DECODER_PMEM:
return CXL_REGION_PMEM;
+ case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
+ return CXL_REGION_DC;
case CXL_DECODER_MIXED:
default:
return CXL_REGION_MIXED;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 9a0cce1e6fca..3b8935089c0c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -365,6 +365,14 @@ enum cxl_decoder_mode {
CXL_DECODER_NONE,
CXL_DECODER_RAM,
CXL_DECODER_PMEM,
+ CXL_DECODER_DC0,
+ CXL_DECODER_DC1,
+ CXL_DECODER_DC2,
+ CXL_DECODER_DC3,
+ CXL_DECODER_DC4,
+ CXL_DECODER_DC5,
+ CXL_DECODER_DC6,
+ CXL_DECODER_DC7,
CXL_DECODER_MIXED,
CXL_DECODER_DEAD,
};
@@ -375,6 +383,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
[CXL_DECODER_NONE] = "none",
[CXL_DECODER_RAM] = "ram",
[CXL_DECODER_PMEM] = "pmem",
+ [CXL_DECODER_DC0] = "dc0",
+ [CXL_DECODER_DC1] = "dc1",
+ [CXL_DECODER_DC2] = "dc2",
+ [CXL_DECODER_DC3] = "dc3",
+ [CXL_DECODER_DC4] = "dc4",
+ [CXL_DECODER_DC5] = "dc5",
+ [CXL_DECODER_DC6] = "dc6",
+ [CXL_DECODER_DC7] = "dc7",
[CXL_DECODER_MIXED] = "mixed",
};

@@ -383,10 +399,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
return "mixed";
}

+static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
+{
+ return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
+}
+
enum cxl_region_mode {
CXL_REGION_NONE,
CXL_REGION_RAM,
CXL_REGION_PMEM,
+ CXL_REGION_DC,
CXL_REGION_MIXED,
};

@@ -396,6 +418,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
[CXL_REGION_NONE] = "none",
[CXL_REGION_RAM] = "ram",
[CXL_REGION_PMEM] = "pmem",
+ [CXL_REGION_DC] = "dc",
[CXL_REGION_MIXED] = "mixed",
};


--
2.44.0


2024-03-25 04:32:45

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 22/26] dax/region: Support DAX device creation on sparse DAX regions

Previous patches introduced a new sparse DAX region type. This region
type may have 0 or more bytes of backing memory.

DAX devices already have the ability to reference sparse ranges of a DAX
region. Leverage the range support of DAX devices to track memory
across a sparse set of region extents.

Requests for extent removal can be received from the device at any time.
But the host is not obliged to release that memory until it is finished
with it. Introduce a use count to track how many DAX devices are using
an extent. If that extent is in use reject the removal of the extent.

Leverage the region RW semaphore to protect the extent data as any
changes to the use of the extent require DAX device, DAX region, and
extent stability during those operations.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v3
[iweiny: simplify the extent objects]
[iweiny: refactor based on the new extent objects created]
[iweiny: remove xarray]
[iweiny: use lock/invalidate/cnt rather than kref]
---
drivers/cxl/core/extent.c | 8 ++
drivers/cxl/core/region.c | 6 +-
drivers/cxl/cxl.h | 1 +
drivers/dax/bus.c | 191 +++++++++++++++++++++++++++++++++++++++-------
drivers/dax/bus.h | 3 +-
drivers/dax/cxl.c | 55 ++++++++++++-
drivers/dax/dax-private.h | 23 ++++++
drivers/dax/hmem/hmem.c | 2 +-
drivers/dax/pmem.c | 2 +-
9 files changed, 258 insertions(+), 33 deletions(-)

diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
index e98acd98ebe2..633397d62836 100644
--- a/drivers/cxl/core/extent.c
+++ b/drivers/cxl/core/extent.c
@@ -81,6 +81,14 @@ static void region_extent_unregister(void *ext)
device_unregister(&reg_ext->dev);
}

+void dax_reg_ext_release(struct region_extent *reg_ext)
+{
+ struct device *region_dev = reg_ext->dev.parent;
+
+ devm_release_action(region_dev, region_extent_unregister, reg_ext);
+}
+EXPORT_SYMBOL_NS_GPL(dax_reg_ext_release, CXL);
+
int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
struct range *hpa_range,
const char *label,
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index a07d95136f0d..7d75512a16bc 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1569,7 +1569,11 @@ int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
static void cxl_ed_rm_region_extent(struct cxl_region *cxlr,
struct region_extent *reg_ext)
{
- cxl_region_notify_extent(cxlr, DCD_RELEASE_CAPACITY, reg_ext);
+ if (cxl_region_notify_extent(cxlr, DCD_RELEASE_CAPACITY, reg_ext))
+ return;
+
+ /* Extent not in use, release it */
+ dax_reg_ext_release(reg_ext);
}

struct rm_data {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 156d7c9a8de5..e002c0ea3c2b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -660,6 +660,7 @@ int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
struct range *dpa_range,
struct cxl_endpoint_decoder *cxled);

+void dax_reg_ext_release(struct region_extent *dr_ext);
bool is_region_extent(struct device *dev);
#define to_region_extent(dev) container_of(dev, struct region_extent, dev)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 56dddaceeccb..70a559763e8c 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -236,11 +236,32 @@ int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
if (rc)
return rc;

- return devm_add_action_or_reset(ext_dev, dax_region_release_extent,
+ /* Assume the devm action will be configured without error */
+ dev_set_drvdata(ext_dev, dax_ext);
+ rc = devm_add_action_or_reset(ext_dev, dax_region_release_extent,
no_free_ptr(dax_ext));
+ if (rc)
+ dev_set_drvdata(ext_dev, NULL);
+ return rc;
}
EXPORT_SYMBOL_GPL(dax_region_add_extent);

+int dax_region_rm_extent(struct dax_region *dax_region,
+ struct device *ext_dev)
+{
+ struct dax_extent *dax_ext;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+
+ dax_ext = dev_get_drvdata(ext_dev);
+ if (!dax_ext || dax_ext->use_cnt == 0)
+ return 0; /* extent not in use */
+
+ dax_ext->invalid = true;
+ return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(dax_region_rm_extent);
+
bool static_dev_dax(struct dev_dax *dev_dax)
{
return is_static(dev_dax->region);
@@ -354,19 +375,44 @@ static ssize_t region_align_show(struct device *dev,
static struct device_attribute dev_attr_region_align =
__ATTR(align, 0400, region_align_show, NULL);

+#define for_each_extent_resource(extent, res) \
+ for (res = (extent)->child; res; res = res->sibling)
+
+unsigned long long
+dax_extent_avail_size(struct resource *ext_res)
+{
+ unsigned long long rc;
+ struct resource *used_res;
+
+ rc = resource_size(ext_res);
+ for_each_extent_resource(ext_res, used_res)
+ rc -= resource_size(used_res);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(dax_extent_avail_size);
+
#define for_each_dax_region_resource(dax_region, res) \
for (res = (dax_region)->res.child; res; res = res->sibling)

static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
{
- resource_size_t size = resource_size(&dax_region->res);
+ resource_size_t size;
struct resource *res;

WARN_ON_ONCE(!rwsem_is_locked(&dax_region_rwsem));

- if (is_sparse(dax_region))
- return 0;
+ if (is_sparse(dax_region)) {
+ /*
+ * Children of a sparse region represent available space not
+ * used space.
+ */
+ size = 0;
+ for_each_dax_region_resource(dax_region, res)
+ size += dax_extent_avail_size(res);
+ return size;
+ }

+ size = resource_size(&dax_region->res);
for_each_dax_region_resource(dax_region, res)
size -= resource_size(res);
return size;
@@ -507,15 +553,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
static void trim_dev_dax_range(struct dev_dax *dev_dax)
{
int i = dev_dax->nr_range - 1;
- struct range *range = &dev_dax->ranges[i].range;
+ struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+ struct range *range = &dev_range->range;
struct dax_region *dax_region = dev_dax->region;
+ struct resource *res = &dax_region->res;

WARN_ON_ONCE(!rwsem_is_locked(&dax_region_rwsem));
dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
(unsigned long long)range->start,
(unsigned long long)range->end);

- __release_region(&dax_region->res, range->start, range_len(range));
+ if (dev_range->dax_ext) {
+ res = dev_range->dax_ext->res;
+ dev_dbg(&dev_dax->dev, "Trim sparse extent %pr\n", res);
+ }
+
+ __release_region(res, range->start, range_len(range));
+
+ if (dev_range->dax_ext)
+ dev_range->dax_ext->use_cnt--;
+
if (--dev_dax->nr_range == 0) {
kfree(dev_dax->ranges);
dev_dax->ranges = NULL;
@@ -711,7 +768,7 @@ static void dax_region_unregister(void *region)

struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
- unsigned long flags)
+ unsigned long flags, struct dax_reg_sparse_ops *sparse_ops)
{
struct dax_region *dax_region;

@@ -729,12 +786,16 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
|| !IS_ALIGNED(range_len(range), align))
return NULL;

+ if (!sparse_ops && (flags & IORESOURCE_DAX_SPARSE_CAP))
+ return NULL;
+
dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
if (!dax_region)
return NULL;

dev_set_drvdata(parent, dax_region);
kref_init(&dax_region->kref);
+ dax_region->sparse_ops = sparse_ops;
dax_region->id = region_id;
dax_region->align = align;
dax_region->dev = parent;
@@ -929,7 +990,8 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
}

static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
- u64 start, resource_size_t size)
+ u64 start, resource_size_t size,
+ struct dax_extent *dax_ext)
{
struct device *dev = &dev_dax->dev;
struct dev_dax_range *ranges;
@@ -968,6 +1030,7 @@ static int alloc_dev_dax_range(struct resource *parent, struct dev_dax *dev_dax,
.start = alloc->start,
.end = alloc->end,
},
+ .dax_ext = dax_ext,
};

dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
@@ -1050,7 +1113,8 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
int i;

for (i = dev_dax->nr_range - 1; i >= 0; i--) {
- struct range *range = &dev_dax->ranges[i].range;
+ struct dev_dax_range *dev_range = &dev_dax->ranges[i];
+ struct range *range = &dev_range->range;
struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
struct resource *adjust = NULL, *res;
resource_size_t shrink;
@@ -1066,12 +1130,21 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
continue;
}

- for_each_dax_region_resource(dax_region, res)
- if (strcmp(res->name, dev_name(dev)) == 0
- && res->start == range->start) {
- adjust = res;
- break;
- }
+ if (dev_range->dax_ext) {
+ for_each_extent_resource(dev_range->dax_ext->res, res)
+ if (strcmp(res->name, dev_name(dev)) == 0
+ && res->start == range->start) {
+ adjust = res;
+ break;
+ }
+ } else {
+ for_each_dax_region_resource(dax_region, res)
+ if (strcmp(res->name, dev_name(dev)) == 0
+ && res->start == range->start) {
+ adjust = res;
+ break;
+ }
+ }

if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
"failed to find matching resource\n"))
@@ -1109,19 +1182,21 @@ static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
}

/**
- * dev_dax_resize_static - Expand the device into the unused portion of the
- * region. This may involve adjusting the end of an existing resource, or
- * allocating a new resource.
+ * __dev_dax_resize - Expand the device into the unused portion of the region.
+ * This may involve adjusting the end of an existing resource, or allocating a
+ * new resource.
*
* @parent: parent resource to allocate this range in
* @dev_dax: DAX device to be expanded
* @to_alloc: amount of space to alloc; must be <= space available in @parent
+ * @dax_ext: if sparse; the extent containing parent
*
* Return the amount of space allocated or -ERRNO on failure
*/
-static ssize_t dev_dax_resize_static(struct resource *parent,
- struct dev_dax *dev_dax,
- resource_size_t to_alloc)
+static ssize_t __dev_dax_resize(struct resource *parent,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc,
+ struct dax_extent *dax_ext)
{
struct resource *res, *first;
int rc;
@@ -1129,7 +1204,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
first = parent->child;
if (!first) {
rc = alloc_dev_dax_range(parent, dev_dax,
- parent->start, to_alloc);
+ parent->start, to_alloc,
+ dax_ext);
if (rc)
return rc;
return to_alloc;
@@ -1143,7 +1219,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
if (res == first && res->start > parent->start) {
alloc = min(res->start - parent->start, to_alloc);
rc = alloc_dev_dax_range(parent, dev_dax,
- parent->start, alloc);
+ parent->start, alloc,
+ dax_ext);
if (rc)
return rc;
return alloc;
@@ -1167,7 +1244,8 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
return rc;
return alloc;
}
- rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc);
+ rc = alloc_dev_dax_range(parent, dev_dax, res->end + 1, alloc,
+ dax_ext);
if (rc)
return rc;
return alloc;
@@ -1178,6 +1256,56 @@ static ssize_t dev_dax_resize_static(struct resource *parent,
return 0;
}

+static ssize_t dev_dax_resize_static(struct dax_region *dax_region,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc)
+{
+ return __dev_dax_resize(&dax_region->res, dev_dax, to_alloc, NULL);
+}
+
+static int dax_ext_match_avail_size(struct device *dev, resource_size_t *size_avail)
+{
+ resource_size_t extent_max;
+ struct dax_extent *dax_ext;
+
+ dax_ext = dev_get_drvdata(dev);
+ if (!dax_ext || dax_ext->invalid)
+ return 0;
+
+ extent_max = dax_extent_avail_size(dax_ext->res);
+ if (!extent_max)
+ return 0;
+
+ *size_avail = extent_max;
+ dax_ext->use_cnt++;
+ return 1;
+}
+
+static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
+ struct dev_dax *dev_dax,
+ resource_size_t to_alloc)
+{
+ struct dax_extent *dax_ext;
+ resource_size_t extent_max;
+ struct device *ext_dev;
+ ssize_t alloc;
+
+ ext_dev = dax_region->sparse_ops->find_ext(dax_region, &extent_max,
+ dax_ext_match_avail_size);
+ if (!ext_dev)
+ return -ENOSPC;
+
+ dax_ext = dev_get_drvdata(ext_dev);
+ if (!dax_ext)
+ return -ENOSPC;
+
+ to_alloc = min(extent_max, to_alloc);
+ alloc = __dev_dax_resize(dax_ext->res, dev_dax, to_alloc, dax_ext);
+ if (alloc < 0)
+ dax_ext->use_cnt--;
+ return alloc;
+}
+
static ssize_t dev_dax_resize(struct dax_region *dax_region,
struct dev_dax *dev_dax, resource_size_t size)
{
@@ -1201,7 +1329,10 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
return -ENXIO;

retry:
- alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
+ if (is_sparse(dax_region))
+ alloc = dev_dax_resize_sparse(dax_region, dev_dax, to_alloc);
+ else
+ alloc = dev_dax_resize_static(dax_region, dev_dax, to_alloc);
if (alloc <= 0)
return alloc;
to_alloc -= alloc;
@@ -1311,7 +1442,7 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
to_alloc = range_len(&r);
if (alloc_is_aligned(dev_dax, to_alloc))
rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
- to_alloc);
+ to_alloc, NULL);
up_write(&dax_dev_rwsem);
up_write(&dax_region_rwsem);

@@ -1536,8 +1667,14 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
device_initialize(dev);
dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);

+ if (is_sparse(dax_region) && data->size) {
+ dev_err(parent, "Sparse DAX region devices are created initially with 0 size");
+ rc = -EINVAL;
+ goto err_id;
+ }
+
rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
- data->size);
+ data->size, NULL);
if (rc)
goto err_range;

diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 783bfeef42cc..4127eee1bd6d 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -9,6 +9,7 @@ struct dev_dax;
struct resource;
struct dax_device;
struct dax_region;
+struct dax_reg_sparse_ops;

/* dax bus specific ioresource flags */
#define IORESOURCE_DAX_STATIC BIT(0)
@@ -17,7 +18,7 @@ struct dax_region;

struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct range *range, int target_node, unsigned int align,
- unsigned long flags);
+ unsigned long flags, struct dax_reg_sparse_ops *sparse_ops);

struct dev_dax_data {
struct dax_region *dax_region;
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 83ee45aff69a..3cb95e5988ae 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -53,7 +53,7 @@ static int cxl_dax_region_notify(struct device *dev,
case DCD_ADD_CAPACITY:
return __cxl_dax_region_add_extent(dax_region, reg_ext);
case DCD_RELEASE_CAPACITY:
- return 0;
+ return dax_region_rm_extent(dax_region, &reg_ext->dev);
case DCD_FORCED_CAPACITY_RELEASE:
default:
dev_err(&cxlr_dax->dev, "Unknown DC event %d\n", nd->event);
@@ -63,6 +63,57 @@ static int cxl_dax_region_notify(struct device *dev,
return -ENXIO;
}

+struct match_data {
+ match_cb match_fn;
+ resource_size_t *size_avail;
+};
+
+static int cxl_dax_match_ext(struct device *dev, void *data)
+{
+ struct match_data *md = data;
+
+ if (!is_region_extent(dev))
+ return 0;
+
+ return md->match_fn(dev, md->size_avail);
+}
+
+/**
+ * find_ext - Match Extent callback
+ * @dax_region: region to search
+ * @size_avail: the available size if an extent is found
+ * @match_fn: match function
+ *
+ * Callback to itterate through the child devices of the DAX region calling
+ * match_fn only on those devices which are extents.
+ *
+ * If a match is found match_fn is responsible for locking or reference
+ * counting dax_ext as needed.
+ */
+static struct device *find_ext(struct dax_region *dax_region,
+ resource_size_t *size_avail,
+ match_cb match_fn)
+{
+ struct match_data md = {
+ .match_fn = match_fn,
+ .size_avail = size_avail,
+ };
+ struct device *ext_dev;
+
+ ext_dev = device_find_child(dax_region->dev, &md, cxl_dax_match_ext);
+
+ if (!ext_dev)
+ return NULL;
+
+ /* caller must hold a count on extent data */
+ put_device(ext_dev);
+ return ext_dev;
+}
+
+struct dax_reg_sparse_ops sparse_ops = {
+ .find_ext = find_ext,
+};
+
static int cxl_dax_region_probe(struct device *dev)
{
struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev);
@@ -81,7 +132,7 @@ static int cxl_dax_region_probe(struct device *dev)
flags |= IORESOURCE_DAX_SPARSE_CAP;

dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
- PMD_SIZE, flags);
+ PMD_SIZE, flags, &sparse_ops);
if (!dax_region)
return -ENOMEM;

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index ac1ccf158650..fe3b271e721c 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -20,13 +20,32 @@ void dax_bus_exit(void);
* struct dax_extent - For sparse regions; an active extent
* @region: dax_region this resources is in
* @res: resource this extent covers
+ * @invalid: extent is invalid and going away
+ * @use_cnt: count the number of uses of this extent
*/
struct dax_extent {
struct dax_region *region;
struct resource *res;
+ bool invalid;
+ unsigned int use_cnt;
};
int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
resource_size_t start, resource_size_t length);
+int dax_region_rm_extent(struct dax_region *dax_region,
+ struct device *ext_dev);
+unsigned long long dax_extent_avail_size(struct resource *ext_res);
+
+typedef int (*match_cb)(struct device *dev, resource_size_t *size_avail);
+
+/**
+ * struct dax_reg_sparse_ops - Operations for sparse regions
+ * @find_ext: Find the extent matched with match_fn
+ */
+struct dax_reg_sparse_ops {
+ struct device *(*find_ext)(struct dax_region *dax_region,
+ resource_size_t *size_avail,
+ match_cb match_fn);
+};

/**
* struct dax_region - mapping infrastructure for dax devices
@@ -39,6 +58,7 @@ int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
* @res: resource tree to track instance allocations
* @seed: allow userspace to find the first unbound seed device
* @youngest: allow userspace to find the most recently created device
+ * @sparse_ops: operations required for sparce regions
*/
struct dax_region {
int id;
@@ -50,6 +70,7 @@ struct dax_region {
struct resource res;
struct device *seed;
struct device *youngest;
+ struct dax_reg_sparse_ops *sparse_ops;
};

struct dax_mapping {
@@ -74,6 +95,7 @@ struct dax_mapping {
* @pgoff: page offset
* @range: resource-span
* @mapping: device to assist in interrogating the range layout
+ * @dax_ext: if not NULL; dax region extent referenced by this range
*/
struct dev_dax {
struct dax_region *region;
@@ -91,6 +113,7 @@ struct dev_dax {
unsigned long pgoff;
struct range range;
struct dax_mapping *mapping;
+ struct dax_extent *dax_ext;
} *ranges;
};

diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index b9da69f92697..c5ddbcef532f 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -28,7 +28,7 @@ static int dax_hmem_probe(struct platform_device *pdev)

mri = dev->platform_data;
dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
- mri->target_node, PMD_SIZE, flags);
+ mri->target_node, PMD_SIZE, flags, NULL);
if (!dax_region)
return -ENOMEM;

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index f3c6c67b8412..acb311539272 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -54,7 +54,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
range.start += offset;
dax_region = alloc_dax_region(dev, region_id, &range,
nd_region->target_node, le32_to_cpu(pfn_sb->align),
- IORESOURCE_DAX_STATIC);
+ IORESOURCE_DAX_STATIC, NULL);
if (!dax_region)
return ERR_PTR(-ENOMEM);


--
2.44.0


2024-03-25 04:42:30

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 12/26] cxl/pci: Factor out interrupt policy check

Dynamic capacity devices (DCD) require interrupts to notify the host of
events in the DCD log. The interrupts for DCD may be supported despite
FW control of memory event logs.

Prepare to support DCD event interrupts separate from other event
interrupts by factoring out the check for event interrupt settings.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V3:
[iweiny: new patch]
---
drivers/cxl/pci.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index ccaf4ad26a4f..12cd5d399230 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -738,6 +738,21 @@ static bool cxl_event_int_is_fw(u8 setting)
return mode == CXL_INT_FW;
}

+static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
+ struct cxl_event_interrupt_policy *policy)
+{
+ if (cxl_event_int_is_fw(policy->info_settings) ||
+ cxl_event_int_is_fw(policy->warn_settings) ||
+ cxl_event_int_is_fw(policy->failure_settings) ||
+ cxl_event_int_is_fw(policy->fatal_settings)) {
+ dev_err(mds->cxlds.dev,
+ "FW still in control of Event Logs despite _OSC settings\n");
+ return false;
+ }
+
+ return true;
+}
+
static int cxl_event_config(struct pci_host_bridge *host_bridge,
struct cxl_memdev_state *mds, bool irq_avail)
{
@@ -760,14 +775,8 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;

- if (cxl_event_int_is_fw(policy.info_settings) ||
- cxl_event_int_is_fw(policy.warn_settings) ||
- cxl_event_int_is_fw(policy.failure_settings) ||
- cxl_event_int_is_fw(policy.fatal_settings)) {
- dev_err(mds->cxlds.dev,
- "FW still in control of Event Logs despite _OSC settings\n");
+ if (!cxl_event_validate_mem_policy(mds, &policy))
return -EBUSY;
- }

rc = cxl_event_config_msgnums(mds, &policy);
if (rc)

--
2.44.0


2024-03-25 06:22:26

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

From: Navneet Singh <[email protected]>

Devices can optionally support Dynamic Capacity (DC). These devices are
known as Dynamic Capacity Devices (DCD).

Implement the DC mailbox commands as specified in CXL 3.1 section
8.2.9.9.9 (opcodes 48XXh). Read the DC configuration and store the DC
region information in the device state.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[Jørgen: ensure CXL 2.0 device support by removing dc_event_log_size]
[iweiny/Jørgen: use get DC config command to signal DCD support]
[djiang: fix subject]
[Fan: add additional region configuration checks]
[Jonathan/djiang: split out region mode changes]
[Jonathan: fix up comments/kdoc]
[Jonathan: s/cxl_get_dc_id/cxl_get_dc_config/]
[Jonathan: use __free() in identify call]
[Jonathan: remove unneeded formatting changes]
[Jonathan: s/cxl_mbox_dynamic_capacity/cxl_mbox_get_dc_config_out/]
[Jonathan: s/cxl_mbox_get_dc_config/cxl_mbox_get_dc_config_in/]
[iweiny: remove type2 work dependancy/rebase on master]
[iweiny: fix 0day build issues]
---
drivers/cxl/core/mbox.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
drivers/cxl/cxlmem.h | 49 +++++++++++++
drivers/cxl/pci.c | 4 ++
3 files changed, 236 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index ed4131c6f50b..14e8a7528a8b 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -1123,7 +1123,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
if (rc < 0)
return rc;

- mds->total_bytes =
+ mds->static_cap =
le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
mds->volatile_only_bytes =
le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
@@ -1230,6 +1230,175 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
return rc;
}

+static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
+ struct cxl_dc_region_config *region_config)
+{
+ struct cxl_dc_region_info *dcr = &mds->dc_region[index];
+ struct device *dev = mds->cxlds.dev;
+
+ dcr->base = le64_to_cpu(region_config->region_base);
+ dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
+ dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
+ dcr->len = le64_to_cpu(region_config->region_length);
+ dcr->blk_size = le64_to_cpu(region_config->region_block_size);
+ dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
+ dcr->flags = region_config->flags;
+ snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
+
+ /* Check regions are in increasing DPA order */
+ if (index > 0) {
+ struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
+
+ if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
+ dev_err(dev,
+ "DPA ordering violation for DC region %d and %d\n",
+ index - 1, index);
+ return -EINVAL;
+ }
+ }
+
+ if (!IS_ALIGNED(dcr->base, SZ_256M) ||
+ !IS_ALIGNED(dcr->base, dcr->blk_size)) {
+ dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n", index,
+ dcr->base, dcr->blk_size);
+ return -EINVAL;
+ }
+
+ if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
+ !IS_ALIGNED(dcr->len, dcr->blk_size)) {
+ dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
+ index, dcr->decode_len, dcr->len, dcr->blk_size);
+ return -EINVAL;
+ }
+
+ if (dcr->blk_size == 0 || dcr->blk_size % 0x40 ||
+ !is_power_of_2(dcr->blk_size)) {
+ dev_err(dev, "DC region %d invalid block size; %#llx\n",
+ index, dcr->blk_size);
+ return -EINVAL;
+ }
+
+ dev_dbg(dev,
+ "DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
+ dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
+
+ return 0;
+}
+
+/* Returns the number of regions in dc_resp or -ERRNO */
+static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
+ struct cxl_mbox_get_dc_config_out *dc_resp,
+ size_t dc_resp_size)
+{
+ struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
+ .region_count = CXL_MAX_DC_REGION,
+ .start_region_index = start_region,
+ };
+ struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
+ .payload_in = &get_dc,
+ .size_in = sizeof(get_dc),
+ .size_out = dc_resp_size,
+ .payload_out = dc_resp,
+ .min_out = 1,
+ };
+ struct device *dev = mds->cxlds.dev;
+ int rc;
+
+ rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+ if (rc < 0)
+ return rc;
+
+ rc = dc_resp->avail_region_count - start_region;
+
+ /*
+ * The number of regions in the payload may have been truncated due to
+ * payload_size limits; if so adjust the returned count to match.
+ */
+ if (mbox_cmd.size_out < sizeof(*dc_resp))
+ rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
+
+ dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
+
+ return rc;
+}
+
+static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
+{
+ return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
+}
+
+/**
+ * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
+ * information from the device.
+ * @mds: The memory device state
+ *
+ * Read Dynamic Capacity information from the device and populate the state
+ * structures for later use.
+ *
+ * Return: 0 if identify was executed successfully, -ERRNO on error.
+ */
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
+{
+ size_t dc_resp_size = mds->payload_size;
+ struct device *dev = mds->cxlds.dev;
+ u8 start_region, i;
+ int rc = 0;
+
+ for (i = 0; i < CXL_MAX_DC_REGION; i++)
+ snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
+
+ /* Check GET_DC_CONFIG is supported by device */
+ if (!cxl_dcd_supported(mds)) {
+ dev_dbg(dev, "DCD not supported\n");
+ return 0;
+ }
+
+ struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
+ kvmalloc(dc_resp_size, GFP_KERNEL);
+ if (!dc_resp)
+ return -ENOMEM;
+
+ start_region = 0;
+ do {
+ int j;
+
+ rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
+ if (rc < 0) {
+ dev_dbg(dev, "Failed to get DC config: %d\n", rc);
+ return rc;
+ }
+
+ mds->nr_dc_region += rc;
+
+ if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
+ dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
+ mds->nr_dc_region);
+ return -EINVAL;
+ }
+
+ for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
+ rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
+ if (rc) {
+ dev_dbg(dev, "Failed to save region info: %d\n", rc);
+ return rc;
+ }
+ }
+
+ start_region = mds->nr_dc_region;
+
+ } while (mds->nr_dc_region < dc_resp->avail_region_count);
+
+ mds->dynamic_cap =
+ mds->dc_region[mds->nr_dc_region - 1].base +
+ mds->dc_region[mds->nr_dc_region - 1].decode_len -
+ mds->dc_region[0].base;
+ dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
+
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
+
static int add_dpa_res(struct device *dev, struct resource *parent,
struct resource *res, resource_size_t start,
resource_size_t size, const char *type)
@@ -1260,8 +1429,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
{
struct cxl_dev_state *cxlds = &mds->cxlds;
struct device *dev = cxlds->dev;
+ size_t untenanted_mem;
int rc;

+ untenanted_mem = mds->dc_region[0].base - mds->static_cap;
+ mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
+
if (!cxlds->media_ready) {
cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
cxlds->ram_res = DEFINE_RES_MEM(0, 0);
@@ -1271,6 +1444,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)

cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);

+ for (int i = 0; i < mds->nr_dc_region; i++) {
+ struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+
+ rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
+ dcr->base, dcr->decode_len, dcr->name);
+ if (rc)
+ return rc;
+ }
+
if (mds->partition_align_bytes == 0) {
rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
mds->volatile_only_bytes, "ram");
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 79a67cff9143..4624cf612c1e 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -402,6 +402,7 @@ enum cxl_devtype {
CXL_DEVTYPE_CLASSMEM,
};

+#define CXL_MAX_DC_REGION 8
/**
* struct cxl_dpa_perf - DPA performance property entry
* @dpa_range - range for DPA address
@@ -431,6 +432,8 @@ struct cxl_dpa_perf {
* @dpa_res: Overall DPA resource tree for the device
* @pmem_res: Active Persistent memory capacity configuration
* @ram_res: Active Volatile memory capacity configuration
+ * @dc_res: Active Dynamic Capacity memory configuration for each possible
+ * region
* @serial: PCIe Device Serial Number
* @type: Generic Memory Class device or Vendor Specific Memory device
*/
@@ -445,10 +448,22 @@ struct cxl_dev_state {
struct resource dpa_res;
struct resource pmem_res;
struct resource ram_res;
+ struct resource dc_res[CXL_MAX_DC_REGION];
u64 serial;
enum cxl_devtype type;
};

+#define CXL_DC_REGION_STRLEN 8
+struct cxl_dc_region_info {
+ u64 base;
+ u64 decode_len;
+ u64 len;
+ u64 blk_size;
+ u32 dsmad_handle;
+ u8 flags;
+ u8 name[CXL_DC_REGION_STRLEN];
+};
+
/**
* struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
*
@@ -467,6 +482,8 @@ struct cxl_dev_state {
* @enabled_cmds: Hardware commands found enabled in CEL.
* @exclusive_cmds: Commands that are kernel-internal only
* @total_bytes: sum of all possible capacities
+ * @static_cap: Sum of static RAM and PMEM capacities
+ * @dynamic_cap: Complete DPA range occupied by DC regions
* @volatile_only_bytes: hard volatile capacity
* @persistent_only_bytes: hard persistent capacity
* @partition_align_bytes: alignment size for partition-able capacity
@@ -474,6 +491,8 @@ struct cxl_dev_state {
* @active_persistent_bytes: sum of hard + soft persistent
* @next_volatile_bytes: volatile capacity change pending device reset
* @next_persistent_bytes: persistent capacity change pending device reset
+ * @nr_dc_region: number of DC regions implemented in the memory device
+ * @dc_region: array containing info about the DC regions
* @event: event log driver state
* @poison: poison driver state info
* @security: security driver state info
@@ -494,7 +513,10 @@ struct cxl_memdev_state {
DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
+
u64 total_bytes;
+ u64 static_cap;
+ u64 dynamic_cap;
u64 volatile_only_bytes;
u64 persistent_only_bytes;
u64 partition_align_bytes;
@@ -506,6 +528,9 @@ struct cxl_memdev_state {
struct cxl_dpa_perf ram_perf;
struct cxl_dpa_perf pmem_perf;

+ u8 nr_dc_region;
+ struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
+
struct cxl_event_state event;
struct cxl_poison_state poison;
struct cxl_security_state security;
@@ -705,6 +730,29 @@ struct cxl_mbox_set_partition_info {

#define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)

+struct cxl_mbox_get_dc_config_in {
+ u8 region_count;
+ u8 start_region_index;
+} __packed;
+
+/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
+struct cxl_mbox_get_dc_config_out {
+ u8 avail_region_count;
+ u8 rsvd[7];
+ struct cxl_dc_region_config {
+ __le64 region_base;
+ __le64 region_decode_length;
+ __le64 region_length;
+ __le64 region_block_size;
+ __le32 region_dsmad_handle;
+ u8 flags;
+ u8 rsvd[3];
+ } __packed region[];
+} __packed;
+#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
+#define CXL_REGIONS_RETURNED(size_out) \
+ ((size_out - 8) / sizeof(struct cxl_dc_region_config))
+
/* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
struct cxl_mbox_set_timestamp_in {
__le64 timestamp;
@@ -828,6 +876,7 @@ enum {
int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
struct cxl_mbox_cmd *cmd);
int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
int cxl_await_media_ready(struct cxl_dev_state *cxlds);
int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 2ff361e756d6..216881455364 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -874,6 +874,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
if (rc)
return rc;

+ rc = cxl_dev_dynamic_capacity_identify(mds);
+ if (rc)
+ return rc;
+
rc = cxl_mem_create_range_info(mds);
if (rc)
return rc;

--
2.44.0


2024-03-25 06:22:43

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 07/26] cxl/port: Add dynamic capacity size support to endpoint decoders

From: Navneet Singh <[email protected]>

To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
map DC partitions (regions). In addition to assigning the size of the
DC partition, the decoder must assign any skip value from the previous
decoder. This must be done within a contiguous DPA space.

Two complications arise with Dynamic Capacity regions which did not
exist with Ram and PMEM partitions. First, gaps in the DPA space can
exist between and around the DC Regions. Second, the Linux resource
tree does not allow a resource to be marked across existing nodes within
a tree.

For clarity, below is an example of an 60GB device with 10GB of RAM,
10GB of PMEM and 10GB for each of 2 DC Regions. The desired CXL mapping
is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.

DPA RANGE
(dpa_res)
0GB 10GB 20GB 30GB 40GB 50GB 60GB
|----------|----------|----------|----------|----------|----------|

RAM PMEM DC0 DC1
(ram_res) (pmem_res) (dc_res[0]) (dc_res[1])
|----------|----------| <gap> |----------| <gap> |----------|

RAM PMEM DC1
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
0GB 5GB 10GB 15GB 20GB 30GB 40GB 50GB 60GB

The previous skip resource between RAM and PMEM was always a child of
the RAM resource and fit nicely [see (S) below]. Because of this
simplicity this skip resource reference was not stored in any CXL state.
On release the skip range could be calculated based on the endpoint
decoders stored values.

Now when DC1 is being mapped 4 skip resources must be created as
children. One for the PMEM resource (A), two of the parent DPA resource
(B,D), and one more child of the DC0 resource (C).

0GB 10GB 20GB 30GB 40GB 50GB 60GB
|----------|----------|----------|----------|----------|----------|
| |
|----------|----------| | |----------| | |----------|
| | | | |
(S) (A) (B) (C) (D)
v v v v v
|XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
skip skip skip skip skip

Expand the calculation of DPA freespace and enhance the logic to support
mapping/unmapping DC DPA space. To track the potential of multiple skip
resources an xarray is attached to the endpoint decoder. The existing
algorithm between RAM and PMEM is consolidated within the new one to
streamline the code even though the result is the storage of a single
skip resource in the xarray.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1:
[iweiny: Update cover letter]
---
drivers/cxl/core/hdm.c | 192 +++++++++++++++++++++++++++++++++++++++++++-----
drivers/cxl/core/port.c | 2 +
drivers/cxl/cxl.h | 2 +
3 files changed, 179 insertions(+), 17 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index e22b6f4f7145..da7d58184490 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -210,6 +210,25 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
}
EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);

+static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+ struct cxl_port *port = cxled_to_port(cxled);
+ struct device *dev = &port->dev;
+ unsigned long index;
+ void *entry;
+
+ xa_for_each(&cxled->skip_res, index, entry) {
+ struct resource *res = entry;
+
+ dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
+ port->id, cxled->cxld.id, res);
+ __release_region(&cxlds->dpa_res, res->start,
+ resource_size(res));
+ xa_erase(&cxled->skip_res, index);
+ }
+}
+
/*
* Must be called in a context that synchronizes against this decoder's
* port ->remove() callback (like an endpoint decoder sysfs attribute)
@@ -220,15 +239,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
struct cxl_port *port = cxled_to_port(cxled);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
struct resource *res = cxled->dpa_res;
- resource_size_t skip_start;

lockdep_assert_held_write(&cxl_dpa_rwsem);

- /* save @skip_start, before @res is released */
- skip_start = res->start - cxled->skip;
__release_region(&cxlds->dpa_res, res->start, resource_size(res));
- if (cxled->skip)
- __release_region(&cxlds->dpa_res, skip_start, cxled->skip);
+ cxl_skip_release(cxled);
cxled->skip = 0;
cxled->dpa_res = NULL;
put_device(&cxled->cxld.dev);
@@ -263,6 +278,100 @@ static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
return mode - CXL_DECODER_DC0;
}

+static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
+ resource_size_t skip_base, resource_size_t skip_len)
+{
+ struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
+ const char *name = dev_name(&cxled->cxld.dev);
+ struct cxl_port *port = cxled_to_port(cxled);
+ struct resource *dpa_res = &cxlds->dpa_res;
+ struct device *dev = &port->dev;
+ struct resource *res;
+ int rc;
+
+ res = __request_region(dpa_res, skip_base, skip_len, name, 0);
+ if (!res)
+ return -EBUSY;
+
+ rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
+ if (rc) {
+ __release_region(dpa_res, skip_base, skip_len);
+ return rc;
+ }
+
+ dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
+ port->id, cxled->cxld.id, res);
+ return 0;
+}
+
+static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
+ resource_size_t base, resource_size_t skipped)
+{
+ struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+ struct cxl_port *port = cxled_to_port(cxled);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+ resource_size_t skip_base = base - skipped;
+ struct device *dev = &port->dev;
+ resource_size_t skip_len = 0;
+ int rc, index;
+
+ if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
+ skip_len = cxlds->ram_res.end - skip_base + 1;
+ rc = cxl_request_skip(cxled, skip_base, skip_len);
+ if (rc)
+ return rc;
+ skip_base += skip_len;
+ }
+
+ if (skip_base == base) {
+ dev_dbg(dev, "skip done ram!\n");
+ return 0;
+ }
+
+ if (resource_size(&cxlds->pmem_res) &&
+ skip_base <= cxlds->pmem_res.end) {
+ skip_len = cxlds->pmem_res.end - skip_base + 1;
+ rc = cxl_request_skip(cxled, skip_base, skip_len);
+ if (rc)
+ return rc;
+ skip_base += skip_len;
+ }
+
+ index = dc_mode_to_region_index(cxled->mode);
+ for (int i = 0; i <= index; i++) {
+ struct resource *dcr = &cxlds->dc_res[i];
+
+ if (skip_base < dcr->start) {
+ skip_len = dcr->start - skip_base;
+ rc = cxl_request_skip(cxled, skip_base, skip_len);
+ if (rc)
+ return rc;
+ skip_base += skip_len;
+ }
+
+ if (skip_base == base) {
+ dev_dbg(dev, "skip done DC region %d!\n", i);
+ break;
+ }
+
+ if (resource_size(dcr) && skip_base <= dcr->end) {
+ if (skip_base > base) {
+ dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
+ i, &skip_base, &base);
+ return -ENXIO;
+ }
+
+ skip_len = dcr->end - skip_base + 1;
+ rc = cxl_request_skip(cxled, skip_base, skip_len);
+ if (rc)
+ return rc;
+ skip_base += skip_len;
+ }
+ }
+
+ return 0;
+}
+
static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
resource_size_t base, resource_size_t len,
resource_size_t skipped)
@@ -300,13 +409,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
}

if (skipped) {
- res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
- dev_name(&cxled->cxld.dev), 0);
- if (!res) {
- dev_dbg(dev,
- "decoder%d.%d: failed to reserve skipped space\n",
- port->id, cxled->cxld.id);
- return -EBUSY;
+ int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
+
+ if (rc) {
+ dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
+ port->id, cxled->cxld.id, &base, &skipped);
+ return rc;
}
}
res = __request_region(&cxlds->dpa_res, base, len,
@@ -314,14 +422,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
if (!res) {
dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
port->id, cxled->cxld.id);
- if (skipped)
- __release_region(&cxlds->dpa_res, base - skipped,
- skipped);
+ cxl_skip_release(cxled);
return -EBUSY;
}
cxled->dpa_res = res;
cxled->skip = skipped;

+ for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
+ int index = dc_mode_to_region_index(mode);
+
+ if (resource_contains(&cxlds->dc_res[index], res)) {
+ cxled->mode = mode;
+ goto success;
+ }
+ }
if (resource_contains(&cxlds->pmem_res, res))
cxled->mode = CXL_DECODER_PMEM;
else if (resource_contains(&cxlds->ram_res, res))
@@ -332,6 +446,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
cxled->mode = CXL_DECODER_MIXED;
}

+success:
+ dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
+ cxled->dpa_res, cxled->mode);
port->hdm_end++;
get_device(&cxled->cxld.dev);
return 0;
@@ -463,14 +580,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,

int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
{
+ resource_size_t free_ram_start, free_pmem_start, free_dc_start;
struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
- resource_size_t free_ram_start, free_pmem_start;
struct cxl_port *port = cxled_to_port(cxled);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
struct device *dev = &cxled->cxld.dev;
resource_size_t start, avail, skip;
struct resource *p, *last;
- int rc;
+ int rc, dc_index;

down_write(&cxl_dpa_rwsem);
if (cxled->cxld.region) {
@@ -500,6 +617,21 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
else
free_pmem_start = cxlds->pmem_res.start;

+ /*
+ * Limit each decoder to a single DC region to map memory with
+ * different DSMAS entry.
+ */
+ dc_index = dc_mode_to_region_index(cxled->mode);
+ if (dc_index >= 0) {
+ if (cxlds->dc_res[dc_index].child) {
+ dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
+ dc_index);
+ rc = -EINVAL;
+ goto out;
+ }
+ free_dc_start = cxlds->dc_res[dc_index].start;
+ }
+
if (cxled->mode == CXL_DECODER_RAM) {
start = free_ram_start;
avail = cxlds->ram_res.end - start + 1;
@@ -521,12 +653,38 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
else
skip_end = start - 1;
skip = skip_end - skip_start + 1;
+ } else if (cxl_decoder_mode_is_dc(cxled->mode)) {
+ resource_size_t skip_start, skip_end;
+
+ start = free_dc_start;
+ avail = cxlds->dc_res[dc_index].end - start + 1;
+ if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
+ skip_start = free_ram_start;
+ else
+ skip_start = free_pmem_start;
+ /*
+ * If any dc region is already mapped, then that allocation
+ * already handled the RAM and PMEM skip. Check for DC region
+ * skip.
+ */
+ for (int i = dc_index - 1; i >= 0 ; i--) {
+ if (cxlds->dc_res[i].child) {
+ skip_start = cxlds->dc_res[i].child->end + 1;
+ break;
+ }
+ }
+
+ skip_end = start - 1;
+ skip = skip_end - skip_start + 1;
} else {
dev_dbg(dev, "mode not set\n");
rc = -EINVAL;
goto out;
}

+ dev_dbg(dev, "DPA Allocation start: %pa len: %#llx Skip: %pa\n",
+ &start, size, &skip);
+
if (size > avail) {
dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 80c0651794eb..036b61cb3007 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -434,6 +434,7 @@ static void cxl_endpoint_decoder_release(struct device *dev)
struct cxl_endpoint_decoder *cxled = to_cxl_endpoint_decoder(dev);

__cxl_decoder_release(&cxled->cxld);
+ xa_destroy(&cxled->skip_res);
kfree(cxled);
}

@@ -1896,6 +1897,7 @@ struct cxl_endpoint_decoder *cxl_endpoint_decoder_alloc(struct cxl_port *port)
return ERR_PTR(-ENOMEM);

cxled->pos = -1;
+ xa_init(&cxled->skip_res);
cxld = &cxled->cxld;
rc = cxl_decoder_init(port, cxld);
if (rc) {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 3b8935089c0c..15d418b3bc9b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -441,6 +441,7 @@ enum cxl_decoder_state {
* @cxld: base cxl_decoder_object
* @dpa_res: actively claimed DPA span of this decoder
* @skip: offset into @dpa_res where @cxld.hpa_range maps
+ * @skip_res: array of skipped resources from the previous decoder end
* @mode: which memory type / access-mode-partition this decoder targets
* @state: autodiscovery state
* @pos: interleave position in @cxld.region
@@ -449,6 +450,7 @@ struct cxl_endpoint_decoder {
struct cxl_decoder cxld;
struct resource *dpa_res;
resource_size_t skip;
+ struct xarray skip_res;
enum cxl_decoder_mode mode;
enum cxl_decoder_state state;
int pos;

--
2.44.0


2024-03-25 06:23:07

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 11/26] cxl/pci: Delay event buffer allocation

The event buffer does not need to be allocated if something has failed
in setting up event irq's.

In prep for adjusting event configuration for DCD events move the buffer
allocation to the end of the event configuration.

Signed-off-by: Ira Weiny <[email protected]>
---
drivers/cxl/pci.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index cedd9b05f129..ccaf4ad26a4f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -756,10 +756,6 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
return 0;
}

- rc = cxl_mem_alloc_event_buf(mds);
- if (rc)
- return rc;
-
rc = cxl_event_get_int_policy(mds, &policy);
if (rc)
return rc;
@@ -777,6 +773,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
if (rc)
return rc;

+ rc = cxl_mem_alloc_event_buf(mds);
+ if (rc)
+ return rc;
+
rc = cxl_event_irqsetup(mds, &policy);
if (rc)
return rc;

--
2.44.0


2024-03-25 06:24:21

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 24/26] tools/testing/cxl: Make event logs dynamic

The test event logs were created as static arrays as an easy way to mock
events. Dynamic Capacity Device (DCD) test support requires events be
generated dynamically when extents are created or destroyed.

Modify the event log storage to be dynamically allocated. Reuse the
static event data to create the dynamic events in the new logs without
inventing complex event injection for the previous tests. Simplify the
processing of the logs by using the event log array index as the handle.
Add a lock to manage concurrency required when user space is allowed to
control DCD extents

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[iweiny: Adjust for new event code]
---
tools/testing/cxl/test/mem.c | 281 ++++++++++++++++++++++++++-----------------
1 file changed, 172 insertions(+), 109 deletions(-)

diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c
index 35ee41e435ab..d8d62e6eeb18 100644
--- a/tools/testing/cxl/test/mem.c
+++ b/tools/testing/cxl/test/mem.c
@@ -124,18 +124,27 @@ static struct {

#define PASS_TRY_LIMIT 3

-#define CXL_TEST_EVENT_CNT_MAX 15
+#define CXL_TEST_EVENT_CNT_MAX 17

/* Set a number of events to return at a time for simulation. */
#define CXL_TEST_EVENT_CNT 3

+/*
+ * @next_handle: next handle (index) to be stored to
+ * @cur_handle: current handle (index) to be returned to the user on get_event
+ * @nr_events: total events in this log
+ * @nr_overflow: number of events added past the log size
+ * @lock: protect these state variables
+ * @events: array of pending events to be returned.
+ */
struct mock_event_log {
- u16 clear_idx;
- u16 cur_idx;
+ u16 next_handle;
+ u16 cur_handle;
u16 nr_events;
u16 nr_overflow;
- u16 overflow_reset;
- struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX];
+ rwlock_t lock;
+ /* 1 extra slot to accommodate that handles can't be 0 */
+ struct cxl_event_record_raw *events[CXL_TEST_EVENT_CNT_MAX + 1];
};

struct mock_event_store {
@@ -170,64 +179,76 @@ static struct mock_event_log *event_find_log(struct device *dev, int log_type)
return &mdata->mes.mock_logs[log_type];
}

-static struct cxl_event_record_raw *event_get_current(struct mock_event_log *log)
-{
- return log->events[log->cur_idx];
-}
-
-static void event_reset_log(struct mock_event_log *log)
-{
- log->cur_idx = 0;
- log->clear_idx = 0;
- log->nr_overflow = log->overflow_reset;
-}
-
/* Handle can never be 0 use 1 based indexing for handle */
-static u16 event_get_clear_handle(struct mock_event_log *log)
+static void event_inc_handle(u16 *handle)
{
- return log->clear_idx + 1;
+ *handle = (*handle + 1) % CXL_TEST_EVENT_CNT_MAX;
+ if (!*handle)
+ *handle = *handle + 1;
}

-/* Handle can never be 0 use 1 based indexing for handle */
-static __le16 event_get_cur_event_handle(struct mock_event_log *log)
-{
- u16 cur_handle = log->cur_idx + 1;
-
- return cpu_to_le16(cur_handle);
-}
-
-static bool event_log_empty(struct mock_event_log *log)
-{
- return log->cur_idx == log->nr_events;
-}
-
-static void mes_add_event(struct mock_event_store *mes,
+/* Add the event or free it on 'overflow' */
+static void mes_add_event(struct cxl_mockmem_data *mdata,
enum cxl_event_log_type log_type,
struct cxl_event_record_raw *event)
{
+ struct device *dev = mdata->mds->cxlds.dev;
struct mock_event_log *log;
+ u16 handle;

if (WARN_ON(log_type >= CXL_EVENT_TYPE_MAX))
return;

- log = &mes->mock_logs[log_type];
+ log = &mdata->mes.mock_logs[log_type];

- if ((log->nr_events + 1) > CXL_TEST_EVENT_CNT_MAX) {
+ write_lock(&log->lock);
+
+ handle = log->next_handle;
+ if ((handle + 1) == log->cur_handle) {
log->nr_overflow++;
- log->overflow_reset = log->nr_overflow;
- return;
+ dev_dbg(dev, "Overflowing %d\n", log_type);
+ devm_kfree(dev, event);
+ goto unlock;
}

- log->events[log->nr_events] = event;
+ dev_dbg(dev, "Log %d; handle %u\n", log_type, handle);
+ event->event.generic.hdr.handle = cpu_to_le16(handle);
+ log->events[handle] = event;
+ event_inc_handle(&log->next_handle);
log->nr_events++;
+
+unlock:
+ write_unlock(&log->lock);
+}
+
+static void mes_del_event(struct device *dev,
+ struct mock_event_log *log,
+ u16 handle)
+{
+ struct cxl_event_record_raw *cur;
+
+ lockdep_assert(lockdep_is_held(&log->lock));
+
+ dev_dbg(dev, "Clearing event %u; cur %u\n", handle, log->cur_handle);
+ cur = log->events[handle];
+ if (!cur) {
+ dev_err(dev, "Mock event index %u empty? nr_events %u",
+ handle, log->nr_events);
+ return;
+ }
+ log->events[handle] = NULL;
+
+ event_inc_handle(&log->cur_handle);
+ log->nr_events--;
+ devm_kfree(dev, cur);
}

static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
{
struct cxl_get_event_payload *pl;
struct mock_event_log *log;
- u16 nr_overflow;
u8 log_type;
+ u16 handle;
int i;

if (cmd->size_in != sizeof(log_type))
@@ -240,31 +261,40 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
if (log_type >= CXL_EVENT_TYPE_MAX)
return -EINVAL;

- memset(cmd->payload_out, 0, cmd->size_out);
-
log = event_find_log(dev, log_type);
- if (!log || event_log_empty(log))
+ if (!log)
return 0;

+ memset(cmd->payload_out, 0, cmd->size_out);
pl = cmd->payload_out;

- for (i = 0; i < CXL_TEST_EVENT_CNT && !event_log_empty(log); i++) {
- memcpy(&pl->records[i], event_get_current(log),
- sizeof(pl->records[i]));
- pl->records[i].event.generic.hdr.handle =
- event_get_cur_event_handle(log);
- log->cur_idx++;
+ read_lock(&log->lock);
+
+ handle = log->cur_handle;
+ dev_dbg(dev, "Get log %d handle %u next %u\n",
+ log_type, handle, log->next_handle);
+ for (i = 0;
+ i < CXL_TEST_EVENT_CNT && handle != log->next_handle;
+ i++, event_inc_handle(&handle)) {
+ struct cxl_event_record_raw *cur;
+
+ cur = log->events[handle];
+ dev_dbg(dev, "Sending event log %d handle %d idx %u\n",
+ log_type, le16_to_cpu(cur->event.generic.hdr.handle),
+ handle);
+ memcpy(&pl->records[i], cur, sizeof(pl->records[i]));
+ pl->records[i].event.generic.hdr.handle = cpu_to_le16(handle);
}

pl->record_count = cpu_to_le16(i);
- if (!event_log_empty(log))
+ if (log->nr_events > i)
pl->flags |= CXL_GET_EVENT_FLAG_MORE_RECORDS;

if (log->nr_overflow) {
u64 ns;

pl->flags |= CXL_GET_EVENT_FLAG_OVERFLOW;
- pl->overflow_err_count = cpu_to_le16(nr_overflow);
+ pl->overflow_err_count = cpu_to_le16(log->nr_overflow);
ns = ktime_get_real_ns();
ns -= 5000000000; /* 5s ago */
pl->first_overflow_timestamp = cpu_to_le64(ns);
@@ -273,16 +303,17 @@ static int mock_get_event(struct device *dev, struct cxl_mbox_cmd *cmd)
pl->last_overflow_timestamp = cpu_to_le64(ns);
}

+ read_unlock(&log->lock);
return 0;
}

static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
{
struct cxl_mbox_clear_event_payload *pl = cmd->payload_in;
- struct mock_event_log *log;
u8 log_type = pl->event_log;
+ struct mock_event_log *log;
+ int nr, rc = 0;
u16 handle;
- int nr;

if (log_type >= CXL_EVENT_TYPE_MAX)
return -EINVAL;
@@ -291,24 +322,23 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
if (!log)
return 0; /* No mock data in this log */

- /*
- * This check is technically not invalid per the specification AFAICS.
- * (The host could 'guess' handles and clear them in order).
- * However, this is not good behavior for the host so test it.
- */
- if (log->clear_idx + pl->nr_recs > log->cur_idx) {
- dev_err(dev,
- "Attempting to clear more events than returned!\n");
- return -EINVAL;
- }
+ write_lock(&log->lock);

/* Check handle order prior to clearing events */
- for (nr = 0, handle = event_get_clear_handle(log);
- nr < pl->nr_recs;
- nr++, handle++) {
+ handle = log->cur_handle;
+ for (nr = 0;
+ nr < pl->nr_recs && handle != log->next_handle;
+ nr++, event_inc_handle(&handle)) {
+
+ dev_dbg(dev, "Checking clear of %d handle %u plhandle %u\n",
+ log_type, handle,
+ le16_to_cpu(pl->handles[nr]));
+
if (handle != le16_to_cpu(pl->handles[nr])) {
- dev_err(dev, "Clearing events out of order\n");
- return -EINVAL;
+ dev_err(dev, "Clearing events out of order %u %u\n",
+ handle, le16_to_cpu(pl->handles[nr]));
+ rc = -EINVAL;
+ goto unlock;
}
}

@@ -316,25 +346,12 @@ static int mock_clear_event(struct device *dev, struct cxl_mbox_cmd *cmd)
log->nr_overflow = 0;

/* Clear events */
- log->clear_idx += pl->nr_recs;
- return 0;
-}
+ for (nr = 0; nr < pl->nr_recs; nr++)
+ mes_del_event(dev, log, le16_to_cpu(pl->handles[nr]));

-static void cxl_mock_event_trigger(struct device *dev)
-{
- struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
- struct mock_event_store *mes = &mdata->mes;
- int i;
-
- for (i = CXL_EVENT_TYPE_INFO; i < CXL_EVENT_TYPE_MAX; i++) {
- struct mock_event_log *log;
-
- log = event_find_log(dev, i);
- if (log)
- event_reset_log(log);
- }
-
- cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+unlock:
+ write_unlock(&log->lock);
+ return rc;
}

struct cxl_event_record_raw maint_needed = {
@@ -459,8 +476,27 @@ static int mock_set_timestamp(struct cxl_dev_state *cxlds,
return 0;
}

-static void cxl_mock_add_event_logs(struct mock_event_store *mes)
+/* Create a dynamically allocated event out of a statically defined event. */
+static void add_event_from_static(struct cxl_mockmem_data *mdata,
+ enum cxl_event_log_type log_type,
+ struct cxl_event_record_raw *raw)
{
+ struct device *dev = mdata->mds->cxlds.dev;
+ struct cxl_event_record_raw *rec;
+
+ rec = devm_kmemdup(dev, raw, sizeof(*rec), GFP_KERNEL);
+ if (!rec) {
+ dev_err(dev, "Failed to alloc event for log\n");
+ return;
+ }
+ mes_add_event(mdata, log_type, rec);
+}
+
+static void cxl_mock_add_event_logs(struct cxl_mockmem_data *mdata)
+{
+ struct mock_event_store *mes = &mdata->mes;
+ struct device *dev = mdata->mds->cxlds.dev;
+
put_unaligned_le16(CXL_GMER_VALID_CHANNEL | CXL_GMER_VALID_RANK,
&gen_media.rec.validity_flags);

@@ -468,43 +504,60 @@ static void cxl_mock_add_event_logs(struct mock_event_store *mes)
CXL_DER_VALID_BANK | CXL_DER_VALID_COLUMN,
&dram.rec.validity_flags);

- mes_add_event(mes, CXL_EVENT_TYPE_INFO, &maint_needed);
- mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_INFO);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO, &maint_needed);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
(struct cxl_event_record_raw *)&gen_media);
- mes_add_event(mes, CXL_EVENT_TYPE_INFO,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_INFO,
(struct cxl_event_record_raw *)&mem_module);
mes->ev_status |= CXLDEV_EVENT_STATUS_INFO;

- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &maint_needed);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_FAIL);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &maint_needed);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
+ (struct cxl_event_record_raw *)&mem_module);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&dram);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&gen_media);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&mem_module);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL,
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL,
(struct cxl_event_record_raw *)&dram);
/* Overflow this log */
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FAIL, &hardware_replace);
mes->ev_status |= CXLDEV_EVENT_STATUS_FAIL;

- mes_add_event(mes, CXL_EVENT_TYPE_FATAL, &hardware_replace);
- mes_add_event(mes, CXL_EVENT_TYPE_FATAL,
+ dev_dbg(dev, "Generating fake event logs %d\n",
+ CXL_EVENT_TYPE_FATAL);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL, &hardware_replace);
+ add_event_from_static(mdata, CXL_EVENT_TYPE_FATAL,
(struct cxl_event_record_raw *)&dram);
mes->ev_status |= CXLDEV_EVENT_STATUS_FATAL;
}

+static void cxl_mock_event_trigger(struct device *dev)
+{
+ struct cxl_mockmem_data *mdata = dev_get_drvdata(dev);
+ struct mock_event_store *mes = &mdata->mes;
+
+ cxl_mock_add_event_logs(mdata);
+ cxl_mem_get_event_records(mdata->mds, mes->ev_status);
+}
+
static int mock_gsl(struct cxl_mbox_cmd *cmd)
{
if (cmd->size_out < sizeof(mock_gsl_payload))
@@ -1438,6 +1491,14 @@ static ssize_t event_trigger_store(struct device *dev,
}
static DEVICE_ATTR_WO(event_trigger);

+static void init_event_log(struct mock_event_log *log)
+{
+ rwlock_init(&log->lock);
+ /* Handle can never be 0 use 1 based indexing for handle */
+ log->cur_handle = 1;
+ log->next_handle = 1;
+}
+
static int cxl_mock_mem_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
@@ -1504,7 +1565,9 @@ static int cxl_mock_mem_probe(struct platform_device *pdev)
if (rc)
return rc;

- cxl_mock_add_event_logs(&mdata->mes);
+ for (int i = 0; i < CXL_EVENT_TYPE_MAX; i++)
+ init_event_log(&mdata->mes.mock_logs[i]);
+ cxl_mock_add_event_logs(mdata);

cxlmd = devm_cxl_add_memdev(&pdev->dev, cxlds);
if (IS_ERR(cxlmd))

--
2.44.0


2024-03-25 08:12:39

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

From: Navneet Singh <[email protected]>

Until now region modes and decoder modes were equivalent in that they
were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
regions (which will represent an array of device regions [better named
partitions] the index of which could be different on different
interleaved devices), the mode of an endpoint decoder and a region will
no longer be equivalent.

Define a new region mode enumeration and adjust the code for it.

Suggested-by: Jonathan Cameron <[email protected]>
Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
<none>
---
drivers/cxl/core/region.c | 77 +++++++++++++++++++++++++++++++++++------------
drivers/cxl/cxl.h | 26 ++++++++++++++--
2 files changed, 81 insertions(+), 22 deletions(-)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 4c7fd2d5cccb..1723d17f121e 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -40,7 +40,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
rc = down_read_interruptible(&cxl_region_rwsem);
if (rc)
return rc;
- if (cxlr->mode != CXL_DECODER_PMEM)
+ if (cxlr->mode != CXL_REGION_PMEM)
rc = sysfs_emit(buf, "\n");
else
rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
@@ -353,7 +353,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
* Support tooling that expects to find a 'uuid' attribute for all
* regions regardless of mode.
*/
- if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
+ if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
return 0444;
return a->mode;
}
@@ -516,7 +516,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
{
struct cxl_region *cxlr = to_cxl_region(dev);

- return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
+ return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
}
static DEVICE_ATTR_RO(mode);

@@ -542,7 +542,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)

/* ways, granularity and uuid (if PMEM) need to be set before HPA */
if (!p->interleave_ways || !p->interleave_granularity ||
- (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
+ (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
return -ENXIO;

div64_u64_rem(size, (u64)SZ_256M * p->interleave_ways, &remainder);
@@ -1683,6 +1683,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
return rc;
}

+static bool cxl_modes_compatible(enum cxl_region_mode rmode,
+ enum cxl_decoder_mode dmode)
+{
+ if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
+ return true;
+ if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
+ return true;
+
+ return false;
+}
+
static int cxl_region_attach(struct cxl_region *cxlr,
struct cxl_endpoint_decoder *cxled, int pos)
{
@@ -1693,9 +1704,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
struct cxl_dport *dport;
int rc = -ENXIO;

- if (cxled->mode != cxlr->mode) {
- dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
- dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
+ if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
+ dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
+ dev_name(&cxled->cxld.dev),
+ cxl_region_mode_name(cxlr->mode),
+ cxl_decoder_mode_name(cxled->mode));
return -EINVAL;
}

@@ -2168,7 +2181,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
* devm_cxl_add_region - Adds a region to a decoder
* @cxlrd: root decoder
* @id: memregion id to create, or memregion_free() on failure
- * @mode: mode for the endpoint decoders of this region
+ * @mode: mode of this region
* @type: select whether this is an expander or accelerator (type-2 or type-3)
*
* This is the second step of region initialization. Regions exist within an
@@ -2179,7 +2192,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
*/
static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
int id,
- enum cxl_decoder_mode mode,
+ enum cxl_region_mode mode,
enum cxl_decoder_type type)
{
struct cxl_port *port = to_cxl_port(cxlrd->cxlsd.cxld.dev.parent);
@@ -2188,11 +2201,12 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
int rc;

switch (mode) {
- case CXL_DECODER_RAM:
- case CXL_DECODER_PMEM:
+ case CXL_REGION_RAM:
+ case CXL_REGION_PMEM:
break;
default:
- dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
+ dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
+ cxl_region_mode_name(mode));
return ERR_PTR(-EINVAL);
}

@@ -2242,7 +2256,7 @@ static ssize_t create_ram_region_show(struct device *dev,
}

static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
- enum cxl_decoder_mode mode, int id)
+ enum cxl_region_mode mode, int id)
{
int rc;

@@ -2270,7 +2284,7 @@ static ssize_t create_pmem_region_store(struct device *dev,
if (rc != 1)
return -EINVAL;

- cxlr = __create_region(cxlrd, CXL_DECODER_PMEM, id);
+ cxlr = __create_region(cxlrd, CXL_REGION_PMEM, id);
if (IS_ERR(cxlr))
return PTR_ERR(cxlr);

@@ -2290,7 +2304,7 @@ static ssize_t create_ram_region_store(struct device *dev,
if (rc != 1)
return -EINVAL;

- cxlr = __create_region(cxlrd, CXL_DECODER_RAM, id);
+ cxlr = __create_region(cxlrd, CXL_REGION_RAM, id);
if (IS_ERR(cxlr))
return PTR_ERR(cxlr);

@@ -2800,6 +2814,24 @@ static int match_region_by_range(struct device *dev, void *data)
return rc;
}

+static enum cxl_region_mode
+cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
+{
+ switch (mode) {
+ case CXL_DECODER_NONE:
+ return CXL_REGION_NONE;
+ case CXL_DECODER_RAM:
+ return CXL_REGION_RAM;
+ case CXL_DECODER_PMEM:
+ return CXL_REGION_PMEM;
+ case CXL_DECODER_MIXED:
+ default:
+ return CXL_REGION_MIXED;
+ }
+
+ return CXL_REGION_MIXED;
+}
+
/* Establish an empty region covering the given HPA range */
static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
struct cxl_endpoint_decoder *cxled)
@@ -2808,12 +2840,17 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
struct cxl_port *port = cxlrd_to_port(cxlrd);
struct range *hpa = &cxled->cxld.hpa_range;
struct cxl_region_params *p;
+ enum cxl_region_mode mode;
struct cxl_region *cxlr;
struct resource *res;
int rc;

+ if (cxled->mode == CXL_DECODER_DEAD)
+ return ERR_PTR(-EINVAL);
+
+ mode = cxl_decoder_to_region_mode(cxled->mode);
do {
- cxlr = __create_region(cxlrd, cxled->mode,
+ cxlr = __create_region(cxlrd, mode,
atomic_read(&cxlrd->region_id));
} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);

@@ -2996,9 +3033,9 @@ static int cxl_region_probe(struct device *dev)
return rc;

switch (cxlr->mode) {
- case CXL_DECODER_PMEM:
+ case CXL_REGION_PMEM:
return devm_cxl_add_pmem_region(cxlr);
- case CXL_DECODER_RAM:
+ case CXL_REGION_RAM:
/*
* The region can not be manged by CXL if any portion of
* it is already online as 'System RAM'
@@ -3010,8 +3047,8 @@ static int cxl_region_probe(struct device *dev)
return 0;
return devm_cxl_add_dax_region(cxlr);
default:
- dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
- cxlr->mode);
+ dev_dbg(&cxlr->dev, "unsupported region mode: %s\n",
+ cxl_region_mode_name(cxlr->mode));
return -ENXIO;
}
}
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 003feebab79b..9a0cce1e6fca 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -383,6 +383,27 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
return "mixed";
}

+enum cxl_region_mode {
+ CXL_REGION_NONE,
+ CXL_REGION_RAM,
+ CXL_REGION_PMEM,
+ CXL_REGION_MIXED,
+};
+
+static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
+{
+ static const char * const names[] = {
+ [CXL_REGION_NONE] = "none",
+ [CXL_REGION_RAM] = "ram",
+ [CXL_REGION_PMEM] = "pmem",
+ [CXL_REGION_MIXED] = "mixed",
+ };
+
+ if (mode >= CXL_REGION_NONE && mode <= CXL_REGION_MIXED)
+ return names[mode];
+ return "mixed";
+}
+
/*
* Track whether this decoder is reserved for region autodiscovery, or
* free for userspace provisioning.
@@ -511,7 +532,8 @@ struct cxl_region_params {
* struct cxl_region - CXL region
* @dev: This region's device
* @id: This region's id. Id is globally unique across all regions
- * @mode: Endpoint decoder allocation / access mode
+ * @mode: Region mode which defines which endpoint decoder mode the region is
+ * compatible with
* @type: Endpoint decoder target type
* @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
* @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
@@ -521,7 +543,7 @@ struct cxl_region_params {
struct cxl_region {
struct device dev;
int id;
- enum cxl_decoder_mode mode;
+ enum cxl_region_mode mode;
enum cxl_decoder_type type;
struct cxl_nvdimm_bridge *cxl_nvb;
struct cxl_pmem_region *cxlr_pmem;

--
2.44.0


2024-03-25 08:12:49

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 20/26] dax: Document dax dev range tuple

The device DAX structure is being enhanced to track additional DCD
information.

The current range tuple was not fully documented. Document it prior to
adding information for DC.

Suggested-by: Jonathan Cameron <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[iweiny: new patch]
---
drivers/dax/dax-private.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index c6319c6567fb..ac1ccf158650 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -70,7 +70,10 @@ struct dax_mapping {
* @dev - device core
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
* @nr_range: size of @ranges
- * @ranges: resource-span + pgoff tuples for the instance
+ * @ranges: range tuples of memory used
+ * @pgoff: page offset
+ * @range: resource-span
+ * @mapping: device to assist in interrogating the range layout
*/
struct dev_dax {
struct dax_region *region;

--
2.44.0


2024-03-25 08:12:49

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 17/26] dax/region: Create extent resources on DAX region driver load

From: Navneet Singh <[email protected]>

DAX regions mapping dynamic capacity partitions introduce a requirement
for the memory backing the region to come and go as required. This
results in a DAX region with sparse areas of memory backing. To track
the sparseness of the region, DAX extent objects need to track
sub-resource information as a new layer between the DAX region resource
and DAX device range resources.

Recall that DCD extents may be accepted when a region is first created.
Extend this support on region driver load. Scan existing extents and
create DAX extent resources as a first step to DAX extent realization.

The lifetime of a DAX extent is tricky to manage because the extent life
may end in one of two ways. First, the device may request the extent be
released. Second, the region may release the extent when it is
destroyed without hardware involvement. Support extent release without
hardware involvement first. Subsequent patches will provide for
hardware to request extent removal.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1
[iweiny: remove xarrays]
[iweiny: remove as much of extra reference stuff as possible]
[iweiny: Move extent resource handling to core DAX code]
---
drivers/dax/bus.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/dax/cxl.c | 43 ++++++++++++++++++++++++++++++++++--
drivers/dax/dax-private.h | 12 +++++++++++
3 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 903566aff5eb..4d5ed7ab6537 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -186,6 +186,61 @@ static bool is_sparse(struct dax_region *dax_region)
return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
}

+static int dax_region_add_resource(struct dax_region *dax_region,
+ struct dax_extent *dax_ext,
+ resource_size_t start,
+ resource_size_t length)
+{
+ struct resource *ext_res;
+
+ dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
+ ext_res = __request_region(&dax_region->res, start, length, "extent", 0);
+ if (!ext_res) {
+ dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
+ &start, &length);
+ return -ENOSPC;
+ }
+
+ dax_ext->region = dax_region;
+ dax_ext->res = ext_res;
+ dev_dbg(dax_region->dev, "Extent add resource %pr\n", ext_res);
+
+ return 0;
+}
+
+static void dax_region_release_extent(void *ext)
+{
+ struct dax_extent *dax_ext = ext;
+ struct dax_region *dax_region = dax_ext->region;
+
+ dev_dbg(dax_region->dev, "Extent release resource %pr\n", dax_ext->res);
+ if (dax_ext->res)
+ __release_region(&dax_region->res, dax_ext->res->start,
+ resource_size(dax_ext->res));
+
+ kfree(dax_ext);
+}
+
+int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
+ resource_size_t start, resource_size_t length)
+{
+ int rc;
+
+ struct dax_extent *dax_ext __free(kfree) = kzalloc(sizeof(*dax_ext),
+ GFP_KERNEL);
+ if (!dax_ext)
+ return -ENOMEM;
+
+ guard(rwsem_write)(&dax_region_rwsem);
+ rc = dax_region_add_resource(dax_region, dax_ext, start, length);
+ if (rc)
+ return rc;
+
+ return devm_add_action_or_reset(ext_dev, dax_region_release_extent,
+ no_free_ptr(dax_ext));
+}
+EXPORT_SYMBOL_GPL(dax_region_add_extent);
+
bool static_dev_dax(struct dev_dax *dev_dax)
{
return is_static(dev_dax->region);
diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
index 415d03fbf9b6..70bdc7a878ab 100644
--- a/drivers/dax/cxl.c
+++ b/drivers/dax/cxl.c
@@ -5,6 +5,42 @@

#include "../cxl/cxl.h"
#include "bus.h"
+#include "dax-private.h"
+
+static int __cxl_dax_region_add_extent(struct dax_region *dax_region,
+ struct region_extent *reg_ext)
+{
+ struct device *ext_dev = &reg_ext->dev;
+ resource_size_t start, length;
+
+ dev_dbg(dax_region->dev, "Adding extent HPA %#llx - %#llx\n",
+ reg_ext->hpa_range.start, reg_ext->hpa_range.end);
+
+ start = dax_region->res.start + reg_ext->hpa_range.start;
+ length = reg_ext->hpa_range.end - reg_ext->hpa_range.start + 1;
+
+ return dax_region_add_extent(dax_region, ext_dev, start, length);
+}
+
+static int cxl_dax_region_add_extent(struct device *dev, void *data)
+{
+ struct dax_region *dax_region = data;
+ struct region_extent *reg_ext;
+
+ if (!is_region_extent(dev))
+ return 0;
+
+ reg_ext = to_region_extent(dev);
+
+ return __cxl_dax_region_add_extent(dax_region, reg_ext);
+}
+
+static void cxl_dax_region_add_extents(struct cxl_dax_region *cxlr_dax,
+ struct dax_region *dax_region)
+{
+ dev_dbg(&cxlr_dax->dev, "Adding extents\n");
+ device_for_each_child(&cxlr_dax->dev, dax_region, cxl_dax_region_add_extent);
+}

static int cxl_dax_region_probe(struct device *dev)
{
@@ -29,9 +65,12 @@ static int cxl_dax_region_probe(struct device *dev)
return -ENOMEM;

dev_size = range_len(&cxlr_dax->hpa_range);
- /* Add empty seed dax device */
- if (cxlr->mode == CXL_REGION_DC)
+ if (cxlr->mode == CXL_REGION_DC) {
+ /* NOTE: Depends on dax_region being set in driver data */
+ cxl_dax_region_add_extents(cxlr_dax, dax_region);
+ /* Add empty seed dax device */
dev_size = 0;
+ }

data = (struct dev_dax_data) {
.dax_region = dax_region,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 446617b73aea..c6319c6567fb 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -16,6 +16,18 @@ struct inode *dax_inode(struct dax_device *dax_dev);
int dax_bus_init(void);
void dax_bus_exit(void);

+/**
+ * struct dax_extent - For sparse regions; an active extent
+ * @region: dax_region this resources is in
+ * @res: resource this extent covers
+ */
+struct dax_extent {
+ struct dax_region *region;
+ struct resource *res;
+};
+int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
+ resource_size_t start, resource_size_t length);
+
/**
* struct dax_region - mapping infrastructure for dax devices
* @id: kernel-wide unique region for a memory range

--
2.44.0


2024-03-25 10:07:04

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 14/26] cxl/region: Read existing extents on region creation

From: Navneet Singh <[email protected]>

Dynamic capacity device extents may be left in an accepted state on a
device due to an unexpected host crash. In this case creation of a new
region on top of the DC partition (region) is expected to expose those
extents for continued use.

Once all endpoint decoders are part of a region and the region is being
realized read the device extent list. For ease of review, this patch
stops after reading the extent list and leaves realization of the region
extents to a future patch.

Signed-off-by: Navneet Singh <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for v1:
[iweiny: remove extent list xarray]
[iweiny: Update spec references to 3.1]
[iweiny: use struct range in extents]
[iweiny: remove all reference tracking and let regions track extents
through the extent devices.]
[djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
---
drivers/cxl/core/core.h | 9 +++
drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/region.c | 29 +++++++
drivers/cxl/cxlmem.h | 49 ++++++++++++
4 files changed, 279 insertions(+)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 91abeffbe985..119b12362977 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -4,6 +4,8 @@
#ifndef __CXL_CORE_H__
#define __CXL_CORE_H__

+#include <cxlmem.h>
+
extern const struct device_type cxl_nvdimm_bridge_type;
extern const struct device_type cxl_nvdimm_type;
extern const struct device_type cxl_pmu_type;
@@ -28,6 +30,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
int cxl_region_init(void);
void cxl_region_exit(void);
int cxl_get_poison_by_endpoint(struct cxl_port *port);
+int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dc_extent *dc_extent);
#else
static inline int cxl_get_poison_by_endpoint(struct cxl_port *port)
{
@@ -43,6 +47,11 @@ static inline int cxl_region_init(void)
static inline void cxl_region_exit(void)
{
}
+static inline int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dc_extent *dc_extent)
+{
+ return 0;
+}
#define CXL_REGION_ATTR(x) NULL
#define CXL_REGION_TYPE(x) NULL
#define SET_CXL_REGION_ATTR(x)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index 58b31fa47b93..9e33a0976828 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
}
EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);

+static int cxl_validate_extent(struct cxl_memdev_state *mds,
+ struct cxl_dc_extent *dc_extent)
+{
+ struct device *dev = mds->cxlds.dev;
+ uint64_t start, len;
+
+ start = le64_to_cpu(dc_extent->start_dpa);
+ len = le64_to_cpu(dc_extent->length);
+
+ /* Extents must not cross region boundary's */
+ for (int i = 0; i < mds->nr_dc_region; i++) {
+ struct cxl_dc_region_info *dcr = &mds->dc_region[i];
+
+ if (dcr->base <= start &&
+ (start + len) <= (dcr->base + dcr->decode_len)) {
+ dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
+ start, start + len - 1, i, start - dcr->base);
+ return 0;
+ }
+ }
+
+ dev_err_ratelimited(dev,
+ "DC extent DPA %#llx - %#llx is not in any DC region\n",
+ start, start + len - 1);
+ return -EINVAL;
+}
+
+static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dc_extent *extent)
+{
+ uint64_t start = le64_to_cpu(extent->start_dpa);
+ uint64_t length = le64_to_cpu(extent->length);
+ struct range ext_range = (struct range){
+ .start = start,
+ .end = start + length - 1,
+ };
+ struct range ed_range = (struct range) {
+ .start = cxled->dpa_res->start,
+ .end = cxled->dpa_res->end,
+ };
+
+ dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
+ cxled->dpa_res, start, length);
+
+ return range_contains(&ed_range, &ext_range);
+}
+
void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
enum cxl_event_log_type type,
enum cxl_event_type event_type,
@@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
return rc;
}

+static struct cxl_memdev_state *
+cxled_to_mds(struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+ return container_of(cxlds, struct cxl_memdev_state, cxlds);
+}
+
static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
enum cxl_event_log_type type)
{
@@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
}
EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);

+static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
+ unsigned int *extent_gen_num)
+{
+ struct cxl_mbox_get_dc_extent_in get_dc_extent;
+ struct cxl_mbox_get_dc_extent_out dc_extents;
+ struct cxl_mbox_cmd mbox_cmd;
+ unsigned int count;
+ int rc;
+
+ get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
+ .extent_cnt = cpu_to_le32(0),
+ .start_extent_index = cpu_to_le32(0),
+ };
+
+ mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+ .payload_in = &get_dc_extent,
+ .size_in = sizeof(get_dc_extent),
+ .size_out = sizeof(dc_extents),
+ .payload_out = &dc_extents,
+ .min_out = 1,
+ };
+
+ rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+ if (rc < 0)
+ return rc;
+
+ count = le32_to_cpu(dc_extents.total_extent_cnt);
+ *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
+
+ return count;
+}
+
+static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
+ unsigned int start_gen_num,
+ unsigned int exp_cnt)
+{
+ struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+ unsigned int start_index, total_read;
+ struct device *dev = mds->cxlds.dev;
+ struct cxl_mbox_cmd mbox_cmd;
+
+ struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
+ kvmalloc(mds->payload_size, GFP_KERNEL);
+ if (!dc_extents)
+ return -ENOMEM;
+
+ total_read = 0;
+ start_index = 0;
+ do {
+ unsigned int nr_ext, total_extent_cnt, gen_num;
+ struct cxl_mbox_get_dc_extent_in get_dc_extent;
+ int rc;
+
+ get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
+ .extent_cnt = cpu_to_le32(exp_cnt - start_index),
+ .start_extent_index = cpu_to_le32(start_index),
+ };
+
+ mbox_cmd = (struct cxl_mbox_cmd) {
+ .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
+ .payload_in = &get_dc_extent,
+ .size_in = sizeof(get_dc_extent),
+ .size_out = mds->payload_size,
+ .payload_out = dc_extents,
+ .min_out = 1,
+ };
+
+ rc = cxl_internal_send_cmd(mds, &mbox_cmd);
+ if (rc < 0)
+ return rc;
+
+ nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
+ total_read += nr_ext;
+ total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
+ gen_num = le32_to_cpu(dc_extents->extent_list_num);
+
+ dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
+ total_extent_cnt, gen_num);
+
+ if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
+ dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
+ gen_num, start_gen_num, exp_cnt, total_extent_cnt);
+ return -EIO;
+ }
+
+ for (int i = 0; i < nr_ext ; i++) {
+ dev_dbg(dev, "Processing extent %d/%d\n",
+ start_index + i, exp_cnt);
+ rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
+ if (rc)
+ continue;
+ if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
+ continue;
+ rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
+ if (rc)
+ return rc;
+ }
+
+ start_index += nr_ext;
+ } while (exp_cnt > total_read);
+
+ return 0;
+}
+
+/**
+ * cxl_read_dc_extents() - Read any existing extents
+ * @cxled: Endpoint decoder which is part of a region
+ *
+ * Issue the Get Dynamic Capacity Extent List command to the device
+ * and add any existing extents found which belong to this decoder.
+ *
+ * Return: 0 if command was executed successfully, -ERRNO on error.
+ */
+int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
+{
+ struct cxl_memdev_state *mds = cxled_to_mds(cxled);
+ struct device *dev = mds->cxlds.dev;
+ unsigned int extent_gen_num;
+ int rc;
+
+ if (!cxl_dcd_supported(mds)) {
+ dev_dbg(dev, "DCD unsupported\n");
+ return 0;
+ }
+
+ rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
+ dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",
+ rc, extent_gen_num);
+ if (rc <= 0) /* 0 == no records found */
+ return rc;
+
+ return cxl_dev_get_dc_extents(cxled, extent_gen_num, rc);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_read_dc_extents, CXL);
+
static int add_dpa_res(struct device *dev, struct resource *parent,
struct resource *res, resource_size_t start,
resource_size_t size, const char *type)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 0d7b09a49dcf..3e563ab29afe 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -1450,6 +1450,13 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
return 0;
}

+/* Callers are expected to ensure cxled has been attached to a region */
+int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
+ struct cxl_dc_extent *dc_extent)
+{
+ return 0;
+}
+
static int cxl_region_attach_position(struct cxl_region *cxlr,
struct cxl_root_decoder *cxlrd,
struct cxl_endpoint_decoder *cxled,
@@ -2773,6 +2780,22 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
return rc;
}

+static int cxl_region_read_extents(struct cxl_region *cxlr)
+{
+ struct cxl_region_params *p = &cxlr->params;
+ int i;
+
+ for (i = 0; i < p->nr_targets; i++) {
+ int rc;
+
+ rc = cxl_read_dc_extents(p->targets[i]);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+
static void cxlr_dax_unregister(void *_cxlr_dax)
{
struct cxl_dax_region *cxlr_dax = _cxlr_dax;
@@ -2807,6 +2830,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
dev_name(dev));

+ if (cxlr->mode == CXL_REGION_DC) {
+ rc = cxl_region_read_extents(cxlr);
+ if (rc)
+ goto err;
+ }
+
return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
cxlr_dax);
err:
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 01bee6eedff3..8f2d8944d334 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -604,6 +604,54 @@ enum cxl_opcode {
UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \
0x40, 0x3d, 0x86)

+/*
+ * Add Dynamic Capacity Response
+ * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
+ */
+struct cxl_mbox_dc_response {
+ __le32 extent_list_size;
+ u8 flags;
+ u8 reserved[3];
+ struct updated_extent_list {
+ __le64 dpa_start;
+ __le64 length;
+ u8 reserved[8];
+ } __packed extent_list[];
+} __packed;
+
+/*
+ * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
+ */
+#define CXL_DC_EXTENT_TAG_LEN 0x10
+struct cxl_dc_extent {
+ __le64 start_dpa;
+ __le64 length;
+ u8 tag[CXL_DC_EXTENT_TAG_LEN];
+ __le16 shared_extn_seq;
+ u8 reserved[6];
+} __packed;
+
+/*
+ * Get Dynamic Capacity Extent List; Input Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
+ */
+struct cxl_mbox_get_dc_extent_in {
+ __le32 extent_cnt;
+ __le32 start_extent_index;
+} __packed;
+
+/*
+ * Get Dynamic Capacity Extent List; Output Payload
+ * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
+ */
+struct cxl_mbox_get_dc_extent_out {
+ __le32 ret_extent_cnt;
+ __le32 total_extent_cnt;
+ __le32 extent_list_num;
+ u8 rsvd[4];
+ struct cxl_dc_extent extent[];
+} __packed;
+
struct cxl_mbox_get_supported_logs {
__le16 entries;
u8 rsvd[6];
@@ -879,6 +927,7 @@ int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
struct cxl_mbox_cmd *cmd);
int cxl_dev_state_identify(struct cxl_memdev_state *mds);
int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
+int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled);
int cxl_await_media_ready(struct cxl_dev_state *cxlds);
int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
int cxl_mem_create_range_info(struct cxl_memdev_state *mds);

--
2.44.0


2024-03-25 17:23:36

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)

On Sun, 24 Mar 2024 16:18:04 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> Per the CXL 3.1 specification software must check the Command Effects
> Log (CEL) to know if a device supports dynamic capacity (DC). If the
> device does support DC the specifics of the DC Regions (0-7) are read
> through the mailbox.
>
> Flag DC Device (DCD) commands in a device if they are supported.
> Subsequent patches will key off these bits to configure DCD.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
This aligns with similar existing code etc and the opcodes are right so
There is a vague argument that maybe we shouldn't claim to support the
command in the debug message at that point in the patch set, but I don't
think we really care either way as expectation is this set should go
in all together.

Reviewed-by: Jonathan Cameron <[email protected]>


2024-03-25 18:13:14

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 04/26] cxl/region: Add dynamic capacity decoder and region modes

On Sun, 24 Mar 2024 16:18:07 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> Region mode must reflect a general dynamic capacity type which is
> associated with a specific Dynamic Capacity (DC) partitions in each
> device decoder within the region. DC partitions are also know as DC
> regions per CXL 3.1.
>
> Decoder mode reflects a specific DC partition.
>
> Define the new modes to use in subsequent patches and the helper
> functions required to make the association between these new modes.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Jonathan Cameron <[email protected]>


2024-03-25 18:17:15

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

On Sun, 24 Mar 2024 16:18:08 -0700
Ira Weiny <[email protected]> wrote:

> cxl_dpa_set_mode() checks the mode for validity two times, once outside
> of the DPA RW semaphore and again within. The function is not in a
> critical path. Prior to Dynamic Capacity the extra check was not much
> of an issue. The addition of DC modes increases the complexity of
> the check.
>
> Simplify the mode check before adding the more complex DC modes.
>
> Signed-off-by: Ira Weiny <[email protected]>

Nice. Maybe drag this earlier in series so it could potentially be
picked up as a cleanup? Same with patch 2 potentially.
If Dave is fine with doing that sort of precursor patches going
earlier, it will save carrying quite so many in this series for
future versions (and make it look less terrifying :)

Reviewed-by: Jonathan Cameron <[email protected]>


2024-03-25 19:00:02

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 15/26] range: Add range_overlaps()

On Sun, Mar 24, 2024 at 04:18:18PM -0700, Ira Weiny wrote:
> Code to support CXL Dynamic Capacity devices will have extent ranges
> which need to be compared for intersection not a subset as is being
> checked in range_contains().
>
> range_overlaps() is defined in btrfs with a different meaning from what
> is required in the standard range code. Dan Williams pointed this out
> in [1]. Adjust the btrfs call according to his suggestion there.
>
> Then add a generic range_overlaps().
>
> Cc: Dan Williams <[email protected]>
> Cc: Chris Mason <[email protected]>
> Cc: Josef Bacik <[email protected]>
> Cc: David Sterba <[email protected]>
> Cc: [email protected]
> Signed-off-by: Ira Weiny <[email protected]>
>
> [1] https://lore.kernel.org/all/[email protected]/
> ---

> fs/btrfs/ordered-data.c | 10 +++++-----

Acked-by: David Sterba <[email protected]>

2024-03-25 20:09:00

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

On Sun, 24 Mar 2024 16:18:05 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> Until now region modes and decoder modes were equivalent in that they
> were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
> regions (which will represent an array of device regions [better named
> partitions] the index of which could be different on different
> interleaved devices), the mode of an endpoint decoder and a region will
> no longer be equivalent.
>
> Define a new region mode enumeration and adjust the code for it.
>
> Suggested-by: Jonathan Cameron <[email protected]>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

I can't really remember the reasoning behind this split, but from a fresh
read it seems reasonable. Some trivial comments inline.

Jonathan

>
> ---
> Changes for v1
> <none>
> ---
> drivers/cxl/core/region.c | 77 +++++++++++++++++++++++++++++++++++------------
> drivers/cxl/cxl.h | 26 ++++++++++++++--
> 2 files changed, 81 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 4c7fd2d5cccb..1723d17f121e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c


> @@ -2800,6 +2814,24 @@ static int match_region_by_range(struct device *dev, void *data)
> return rc;
> }
>
> +static enum cxl_region_mode
> +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> +{
> + switch (mode) {
> + case CXL_DECODER_NONE:
> + return CXL_REGION_NONE;
> + case CXL_DECODER_RAM:
> + return CXL_REGION_RAM;
> + case CXL_DECODER_PMEM:
> + return CXL_REGION_PMEM;
> + case CXL_DECODER_MIXED:
> + default:
> + return CXL_REGION_MIXED;
> + }
> +

Dead code.

> + return CXL_REGION_MIXED;
> +}
> +
> /* Establish an empty region covering the given HPA range */
> static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder *cxled)
> @@ -2808,12 +2840,17 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> struct cxl_port *port = cxlrd_to_port(cxlrd);
> struct range *hpa = &cxled->cxld.hpa_range;
> struct cxl_region_params *p;
> + enum cxl_region_mode mode;
> struct cxl_region *cxlr;
> struct resource *res;
> int rc;
>
> + if (cxled->mode == CXL_DECODER_DEAD)
> + return ERR_PTR(-EINVAL);

Not a bad thing necessarily, but why do we now need this and didn't before?

> +
> + mode = cxl_decoder_to_region_mode(cxled->mode);
> do {
> - cxlr = __create_region(cxlrd, cxled->mode,
> + cxlr = __create_region(cxlrd, mode,
> atomic_read(&cxlrd->region_id));
> } while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);


> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 003feebab79b..9a0cce1e6fca 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h


> /*
> * Track whether this decoder is reserved for region autodiscovery, or
> * free for userspace provisioning.
> @@ -511,7 +532,8 @@ struct cxl_region_params {
> * struct cxl_region - CXL region
> * @dev: This region's device
> * @id: This region's id. Id is globally unique across all regions
> - * @mode: Endpoint decoder allocation / access mode
> + * @mode: Region mode which defines which endpoint decoder mode the region is
mode or potentially modes?

If region is mixed, I guess that means endpoint could be pmem or ram in theory?
Don't think anyone has implemented anything yet, but is the potential there?


> + * compatible with


2024-03-25 21:24:43

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 15/26] range: Add range_overlaps()

On Sun, 24 Mar 2024, Ira Weiny wrote:

>Code to support CXL Dynamic Capacity devices will have extent ranges
>which need to be compared for intersection not a subset as is being
>checked in range_contains().
>
>range_overlaps() is defined in btrfs with a different meaning from what
>is required in the standard range code. Dan Williams pointed this out
>in [1]. Adjust the btrfs call according to his suggestion there.
>
>Then add a generic range_overlaps().

Reviewed-by: Davidlohr Bueso <[email protected]>


2024-03-25 21:39:02

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

On Sun, 24 Mar 2024, Ira Weiny wrote:

>cxl_dpa_set_mode() checks the mode for validity two times, once outside
>of the DPA RW semaphore and again within. The function is not in a
>critical path. Prior to Dynamic Capacity the extra check was not much
>of an issue. The addition of DC modes increases the complexity of
>the check.

I agree (also to pick this up regardless of dcd work).

>
>Simplify the mode check before adding the more complex DC modes.

Reviewed-by: Davidlohr Bueso <[email protected]>

>Signed-off-by: Ira Weiny <[email protected]>
>
>---
>Changes for v1:
>[iweiny: new patch]
>[Jonathan: based on getting rid of the loop in cxl_dpa_set_mode]
>[Jonathan: standardize on resource_size() == 0]
>---
> drivers/cxl/core/hdm.c | 45 ++++++++++++++++++---------------------------
> 1 file changed, 18 insertions(+), 27 deletions(-)
>
>diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
>index 7d97790b893d..66b8419fd0c3 100644
>--- a/drivers/cxl/core/hdm.c
>+++ b/drivers/cxl/core/hdm.c
>@@ -411,44 +411,35 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &cxled->cxld.dev;
>- int rc;
>
>+ guard(rwsem_write)(&cxl_dpa_rwsem);
>+ if (cxled->cxld.flags & CXL_DECODER_F_ENABLE)
>+ return -EBUSY;
>+
>+ /*
>+ * Check that the mode is supported by the current partition
>+ * configuration
>+ */
> switch (mode) {
> case CXL_DECODER_RAM:
>+ if (!resource_size(&cxlds->ram_res)) {
>+ dev_dbg(dev, "no available ram capacity\n");
>+ return -ENXIO;
>+ }
>+ break;
> case CXL_DECODER_PMEM:
>+ if (!resource_size(&cxlds->pmem_res)) {
>+ dev_dbg(dev, "no available pmem capacity\n");
>+ return -ENXIO;
>+ }
> break;
> default:
> dev_dbg(dev, "unsupported mode: %d\n", mode);
> return -EINVAL;
> }
>
>- down_write(&cxl_dpa_rwsem);
>- if (cxled->cxld.flags & CXL_DECODER_F_ENABLE) {
>- rc = -EBUSY;
>- goto out;
>- }
>-
>- /*
>- * Only allow modes that are supported by the current partition
>- * configuration
>- */
>- if (mode == CXL_DECODER_PMEM && !resource_size(&cxlds->pmem_res)) {
>- dev_dbg(dev, "no available pmem capacity\n");
>- rc = -ENXIO;
>- goto out;
>- }
>- if (mode == CXL_DECODER_RAM && !resource_size(&cxlds->ram_res)) {
>- dev_dbg(dev, "no available ram capacity\n");
>- rc = -ENXIO;
>- goto out;
>- }
>-
> cxled->mode = mode;
>- rc = 0;
>-out:
>- up_write(&cxl_dpa_rwsem);
>-
>- return rc;
>+ return 0;
> }
>
> int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>
>--
>2.44.0
>

2024-03-25 22:17:00

by fan

[permalink] [raw]
Subject: Re: [PATCH 01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)

On Sun, Mar 24, 2024 at 04:18:04PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Per the CXL 3.1 specification software must check the Command Effects
> Log (CEL) to know if a device supports dynamic capacity (DC). If the
> device does support DC the specifics of the DC Regions (0-7) are read
> through the mailbox.
>
> Flag DC Device (DCD) commands in a device if they are supported.
> Subsequent patches will key off these bits to configure DCD.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
> ---

Reviewed-by: Fan Ni <[email protected]>

> Changes for v1
> [iweiny: update to latest master]
> [iweiny: update commit message]
> [iweiny: Based on the fix:
> https://lore.kernel.org/all/[email protected]/
> [jonathan: remove unneeded format change]
> [jonathan: don't split security code in mbox.c]
> ---
> drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
> drivers/cxl/cxlmem.h | 15 +++++++++++++++
> 2 files changed, 48 insertions(+)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9adda4795eb7..ed4131c6f50b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -161,6 +161,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
> }
> }
>
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> + return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> + u16 opcode)
> +{
> + switch (opcode) {
> + case CXL_MBOX_OP_GET_DC_CONFIG:
> + set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> + set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_ADD_DC_RESPONSE:
> + set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_RELEASE_DC:
> + set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> + break;
> + default:
> + break;
> + }
> +}
> +
> static bool cxl_is_poison_command(u16 opcode)
> {
> #define CXL_MBOX_OP_POISON_CMDS 0x43
> @@ -733,6 +761,11 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> enabled++;
> }
>
> + if (cxl_is_dcd_command(opcode)) {
> + cxl_set_dcd_cmd_enabled(mds, opcode);
> + enabled++;
> + }
> +
> dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
> enabled ? "enabled" : "unsupported by driver");
> }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 20fb3b35e89e..79a67cff9143 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -238,6 +238,15 @@ struct cxl_event_state {
> struct mutex log_lock;
> };
>
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> + CXL_DCD_ENABLED_GET_CONFIG,
> + CXL_DCD_ENABLED_GET_EXTENT_LIST,
> + CXL_DCD_ENABLED_ADD_RESPONSE,
> + CXL_DCD_ENABLED_RELEASE,
> + CXL_DCD_ENABLED_MAX
> +};
> +
> /* Device enabled poison commands */
> enum poison_cmd_enabled_bits {
> CXL_POISON_ENABLED_LIST,
> @@ -454,6 +463,7 @@ struct cxl_dev_state {
> * (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> * @mbox_mutex: Mutex to synchronize mailbox access.
> * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> * @total_bytes: sum of all possible capacities
> @@ -481,6 +491,7 @@ struct cxl_memdev_state {
> size_t lsa_size;
> struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> char firmware_version[0x10];
> + DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> u64 total_bytes;
> @@ -551,6 +562,10 @@ enum cxl_opcode {
> CXL_MBOX_OP_UNLOCK = 0x4503,
> CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
> CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
> + CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
> + CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
> + CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
> + CXL_MBOX_OP_RELEASE_DC = 0x4803,
> CXL_MBOX_OP_MAX = 0x10000
> };
>
>
> --
> 2.44.0
>

2024-03-25 22:36:11

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

On Sun, 24 Mar 2024 16:18:06 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> Devices can optionally support Dynamic Capacity (DC). These devices are
> known as Dynamic Capacity Devices (DCD).
>
> Implement the DC mailbox commands as specified in CXL 3.1 section
> 8.2.9.9.9 (opcodes 48XXh). Read the DC configuration and store the DC
> region information in the device state.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

A few minor things inline,

Jonathan

>
> ---
> Changes for v1
> [J?rgen: ensure CXL 2.0 device support by removing dc_event_log_size]
> [iweiny/J?rgen: use get DC config command to signal DCD support]
> [djiang: fix subject]
> [Fan: add additional region configuration checks]
> [Jonathan/djiang: split out region mode changes]
> [Jonathan: fix up comments/kdoc]
> [Jonathan: s/cxl_get_dc_id/cxl_get_dc_config/]
> [Jonathan: use __free() in identify call]
> [Jonathan: remove unneeded formatting changes]
> [Jonathan: s/cxl_mbox_dynamic_capacity/cxl_mbox_get_dc_config_out/]
> [Jonathan: s/cxl_mbox_get_dc_config/cxl_mbox_get_dc_config_in/]
> [iweiny: remove type2 work dependancy/rebase on master]
> [iweiny: fix 0day build issues]
> ---
> drivers/cxl/core/mbox.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxlmem.h | 49 +++++++++++++
> drivers/cxl/pci.c | 4 ++
> 3 files changed, 236 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index ed4131c6f50b..14e8a7528a8b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1123,7 +1123,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
> if (rc < 0)
> return rc;
>
> - mds->total_bytes =
> + mds->static_cap =
> le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
> mds->volatile_only_bytes =
> le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1230,6 +1230,175 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> return rc;
> }
>
> +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
> + struct cxl_dc_region_config *region_config)
> +{
> + struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> + struct device *dev = mds->cxlds.dev;
> +
> + dcr->base = le64_to_cpu(region_config->region_base);
> + dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> + dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> + dcr->len = le64_to_cpu(region_config->region_length);
> + dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> + dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> + dcr->flags = region_config->flags;
> + snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> +
> + /* Check regions are in increasing DPA order */
> + if (index > 0) {
> + struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> +
> + if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> + dev_err(dev,
> + "DPA ordering violation for DC region %d and %d\n",
> + index - 1, index);
> + return -EINVAL;
> + }
> + }
> +
> + if (!IS_ALIGNED(dcr->base, SZ_256M) ||
> + !IS_ALIGNED(dcr->base, dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n", index,

Odd choice of line wrap. I'd drag index onto the line below.

> + dcr->base, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
> + !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
> + index, dcr->decode_len, dcr->len, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + if (dcr->blk_size == 0 || dcr->blk_size % 0x40 ||

Hmm. I thought we had a define for CXL 'cacheline' size, but can't find it now.
If not we should add one (and find a better name than that).

> + !is_power_of_2(dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid block size; %#llx\n",
> + index, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + dev_dbg(dev,
> + "DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> + dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +
> + return 0;
> +}
> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> + struct cxl_mbox_get_dc_config_out *dc_resp,
> + size_t dc_resp_size)
> +{
> + struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> + .region_count = CXL_MAX_DC_REGION,
> + .start_region_index = start_region,
> + };
> + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> + .payload_in = &get_dc,
> + .size_in = sizeof(get_dc),
> + .size_out = dc_resp_size,
> + .payload_out = dc_resp,
> + .min_out = 1,
> + };
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + rc = dc_resp->avail_region_count - start_region;
> +
> + /*
> + * The number of regions in the payload may have been truncated due to
> + * payload_size limits; if so adjust the returned count to match.
> + */
> + if (mbox_cmd.size_out < sizeof(*dc_resp))
> + rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);

Why not always return this? If there was space, doesn't it equal
the value set above anyway?

> +
> + dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> +
> + return rc;
> +}

> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + *
> + * Read Dynamic Capacity information from the device and populate the state
> + * structures for later use.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> + size_t dc_resp_size = mds->payload_size;
> + struct device *dev = mds->cxlds.dev;
> + u8 start_region, i;
> + int rc = 0;

Is this used before being set?

> +
> + for (i = 0; i < CXL_MAX_DC_REGION; i++)
> + snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> + /* Check GET_DC_CONFIG is supported by device */
> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD not supported\n");
> + return 0;
> + }
> +
> + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> + kvmalloc(dc_resp_size, GFP_KERNEL);
> + if (!dc_resp)
> + return -ENOMEM;
> +
> + start_region = 0;
> + do {
> + int j;
> +
> + rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> + if (rc < 0) {
> + dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> + return rc;
> + }
> +
> + mds->nr_dc_region += rc;
> +
> + if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> + dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> + mds->nr_dc_region);
> + return -EINVAL;
> + }
> +
> + for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> + rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> + if (rc) {
> + dev_dbg(dev, "Failed to save region info: %d\n", rc);
> + return rc;
> + }
> + }
> +
> + start_region = mds->nr_dc_region;
> +
> + } while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> + mds->dynamic_cap =
> + mds->dc_region[mds->nr_dc_region - 1].base +
> + mds->dc_region[mds->nr_dc_region - 1].decode_len -
> + mds->dc_region[0].base;
> + dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);



> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 79a67cff9143..4624cf612c1e 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h

> /**
> * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> *
> @@ -467,6 +482,8 @@ struct cxl_dev_state {
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of static RAM and PMEM capacities
> + * @dynamic_cap: Complete DPA range occupied by DC regions
> * @volatile_only_bytes: hard volatile capacity
> * @persistent_only_bytes: hard persistent capacity
> * @partition_align_bytes: alignment size for partition-able capacity
> @@ -474,6 +491,8 @@ struct cxl_dev_state {
> * @active_persistent_bytes: sum of hard + soft persistent
> * @next_volatile_bytes: volatile capacity change pending device reset
> * @next_persistent_bytes: persistent capacity change pending device reset

Looks like we have some ordering issues ram_perf and pmem_perf (at least)
that we should fix up as a precursor. I sent a reply to the QoS patch
that added these.

> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> * @event: event log driver state
> * @poison: poison driver state info
> * @security: security driver state info
> @@ -494,7 +513,10 @@ struct cxl_memdev_state {
> DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
Trivial but this is an unrelated change and shouldn't be in this patch.

> u64 total_bytes;
> + u64 static_cap;
> + u64 dynamic_cap;
> u64 volatile_only_bytes;
> u64 persistent_only_bytes;
> u64 partition_align_bytes;
> @@ -506,6 +528,9 @@ struct cxl_memdev_state {
> struct cxl_dpa_perf ram_perf;
> struct cxl_dpa_perf pmem_perf;
>
> + u8 nr_dc_region;
> + struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +
> struct cxl_event_state event;
> struct cxl_poison_state poison;
> struct cxl_security_state security;

> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_get_dc_config_out {
> + u8 avail_region_count;
> + u8 rsvd[7];
> + struct cxl_dc_region_config {
> + __le64 region_base;
> + __le64 region_decode_length;
> + __le64 region_length;
> + __le64 region_block_size;
> + __le32 region_dsmad_handle;
> + u8 flags;
> + u8 rsvd[3];
> + } __packed region[];
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_REGIONS_RETURNED(size_out) \
> + ((size_out - 8) / sizeof(struct cxl_dc_region_config))

Can we make that 8 self documenting?
offsetof(struct cxl_dc_region_config, region) perhaps?

> +


2024-03-25 23:28:59

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)

On Sun, 24 Mar 2024, [email protected] wrote:

>From: Navneet Singh <[email protected]>
>
>Per the CXL 3.1 specification software must check the Command Effects
>Log (CEL) to know if a device supports dynamic capacity (DC). If the
>device does support DC the specifics of the DC Regions (0-7) are read
>through the mailbox.

I vote to fold this into patch 3, favoring reduced patch count in the
series to trvially enlarging that particular patch.

>Flag DC Device (DCD) commands in a device if they are supported.
>Subsequent patches will key off these bits to configure DCD.

It would be good to mention these here explicitly (if this patch will
live on). For example, that config will be the driver's way of telling
if dcd is enabled or disabled - we could have cases of that zeroed bit
but the rest enabled.

lgtm otherwise.

Reviewed-by: Davidlohr Bueso <[email protected]>

>Signed-off-by: Navneet Singh <[email protected]>
>Co-developed-by: Ira Weiny <[email protected]>
>Signed-off-by: Ira Weiny <[email protected]>
>---
>Changes for v1
>[iweiny: update to latest master]
>[iweiny: update commit message]
>[iweiny: Based on the fix:
> https://lore.kernel.org/all/[email protected]/
>[jonathan: remove unneeded format change]
>[jonathan: don't split security code in mbox.c]
>---
> drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
> drivers/cxl/cxlmem.h | 15 +++++++++++++++
> 2 files changed, 48 insertions(+)
>
>diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
>index 9adda4795eb7..ed4131c6f50b 100644
>--- a/drivers/cxl/core/mbox.c
>+++ b/drivers/cxl/core/mbox.c
>@@ -161,6 +161,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
> }
> }
>
>+static bool cxl_is_dcd_command(u16 opcode)
>+{
>+#define CXL_MBOX_OP_DCD_CMDS 0x48
>+
>+ return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
>+}
>+
>+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
>+ u16 opcode)
>+{
>+ switch (opcode) {
>+ case CXL_MBOX_OP_GET_DC_CONFIG:
>+ set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
>+ break;
>+ case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
>+ set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
>+ break;
>+ case CXL_MBOX_OP_ADD_DC_RESPONSE:
>+ set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
>+ break;
>+ case CXL_MBOX_OP_RELEASE_DC:
>+ set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
>+ break;
>+ default:
>+ break;
>+ }
>+}
>+
> static bool cxl_is_poison_command(u16 opcode)
> {
> #define CXL_MBOX_OP_POISON_CMDS 0x43
>@@ -733,6 +761,11 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> enabled++;
> }
>
>+ if (cxl_is_dcd_command(opcode)) {
>+ cxl_set_dcd_cmd_enabled(mds, opcode);
>+ enabled++;
>+ }
>+
> dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
> enabled ? "enabled" : "unsupported by driver");
> }
>diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
>index 20fb3b35e89e..79a67cff9143 100644
>--- a/drivers/cxl/cxlmem.h
>+++ b/drivers/cxl/cxlmem.h
>@@ -238,6 +238,15 @@ struct cxl_event_state {
> struct mutex log_lock;
> };
>
>+/* Device enabled DCD commands */
>+enum dcd_cmd_enabled_bits {
>+ CXL_DCD_ENABLED_GET_CONFIG,
>+ CXL_DCD_ENABLED_GET_EXTENT_LIST,
>+ CXL_DCD_ENABLED_ADD_RESPONSE,
>+ CXL_DCD_ENABLED_RELEASE,
>+ CXL_DCD_ENABLED_MAX
>+};
>+
> /* Device enabled poison commands */
> enum poison_cmd_enabled_bits {
> CXL_POISON_ENABLED_LIST,
>@@ -454,6 +463,7 @@ struct cxl_dev_state {
> * (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> * @mbox_mutex: Mutex to synchronize mailbox access.
> * @firmware_version: Firmware version for the memory device.
>+ * @dcd_cmds: List of DCD commands implemented by memory device
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> * @total_bytes: sum of all possible capacities
>@@ -481,6 +491,7 @@ struct cxl_memdev_state {
> size_t lsa_size;
> struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> char firmware_version[0x10];
>+ DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> u64 total_bytes;
>@@ -551,6 +562,10 @@ enum cxl_opcode {
> CXL_MBOX_OP_UNLOCK = 0x4503,
> CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
> CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
>+ CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
>+ CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
>+ CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
>+ CXL_MBOX_OP_RELEASE_DC = 0x4803,
> CXL_MBOX_OP_MAX = 0x10000
> };
>
>
>--
>2.44.0
>

2024-03-25 23:34:40

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

On Sun, 24 Mar 2024, [email protected] wrote:

>From: Navneet Singh <[email protected]>
>
>Until now region modes and decoder modes were equivalent in that they
>were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
>regions (which will represent an array of device regions [better named
>partitions] the index of which could be different on different
>interleaved devices), the mode of an endpoint decoder and a region will
>no longer be equivalent.
>
>Define a new region mode enumeration and adjust the code for it.

Could this could also be picked up regardless of dcd?

Thanks,
Davidlohr

2024-03-25 23:36:22

by fan

[permalink] [raw]
Subject: Re: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

On Sun, Mar 24, 2024 at 04:18:06PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Devices can optionally support Dynamic Capacity (DC). These devices are
> known as Dynamic Capacity Devices (DCD).
>
> Implement the DC mailbox commands as specified in CXL 3.1 section
> 8.2.9.9.9 (opcodes 48XXh). Read the DC configuration and store the DC
> region information in the device state.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [J?rgen: ensure CXL 2.0 device support by removing dc_event_log_size]
> [iweiny/J?rgen: use get DC config command to signal DCD support]
> [djiang: fix subject]
> [Fan: add additional region configuration checks]
> [Jonathan/djiang: split out region mode changes]
> [Jonathan: fix up comments/kdoc]
> [Jonathan: s/cxl_get_dc_id/cxl_get_dc_config/]
> [Jonathan: use __free() in identify call]
> [Jonathan: remove unneeded formatting changes]
> [Jonathan: s/cxl_mbox_dynamic_capacity/cxl_mbox_get_dc_config_out/]
> [Jonathan: s/cxl_mbox_get_dc_config/cxl_mbox_get_dc_config_in/]
> [iweiny: remove type2 work dependancy/rebase on master]
> [iweiny: fix 0day build issues]
> ---
> drivers/cxl/core/mbox.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxlmem.h | 49 +++++++++++++
> drivers/cxl/pci.c | 4 ++
> 3 files changed, 236 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index ed4131c6f50b..14e8a7528a8b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1123,7 +1123,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
> if (rc < 0)
> return rc;
>
> - mds->total_bytes =
> + mds->static_cap =
> le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
> mds->volatile_only_bytes =
> le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1230,6 +1230,175 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> return rc;
> }
>
> +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
> + struct cxl_dc_region_config *region_config)
> +{
> + struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> + struct device *dev = mds->cxlds.dev;
> +
> + dcr->base = le64_to_cpu(region_config->region_base);
> + dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> + dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> + dcr->len = le64_to_cpu(region_config->region_length);
> + dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> + dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> + dcr->flags = region_config->flags;
> + snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> +
> + /* Check regions are in increasing DPA order */
> + if (index > 0) {
> + struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> +
> + if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> + dev_err(dev,
> + "DPA ordering violation for DC region %d and %d\n",
> + index - 1, index);
> + return -EINVAL;
> + }
> + }
> +
> + if (!IS_ALIGNED(dcr->base, SZ_256M) ||
> + !IS_ALIGNED(dcr->base, dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n", index,
> + dcr->base, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
> + !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
> + index, dcr->decode_len, dcr->len, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + if (dcr->blk_size == 0 || dcr->blk_size % 0x40 ||
> + !is_power_of_2(dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid block size; %#llx\n",
> + index, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + dev_dbg(dev,
> + "DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> + dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +
> + return 0;
> +}
> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> + struct cxl_mbox_get_dc_config_out *dc_resp,
> + size_t dc_resp_size)
> +{
> + struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> + .region_count = CXL_MAX_DC_REGION,
> + .start_region_index = start_region,
> + };
> + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> + .payload_in = &get_dc,
> + .size_in = sizeof(get_dc),
> + .size_out = dc_resp_size,
> + .payload_out = dc_resp,
> + .min_out = 1,
> + };
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + rc = dc_resp->avail_region_count - start_region;
> +
> + /*
> + * The number of regions in the payload may have been truncated due to
> + * payload_size limits; if so adjust the returned count to match.
> + */
> + if (mbox_cmd.size_out < sizeof(*dc_resp))
> + rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
> +
> + dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> +
> + return rc;
> +}
> +
> +static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> + return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + *
> + * Read Dynamic Capacity information from the device and populate the state
> + * structures for later use.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> + size_t dc_resp_size = mds->payload_size;
> + struct device *dev = mds->cxlds.dev;
> + u8 start_region, i;
> + int rc = 0;
> +
> + for (i = 0; i < CXL_MAX_DC_REGION; i++)
> + snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> + /* Check GET_DC_CONFIG is supported by device */
> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD not supported\n");
> + return 0;
> + }
> +
> + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> + kvmalloc(dc_resp_size, GFP_KERNEL);
> + if (!dc_resp)
> + return -ENOMEM;
> +
> + start_region = 0;
> + do {
> + int j;
> +
> + rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> + if (rc < 0) {
> + dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> + return rc;
> + }
> +
> + mds->nr_dc_region += rc;
> +
> + if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> + dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> + mds->nr_dc_region);
> + return -EINVAL;
> + }
> +
> + for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> + rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> + if (rc) {
> + dev_dbg(dev, "Failed to save region info: %d\n", rc);
> + return rc;
> + }
> + }
> +
> + start_region = mds->nr_dc_region;
> +
> + } while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> + mds->dynamic_cap =
> + mds->dc_region[mds->nr_dc_region - 1].base +
> + mds->dc_region[mds->nr_dc_region - 1].decode_len -
> + mds->dc_region[0].base;
> + dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
> static int add_dpa_res(struct device *dev, struct resource *parent,
> struct resource *res, resource_size_t start,
> resource_size_t size, const char *type)
> @@ -1260,8 +1429,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> {
> struct cxl_dev_state *cxlds = &mds->cxlds;
> struct device *dev = cxlds->dev;
> + size_t untenanted_mem;
> int rc;
>
> + untenanted_mem = mds->dc_region[0].base - mds->static_cap;
> + mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
> +
> if (!cxlds->media_ready) {
> cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> @@ -1271,6 +1444,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>
> cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
>
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> + dcr->base, dcr->decode_len, dcr->name);
> + if (rc)
> + return rc;
> + }
> +
> if (mds->partition_align_bytes == 0) {
> rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> mds->volatile_only_bytes, "ram");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 79a67cff9143..4624cf612c1e 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -402,6 +402,7 @@ enum cxl_devtype {
> CXL_DEVTYPE_CLASSMEM,
> };
>
> +#define CXL_MAX_DC_REGION 8
> /**
> * struct cxl_dpa_perf - DPA performance property entry
> * @dpa_range - range for DPA address
> @@ -431,6 +432,8 @@ struct cxl_dpa_perf {
> * @dpa_res: Overall DPA resource tree for the device
> * @pmem_res: Active Persistent memory capacity configuration
> * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + * region
> * @serial: PCIe Device Serial Number
> * @type: Generic Memory Class device or Vendor Specific Memory device
> */
> @@ -445,10 +448,22 @@ struct cxl_dev_state {
> struct resource dpa_res;
> struct resource pmem_res;
> struct resource ram_res;
> + struct resource dc_res[CXL_MAX_DC_REGION];
> u64 serial;
> enum cxl_devtype type;
> };
>
> +#define CXL_DC_REGION_STRLEN 8
> +struct cxl_dc_region_info {
> + u64 base;
> + u64 decode_len;
> + u64 len;
> + u64 blk_size;
> + u32 dsmad_handle;
> + u8 flags;
> + u8 name[CXL_DC_REGION_STRLEN];
> +};
> +
> /**
> * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> *
> @@ -467,6 +482,8 @@ struct cxl_dev_state {
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of static RAM and PMEM capacities
> + * @dynamic_cap: Complete DPA range occupied by DC regions
> * @volatile_only_bytes: hard volatile capacity
> * @persistent_only_bytes: hard persistent capacity
> * @partition_align_bytes: alignment size for partition-able capacity
> @@ -474,6 +491,8 @@ struct cxl_dev_state {
> * @active_persistent_bytes: sum of hard + soft persistent
> * @next_volatile_bytes: volatile capacity change pending device reset
> * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> * @event: event log driver state
> * @poison: poison driver state info
> * @security: security driver state info
> @@ -494,7 +513,10 @@ struct cxl_memdev_state {
> DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
> u64 total_bytes;
> + u64 static_cap;
> + u64 dynamic_cap;
> u64 volatile_only_bytes;
> u64 persistent_only_bytes;
> u64 partition_align_bytes;
> @@ -506,6 +528,9 @@ struct cxl_memdev_state {
> struct cxl_dpa_perf ram_perf;
> struct cxl_dpa_perf pmem_perf;
>
> + u8 nr_dc_region;
> + struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +
> struct cxl_event_state event;
> struct cxl_poison_state poison;
> struct cxl_security_state security;
> @@ -705,6 +730,29 @@ struct cxl_mbox_set_partition_info {
>
> #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
>
> +struct cxl_mbox_get_dc_config_in {
> + u8 region_count;
> + u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_get_dc_config_out {
> + u8 avail_region_count;
> + u8 rsvd[7];
> + struct cxl_dc_region_config {
> + __le64 region_base;
> + __le64 region_decode_length;
> + __le64 region_length;
> + __le64 region_block_size;
> + __le32 region_dsmad_handle;
> + u8 flags;
> + u8 rsvd[3];
> + } __packed region[];
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_REGIONS_RETURNED(size_out) \
> + ((size_out - 8) / sizeof(struct cxl_dc_region_config))

Although the result may be unchanged, but in cxl spec r3.1, there are four
fields after the region configuration structure.

Fan

> +
> /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> struct cxl_mbox_set_timestamp_in {
> __le64 timestamp;
> @@ -828,6 +876,7 @@ enum {
> int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 2ff361e756d6..216881455364 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -874,6 +874,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (rc)
> return rc;
>
> + rc = cxl_dev_dynamic_capacity_identify(mds);
> + if (rc)
> + return rc;
> +
> rc = cxl_mem_create_range_info(mds);
> if (rc)
> return rc;
>
> --
> 2.44.0
>

2024-03-26 00:45:30

by fan

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

On Sun, Mar 24, 2024 at 04:18:03PM -0700, [email protected] wrote:
> A git tree of this series can be found here:
>
> https://github.com/weiny2/linux-kernel/tree/dcd-2024-03-24
>
> Pre-requisite:
> ==============
>
> The locking introduced by Vishal for DAX regions:
> https://lore.kernel.org/all/[email protected]/T/#u
>
> Background
> ==========
>
> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows the memory capacity to change dynamically, without
> the need for resetting the device, reconfiguring HDM decoders, or
> reconfiguring software DAX regions.
>
> One of the biggest use cases for Dynamic Capacity is to allow hosts to
> share memory dynamically within a data center without increasing the
> per-host attached memory.
>
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory. Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Device
> the host sees, the Host Kernel, and a Host User.
>
> Typical work flows are shown below.
>
> Orchestrator FM Device Host Kernel Host User
>
> | | | | |
> |-------------- Create region ----------------------->|
> | | | | |
> | | | |<-- Create ---|
> | | | | Region |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Accept -|<- Accept -| |
> | | Extent | Extent | |
> | | | |<- Create --->|
> | | | | DAX dev |-- Use memory
> | | | | | |
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> | | | | |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Remove -->|- Release->|- Release ->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |
> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Accept -|<- Accept -| |
> | | Extent | Extent | |
> | | | |<- Create ----|
> | | | | DAX dev |-- Use memory
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Remove -->|- Release->|- Release ->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |
> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | |<- Create ----|
> | | | | DAX dev |-- Use memory
> | | | | | |
> |-- Remove -->|- Release->|- Release ->| | |
> | Capacity | Extent | Extent | | |
> | | | | | |
> | | | (Release Ignored) | |
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> |<------------- Signal done --------------------------|
> | | | | |
> | |- Release->|- Release ->| |
> | | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |
> | | | |<- Destroy ---|
> | | | | Region |
> | | | | |
>
> Previous RFCs of this series[0] resulted in significant architectural
> comments. Previous versions allowed memory capacity to be accepted by
> the host regardless of the existence of a software region being mapped.
>
> With this new patch set the order of the create region and DAX device
> creation must be synchronized with the Orchestrator adding/removing
> capacity. The host kernel will reject an add extent event if the region
> is not created yet. It will also ignore a release if the DAX device is
> created and referencing an extent.
>
> Neither of these synchronizations are anticipated to be an issue with
> real applications.
>
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced. A sparse DAX region may have 0 or
> more bytes of available space. The total space depends on the number
> and size of the extents which have been added.
>
> Initially it is anticipated that users of the memory will carefully
> coordinate the surfacing of additional capacity with the creation of DAX
> devices which use that capacity. Therefore, the allocation of the
> memory to DAX devices does not allow for specific associations between
> DAX device and extent. This keeps allocations very similar to existing
> DAX region behavior.
>
> Great care was taken to greatly simplify extent tracking. Specifically,
> in comparison to previous versions of the patch set, all extent tracking
> xarrays have been eliminated from the code. In addition, most of the
> extra software objects and associated referenced counts have been
> eliminated.
>
> In this version, extents are tracked purely as sub-devices of the
> region. This ensures that the region destruction cleans up all extent
> allocations properly. Device managed callbacks are wired to ensure any
> additional data required for DAX device references are handled
> correctly.
>
> Due to these major changes I'm setting this new series to V1.
>
> In summary the major functionality of this series includes:
>
> - Getting the dynamic capacity (DC) configuration information from cxl
> devices
>
> - Configuring the DC regions reported by hardware
>
> - Enhancing the CXL and DAX regions for dynamic capacity support
> a. Maintain a logical separation between hardware extents and
> software managed region extents. This provides an
> abstraction between the layers and should allow for
> interleaving in the future
>
> - Get hardware extent lists for endpoint decoders upon
> region creation.
>
> - Adjust extent/region memory available on the following events.
> a. Add capacity Events
> b. Release capacity events
>
> - Host response for add capacity
> a. do not accept the extent if:
> If the region does not exist
> or an error occurs realizing the extent
> B. If the region does exist
> realize a DAX region extent with 1:1 mapping (no
> interleave yet)
>
> - Host response for remove capacity
> a. If no DAX devices reference the extent release the extent
> b. If a reference does exist, ignore the request.
> (Require FM to issue release again.)
>
> - Modify DAX device creation/resize to account for extents within a
> sparse DAX region
>
> - Trace Dynamic Capacity events for debugging
>
> - Add cxl-test infrastructure to allow for faster unit testing
> (See new ndctl branch for cxl-dcd.sh test[1])
>
> Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
>
> Remaining work:
>
> 1) Integrate the QoS work from Dave Jiang
> 2) Interleave support
>
> Possible additional work depending on requirements:
>
> 1) Allow mapping to specific extents (perhaps based on
> label/tag)
> 2) Release extents when DAX devices are released if a release
> was previously seen from the device
> 3) Accept a new extent which extends (but overlaps) an existing
> extent(s)
>
> [0] RFC v2: https://lore.kernel.org/r/[email protected]
> [1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-03-22
> [2] https://lore.kernel.org/all/[email protected]/
>
> ---
> Changes for v1:
> - iweiny: Largely new series
> - iweiny: Remove review tags due to the series being a major rework
> - iweiny: Fix authorship for Navneet patches
> - iweiny: Remove extent xarrays
> - iweiny: Remove kreferences, replace with 1 use count protected under dax_rwsem
> - iweiny: Mark all sysfs entries for the 6.10 June 2024 kernel
> - iweiny: Remove gotos
> - iweiny: Fix 0day issues
> - Jonathan Cameron: address comments
> - Navneet Singh: address comments
> - Dan Williams: address comments
> - Dave Jiang: address comments
> - Fan Ni: address comments
> - J?rgen Hansen: address comments
> - Link to RFC v2: https://lore.kernel.org/r/[email protected]
>

Hi Ira,
Have not got a chance to check the code yet, but I noticed one thing
when testing with my DCD emulation code.
Currently, if we do partial release, it seems the whole extent will be
removed. Is it designed intentionally?

Fan

> ---
> Ira Weiny (12):
> cxl/core: Simplify cxl_dpa_set_mode()
> cxl/events: Factor out event msgnum configuration
> cxl/pci: Delay event buffer allocation
> cxl/pci: Factor out interrupt policy check
> range: Add range_overlaps()
> dax/bus: Factor out dev dax resize logic
> dax: Document dax dev range tuple
> dax/region: Prevent range mapping allocation on sparse regions
> dax/region: Support DAX device creation on sparse DAX regions
> tools/testing/cxl: Make event logs dynamic
> tools/testing/cxl: Add DC Regions to mock mem data
> tools/testing/cxl: Add Dynamic Capacity events
>
> Navneet Singh (14):
> cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
> cxl/core: Separate region mode from decoder mode
> cxl/mem: Read dynamic capacity configuration from the device
> cxl/region: Add dynamic capacity decoder and region modes
> cxl/port: Add Dynamic Capacity mode support to endpoint decoders
> cxl/port: Add dynamic capacity size support to endpoint decoders
> cxl/mem: Expose device dynamic capacity capabilities
> cxl/region: Add Dynamic Capacity CXL region support
> cxl/mem: Configure dynamic capacity interrupts
> cxl/region: Read existing extents on region creation
> cxl/extent: Realize extent devices
> dax/region: Create extent resources on DAX region driver load
> cxl/mem: Handle DCD add & release capacity events.
> cxl/mem: Trace Dynamic capacity Event Record
>
> Documentation/ABI/testing/sysfs-bus-cxl | 60 ++-
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/core.h | 10 +
> drivers/cxl/core/extent.c | 145 +++++
> drivers/cxl/core/hdm.c | 254 +++++++--
> drivers/cxl/core/mbox.c | 591 ++++++++++++++++++++-
> drivers/cxl/core/memdev.c | 76 +++
> drivers/cxl/core/port.c | 19 +
> drivers/cxl/core/region.c | 334 +++++++++++-
> drivers/cxl/core/trace.h | 65 +++
> drivers/cxl/cxl.h | 127 ++++-
> drivers/cxl/cxlmem.h | 114 ++++
> drivers/cxl/mem.c | 45 ++
> drivers/cxl/pci.c | 122 +++--
> drivers/dax/bus.c | 353 +++++++++---
> drivers/dax/bus.h | 4 +-
> drivers/dax/cxl.c | 127 ++++-
> drivers/dax/dax-private.h | 40 +-
> drivers/dax/hmem/hmem.c | 2 +-
> drivers/dax/pmem.c | 2 +-
> fs/btrfs/ordered-data.c | 10 +-
> include/linux/cxl-event.h | 31 ++
> include/linux/range.h | 7 +
> tools/testing/cxl/Kbuild | 1 +
> tools/testing/cxl/test/mem.c | 914 ++++++++++++++++++++++++++++----
> 25 files changed, 3152 insertions(+), 302 deletions(-)
> ---
> base-commit: dff54316795991e88a453a095a9322718a34034a
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
>
> Best regards,
> --
> Ira Weiny <[email protected]>
>

2024-03-26 01:32:08

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 11/26] cxl/pci: Delay event buffer allocation

On Sun, 24 Mar 2024, Ira Weiny wrote:

>The event buffer does not need to be allocated if something has failed
>in setting up event irq's.
>
>In prep for adjusting event configuration for DCD events move the buffer
>allocation to the end of the event configuration.

The above could be removed and just picked up independet of dcd.

Reviewed-by: Davidlohr Bueso <[email protected]>

>
>Signed-off-by: Ira Weiny <[email protected]>
>---
> drivers/cxl/pci.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
>diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>index cedd9b05f129..ccaf4ad26a4f 100644
>--- a/drivers/cxl/pci.c
>+++ b/drivers/cxl/pci.c
>@@ -756,10 +756,6 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> return 0;
> }
>
>- rc = cxl_mem_alloc_event_buf(mds);
>- if (rc)
>- return rc;
>-
> rc = cxl_event_get_int_policy(mds, &policy);
> if (rc)
> return rc;
>@@ -777,6 +773,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
>+ rc = cxl_mem_alloc_event_buf(mds);
>+ if (rc)
>+ return rc;
>+
> rc = cxl_event_irqsetup(mds, &policy);
> if (rc)
> return rc;
>
>--
>2.44.0
>

2024-03-26 01:35:58

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 08/26] cxl/mem: Expose device dynamic capacity capabilities

On Sun, 24 Mar 2024, [email protected] wrote:

>+What: /sys/bus/cxl/devices/memX/dc/region_count
>+Date: June, 2024
>+KernelVersion: v6.10
>+Contact: [email protected]
>+Description:
>+ (RO) Number of Dynamic Capacity (DC) regions supported on the
>+ device. May be 0 if the device does not support Dynamic
>+ Capacity.

If dcd is not supported then we should not have the dc/ directory
altogether.

Thanks,
Davidlohr

2024-03-26 12:51:38

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH 15/26] range: Add range_overlaps()

On 25.03.24 04:51, Ira Weiny wrote:
> Code to support CXL Dynamic Capacity devices will have extent ranges
> which need to be compared for intersection not a subset as is being
> checked in range_contains().
>
> range_overlaps() is defined in btrfs with a different meaning from what
> is required in the standard range code. Dan Williams pointed this out
> in [1]. Adjust the btrfs call according to his suggestion there.
>
> Then add a generic range_overlaps().
>
> Cc: Dan Williams <[email protected]>
> Cc: Chris Mason <[email protected]>
> Cc: Josef Bacik <[email protected]>
> Cc: David Sterba <[email protected]>
> Cc: [email protected]
> Signed-off-by: Ira Weiny <[email protected]>
>
> [1] https://lore.kernel.org/all/[email protected]/
> ---
> fs/btrfs/ordered-data.c | 10 +++++-----
> include/linux/range.h | 7 +++++++
> 2 files changed, 12 insertions(+), 5 deletions(-)

For fs/btrfs/ordered-data.c:
Reviewed-by: Johannes Thumshirn <[email protected]>

2024-03-26 16:17:42

by fan

[permalink] [raw]
Subject: Re: [PATCH 04/26] cxl/region: Add dynamic capacity decoder and region modes

On Sun, Mar 24, 2024 at 04:18:07PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Region mode must reflect a general dynamic capacity type which is
> associated with a specific Dynamic Capacity (DC) partitions in each
s/partitions/partition/

Otherwise,

Reviewed-by: Fan Ni <[email protected]>

> device decoder within the region. DC partitions are also know as DC
> regions per CXL 3.1.
>
> Decoder mode reflects a specific DC partition.
>
> Define the new modes to use in subsequent patches and the helper
> functions required to make the association between these new modes.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
> ---
> Changes for v1
> [iweiny: split out from: Add dynamic capacity cxl region support.]
> ---
> drivers/cxl/core/region.c | 4 ++++
> drivers/cxl/cxl.h | 23 +++++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 1723d17f121e..ec3b8c6948e9 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1690,6 +1690,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> return true;
> if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> return true;
> + if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
> + return true;
>
> return false;
> }
> @@ -2824,6 +2826,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> return CXL_REGION_RAM;
> case CXL_DECODER_PMEM:
> return CXL_REGION_PMEM;
> + case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> + return CXL_REGION_DC;
> case CXL_DECODER_MIXED:
> default:
> return CXL_REGION_MIXED;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 9a0cce1e6fca..3b8935089c0c 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -365,6 +365,14 @@ enum cxl_decoder_mode {
> CXL_DECODER_NONE,
> CXL_DECODER_RAM,
> CXL_DECODER_PMEM,
> + CXL_DECODER_DC0,
> + CXL_DECODER_DC1,
> + CXL_DECODER_DC2,
> + CXL_DECODER_DC3,
> + CXL_DECODER_DC4,
> + CXL_DECODER_DC5,
> + CXL_DECODER_DC6,
> + CXL_DECODER_DC7,
> CXL_DECODER_MIXED,
> CXL_DECODER_DEAD,
> };
> @@ -375,6 +383,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> [CXL_DECODER_NONE] = "none",
> [CXL_DECODER_RAM] = "ram",
> [CXL_DECODER_PMEM] = "pmem",
> + [CXL_DECODER_DC0] = "dc0",
> + [CXL_DECODER_DC1] = "dc1",
> + [CXL_DECODER_DC2] = "dc2",
> + [CXL_DECODER_DC3] = "dc3",
> + [CXL_DECODER_DC4] = "dc4",
> + [CXL_DECODER_DC5] = "dc5",
> + [CXL_DECODER_DC6] = "dc6",
> + [CXL_DECODER_DC7] = "dc7",
> [CXL_DECODER_MIXED] = "mixed",
> };
>
> @@ -383,10 +399,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> return "mixed";
> }
>
> +static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> + return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
> enum cxl_region_mode {
> CXL_REGION_NONE,
> CXL_REGION_RAM,
> CXL_REGION_PMEM,
> + CXL_REGION_DC,
> CXL_REGION_MIXED,
> };
>
> @@ -396,6 +418,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> [CXL_REGION_NONE] = "none",
> [CXL_REGION_RAM] = "ram",
> [CXL_REGION_PMEM] = "pmem",
> + [CXL_REGION_DC] = "dc",
> [CXL_REGION_MIXED] = "mixed",
> };
>
>
> --
> 2.44.0
>

2024-03-26 16:26:19

by fan

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

On Sun, Mar 24, 2024 at 04:18:08PM -0700, Ira Weiny wrote:
> cxl_dpa_set_mode() checks the mode for validity two times, once outside
> of the DPA RW semaphore and again within. The function is not in a
> critical path. Prior to Dynamic Capacity the extra check was not much
> of an issue. The addition of DC modes increases the complexity of
> the check.
>
> Simplify the mode check before adding the more complex DC modes.
>
> Signed-off-by: Ira Weiny <[email protected]>
>

Reviewed-by: Fan Ni <[email protected]>

> ---
> Changes for v1:
> [iweiny: new patch]
> [Jonathan: based on getting rid of the loop in cxl_dpa_set_mode]
> [Jonathan: standardize on resource_size() == 0]
> ---
> drivers/cxl/core/hdm.c | 45 ++++++++++++++++++---------------------------
> 1 file changed, 18 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7d97790b893d..66b8419fd0c3 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -411,44 +411,35 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &cxled->cxld.dev;
> - int rc;
>
> + guard(rwsem_write)(&cxl_dpa_rwsem);
> + if (cxled->cxld.flags & CXL_DECODER_F_ENABLE)
> + return -EBUSY;
> +
> + /*
> + * Check that the mode is supported by the current partition
> + * configuration
> + */
> switch (mode) {
> case CXL_DECODER_RAM:
> + if (!resource_size(&cxlds->ram_res)) {
> + dev_dbg(dev, "no available ram capacity\n");
> + return -ENXIO;
> + }
> + break;
> case CXL_DECODER_PMEM:
> + if (!resource_size(&cxlds->pmem_res)) {
> + dev_dbg(dev, "no available pmem capacity\n");
> + return -ENXIO;
> + }
> break;
> default:
> dev_dbg(dev, "unsupported mode: %d\n", mode);
> return -EINVAL;
> }
>
> - down_write(&cxl_dpa_rwsem);
> - if (cxled->cxld.flags & CXL_DECODER_F_ENABLE) {
> - rc = -EBUSY;
> - goto out;
> - }
> -
> - /*
> - * Only allow modes that are supported by the current partition
> - * configuration
> - */
> - if (mode == CXL_DECODER_PMEM && !resource_size(&cxlds->pmem_res)) {
> - dev_dbg(dev, "no available pmem capacity\n");
> - rc = -ENXIO;
> - goto out;
> - }
> - if (mode == CXL_DECODER_RAM && !resource_size(&cxlds->ram_res)) {
> - dev_dbg(dev, "no available ram capacity\n");
> - rc = -ENXIO;
> - goto out;
> - }
> -
> cxled->mode = mode;
> - rc = 0;
> -out:
> - up_write(&cxl_dpa_rwsem);
> -
> - return rc;
> + return 0;
> }
>
> int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>
> --
> 2.44.0
>

2024-03-26 16:37:14

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)



On 3/24/24 4:18 PM, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Per the CXL 3.1 specification software must check the Command Effects
> Log (CEL) to know if a device supports dynamic capacity (DC). If the
> device does support DC the specifics of the DC Regions (0-7) are read
> through the mailbox.
>
> Flag DC Device (DCD) commands in a device if they are supported.
> Subsequent patches will key off these bits to configure DCD.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Dave Jiang <[email protected]>

small formatting nit below

> ---
> Changes for v1
> [iweiny: update to latest master]
> [iweiny: update commit message]
> [iweiny: Based on the fix:
> https://lore.kernel.org/all/[email protected]/
> [jonathan: remove unneeded format change]
> [jonathan: don't split security code in mbox.c]
> ---
> drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
> drivers/cxl/cxlmem.h | 15 +++++++++++++++
> 2 files changed, 48 insertions(+)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9adda4795eb7..ed4131c6f50b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -161,6 +161,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
> }
> }
>
> +static bool cxl_is_dcd_command(u16 opcode)
> +{
> +#define CXL_MBOX_OP_DCD_CMDS 0x48
> +
> + return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> +}
> +
> +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> + u16 opcode)

This seems misaligned.

DJ

> +{
> + switch (opcode) {
> + case CXL_MBOX_OP_GET_DC_CONFIG:
> + set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> + set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_ADD_DC_RESPONSE:
> + set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> + break;
> + case CXL_MBOX_OP_RELEASE_DC:
> + set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> + break;
> + default:
> + break;
> + }
> +}
> +
> static bool cxl_is_poison_command(u16 opcode)
> {
> #define CXL_MBOX_OP_POISON_CMDS 0x43
> @@ -733,6 +761,11 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> enabled++;
> }
>
> + if (cxl_is_dcd_command(opcode)) {
> + cxl_set_dcd_cmd_enabled(mds, opcode);
> + enabled++;
> + }
> +
> dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
> enabled ? "enabled" : "unsupported by driver");
> }
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 20fb3b35e89e..79a67cff9143 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -238,6 +238,15 @@ struct cxl_event_state {
> struct mutex log_lock;
> };
>
> +/* Device enabled DCD commands */
> +enum dcd_cmd_enabled_bits {
> + CXL_DCD_ENABLED_GET_CONFIG,
> + CXL_DCD_ENABLED_GET_EXTENT_LIST,
> + CXL_DCD_ENABLED_ADD_RESPONSE,
> + CXL_DCD_ENABLED_RELEASE,
> + CXL_DCD_ENABLED_MAX
> +};
> +
> /* Device enabled poison commands */
> enum poison_cmd_enabled_bits {
> CXL_POISON_ENABLED_LIST,
> @@ -454,6 +463,7 @@ struct cxl_dev_state {
> * (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> * @mbox_mutex: Mutex to synchronize mailbox access.
> * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> * @total_bytes: sum of all possible capacities
> @@ -481,6 +491,7 @@ struct cxl_memdev_state {
> size_t lsa_size;
> struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> char firmware_version[0x10];
> + DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> u64 total_bytes;
> @@ -551,6 +562,10 @@ enum cxl_opcode {
> CXL_MBOX_OP_UNLOCK = 0x4503,
> CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
> CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
> + CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
> + CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
> + CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
> + CXL_MBOX_OP_RELEASE_DC = 0x4803,
> CXL_MBOX_OP_MAX = 0x10000
> };
>
>

2024-03-26 18:09:40

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()



On 3/24/24 4:18 PM, Ira Weiny wrote:
> cxl_dpa_set_mode() checks the mode for validity two times, once outside
> of the DPA RW semaphore and again within. The function is not in a
> critical path. Prior to Dynamic Capacity the extra check was not much
> of an issue. The addition of DC modes increases the complexity of
> the check.
>
> Simplify the mode check before adding the more complex DC modes.

I would augment this by saying simplify "by using scope-based resource menagement".
>
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Dave Jiang <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: new patch]
> [Jonathan: based on getting rid of the loop in cxl_dpa_set_mode]
> [Jonathan: standardize on resource_size() == 0]
> ---
> drivers/cxl/core/hdm.c | 45 ++++++++++++++++++---------------------------
> 1 file changed, 18 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7d97790b893d..66b8419fd0c3 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -411,44 +411,35 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &cxled->cxld.dev;
> - int rc;
>
> + guard(rwsem_write)(&cxl_dpa_rwsem);
> + if (cxled->cxld.flags & CXL_DECODER_F_ENABLE)
> + return -EBUSY;
> +
> + /*
> + * Check that the mode is supported by the current partition
> + * configuration
> + */
> switch (mode) {
> case CXL_DECODER_RAM:
> + if (!resource_size(&cxlds->ram_res)) {
> + dev_dbg(dev, "no available ram capacity\n");
> + return -ENXIO;
> + }
> + break;
> case CXL_DECODER_PMEM:
> + if (!resource_size(&cxlds->pmem_res)) {
> + dev_dbg(dev, "no available pmem capacity\n");
> + return -ENXIO;
> + }
> break;
> default:
> dev_dbg(dev, "unsupported mode: %d\n", mode);
> return -EINVAL;
> }
>
> - down_write(&cxl_dpa_rwsem);
> - if (cxled->cxld.flags & CXL_DECODER_F_ENABLE) {
> - rc = -EBUSY;
> - goto out;
> - }
> -
> - /*
> - * Only allow modes that are supported by the current partition
> - * configuration
> - */
> - if (mode == CXL_DECODER_PMEM && !resource_size(&cxlds->pmem_res)) {
> - dev_dbg(dev, "no available pmem capacity\n");
> - rc = -ENXIO;
> - goto out;
> - }
> - if (mode == CXL_DECODER_RAM && !resource_size(&cxlds->ram_res)) {
> - dev_dbg(dev, "no available ram capacity\n");
> - rc = -ENXIO;
> - goto out;
> - }
> -
> cxled->mode = mode;
> - rc = 0;
> -out:
> - up_write(&cxl_dpa_rwsem);
> -
> - return rc;
> + return 0;
> }
>
> int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>

2024-03-26 18:31:11

by fan

[permalink] [raw]
Subject: Re: [PATCH 08/26] cxl/mem: Expose device dynamic capacity capabilities

On Mon, Mar 25, 2024 at 04:40:16PM -0700, Davidlohr Bueso wrote:
> On Sun, 24 Mar 2024, [email protected] wrote:
>
> > +What: /sys/bus/cxl/devices/memX/dc/region_count
> > +Date: June, 2024
> > +KernelVersion: v6.10
> > +Contact: [email protected]
> > +Description:
> > + (RO) Number of Dynamic Capacity (DC) regions supported on the
> > + device. May be 0 if the device does not support Dynamic
> > + Capacity.
>
> If dcd is not supported then we should not have the dc/ directory
> altogether.
>
> Thanks,
> Davidlohr

I also think so. However, I also noticed one thing (not DCD related).
Even for a PMEM device, for example, we have a ram directory under the
device directory.

===================
root@DT:~# cxl list
[
{
"memdev":"mem0",
"pmem_size":536870912,
"serial":0,
"host":"0000:0d:00.0"
}
]
root@DT:~# ls /sys/bus/cxl/devices/mem0/
dc dev driver firmware firmware_version label_storage_size numa_node payload_max pmem pmem0 ram security serial subsystem trigger_poison_list uevent
root@DT:~#
===================

Fan


2024-03-26 22:34:53

by fan

[permalink] [raw]
Subject: Re: [PATCH 09/26] cxl/region: Add Dynamic Capacity CXL region support

On Sun, Mar 24, 2024 at 04:18:12PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> CXL devices optionally support dynamic capacity. CXL Regions must be
> configured correctly to access this capacity. Similar to ram and pmem
> partitions, DC Regions, as they are called in CXL 3.1, represent
> different partitions of the DPA space.
>
> Introduce the concept of a sparse DAX region. Add the create_dc_region
> sysfs entry to create sparse DC DAX regions. Special case DC capable
> regions to create a 0 sized seed DAX device to maintain backwards
> compatibility with older software which needs a default DAX device to
> hold the region reference.
>
> Flag sparse DAX regions to indicate 0 capacity available until such time
> as DC capacity is added.
>
> Interleaving is deferred in this series. Add an early check.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [djiang: mark sysfs entries to be in 6.10 kernel including date]
> [djbw: change dax region typing to be 'sparse' rather than 'dynamic']
> [iweiny: rebase changes to master instead of type2 patches]
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 22 +++++++++++-----------
> drivers/cxl/core/core.h | 1 +
> drivers/cxl/core/port.c | 1 +
> drivers/cxl/core/region.c | 33 +++++++++++++++++++++++++++++++++
> drivers/dax/bus.c | 8 ++++++++
> drivers/dax/bus.h | 1 +
> drivers/dax/cxl.c | 15 +++++++++++++--
> 7 files changed, 68 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 8a4f572c8498..f0cf52fff9fa 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -411,20 +411,20 @@ Description:
> interleave_granularity).
>
>
> -What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> -Date: May, 2022, January, 2023
> -KernelVersion: v6.0 (pmem), v6.3 (ram)
> +What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
> +Date: May, 2022, January, 2023, June 2024
> +KernelVersion: v6.0 (pmem), v6.3 (ram), v6.10 (dc)
> Contact: [email protected]
> Description:
> (RW) Write a string in the form 'regionZ' to start the process
> - of defining a new persistent, or volatile memory region
> - (interleave-set) within the decode range bounded by root decoder
> - 'decoderX.Y'. The value written must match the current value
> - returned from reading this attribute. An atomic compare exchange
> - operation is done on write to assign the requested id to a
> - region and allocate the region-id for the next creation attempt.
> - EBUSY is returned if the region name written does not match the
> - current cached value.
> + of defining a new persistent, volatile, or Dynamic Capacity
> + (DC) memory region (interleave-set) within the decode range
> + bounded by root decoder 'decoderX.Y'. The value written must
> + match the current value returned from reading this attribute.
> + An atomic compare exchange operation is done on write to assign
> + the requested id to a region and allocate the region-id for the
> + next creation attempt. EBUSY is returned if the region name
> + written does not match the current cached value.
>
>
> What: /sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 3b64fb1b9ed0..91abeffbe985 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -13,6 +13,7 @@ extern struct attribute_group cxl_base_attribute_group;
> #ifdef CONFIG_CXL_REGION
> extern struct device_attribute dev_attr_create_pmem_region;
> extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dc_region;
> extern struct device_attribute dev_attr_delete_region;
> extern struct device_attribute dev_attr_region;
> extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 036b61cb3007..661177b575f7 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -335,6 +335,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
> &dev_attr_qos_class.attr,
> SET_CXL_REGION_ATTR(create_pmem_region)
> SET_CXL_REGION_ATTR(create_ram_region)
> + SET_CXL_REGION_ATTR(create_dc_region)
> SET_CXL_REGION_ATTR(delete_region)
> NULL,
> };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index ec3b8c6948e9..0d7b09a49dcf 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2205,6 +2205,7 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
> switch (mode) {
> case CXL_REGION_RAM:
> case CXL_REGION_PMEM:
> + case CXL_REGION_DC:
> break;
> default:
> dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> @@ -2314,6 +2315,32 @@ static ssize_t create_ram_region_store(struct device *dev,
> }
> DEVICE_ATTR_RW(create_ram_region);
>
> +static ssize_t create_dc_region_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> + struct cxl_region *cxlr;
> + int rc, id;
> +
> + rc = sscanf(buf, "region%d\n", &id);
> + if (rc != 1)
> + return -EINVAL;
> +
> + cxlr = __create_region(cxlrd, CXL_REGION_DC, id);
> + if (IS_ERR(cxlr))
> + return PTR_ERR(cxlr);
> +
> + return len;
> +}
> +DEVICE_ATTR_RW(create_dc_region);

create_ram_region_store, create_pmem_region_store and
create_dc_region_store have mostly duplicate code, should we consider
extracting out as a helper function and pass region type for ram/pmem/dc region
store?

Fan

> +
> static ssize_t region_show(struct device *dev, struct device_attribute *attr,
> char *buf)
> {
> @@ -2759,6 +2786,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> struct device *dev;
> int rc;
>
> + if (cxlr->mode == CXL_REGION_DC && cxlr->params.interleave_ways != 1) {
> + dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> + return -EINVAL;
> + }
> +
> cxlr_dax = cxl_dax_region_alloc(cxlr);
> if (IS_ERR(cxlr_dax))
> return PTR_ERR(cxlr_dax);
> @@ -3040,6 +3072,7 @@ static int cxl_region_probe(struct device *dev)
> case CXL_REGION_PMEM:
> return devm_cxl_add_pmem_region(cxlr);
> case CXL_REGION_RAM:
> + case CXL_REGION_DC:
> /*
> * The region can not be manged by CXL if any portion of
> * it is already online as 'System RAM'
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index cb148f74ceda..903566aff5eb 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -181,6 +181,11 @@ static bool is_static(struct dax_region *dax_region)
> return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
> }
>
> +static bool is_sparse(struct dax_region *dax_region)
> +{
> + return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> +}
> +
> bool static_dev_dax(struct dev_dax *dev_dax)
> {
> return is_static(dev_dax->region);
> @@ -304,6 +309,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>
> WARN_ON_ONCE(!rwsem_is_locked(&dax_region_rwsem));
>
> + if (is_sparse(dax_region))
> + return 0;
> +
> for_each_dax_region_resource(dax_region, res)
> size -= resource_size(res);
> return size;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index cbbf64443098..783bfeef42cc 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
> /* dax bus specific ioresource flags */
> #define IORESOURCE_DAX_STATIC BIT(0)
> #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
>
> struct dax_region *alloc_dax_region(struct device *parent, int region_id,
> struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index c696837ab23c..415d03fbf9b6 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,30 @@ static int cxl_dax_region_probe(struct device *dev)
> struct cxl_region *cxlr = cxlr_dax->cxlr;
> struct dax_region *dax_region;
> struct dev_dax_data data;
> + resource_size_t dev_size;
> + unsigned long flags;
>
> if (nid == NUMA_NO_NODE)
> nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>
> + flags = IORESOURCE_DAX_KMEM;
> + if (cxlr->mode == CXL_REGION_DC)
> + flags |= IORESOURCE_DAX_SPARSE_CAP;
> +
> dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> - PMD_SIZE, IORESOURCE_DAX_KMEM);
> + PMD_SIZE, flags);
> if (!dax_region)
> return -ENOMEM;
>
> + dev_size = range_len(&cxlr_dax->hpa_range);
> + /* Add empty seed dax device */
> + if (cxlr->mode == CXL_REGION_DC)
> + dev_size = 0;
> +
> data = (struct dev_dax_data) {
> .dax_region = dax_region,
> .id = -1,
> - .size = range_len(&cxlr_dax->hpa_range),
> + .size = dev_size,
> .memmap_on_memory = true,
> };
>
>
> --
> 2.44.0
>

2024-03-26 23:27:24

by fan

[permalink] [raw]
Subject: Re: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts

On Sun, Mar 24, 2024 at 04:18:16PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism. The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications.
>
> Firmware can't configure DCD events to be FW controlled but can retain
> control of memory events. Split irq configuration of memory events and
> DCD events to allow for FW control of memory events while DCD is host
> controlled.
>
> Configure DCD event log interrupts on devices supporting dynamic
> capacity. Disable DCD if interrupts are not supported.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [iweiny: rebase to upstream irq code]
> [iweiny: disable DCD if irqs not supported]
> ---
> drivers/cxl/core/mbox.c | 9 ++++++-
> drivers/cxl/cxl.h | 4 ++-
> drivers/cxl/cxlmem.h | 4 +++
> drivers/cxl/pci.c | 71 ++++++++++++++++++++++++++++++++++++++++---------
> 4 files changed, 74 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 14e8a7528a8b..58b31fa47b93 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1323,10 +1323,17 @@ static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> return rc;
> }
>
> -static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> {
> return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> }
> +EXPORT_SYMBOL_NS_GPL(cxl_dcd_supported, CXL);
> +
> +void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> + clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_disable_dcd, CXL);
>
> /**
> * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 15d418b3bc9b..d585f5fdd3ae 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -164,11 +164,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
> #define CXLDEV_EVENT_STATUS_WARN BIT(1)
> #define CXLDEV_EVENT_STATUS_FAIL BIT(2)
> #define CXLDEV_EVENT_STATUS_FATAL BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD BIT(4)
>
> #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \
> CXLDEV_EVENT_STATUS_WARN | \
> CXLDEV_EVENT_STATUS_FAIL | \
> - CXLDEV_EVENT_STATUS_FATAL)
> + CXLDEV_EVENT_STATUS_FATAL| \
> + CXLDEV_EVENT_STATUS_DCD)
>
> /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
> #define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 4624cf612c1e..01bee6eedff3 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -225,7 +225,9 @@ struct cxl_event_interrupt_policy {
> u8 warn_settings;
> u8 failure_settings;
> u8 fatal_settings;
> + u8 dcd_settings;
> } __packed;
> +#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
>
> /**
> * struct cxl_event_state - Event log driver state
> @@ -890,6 +892,8 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> const uuid_t *uuid, union cxl_event *evt);
> +bool cxl_dcd_supported(struct cxl_memdev_state *mds);
> +void cxl_disable_dcd(struct cxl_memdev_state *mds);
> int cxl_set_timestamp(struct cxl_memdev_state *mds);
> int cxl_poison_state_init(struct cxl_memdev_state *mds);
> int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 12cd5d399230..ef482eae09e9 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -669,22 +669,33 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> }
>
> static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
> - struct cxl_event_interrupt_policy *policy)
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> {
> struct cxl_mbox_cmd mbox_cmd;
> + size_t size_in;
> int rc;
>
> - *policy = (struct cxl_event_interrupt_policy) {
> - .info_settings = CXL_INT_MSI_MSIX,
> - .warn_settings = CXL_INT_MSI_MSIX,
> - .failure_settings = CXL_INT_MSI_MSIX,
> - .fatal_settings = CXL_INT_MSI_MSIX,
> - };
> + if (native_cxl) {
> + *policy = (struct cxl_event_interrupt_policy) {
> + .info_settings = CXL_INT_MSI_MSIX,
> + .warn_settings = CXL_INT_MSI_MSIX,
> + .failure_settings = CXL_INT_MSI_MSIX,
> + .fatal_settings = CXL_INT_MSI_MSIX,
> + .dcd_settings = 0,
> + };
> + }
> + size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
> +
> + if (cxl_dcd_supported(mds)) {
> + policy->dcd_settings = CXL_INT_MSI_MSIX;
> + size_in += sizeof(policy->dcd_settings);
> + }
>
> mbox_cmd = (struct cxl_mbox_cmd) {
> .opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
> .payload_in = policy,
> - .size_in = sizeof(*policy),
> + .size_in = size_in,
> };
>
> rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> @@ -731,6 +742,31 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
> return 0;
> }
>
> +static int cxl_irqsetup(struct cxl_memdev_state *mds,
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> +{
> + struct cxl_dev_state *cxlds = &mds->cxlds;
> + int rc;
> +
> + if (native_cxl) {
> + rc = cxl_event_irqsetup(mds, policy);
> + if (rc)
> + return rc;
> + }
> +
> + if (cxl_dcd_supported(mds)) {
> + rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
> + if (rc) {
> + dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
> + cxl_disable_dcd(mds);
> + return rc;
> + }
> + }
> +
> + return 0;
> +}
> +
> static bool cxl_event_int_is_fw(u8 setting)
> {
> u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
> @@ -757,17 +793,25 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> struct cxl_memdev_state *mds, bool irq_avail)
> {
> struct cxl_event_interrupt_policy policy = { 0 };
> + bool native_cxl = host_bridge->native_cxl_error;
> int rc;
>
> /*
> * When BIOS maintains CXL error reporting control, it will process
> * event records. Only one agent can do so.
> + *
> + * If BIOS has control of events and DCD is not supported skip event
> + * configuration.
> */
> - if (!host_bridge->native_cxl_error)
> + if (!native_cxl && !cxl_dcd_supported(mds))
> return 0;
>
> if (!irq_avail) {
> dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
> + if (cxl_dcd_supported(mds)) {
> + dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
> + cxl_disable_dcd(mds);
> + }
> return 0;
> }
>
> @@ -775,10 +819,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
> - if (!cxl_event_validate_mem_policy(mds, &policy))
> + if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
> return -EBUSY;
>
> - rc = cxl_event_config_msgnums(mds, &policy);
> + rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
> if (rc)
> return rc;
>
> @@ -786,12 +830,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
> - rc = cxl_event_irqsetup(mds, &policy);
> + rc = cxl_irqsetup(mds, &policy, native_cxl);
> if (rc)
> return rc;
>
> cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
>
> + dev_dbg(mds->cxlds.dev, "Event config : %d %d\n",
> + native_cxl, cxl_dcd_supported(mds));

The message will print out two numbers, seems not very clear. Should we
translate to more straightforward message, like
native_cxl? "OS...":""
cxl_dcd_supported(msd)? "DCD supported": "DCD not supported"?

Fan

> +
> return 0;
> }
>
>
> --
> 2.44.0
>

2024-03-26 23:27:40

by fan

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

On Sun, Mar 24, 2024 at 04:18:17PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case creation of a new
> region on top of the DC partition (region) is expected to expose those
> extents for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized read the device extent list. For ease of review, this patch
> stops after reading the extent list and leaves realization of the region
> extents to a future patch.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: remove extent list xarray]
> [iweiny: Update spec references to 3.1]
> [iweiny: use struct range in extents]
> [iweiny: remove all reference tracking and let regions track extents
> through the extent devices.]
> [djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
> ---
> drivers/cxl/core/core.h | 9 +++
> drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 29 +++++++
> drivers/cxl/cxlmem.h | 49 ++++++++++++
> 4 files changed, 279 insertions(+)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 91abeffbe985..119b12362977 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,6 +4,8 @@
> #ifndef __CXL_CORE_H__
> #define __CXL_CORE_H__
>
> +#include <cxlmem.h>
> +
> extern const struct device_type cxl_nvdimm_bridge_type;
> extern const struct device_type cxl_nvdimm_type;
> extern const struct device_type cxl_pmu_type;
> @@ -28,6 +30,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
> int cxl_region_init(void);
> void cxl_region_exit(void);
> int cxl_get_poison_by_endpoint(struct cxl_port *port);
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent);
> #else
> static inline int cxl_get_poison_by_endpoint(struct cxl_port *port)
> {
> @@ -43,6 +47,11 @@ static inline int cxl_region_init(void)
> static inline void cxl_region_exit(void)
> {
> }
> +static inline int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> #define CXL_REGION_ATTR(x) NULL
> #define CXL_REGION_TYPE(x) NULL
> #define SET_CXL_REGION_ATTR(x)
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 58b31fa47b93..9e33a0976828 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct device *dev = mds->cxlds.dev;
> + uint64_t start, len;
> +
> + start = le64_to_cpu(dc_extent->start_dpa);
> + len = le64_to_cpu(dc_extent->length);
> +
> + /* Extents must not cross region boundary's */
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + if (dcr->base <= start &&
> + (start + len) <= (dcr->base + dcr->decode_len)) {

Why not use range_contains here as below?

Fan
> + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> + start, start + len - 1, i, start - dcr->base);
> + return 0;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> + start, start + len - 1);
> + return -EINVAL;
> +}
> +
> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *extent)
> +{
> + uint64_t start = le64_to_cpu(extent->start_dpa);
> + uint64_t length = le64_to_cpu(extent->length);
> + struct range ext_range = (struct range){
> + .start = start,
> + .end = start + length - 1,
> + };
> + struct range ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> +
> + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
> + cxled->dpa_res, start, length);
> +
> + return range_contains(&ed_range, &ext_range);
> +}
> +
> void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> @@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> + unsigned int *extent_gen_num)
> +{
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + struct cxl_mbox_get_dc_extent_out dc_extents;
> + struct cxl_mbox_cmd mbox_cmd;
> + unsigned int count;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(0),
> + .start_extent_index = cpu_to_le32(0),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = sizeof(dc_extents),
> + .payload_out = &dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + count = le32_to_cpu(dc_extents.total_extent_cnt);
> + *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> +
> + return count;
> +}
> +
> +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
> + unsigned int start_gen_num,
> + unsigned int exp_cnt)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + unsigned int start_index, total_read;
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_mbox_cmd mbox_cmd;
> +
> + struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
> + kvmalloc(mds->payload_size, GFP_KERNEL);
> + if (!dc_extents)
> + return -ENOMEM;
> +
> + total_read = 0;
> + start_index = 0;
> + do {
> + unsigned int nr_ext, total_extent_cnt, gen_num;
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(exp_cnt - start_index),
> + .start_extent_index = cpu_to_le32(start_index),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = mds->payload_size,
> + .payload_out = dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
> + total_read += nr_ext;
> + total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> + gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +
> + dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> + total_extent_cnt, gen_num);
> +
> + if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> + dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
> + gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> + return -EIO;
> + }
> +
> + for (int i = 0; i < nr_ext ; i++) {
> + dev_dbg(dev, "Processing extent %d/%d\n",
> + start_index + i, exp_cnt);
> + rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
> + if (rc)
> + continue;
> + if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
> + continue;
> + rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
> + if (rc)
> + return rc;
> + }
> +
> + start_index += nr_ext;
> + } while (exp_cnt > total_read);
> +
> + return 0;
> +}
> +
> +/**
> + * cxl_read_dc_extents() - Read any existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add any existing extents found which belong to this decoder.
> + *
> + * Return: 0 if command was executed successfully, -ERRNO on error.
> + */
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + unsigned int extent_gen_num;
> + int rc;
> +
> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD unsupported\n");
> + return 0;
> + }
> +
> + rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> + dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",
> + rc, extent_gen_num);
> + if (rc <= 0) /* 0 == no records found */
> + return rc;
> +
> + return cxl_dev_get_dc_extents(cxled, extent_gen_num, rc);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_read_dc_extents, CXL);
> +
> static int add_dpa_res(struct device *dev, struct resource *parent,
> struct resource *res, resource_size_t start,
> resource_size_t size, const char *type)
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 0d7b09a49dcf..3e563ab29afe 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,6 +1450,13 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> +
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder *cxled,
> @@ -2773,6 +2780,22 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
> return rc;
> }
>
> +static int cxl_region_read_extents(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int i;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + int rc;
> +
> + rc = cxl_read_dc_extents(p->targets[i]);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> @@ -2807,6 +2830,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> dev_name(dev));
>
> + if (cxlr->mode == CXL_REGION_DC) {
> + rc = cxl_region_read_extents(cxlr);
> + if (rc)
> + goto err;
> + }
> +
> return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> cxlr_dax);
> err:
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 01bee6eedff3..8f2d8944d334 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -604,6 +604,54 @@ enum cxl_opcode {
> UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \
> 0x40, 0x3d, 0x86)
>
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> + __le32 extent_list_size;
> + u8 flags;
> + u8 reserved[3];
> + struct updated_extent_list {
> + __le64 dpa_start;
> + __le64 length;
> + u8 reserved[8];
> + } __packed extent_list[];
> +} __packed;
> +
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +#define CXL_DC_EXTENT_TAG_LEN 0x10
> +struct cxl_dc_extent {
> + __le64 start_dpa;
> + __le64 length;
> + u8 tag[CXL_DC_EXTENT_TAG_LEN];
> + __le16 shared_extn_seq;
> + u8 reserved[6];
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Input Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
> + */
> +struct cxl_mbox_get_dc_extent_in {
> + __le32 extent_cnt;
> + __le32 start_extent_index;
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_dc_extent_out {
> + __le32 ret_extent_cnt;
> + __le32 total_extent_cnt;
> + __le32 extent_list_num;
> + u8 rsvd[4];
> + struct cxl_dc_extent extent[];
> +} __packed;
> +
> struct cxl_mbox_get_supported_logs {
> __le16 entries;
> u8 rsvd[6];
> @@ -879,6 +927,7 @@ int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
>
> --
> 2.44.0
>

2024-03-27 15:45:40

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 04/26] cxl/region: Add dynamic capacity decoder and region modes



On 3/24/24 4:18 PM, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Region mode must reflect a general dynamic capacity type which is
> associated with a specific Dynamic Capacity (DC) partitions in each
> device decoder within the region. DC partitions are also know as DC
> regions per CXL 3.1.

This section reads somewhat awkward to me. Does this read any better?

One or more Dynamic Capacity (DC) partitions (and decoders) form a CXL software region. The region mode reflects composition of that entire software region. Decoder mode reflects a specific DC partition. DC partitions are also known as DC regions per CXL specification r3.1 but is not the same entity as CXL software regions.

DJ

>
> Decoder mode reflects a specific DC partition.
>
> Define the new modes to use in subsequent patches and the helper
> functions required to make the association between these new modes.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
> ---
> Changes for v1
> [iweiny: split out from: Add dynamic capacity cxl region support.]
> ---
> drivers/cxl/core/region.c | 4 ++++
> drivers/cxl/cxl.h | 23 +++++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 1723d17f121e..ec3b8c6948e9 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1690,6 +1690,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> return true;
> if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> return true;
> + if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
> + return true;
>
> return false;
> }
> @@ -2824,6 +2826,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> return CXL_REGION_RAM;
> case CXL_DECODER_PMEM:
> return CXL_REGION_PMEM;
> + case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> + return CXL_REGION_DC;
> case CXL_DECODER_MIXED:
> default:
> return CXL_REGION_MIXED;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 9a0cce1e6fca..3b8935089c0c 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -365,6 +365,14 @@ enum cxl_decoder_mode {
> CXL_DECODER_NONE,
> CXL_DECODER_RAM,
> CXL_DECODER_PMEM,
> + CXL_DECODER_DC0,
> + CXL_DECODER_DC1,
> + CXL_DECODER_DC2,
> + CXL_DECODER_DC3,
> + CXL_DECODER_DC4,
> + CXL_DECODER_DC5,
> + CXL_DECODER_DC6,
> + CXL_DECODER_DC7,
> CXL_DECODER_MIXED,
> CXL_DECODER_DEAD,
> };
> @@ -375,6 +383,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> [CXL_DECODER_NONE] = "none",
> [CXL_DECODER_RAM] = "ram",
> [CXL_DECODER_PMEM] = "pmem",
> + [CXL_DECODER_DC0] = "dc0",
> + [CXL_DECODER_DC1] = "dc1",
> + [CXL_DECODER_DC2] = "dc2",
> + [CXL_DECODER_DC3] = "dc3",
> + [CXL_DECODER_DC4] = "dc4",
> + [CXL_DECODER_DC5] = "dc5",
> + [CXL_DECODER_DC6] = "dc6",
> + [CXL_DECODER_DC7] = "dc7",
> [CXL_DECODER_MIXED] = "mixed",
> };
>
> @@ -383,10 +399,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> return "mixed";
> }
>
> +static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> + return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
> enum cxl_region_mode {
> CXL_REGION_NONE,
> CXL_REGION_RAM,
> CXL_REGION_PMEM,
> + CXL_REGION_DC,
> CXL_REGION_MIXED,
> };
>
> @@ -396,6 +418,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> [CXL_REGION_NONE] = "none",
> [CXL_REGION_RAM] = "ram",
> [CXL_REGION_PMEM] = "pmem",
> + [CXL_REGION_DC] = "dc",
> [CXL_REGION_MIXED] = "mixed",
> };
>
>

2024-03-27 17:27:58

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 09/26] cxl/region: Add Dynamic Capacity CXL region support



On 3/24/24 4:18 PM, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> CXL devices optionally support dynamic capacity. CXL Regions must be
> configured correctly to access this capacity. Similar to ram and pmem
> partitions, DC Regions, as they are called in CXL 3.1, represent
> different partitions of the DPA space.
>
> Introduce the concept of a sparse DAX region. Add the create_dc_region
> sysfs entry to create sparse DC DAX regions. Special case DC capable
> regions to create a 0 sized seed DAX device to maintain backwards
> compatibility with older software which needs a default DAX device to
> hold the region reference.
>
> Flag sparse DAX regions to indicate 0 capacity available until such time
> as DC capacity is added.
>
> Interleaving is deferred in this series. Add an early check.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [djiang: mark sysfs entries to be in 6.10 kernel including date]
> [djbw: change dax region typing to be 'sparse' rather than 'dynamic']
> [iweiny: rebase changes to master instead of type2 patches]
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 22 +++++++++++-----------
> drivers/cxl/core/core.h | 1 +
> drivers/cxl/core/port.c | 1 +
> drivers/cxl/core/region.c | 33 +++++++++++++++++++++++++++++++++
> drivers/dax/bus.c | 8 ++++++++
> drivers/dax/bus.h | 1 +
> drivers/dax/cxl.c | 15 +++++++++++++--
> 7 files changed, 68 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 8a4f572c8498..f0cf52fff9fa 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -411,20 +411,20 @@ Description:
> interleave_granularity).
>
>
> -What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> -Date: May, 2022, January, 2023
> -KernelVersion: v6.0 (pmem), v6.3 (ram)
> +What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
> +Date: May, 2022, January, 2023, June 2024
> +KernelVersion: v6.0 (pmem), v6.3 (ram), v6.10 (dc)
> Contact: [email protected]
> Description:
> (RW) Write a string in the form 'regionZ' to start the process
> - of defining a new persistent, or volatile memory region
> - (interleave-set) within the decode range bounded by root decoder
> - 'decoderX.Y'. The value written must match the current value
> - returned from reading this attribute. An atomic compare exchange
> - operation is done on write to assign the requested id to a
> - region and allocate the region-id for the next creation attempt.
> - EBUSY is returned if the region name written does not match the
> - current cached value.
> + of defining a new persistent, volatile, or Dynamic Capacity
> + (DC) memory region (interleave-set) within the decode range
> + bounded by root decoder 'decoderX.Y'. The value written must
> + match the current value returned from reading this attribute.
> + An atomic compare exchange operation is done on write to assign
> + the requested id to a region and allocate the region-id for the
> + next creation attempt. EBUSY is returned if the region name

-EBUSY?

> + written does not match the current cached value.
>
>
> What: /sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 3b64fb1b9ed0..91abeffbe985 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -13,6 +13,7 @@ extern struct attribute_group cxl_base_attribute_group;
> #ifdef CONFIG_CXL_REGION
> extern struct device_attribute dev_attr_create_pmem_region;
> extern struct device_attribute dev_attr_create_ram_region;
> +extern struct device_attribute dev_attr_create_dc_region;
> extern struct device_attribute dev_attr_delete_region;
> extern struct device_attribute dev_attr_region;
> extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 036b61cb3007..661177b575f7 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -335,6 +335,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
> &dev_attr_qos_class.attr,
> SET_CXL_REGION_ATTR(create_pmem_region)
> SET_CXL_REGION_ATTR(create_ram_region)
> + SET_CXL_REGION_ATTR(create_dc_region)
> SET_CXL_REGION_ATTR(delete_region)
> NULL,
> };
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index ec3b8c6948e9..0d7b09a49dcf 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2205,6 +2205,7 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
> switch (mode) {
> case CXL_REGION_RAM:
> case CXL_REGION_PMEM:
> + case CXL_REGION_DC:
> break;
> default:
> dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> @@ -2314,6 +2315,32 @@ static ssize_t create_ram_region_store(struct device *dev,
> }
> DEVICE_ATTR_RW(create_ram_region);
>
> +static ssize_t create_dc_region_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static ssize_t create_dc_region_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> + struct cxl_region *cxlr;
> + int rc, id;
> +
> + rc = sscanf(buf, "region%d\n", &id);
> + if (rc != 1)
> + return -EINVAL;
> +
> + cxlr = __create_region(cxlrd, CXL_REGION_DC, id);
> + if (IS_ERR(cxlr))
> + return PTR_ERR(cxlr);
> +
> + return len;
> +}
> +DEVICE_ATTR_RW(create_dc_region);
> +
> static ssize_t region_show(struct device *dev, struct device_attribute *attr,
> char *buf)
> {
> @@ -2759,6 +2786,11 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> struct device *dev;
> int rc;
>
> + if (cxlr->mode == CXL_REGION_DC && cxlr->params.interleave_ways != 1) {
> + dev_err(&cxlr->dev, "Interleaving DC not supported\n");
> + return -EINVAL;
> + }
> +
> cxlr_dax = cxl_dax_region_alloc(cxlr);
> if (IS_ERR(cxlr_dax))
> return PTR_ERR(cxlr_dax);
> @@ -3040,6 +3072,7 @@ static int cxl_region_probe(struct device *dev)
> case CXL_REGION_PMEM:
> return devm_cxl_add_pmem_region(cxlr);
> case CXL_REGION_RAM:
> + case CXL_REGION_DC:
> /*
> * The region can not be manged by CXL if any portion of
> * it is already online as 'System RAM'
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index cb148f74ceda..903566aff5eb 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -181,6 +181,11 @@ static bool is_static(struct dax_region *dax_region)
> return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
> }
>
> +static bool is_sparse(struct dax_region *dax_region)
> +{
> + return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> +}
> +
> bool static_dev_dax(struct dev_dax *dev_dax)
> {
> return is_static(dev_dax->region);
> @@ -304,6 +309,9 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
>
> WARN_ON_ONCE(!rwsem_is_locked(&dax_region_rwsem));
>
> + if (is_sparse(dax_region))
> + return 0;
> +
> for_each_dax_region_resource(dax_region, res)
> size -= resource_size(res);
> return size;
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index cbbf64443098..783bfeef42cc 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -13,6 +13,7 @@ struct dax_region;
> /* dax bus specific ioresource flags */
> #define IORESOURCE_DAX_STATIC BIT(0)
> #define IORESOURCE_DAX_KMEM BIT(1)
> +#define IORESOURCE_DAX_SPARSE_CAP BIT(2)
>
> struct dax_region *alloc_dax_region(struct device *parent, int region_id,
> struct range *range, int target_node, unsigned int align,
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index c696837ab23c..415d03fbf9b6 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -13,19 +13,30 @@ static int cxl_dax_region_probe(struct device *dev)
> struct cxl_region *cxlr = cxlr_dax->cxlr;
> struct dax_region *dax_region;
> struct dev_dax_data data;
> + resource_size_t dev_size;
> + unsigned long flags;
>
> if (nid == NUMA_NO_NODE)
> nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
>
> + flags = IORESOURCE_DAX_KMEM;
> + if (cxlr->mode == CXL_REGION_DC)
> + flags |= IORESOURCE_DAX_SPARSE_CAP;
> +
> dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> - PMD_SIZE, IORESOURCE_DAX_KMEM);
> + PMD_SIZE, flags);
> if (!dax_region)
> return -ENOMEM;
>
> + dev_size = range_len(&cxlr_dax->hpa_range);
> + /* Add empty seed dax device */
> + if (cxlr->mode == CXL_REGION_DC)
> + dev_size = 0;

Nit. Just do if/else so dev_size isn't set twice if mode is DC.

> +
> data = (struct dev_dax_data) {
> .dax_region = dax_region,
> .id = -1,
> - .size = range_len(&cxlr_dax->hpa_range),
> + .size = dev_size,
> .memmap_on_memory = true,
> };
>
>

2024-03-27 17:36:45

by fan

[permalink] [raw]
Subject: Re: [PATCH 15/26] range: Add range_overlaps()

On Sun, Mar 24, 2024 at 04:18:18PM -0700, Ira Weiny wrote:
> Code to support CXL Dynamic Capacity devices will have extent ranges
> which need to be compared for intersection not a subset as is being
> checked in range_contains().
>
> range_overlaps() is defined in btrfs with a different meaning from what
> is required in the standard range code. Dan Williams pointed this out
> in [1]. Adjust the btrfs call according to his suggestion there.
>
> Then add a generic range_overlaps().
>
> Cc: Dan Williams <[email protected]>
> Cc: Chris Mason <[email protected]>
> Cc: Josef Bacik <[email protected]>
> Cc: David Sterba <[email protected]>
> Cc: [email protected]
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Fan Ni <[email protected]>

>
> [1] https://lore.kernel.org/all/[email protected]/
> ---
> fs/btrfs/ordered-data.c | 10 +++++-----
> include/linux/range.h | 7 +++++++
> 2 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 59850dc17b22..032d30a49edc 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -111,8 +111,8 @@ static struct rb_node *__tree_search(struct rb_root *root, u64 file_offset,
> return NULL;
> }
>
> -static int range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
> - u64 len)
> +static int btrfs_range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
> + u64 len)
> {
> if (file_offset + len <= entry->file_offset ||
> entry->file_offset + entry->num_bytes <= file_offset)
> @@ -914,7 +914,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
>
> while (1) {
> entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> break;
>
> if (entry->file_offset >= file_offset + len) {
> @@ -1043,12 +1043,12 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
> }
> if (prev) {
> entry = rb_entry(prev, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> goto out;
> }
> if (next) {
> entry = rb_entry(next, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> goto out;
> }
> /* No ordered extent in the range */
> diff --git a/include/linux/range.h b/include/linux/range.h
> index 6ad0b73cb7ad..9a46f3212965 100644
> --- a/include/linux/range.h
> +++ b/include/linux/range.h
> @@ -13,11 +13,18 @@ static inline u64 range_len(const struct range *range)
> return range->end - range->start + 1;
> }
>
> +/* True if r1 completely contains r2 */
> static inline bool range_contains(struct range *r1, struct range *r2)
> {
> return r1->start <= r2->start && r1->end >= r2->end;
> }
>
> +/* True if any part of r1 overlaps r2 */
> +static inline bool range_overlaps(struct range *r1, struct range *r2)
> +{
> + return r1->start <= r2->end && r1->end >= r2->start;
> +}
> +
> int add_range(struct range *range, int az, int nr_range,
> u64 start, u64 end);
>
>
> --
> 2.44.0
>

2024-03-27 17:39:12

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 11/26] cxl/pci: Delay event buffer allocation



On 3/24/24 4:18 PM, Ira Weiny wrote:
> The event buffer does not need to be allocated if something has failed
> in setting up event irq's.
>
> In prep for adjusting event configuration for DCD events move the buffer
> allocation to the end of the event configuration.
>
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Dave Jiang <[email protected]>
> ---
> drivers/cxl/pci.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index cedd9b05f129..ccaf4ad26a4f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -756,10 +756,6 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> return 0;
> }
>
> - rc = cxl_mem_alloc_event_buf(mds);
> - if (rc)
> - return rc;
> -
> rc = cxl_event_get_int_policy(mds, &policy);
> if (rc)
> return rc;
> @@ -777,6 +773,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
> + rc = cxl_mem_alloc_event_buf(mds);
> + if (rc)
> + return rc;
> +
> rc = cxl_event_irqsetup(mds, &policy);
> if (rc)
> return rc;
>

2024-03-27 17:41:29

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 12/26] cxl/pci: Factor out interrupt policy check



On 3/24/24 4:18 PM, Ira Weiny wrote:
> Dynamic capacity devices (DCD) require interrupts to notify the host of
> events in the DCD log. The interrupts for DCD may be supported despite
> FW control of memory event logs.
>
> Prepare to support DCD event interrupts separate from other event
> interrupts by factoring out the check for event interrupt settings.
>
> Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
>
> ---
> Changes for V3:
> [iweiny: new patch]
> ---
> drivers/cxl/pci.c | 23 ++++++++++++++++-------
> 1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index ccaf4ad26a4f..12cd5d399230 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -738,6 +738,21 @@ static bool cxl_event_int_is_fw(u8 setting)
> return mode == CXL_INT_FW;
> }
>
> +static bool cxl_event_validate_mem_policy(struct cxl_memdev_state *mds,
> + struct cxl_event_interrupt_policy *policy)
> +{
> + if (cxl_event_int_is_fw(policy->info_settings) ||
> + cxl_event_int_is_fw(policy->warn_settings) ||
> + cxl_event_int_is_fw(policy->failure_settings) ||
> + cxl_event_int_is_fw(policy->fatal_settings)) {
> + dev_err(mds->cxlds.dev,
> + "FW still in control of Event Logs despite _OSC settings\n");
> + return false;
> + }
> +
> + return true;
> +}
> +
> static int cxl_event_config(struct pci_host_bridge *host_bridge,
> struct cxl_memdev_state *mds, bool irq_avail)
> {
> @@ -760,14 +775,8 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
> - if (cxl_event_int_is_fw(policy.info_settings) ||
> - cxl_event_int_is_fw(policy.warn_settings) ||
> - cxl_event_int_is_fw(policy.failure_settings) ||
> - cxl_event_int_is_fw(policy.fatal_settings)) {
> - dev_err(mds->cxlds.dev,
> - "FW still in control of Event Logs despite _OSC settings\n");
> + if (!cxl_event_validate_mem_policy(mds, &policy))
> return -EBUSY;
> - }
>
> rc = cxl_event_config_msgnums(mds, &policy);
> if (rc)
>

2024-03-27 17:49:05

by fan

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

On Sun, Mar 24, 2024 at 04:18:17PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case creation of a new
> region on top of the DC partition (region) is expected to expose those
> extents for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized read the device extent list. For ease of review, this patch
> stops after reading the extent list and leaves realization of the region
> extents to a future patch.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: remove extent list xarray]
> [iweiny: Update spec references to 3.1]
> [iweiny: use struct range in extents]
> [iweiny: remove all reference tracking and let regions track extents
> through the extent devices.]
> [djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
> ---
> drivers/cxl/core/core.h | 9 +++
> drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 29 +++++++
> drivers/cxl/cxlmem.h | 49 ++++++++++++
> 4 files changed, 279 insertions(+)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 91abeffbe985..119b12362977 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,6 +4,8 @@
> #ifndef __CXL_CORE_H__
> #define __CXL_CORE_H__
>
> +#include <cxlmem.h>
> +
> extern const struct device_type cxl_nvdimm_bridge_type;
> extern const struct device_type cxl_nvdimm_type;
> extern const struct device_type cxl_pmu_type;
> @@ -28,6 +30,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
> int cxl_region_init(void);
> void cxl_region_exit(void);
> int cxl_get_poison_by_endpoint(struct cxl_port *port);
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent);
> #else
> static inline int cxl_get_poison_by_endpoint(struct cxl_port *port)
> {
> @@ -43,6 +47,11 @@ static inline int cxl_region_init(void)
> static inline void cxl_region_exit(void)
> {
> }
> +static inline int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> #define CXL_REGION_ATTR(x) NULL
> #define CXL_REGION_TYPE(x) NULL
> #define SET_CXL_REGION_ATTR(x)
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 58b31fa47b93..9e33a0976828 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct device *dev = mds->cxlds.dev;
> + uint64_t start, len;
> +
> + start = le64_to_cpu(dc_extent->start_dpa);
> + len = le64_to_cpu(dc_extent->length);
> +
> + /* Extents must not cross region boundary's */
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + if (dcr->base <= start &&
> + (start + len) <= (dcr->base + dcr->decode_len)) {
> + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> + start, start + len - 1, i, start - dcr->base);
> + return 0;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> + start, start + len - 1);
> + return -EINVAL;
> +}
> +
> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *extent)
> +{
> + uint64_t start = le64_to_cpu(extent->start_dpa);
> + uint64_t length = le64_to_cpu(extent->length);
> + struct range ext_range = (struct range){
> + .start = start,
> + .end = start + length - 1,
> + };
> + struct range ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> +
> + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
> + cxled->dpa_res, start, length);
> +
> + return range_contains(&ed_range, &ext_range);
> +}
> +
> void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> @@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> + unsigned int *extent_gen_num)
> +{
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + struct cxl_mbox_get_dc_extent_out dc_extents;
> + struct cxl_mbox_cmd mbox_cmd;
> + unsigned int count;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(0),
> + .start_extent_index = cpu_to_le32(0),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = sizeof(dc_extents),
> + .payload_out = &dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + count = le32_to_cpu(dc_extents.total_extent_cnt);
> + *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> +
> + return count;
> +}
> +
> +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
> + unsigned int start_gen_num,
> + unsigned int exp_cnt)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + unsigned int start_index, total_read;
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_mbox_cmd mbox_cmd;
> +
> + struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
> + kvmalloc(mds->payload_size, GFP_KERNEL);
> + if (!dc_extents)
> + return -ENOMEM;
> +
> + total_read = 0;
> + start_index = 0;
> + do {
> + unsigned int nr_ext, total_extent_cnt, gen_num;
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(exp_cnt - start_index),
> + .start_extent_index = cpu_to_le32(start_index),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = mds->payload_size,
> + .payload_out = dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
> + total_read += nr_ext;
> + total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> + gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +
> + dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> + total_extent_cnt, gen_num);
> +
> + if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> + dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
> + gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> + return -EIO;
> + }
> +
> + for (int i = 0; i < nr_ext ; i++) {
> + dev_dbg(dev, "Processing extent %d/%d\n",
> + start_index + i, exp_cnt);
> + rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
> + if (rc)
> + continue;
> + if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
> + continue;
> + rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
> + if (rc)
> + return rc;
> + }
> +
> + start_index += nr_ext;
> + } while (exp_cnt > total_read);
> +
> + return 0;
> +}
> +
> +/**
> + * cxl_read_dc_extents() - Read any existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add any existing extents found which belong to this decoder.
> + *
> + * Return: 0 if command was executed successfully, -ERRNO on error.
> + */
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + unsigned int extent_gen_num;
> + int rc;
> +
> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD unsupported\n");
> + return 0;
> + }
> +
> + rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> + dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",
> + rc, extent_gen_num);
> + if (rc <= 0) /* 0 == no records found */
> + return rc;
> +
> + return cxl_dev_get_dc_extents(cxled, extent_gen_num, rc);

Not sure about the behaviour here. From the cxl_dev_get_dc_extents
implementation below, if gen_num changed or the expected extent count
changed, it will return error.
If I understand it correctly, if the above two values change, it means
the extent list has been updated due to extent add/release since last
time we read the extent list info (cxl_dev_get_dc_extent_cnt), do we
need to fail the operation or try again?

Fan

> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_read_dc_extents, CXL);
> +
> static int add_dpa_res(struct device *dev, struct resource *parent,
> struct resource *res, resource_size_t start,
> resource_size_t size, const char *type)
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 0d7b09a49dcf..3e563ab29afe 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,6 +1450,13 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> +
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder *cxled,
> @@ -2773,6 +2780,22 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
> return rc;
> }
>
> +static int cxl_region_read_extents(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int i;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + int rc;
> +
> + rc = cxl_read_dc_extents(p->targets[i]);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> @@ -2807,6 +2830,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> dev_name(dev));
>
> + if (cxlr->mode == CXL_REGION_DC) {
> + rc = cxl_region_read_extents(cxlr);
> + if (rc)
> + goto err;
> + }
> +
> return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> cxlr_dax);
> err:
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 01bee6eedff3..8f2d8944d334 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -604,6 +604,54 @@ enum cxl_opcode {
> UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \
> 0x40, 0x3d, 0x86)
>
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> + __le32 extent_list_size;
> + u8 flags;
> + u8 reserved[3];
> + struct updated_extent_list {
> + __le64 dpa_start;
> + __le64 length;
> + u8 reserved[8];
> + } __packed extent_list[];
> +} __packed;
> +
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +#define CXL_DC_EXTENT_TAG_LEN 0x10
> +struct cxl_dc_extent {
> + __le64 start_dpa;
> + __le64 length;
> + u8 tag[CXL_DC_EXTENT_TAG_LEN];
> + __le16 shared_extn_seq;
> + u8 reserved[6];
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Input Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
> + */
> +struct cxl_mbox_get_dc_extent_in {
> + __le32 extent_cnt;
> + __le32 start_extent_index;
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_dc_extent_out {
> + __le32 ret_extent_cnt;
> + __le32 total_extent_cnt;
> + __le32 extent_list_num;
> + u8 rsvd[4];
> + struct cxl_dc_extent extent[];
> +} __packed;
> +
> struct cxl_mbox_get_supported_logs {
> __le16 entries;
> u8 rsvd[6];
> @@ -879,6 +927,7 @@ int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
>
> --
> 2.44.0
>

2024-03-27 18:05:12

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 10/26] cxl/events: Factor out event msgnum configuration



On 3/24/24 4:18 PM, Ira Weiny wrote:
> Dynamic Capacity Devices (DCD) require events to process extent addition
> or removal. BIOS may have control over memory event processing.
>
> Factor out cxl_event_config_msgnums() in preparation for setting up DCD
> event interrupts separate from memory events.
>
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Dave Jiang <[email protected]>
> ---
> drivers/cxl/pci.c | 24 ++++++++++++------------
> 1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 216881455364..cedd9b05f129 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -698,35 +698,31 @@ static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
> return cxl_event_get_int_policy(mds, policy);
> }
>
> -static int cxl_event_irqsetup(struct cxl_memdev_state *mds)
> +static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
> + struct cxl_event_interrupt_policy *policy)
> {
> struct cxl_dev_state *cxlds = &mds->cxlds;
> - struct cxl_event_interrupt_policy policy;
> int rc;
>
> - rc = cxl_event_config_msgnums(mds, &policy);
> - if (rc)
> - return rc;
> -
> - rc = cxl_event_req_irq(cxlds, policy.info_settings);
> + rc = cxl_event_req_irq(cxlds, policy->info_settings);
> if (rc) {
> dev_err(cxlds->dev, "Failed to get interrupt for event Info log\n");
> return rc;
> }
>
> - rc = cxl_event_req_irq(cxlds, policy.warn_settings);
> + rc = cxl_event_req_irq(cxlds, policy->warn_settings);
> if (rc) {
> dev_err(cxlds->dev, "Failed to get interrupt for event Warn log\n");
> return rc;
> }
>
> - rc = cxl_event_req_irq(cxlds, policy.failure_settings);
> + rc = cxl_event_req_irq(cxlds, policy->failure_settings);
> if (rc) {
> dev_err(cxlds->dev, "Failed to get interrupt for event Failure log\n");
> return rc;
> }
>
> - rc = cxl_event_req_irq(cxlds, policy.fatal_settings);
> + rc = cxl_event_req_irq(cxlds, policy->fatal_settings);
> if (rc) {
> dev_err(cxlds->dev, "Failed to get interrupt for event Fatal log\n");
> return rc;
> @@ -745,7 +741,7 @@ static bool cxl_event_int_is_fw(u8 setting)
> static int cxl_event_config(struct pci_host_bridge *host_bridge,
> struct cxl_memdev_state *mds, bool irq_avail)
> {
> - struct cxl_event_interrupt_policy policy;
> + struct cxl_event_interrupt_policy policy = { 0 };
> int rc;
>
> /*
> @@ -777,7 +773,11 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> return -EBUSY;
> }
>
> - rc = cxl_event_irqsetup(mds);
> + rc = cxl_event_config_msgnums(mds, &policy);
> + if (rc)
> + return rc;
> +
> + rc = cxl_event_irqsetup(mds, &policy);
> if (rc)
> return rc;
>
>

2024-03-27 18:29:51

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts



On 3/24/24 4:18 PM, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism. The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications.
>
> Firmware can't configure DCD events to be FW controlled but can retain
> control of memory events. Split irq configuration of memory events and
> DCD events to allow for FW control of memory events while DCD is host
> controlled.
>
> Configure DCD event log interrupts on devices supporting dynamic
> capacity. Disable DCD if interrupts are not supported.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

A few minor comments. The rest LGTM.
>
> ---
> Changes for v1
> [iweiny: rebase to upstream irq code]
> [iweiny: disable DCD if irqs not supported]
> ---
> drivers/cxl/core/mbox.c | 9 ++++++-
> drivers/cxl/cxl.h | 4 ++-
> drivers/cxl/cxlmem.h | 4 +++
> drivers/cxl/pci.c | 71 ++++++++++++++++++++++++++++++++++++++++---------
> 4 files changed, 74 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 14e8a7528a8b..58b31fa47b93 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1323,10 +1323,17 @@ static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> return rc;
> }
>
> -static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> {
> return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> }
> +EXPORT_SYMBOL_NS_GPL(cxl_dcd_supported, CXL);
> +
> +void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> + clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_disable_dcd, CXL);

Should these one-liners just go into a header file?

>
> /**
> * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 15d418b3bc9b..d585f5fdd3ae 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -164,11 +164,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
> #define CXLDEV_EVENT_STATUS_WARN BIT(1)
> #define CXLDEV_EVENT_STATUS_FAIL BIT(2)
> #define CXLDEV_EVENT_STATUS_FATAL BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD BIT(4)

extra tab?

DJ

>
> #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \
> CXLDEV_EVENT_STATUS_WARN | \
> CXLDEV_EVENT_STATUS_FAIL | \
> - CXLDEV_EVENT_STATUS_FATAL)
> + CXLDEV_EVENT_STATUS_FATAL| \
> + CXLDEV_EVENT_STATUS_DCD)
>
> /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
> #define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 4624cf612c1e..01bee6eedff3 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -225,7 +225,9 @@ struct cxl_event_interrupt_policy {
> u8 warn_settings;
> u8 failure_settings;
> u8 fatal_settings;
> + u8 dcd_settings;
> } __packed;
> +#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
>
> /**
> * struct cxl_event_state - Event log driver state
> @@ -890,6 +892,8 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> const uuid_t *uuid, union cxl_event *evt);
> +bool cxl_dcd_supported(struct cxl_memdev_state *mds);
> +void cxl_disable_dcd(struct cxl_memdev_state *mds);
> int cxl_set_timestamp(struct cxl_memdev_state *mds);
> int cxl_poison_state_init(struct cxl_memdev_state *mds);
> int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 12cd5d399230..ef482eae09e9 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -669,22 +669,33 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> }
>
> static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
> - struct cxl_event_interrupt_policy *policy)
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> {
> struct cxl_mbox_cmd mbox_cmd;
> + size_t size_in;
> int rc;
>
> - *policy = (struct cxl_event_interrupt_policy) {
> - .info_settings = CXL_INT_MSI_MSIX,
> - .warn_settings = CXL_INT_MSI_MSIX,
> - .failure_settings = CXL_INT_MSI_MSIX,
> - .fatal_settings = CXL_INT_MSI_MSIX,
> - };
> + if (native_cxl) {
> + *policy = (struct cxl_event_interrupt_policy) {
> + .info_settings = CXL_INT_MSI_MSIX,
> + .warn_settings = CXL_INT_MSI_MSIX,
> + .failure_settings = CXL_INT_MSI_MSIX,
> + .fatal_settings = CXL_INT_MSI_MSIX,
> + .dcd_settings = 0,
> + };
> + }
> + size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;
> +
> + if (cxl_dcd_supported(mds)) {
> + policy->dcd_settings = CXL_INT_MSI_MSIX;
> + size_in += sizeof(policy->dcd_settings);
> + }
>
> mbox_cmd = (struct cxl_mbox_cmd) {
> .opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
> .payload_in = policy,
> - .size_in = sizeof(*policy),
> + .size_in = size_in,
> };
>
> rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> @@ -731,6 +742,31 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
> return 0;
> }
>
> +static int cxl_irqsetup(struct cxl_memdev_state *mds,
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> +{
> + struct cxl_dev_state *cxlds = &mds->cxlds;
> + int rc;
> +
> + if (native_cxl) {
> + rc = cxl_event_irqsetup(mds, policy);
> + if (rc)
> + return rc;
> + }
> +
> + if (cxl_dcd_supported(mds)) {
> + rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
> + if (rc) {
> + dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
> + cxl_disable_dcd(mds);
> + return rc;
> + }
> + }
> +
> + return 0;
> +}
> +
> static bool cxl_event_int_is_fw(u8 setting)
> {
> u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
> @@ -757,17 +793,25 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> struct cxl_memdev_state *mds, bool irq_avail)
> {
> struct cxl_event_interrupt_policy policy = { 0 };
> + bool native_cxl = host_bridge->native_cxl_error;
> int rc;
>
> /*
> * When BIOS maintains CXL error reporting control, it will process
> * event records. Only one agent can do so.
> + *
> + * If BIOS has control of events and DCD is not supported skip event
> + * configuration.
> */
> - if (!host_bridge->native_cxl_error)
> + if (!native_cxl && !cxl_dcd_supported(mds))
> return 0;
>
> if (!irq_avail) {
> dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
> + if (cxl_dcd_supported(mds)) {
> + dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");
> + cxl_disable_dcd(mds);
> + }
> return 0;
> }
>
> @@ -775,10 +819,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
> - if (!cxl_event_validate_mem_policy(mds, &policy))
> + if (native_cxl && !cxl_event_validate_mem_policy(mds, &policy))
> return -EBUSY;
>
> - rc = cxl_event_config_msgnums(mds, &policy);
> + rc = cxl_event_config_msgnums(mds, &policy, native_cxl);
> if (rc)
> return rc;
>
> @@ -786,12 +830,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> if (rc)
> return rc;
>
> - rc = cxl_event_irqsetup(mds, &policy);
> + rc = cxl_irqsetup(mds, &policy, native_cxl);
> if (rc)
> return rc;
>
> cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
>
> + dev_dbg(mds->cxlds.dev, "Event config : %d %d\n",
> + native_cxl, cxl_dcd_supported(mds));
> +
> return 0;
> }
>
>

2024-03-27 18:37:08

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation



On 3/24/24 4:18 PM, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case creation of a new
> region on top of the DC partition (region) is expected to expose those
> extents for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized read the device extent list. For ease of review, this patch
> stops after reading the extent list and leaves realization of the region
> extents to a future patch.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: remove extent list xarray]
> [iweiny: Update spec references to 3.1]
> [iweiny: use struct range in extents]
> [iweiny: remove all reference tracking and let regions track extents
> through the extent devices.]
> [djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
> ---
> drivers/cxl/core/core.h | 9 +++
> drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 29 +++++++
> drivers/cxl/cxlmem.h | 49 ++++++++++++
> 4 files changed, 279 insertions(+)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 91abeffbe985..119b12362977 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,6 +4,8 @@
> #ifndef __CXL_CORE_H__
> #define __CXL_CORE_H__
>
> +#include <cxlmem.h>
> +
> extern const struct device_type cxl_nvdimm_bridge_type;
> extern const struct device_type cxl_nvdimm_type;
> extern const struct device_type cxl_pmu_type;
> @@ -28,6 +30,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
> int cxl_region_init(void);
> void cxl_region_exit(void);
> int cxl_get_poison_by_endpoint(struct cxl_port *port);
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent);
> #else
> static inline int cxl_get_poison_by_endpoint(struct cxl_port *port)
> {
> @@ -43,6 +47,11 @@ static inline int cxl_region_init(void)
> static inline void cxl_region_exit(void)
> {
> }
> +static inline int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> #define CXL_REGION_ATTR(x) NULL
> #define CXL_REGION_TYPE(x) NULL
> #define SET_CXL_REGION_ATTR(x)
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 58b31fa47b93..9e33a0976828 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct device *dev = mds->cxlds.dev;
> + uint64_t start, len;
u64


> +
> + start = le64_to_cpu(dc_extent->start_dpa);
> + len = le64_to_cpu(dc_extent->length);
> +
> + /* Extents must not cross region boundary's */
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + if (dcr->base <= start &&
> + (start + len) <= (dcr->base + dcr->decode_len)) {

Can range_contains() be used here as well?

> + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> + start, start + len - 1, i, start - dcr->base);
> + return 0;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> + start, start + len - 1);
> + return -EINVAL;
> +}
> +
> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,

cxl_dc_extent_in_endpoint_decoder() is more readable

> + struct cxl_dc_extent *extent)
> +{
> + uint64_t start = le64_to_cpu(extent->start_dpa);
> + uint64_t length = le64_to_cpu(extent->length);
u64


> + struct range ext_range = (struct range){
> + .start = start,
> + .end = start + length - 1,
> + };
> + struct range ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> +
> + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
> + cxled->dpa_res, start, length);
> +
> + return range_contains(&ed_range, &ext_range);
> +}
> +
> void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> @@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,

cxl_dev_get_dc_extent_generation()? or spell out count

DJ

> + unsigned int *extent_gen_num)
> +{
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + struct cxl_mbox_get_dc_extent_out dc_extents;
> + struct cxl_mbox_cmd mbox_cmd;
> + unsigned int count;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(0),
> + .start_extent_index = cpu_to_le32(0),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = sizeof(dc_extents),
> + .payload_out = &dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + count = le32_to_cpu(dc_extents.total_extent_cnt);
> + *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> +
> + return count;
> +}
> +
> +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
> + unsigned int start_gen_num,
> + unsigned int exp_cnt)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + unsigned int start_index, total_read;
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_mbox_cmd mbox_cmd;
> +
> + struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
> + kvmalloc(mds->payload_size, GFP_KERNEL);
> + if (!dc_extents)
> + return -ENOMEM;
> +
> + total_read = 0;
> + start_index = 0;
> + do {
> + unsigned int nr_ext, total_extent_cnt, gen_num;
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(exp_cnt - start_index),
> + .start_extent_index = cpu_to_le32(start_index),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = mds->payload_size,
> + .payload_out = dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
> + total_read += nr_ext;
> + total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> + gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +
> + dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> + total_extent_cnt, gen_num);
> +
> + if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> + dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
> + gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> + return -EIO;
> + }
> +
> + for (int i = 0; i < nr_ext ; i++) {
> + dev_dbg(dev, "Processing extent %d/%d\n",
> + start_index + i, exp_cnt);
> + rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
> + if (rc)
> + continue;
> + if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
> + continue;
> + rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
> + if (rc)
> + return rc;
> + }
> +
> + start_index += nr_ext;
> + } while (exp_cnt > total_read);
> +
> + return 0;
> +}
> +
> +/**
> + * cxl_read_dc_extents() - Read any existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add any existing extents found which belong to this decoder.
> + *
> + * Return: 0 if command was executed successfully, -ERRNO on error.
> + */
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + unsigned int extent_gen_num;
> + int rc;
> +
> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD unsupported\n");
> + return 0;
> + }
> +
> + rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> + dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",
> + rc, extent_gen_num);
> + if (rc <= 0) /* 0 == no records found */
> + return rc;
> +
> + return cxl_dev_get_dc_extents(cxled, extent_gen_num, rc);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_read_dc_extents, CXL);
> +
> static int add_dpa_res(struct device *dev, struct resource *parent,
> struct resource *res, resource_size_t start,
> resource_size_t size, const char *type)
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 0d7b09a49dcf..3e563ab29afe 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,6 +1450,13 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> +
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder *cxled,
> @@ -2773,6 +2780,22 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
> return rc;
> }
>
> +static int cxl_region_read_extents(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int i;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + int rc;
> +
> + rc = cxl_read_dc_extents(p->targets[i]);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> @@ -2807,6 +2830,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> dev_name(dev));
>
> + if (cxlr->mode == CXL_REGION_DC) {
> + rc = cxl_region_read_extents(cxlr);
> + if (rc)
> + goto err;
> + }
> +
> return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> cxlr_dax);
> err:
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 01bee6eedff3..8f2d8944d334 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -604,6 +604,54 @@ enum cxl_opcode {
> UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \
> 0x40, 0x3d, 0x86)
>
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> + __le32 extent_list_size;
> + u8 flags;
> + u8 reserved[3];
> + struct updated_extent_list {
> + __le64 dpa_start;
> + __le64 length;
> + u8 reserved[8];
> + } __packed extent_list[];
> +} __packed;
> +
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +#define CXL_DC_EXTENT_TAG_LEN 0x10
> +struct cxl_dc_extent {
> + __le64 start_dpa;
> + __le64 length;
> + u8 tag[CXL_DC_EXTENT_TAG_LEN];
> + __le16 shared_extn_seq;
> + u8 reserved[6];
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Input Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
> + */
> +struct cxl_mbox_get_dc_extent_in {
> + __le32 extent_cnt;
> + __le32 start_extent_index;
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_dc_extent_out {
> + __le32 ret_extent_cnt;
> + __le32 total_extent_cnt;
> + __le32 extent_list_num;
> + u8 rsvd[4];
> + struct cxl_dc_extent extent[];
> +} __packed;
> +
> struct cxl_mbox_get_supported_logs {
> __le16 entries;
> u8 rsvd[6];
> @@ -879,6 +927,7 @@ int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
>

2024-03-27 22:34:52

by fan

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices

On Sun, Mar 24, 2024 at 04:18:19PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Once all extents of an interleave set are present a region must
> surface an extent to the region.
>
> Without interleaving; endpoint decoder and region extents have a 1:1
> relationship. Future support for IW > 1 will maintain a N:1
> relationship between the device extents and region extents.
>
> Create a region extent device for every device extent found. Release of
> the extent device triggers a response to the underlying hardware extent.
>
> There is no strong use case to support the addition of extents which
> overlap previously accepted extent ranges. Reject such new extents
> until such time as a good use case emerges.
>
> Expose the necessary details of region extents by creating the following
> sysfs entries.
>
> /sys/bus/cxl/devices/dax_regionX/extentY
> /sys/bus/cxl/devices/dax_regionX/extentY/offset
> /sys/bus/cxl/devices/dax_regionX/extentY/length
> /sys/bus/cxl/devices/dax_regionX/extentY/label
>
> The use of the extent devices by the DAX layer is deferred to later
> patches.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>

Minor comments inline,

> ---
> Changes for v1
> [iweiny: new patch]
> [iweiny: Rename 'dr_extent' to 'region_extent']
> ---
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/extent.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/mbox.c | 43 +++++++++++++++
> drivers/cxl/core/region.c | 76 +++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 37 +++++++++++++
> tools/testing/cxl/Kbuild | 1 +
> 6 files changed, 290 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..35c5c76bfcf1 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -14,5 +14,6 @@ cxl_core-y += pci.o
> cxl_core-y += hdm.o
> cxl_core-y += pmu.o
> cxl_core-y += cdat.o
> +cxl_core-y += extent.o
> cxl_core-$(CONFIG_TRACING) += trace.o
> cxl_core-$(CONFIG_CXL_REGION) += region.o
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..487c220f1c3c
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <cxl.h>
> +
> +static DEFINE_IDA(cxl_extent_ida);
> +
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + return sysfs_emit(buf, "%pa\n", &reg_ext->hpa_range.start);
> +}
> +static DEVICE_ATTR_RO(offset);
> +
> +static ssize_t length_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> + u64 length = range_len(&reg_ext->hpa_range);
> +
> + return sysfs_emit(buf, "%pa\n", &length);
> +}
> +static DEVICE_ATTR_RO(length);
> +
> +static ssize_t label_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + return sysfs_emit(buf, "%s\n", reg_ext->label);
> +}
> +static DEVICE_ATTR_RO(label);
> +
> +static struct attribute *region_extent_attrs[] = {
> + &dev_attr_offset.attr,
> + &dev_attr_length.attr,
> + &dev_attr_label.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group region_extent_attribute_group = {
> + .attrs = region_extent_attrs,
> +};
> +
> +static const struct attribute_group *region_extent_attribute_groups[] = {
> + &region_extent_attribute_group,
> + NULL,
> +};
> +
> +static void region_extent_release(struct device *dev)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + cxl_release_ed_extent(&reg_ext->ed_ext);
> + ida_free(&cxl_extent_ida, reg_ext->dev.id);
> + kfree(reg_ext);
> +}
> +
> +static const struct device_type region_extent_type = {
> + .name = "extent",
> + .release = region_extent_release,
> + .groups = region_extent_attribute_groups,
> +};
> +
> +bool is_region_extent(struct device *dev)
> +{
> + return dev->type == &region_extent_type;
> +}
> +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL);
> +
> +static void region_extent_unregister(void *ext)
> +{
> + struct region_extent *reg_ext = ext;
> +
> + dev_dbg(&reg_ext->dev, "DAX region rm extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> + device_unregister(&reg_ext->dev);
> +}
> +
> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct region_extent *reg_ext;
> + struct device *dev;
> + int rc, id;
> +
> + id = ida_alloc(&cxl_extent_ida, GFP_KERNEL);
> + if (id < 0)
> + return -ENOMEM;
> +
> + reg_ext = kzalloc(sizeof(*reg_ext), GFP_KERNEL);
> + if (!reg_ext)
> + return -ENOMEM;
> +
> + reg_ext->hpa_range = *hpa_range;
> + reg_ext->ed_ext.dpa_range = *dpa_range;
> + reg_ext->ed_ext.cxled = cxled;
> + snprintf(reg_ext->label, DAX_EXTENT_LABEL_LEN, "%s", label);
> +
> + dev = &reg_ext->dev;
> + device_initialize(dev);
> + dev->id = id;
> + device_set_pm_not_required(dev);
> + dev->parent = &cxlr_dax->dev;
> + dev->type = &region_extent_type;
> + rc = dev_set_name(dev, "extent%d", dev->id);
> + if (rc)
> + goto err;
> +
> + rc = device_add(dev);
> + if (rc)
> + goto err;
> +
> + dev_dbg(dev, "DAX region extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> + reg_ext);
> +
> +err:
> + dev_err(&cxlr_dax->dev, "Failed to initialize DAX extent dev HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + put_device(dev);
> + return rc;
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9e33a0976828..6b00e717e42b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> + struct range *extent, int opcode)
> +{
> + struct cxl_mbox_cmd mbox_cmd;
> + size_t size;
> +
> + struct cxl_mbox_dc_response *dc_res __free(kfree);
> + size = struct_size(dc_res, extent_list, 1);
> + dc_res = kzalloc(size, GFP_KERNEL);
> + if (!dc_res)
> + return -ENOMEM;
> +
> + dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
> + memset(dc_res->extent_list[0].reserved, 0, 8);

The space has already been zeroed with kzalloc.

Fan

> + dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
> + dc_res->extent_list_size = cpu_to_le32(1);
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = opcode,
> + .size_in = size,
> + .payload_in = dc_res,
> + };
> +
> + return cxl_internal_send_cmd(mds, &mbox_cmd);
> +}
> +
> static struct cxl_memdev_state *
> cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> {
> @@ -1029,6 +1055,23 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> return container_of(cxlds, struct cxl_memdev_state, cxlds);
> }
>
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent)
> +{
> + struct cxl_endpoint_decoder *cxled = extent->cxled;
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + dev_dbg(dev, "Releasing DC extent DPA %#llx - %#llx\n",
> + extent->dpa_range.start, extent->dpa_range.end);
> +
> + rc = cxl_send_dc_cap_response(mds, &extent->dpa_range, CXL_MBOX_OP_RELEASE_DC);
> + if (rc)
> + dev_dbg(dev, "Failed to respond releasing extent DPA %#llx - %#llx; %d\n",
> + extent->dpa_range.start, extent->dpa_range.end, rc);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_release_ed_extent, CXL);
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3e563ab29afe..7635ff109578 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,11 +1450,81 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +static int extent_check_overlap(struct device *dev, void *arg)
> +{
> + struct range *new_range = arg;
> + struct region_extent *ext;
> +
> + if (!is_region_extent(dev))
> + return 0;
> +
> + ext = to_region_extent(dev);
> + return range_overlaps(&ext->hpa_range, new_range);
> +}
> +
> +static int extent_overlaps(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range)
> +{
> + struct device *dev __free(put_device) =
> + device_find_child(&cxlr_dax->dev, hpa_range, extent_check_overlap);
> +
> + if (dev)
> + return -EINVAL;
> + return 0;
> +}
> +
> /* Callers are expected to ensure cxled has been attached to a region */
> int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> struct cxl_dc_extent *dc_extent)
> {
> - return 0;
> + struct cxl_region *cxlr = cxled->cxld.region;
> + struct range ext_dpa_range, ext_hpa_range;
> + struct device *dev = &cxlr->dev;
> + resource_size_t dpa_offset, hpa;
> +
> + /*
> + * Interleave ways == 1 means this coresponds to a 1:1 mapping between
> + * device extents and DAX region extents. Future implementations
> + * should hold DC region extents here until the full dax region extent
> + * can be realized.
> + */
> + if (cxlr->params.interleave_ways != 1) {
> + dev_err(dev, "Interleaving DC not supported\n");
> + return -EINVAL;
> + }
> +
> + ext_dpa_range = (struct range) {
> + .start = le64_to_cpu(dc_extent->start_dpa),
> + .end = le64_to_cpu(dc_extent->start_dpa) +
> + le64_to_cpu(dc_extent->length) - 1,
> + };
> +
> + dev_dbg(dev, "Adding DC extent DPA %#llx - %#llx\n",
> + ext_dpa_range.start, ext_dpa_range.end);
> +
> + /*
> + * Without interleave...
> + * HPA offset == DPA offset
> + * ... but do the math anyway
> + */
> + dpa_offset = ext_dpa_range.start - cxled->dpa_res->start;
> + hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> + ext_hpa_range = (struct range) {
> + .start = hpa - cxlr->cxlr_dax->hpa_range.start,
> + .end = ext_hpa_range.start + range_len(&ext_dpa_range) - 1,
> + };
> +
> + if (extent_overlaps(cxlr->cxlr_dax, &ext_hpa_range))
> + return -EINVAL;
> +
> + dev_dbg(dev, "Realizing region extent at HPA %#llx - %#llx\n",
> + ext_hpa_range.start, ext_hpa_range.end);
> +
> + return dax_region_create_ext(cxlr->cxlr_dax, &ext_hpa_range,
> + (char *)dc_extent->tag,
> + &ext_dpa_range,
> + cxled);
> }
>
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> @@ -2684,6 +2754,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>
> dev = &cxlr_dax->dev;
> cxlr_dax->cxlr = cxlr;
> + cxlr->cxlr_dax = cxlr_dax;
> device_initialize(dev);
> lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> device_set_pm_not_required(dev);
> @@ -2799,7 +2870,10 @@ static int cxl_region_read_extents(struct cxl_region *cxlr)
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> + struct cxl_region *cxlr = cxlr_dax->cxlr;
>
> + cxlr->cxlr_dax = NULL;
> + cxlr_dax->cxlr = NULL;
> device_unregister(&cxlr_dax->dev);
> }
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d585f5fdd3ae..5379ad7f5852 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -564,6 +564,7 @@ struct cxl_region_params {
> * @type: Endpoint decoder target type
> * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
> * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
> * @flags: Region state flags
> * @params: active + config params for the region
> */
> @@ -574,6 +575,7 @@ struct cxl_region {
> enum cxl_decoder_type type;
> struct cxl_nvdimm_bridge *cxl_nvb;
> struct cxl_pmem_region *cxlr_pmem;
> + struct cxl_dax_region *cxlr_dax;
> unsigned long flags;
> struct cxl_region_params params;
> };
> @@ -617,6 +619,41 @@ struct cxl_dax_region {
> struct range hpa_range;
> };
>
> +/**
> + * struct cxl_ed_extent - Extent within an endpoint decoder
> + * @dpa_range: DPA range this extent covers within the decoder
> + * @cxled: reference to the endpoint decoder
> + */
> +struct cxl_ed_extent {
> + struct range dpa_range;
> + struct cxl_endpoint_decoder *cxled;
> +};
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent);
> +
> +/**
> + * struct region_extent - CXL DAX region extent
> + * @dev: device representing this extent
> + * @hpa_range: HPA range of this extent
> + * @label: label of the extent
> + * @ed_ext: Endpoint decoder extent which backs this extent
> + */
> +#define DAX_EXTENT_LABEL_LEN 64
> +struct region_extent {
> + struct device dev;
> + struct range hpa_range;
> + char label[DAX_EXTENT_LABEL_LEN];
> + struct cxl_ed_extent ed_ext;
> +};
> +
> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled);
> +
> +bool is_region_extent(struct device *dev);
> +#define to_region_extent(dev) container_of(dev, struct region_extent, dev)
> +
> /**
> * struct cxl_port - logical collection of upstream port devices and
> * downstream port devices to construct a CXL memory
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 030b388800f0..dc0cc1d5e6a0 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -60,6 +60,7 @@ cxl_core-y += $(CXL_CORE_SRC)/pci.o
> cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> cxl_core-y += $(CXL_CORE_SRC)/pmu.o
> cxl_core-y += $(CXL_CORE_SRC)/cdat.o
> +cxl_core-y += $(CXL_CORE_SRC)/extent.o
> cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
> cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
> cxl_core-y += config_check.o
>
> --
> 2.44.0
>

2024-03-28 05:21:08

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

fan wrote:
> On Sun, Mar 24, 2024 at 04:18:03PM -0700, [email protected] wrote:
> > A git tree of this series can be found here:

[snip]

> >
>
> Hi Ira,
> Have not got a chance to check the code yet, but I noticed one thing
> when testing with my DCD emulation code.
> Currently, if we do partial release, it seems the whole extent will be
> removed. Is it designed intentionally?
>

Yes that is my intent. I specifically called that out in patch 18.

https://lore.kernel.org/all/[email protected]/

I thought we discussed this in one of the collaboration calls. Mainly
this is to simplify by not attempting any split of the extents the host is
tracking. It really is expected that the FM/device is going to keep those
extents offered and release them in their entirety. I understand this may
complicate the device because it may see a release of memory prior to the
request of that release. And perhaps this complicates the device. But in
that case it (or the FM really) should not attempt to release partial
extents.

Ira

[snip]

2024-03-28 05:23:09

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

Davidlohr Bueso wrote:
> On Sun, 24 Mar 2024, [email protected] wrote:
>
> >From: Navneet Singh <[email protected]>
> >
> >Until now region modes and decoder modes were equivalent in that they
> >were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
> >regions (which will represent an array of device regions [better named
> >partitions] the index of which could be different on different
> >interleaved devices), the mode of an endpoint decoder and a region will
> >no longer be equivalent.
> >
> >Define a new region mode enumeration and adjust the code for it.
>
> Could this could also be picked up regardless of dcd?

It could but there is no need for it without DCD.

I will work on re-ordering the cleanups if Dave will agree to take them
early.

Ira

2024-03-28 20:09:50

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode



On 3/27/24 10:22 PM, Ira Weiny wrote:
> Davidlohr Bueso wrote:
>> On Sun, 24 Mar 2024, [email protected] wrote:
>>
>>> From: Navneet Singh <[email protected]>
>>>
>>> Until now region modes and decoder modes were equivalent in that they
>>> were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
>>> regions (which will represent an array of device regions [better named
>>> partitions] the index of which could be different on different
>>> interleaved devices), the mode of an endpoint decoder and a region will
>>> no longer be equivalent.
>>>
>>> Define a new region mode enumeration and adjust the code for it.
>>
>> Could this could also be picked up regardless of dcd?
>
> It could but there is no need for it without DCD.
>
> I will work on re-ordering the cleanups if Dave will agree to take them
> early.

There's no reason for the change unless it comes with DCD right? And probably no urgent need to taking it ahead then?
>
> Ira

2024-03-28 20:10:38

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 15/26] range: Add range_overlaps()



On 3/24/24 4:18 PM, Ira Weiny wrote:
> Code to support CXL Dynamic Capacity devices will have extent ranges
> which need to be compared for intersection not a subset as is being
> checked in range_contains().
>
> range_overlaps() is defined in btrfs with a different meaning from what
> is required in the standard range code. Dan Williams pointed this out
> in [1]. Adjust the btrfs call according to his suggestion there.
>
> Then add a generic range_overlaps().
>
> Cc: Dan Williams <[email protected]>
> Cc: Chris Mason <[email protected]>
> Cc: Josef Bacik <[email protected]>
> Cc: David Sterba <[email protected]>
> Cc: [email protected]
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Dave Jiang <[email protected]>
>
> [1] https://lore.kernel.org/all/[email protected]/
> ---
> fs/btrfs/ordered-data.c | 10 +++++-----
> include/linux/range.h | 7 +++++++
> 2 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 59850dc17b22..032d30a49edc 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -111,8 +111,8 @@ static struct rb_node *__tree_search(struct rb_root *root, u64 file_offset,
> return NULL;
> }
>
> -static int range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
> - u64 len)
> +static int btrfs_range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
> + u64 len)
> {
> if (file_offset + len <= entry->file_offset ||
> entry->file_offset + entry->num_bytes <= file_offset)
> @@ -914,7 +914,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
>
> while (1) {
> entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> break;
>
> if (entry->file_offset >= file_offset + len) {
> @@ -1043,12 +1043,12 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
> }
> if (prev) {
> entry = rb_entry(prev, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> goto out;
> }
> if (next) {
> entry = rb_entry(next, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> goto out;
> }
> /* No ordered extent in the range */
> diff --git a/include/linux/range.h b/include/linux/range.h
> index 6ad0b73cb7ad..9a46f3212965 100644
> --- a/include/linux/range.h
> +++ b/include/linux/range.h
> @@ -13,11 +13,18 @@ static inline u64 range_len(const struct range *range)
> return range->end - range->start + 1;
> }
>
> +/* True if r1 completely contains r2 */
> static inline bool range_contains(struct range *r1, struct range *r2)
> {
> return r1->start <= r2->start && r1->end >= r2->end;
> }
>
> +/* True if any part of r1 overlaps r2 */
> +static inline bool range_overlaps(struct range *r1, struct range *r2)
> +{
> + return r1->start <= r2->end && r1->end >= r2->start;
> +}
> +
> int add_range(struct range *range, int az, int nr_range,
> u64 start, u64 end);
>
>

2024-03-28 21:11:39

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices



On 3/24/24 4:18 PM, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Once all extents of an interleave set are present a region must
> surface an extent to the region.
>
> Without interleaving; endpoint decoder and region extents have a 1:1
> relationship. Future support for IW > 1 will maintain a N:1
> relationship between the device extents and region extents.
>
> Create a region extent device for every device extent found. Release of
> the extent device triggers a response to the underlying hardware extent.
>
> There is no strong use case to support the addition of extents which
> overlap previously accepted extent ranges. Reject such new extents
> until such time as a good use case emerges.
>
> Expose the necessary details of region extents by creating the following
> sysfs entries.
>
> /sys/bus/cxl/devices/dax_regionX/extentY
> /sys/bus/cxl/devices/dax_regionX/extentY/offset
> /sys/bus/cxl/devices/dax_regionX/extentY/length
> /sys/bus/cxl/devices/dax_regionX/extentY/label
>
> The use of the extent devices by the DAX layer is deferred to later
> patches.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [iweiny: new patch]
> [iweiny: Rename 'dr_extent' to 'region_extent']
> ---
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/extent.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/mbox.c | 43 +++++++++++++++
> drivers/cxl/core/region.c | 76 +++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 37 +++++++++++++
> tools/testing/cxl/Kbuild | 1 +
> 6 files changed, 290 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..35c5c76bfcf1 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -14,5 +14,6 @@ cxl_core-y += pci.o
> cxl_core-y += hdm.o
> cxl_core-y += pmu.o
> cxl_core-y += cdat.o
> +cxl_core-y += extent.o
> cxl_core-$(CONFIG_TRACING) += trace.o
> cxl_core-$(CONFIG_CXL_REGION) += region.o
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..487c220f1c3c
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <cxl.h>
> +
> +static DEFINE_IDA(cxl_extent_ida);

According to Documentation/core-api/idr.rst, IDR interface is deprecated and xarray usage is preferred.

> +
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> + char *buf)

Parameter alignment a bit off here? and some of the other functions as well.

> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + return sysfs_emit(buf, "%pa\n", &reg_ext->hpa_range.start);
> +}
> +static DEVICE_ATTR_RO(offset);
> +
> +static ssize_t length_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> + u64 length = range_len(&reg_ext->hpa_range);
> +
> + return sysfs_emit(buf, "%pa\n", &length);
> +}
> +static DEVICE_ATTR_RO(length);
> +
> +static ssize_t label_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + return sysfs_emit(buf, "%s\n", reg_ext->label);
> +}
> +static DEVICE_ATTR_RO(label);
> +
> +static struct attribute *region_extent_attrs[] = {
> + &dev_attr_offset.attr,
> + &dev_attr_length.attr,
> + &dev_attr_label.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group region_extent_attribute_group = {
> + .attrs = region_extent_attrs,
> +};
> +
> +static const struct attribute_group *region_extent_attribute_groups[] = {
> + &region_extent_attribute_group,
> + NULL,
> +};
> +
> +static void region_extent_release(struct device *dev)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + cxl_release_ed_extent(&reg_ext->ed_ext);
> + ida_free(&cxl_extent_ida, reg_ext->dev.id);
> + kfree(reg_ext);
> +}
> +
> +static const struct device_type region_extent_type = {
> + .name = "extent",
> + .release = region_extent_release,
> + .groups = region_extent_attribute_groups,
> +};
> +
> +bool is_region_extent(struct device *dev)
> +{
> + return dev->type == &region_extent_type;
> +}
> +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL);
> +
> +static void region_extent_unregister(void *ext)
> +{
> + struct region_extent *reg_ext = ext;
> +
> + dev_dbg(&reg_ext->dev, "DAX region rm extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> + device_unregister(&reg_ext->dev);
> +}
> +
> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct region_extent *reg_ext;
> + struct device *dev;
> + int rc, id;
> +
> + id = ida_alloc(&cxl_extent_ida, GFP_KERNEL);
> + if (id < 0)
> + return -ENOMEM;
> +
> + reg_ext = kzalloc(sizeof(*reg_ext), GFP_KERNEL);
> + if (!reg_ext)
> + return -ENOMEM;
> +
> + reg_ext->hpa_range = *hpa_range;
> + reg_ext->ed_ext.dpa_range = *dpa_range;
> + reg_ext->ed_ext.cxled = cxled;
> + snprintf(reg_ext->label, DAX_EXTENT_LABEL_LEN, "%s", label);
> +
> + dev = &reg_ext->dev;
> + device_initialize(dev);
> + dev->id = id;
> + device_set_pm_not_required(dev);
> + dev->parent = &cxlr_dax->dev;
> + dev->type = &region_extent_type;
> + rc = dev_set_name(dev, "extent%d", dev->id);
> + if (rc)
> + goto err;
> +
> + rc = device_add(dev);
> + if (rc)
> + goto err;
> +
> + dev_dbg(dev, "DAX region extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> + reg_ext);
> +
> +err:
> + dev_err(&cxlr_dax->dev, "Failed to initialize DAX extent dev HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + put_device(dev);
> + return rc;
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9e33a0976828..6b00e717e42b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> + struct range *extent, int opcode)
> +{
> + struct cxl_mbox_cmd mbox_cmd;
> + size_t size;
> +
> + struct cxl_mbox_dc_response *dc_res __free(kfree);
> + size = struct_size(dc_res, extent_list, 1);
> + dc_res = kzalloc(size, GFP_KERNEL);
> + if (!dc_res)
> + return -ENOMEM;
> +
> + dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
> + memset(dc_res->extent_list[0].reserved, 0, 8);

Not needed. kzalloc already zeroed.

DJ

> + dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
> + dc_res->extent_list_size = cpu_to_le32(1);
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = opcode,
> + .size_in = size,
> + .payload_in = dc_res,
> + };
> +
> + return cxl_internal_send_cmd(mds, &mbox_cmd);
> +}
> +
> static struct cxl_memdev_state *
> cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> {
> @@ -1029,6 +1055,23 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> return container_of(cxlds, struct cxl_memdev_state, cxlds);
> }
>
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent)
> +{
> + struct cxl_endpoint_decoder *cxled = extent->cxled;
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + dev_dbg(dev, "Releasing DC extent DPA %#llx - %#llx\n",
> + extent->dpa_range.start, extent->dpa_range.end);
> +
> + rc = cxl_send_dc_cap_response(mds, &extent->dpa_range, CXL_MBOX_OP_RELEASE_DC);
> + if (rc)
> + dev_dbg(dev, "Failed to respond releasing extent DPA %#llx - %#llx; %d\n",
> + extent->dpa_range.start, extent->dpa_range.end, rc);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_release_ed_extent, CXL);
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3e563ab29afe..7635ff109578 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,11 +1450,81 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +static int extent_check_overlap(struct device *dev, void *arg)
> +{
> + struct range *new_range = arg;
> + struct region_extent *ext;
> +
> + if (!is_region_extent(dev))
> + return 0;
> +
> + ext = to_region_extent(dev);
> + return range_overlaps(&ext->hpa_range, new_range);
> +}
> +
> +static int extent_overlaps(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range)
> +{
> + struct device *dev __free(put_device) =
> + device_find_child(&cxlr_dax->dev, hpa_range, extent_check_overlap);
> +
> + if (dev)
> + return -EINVAL;
> + return 0;
> +}
> +
> /* Callers are expected to ensure cxled has been attached to a region */
> int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> struct cxl_dc_extent *dc_extent)
> {
> - return 0;
> + struct cxl_region *cxlr = cxled->cxld.region;
> + struct range ext_dpa_range, ext_hpa_range;
> + struct device *dev = &cxlr->dev;
> + resource_size_t dpa_offset, hpa;
> +
> + /*
> + * Interleave ways == 1 means this coresponds to a 1:1 mapping between
> + * device extents and DAX region extents. Future implementations
> + * should hold DC region extents here until the full dax region extent
> + * can be realized.
> + */
> + if (cxlr->params.interleave_ways != 1) {
> + dev_err(dev, "Interleaving DC not supported\n");
> + return -EINVAL;
> + }
> +
> + ext_dpa_range = (struct range) {
> + .start = le64_to_cpu(dc_extent->start_dpa),
> + .end = le64_to_cpu(dc_extent->start_dpa) +
> + le64_to_cpu(dc_extent->length) - 1,
> + };
> +
> + dev_dbg(dev, "Adding DC extent DPA %#llx - %#llx\n",
> + ext_dpa_range.start, ext_dpa_range.end);
> +
> + /*
> + * Without interleave...
> + * HPA offset == DPA offset
> + * ... but do the math anyway
> + */
> + dpa_offset = ext_dpa_range.start - cxled->dpa_res->start;
> + hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> + ext_hpa_range = (struct range) {
> + .start = hpa - cxlr->cxlr_dax->hpa_range.start,
> + .end = ext_hpa_range.start + range_len(&ext_dpa_range) - 1,
> + };
> +
> + if (extent_overlaps(cxlr->cxlr_dax, &ext_hpa_range))
> + return -EINVAL;
> +
> + dev_dbg(dev, "Realizing region extent at HPA %#llx - %#llx\n",
> + ext_hpa_range.start, ext_hpa_range.end);
> +
> + return dax_region_create_ext(cxlr->cxlr_dax, &ext_hpa_range,
> + (char *)dc_extent->tag,
> + &ext_dpa_range,
> + cxled);
> }
>
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> @@ -2684,6 +2754,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>
> dev = &cxlr_dax->dev;
> cxlr_dax->cxlr = cxlr;
> + cxlr->cxlr_dax = cxlr_dax;
> device_initialize(dev);
> lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> device_set_pm_not_required(dev);
> @@ -2799,7 +2870,10 @@ static int cxl_region_read_extents(struct cxl_region *cxlr)
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> + struct cxl_region *cxlr = cxlr_dax->cxlr;
>
> + cxlr->cxlr_dax = NULL;
> + cxlr_dax->cxlr = NULL;
> device_unregister(&cxlr_dax->dev);
> }
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d585f5fdd3ae..5379ad7f5852 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -564,6 +564,7 @@ struct cxl_region_params {
> * @type: Endpoint decoder target type
> * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
> * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
> * @flags: Region state flags
> * @params: active + config params for the region
> */
> @@ -574,6 +575,7 @@ struct cxl_region {
> enum cxl_decoder_type type;
> struct cxl_nvdimm_bridge *cxl_nvb;
> struct cxl_pmem_region *cxlr_pmem;
> + struct cxl_dax_region *cxlr_dax;
> unsigned long flags;
> struct cxl_region_params params;
> };
> @@ -617,6 +619,41 @@ struct cxl_dax_region {
> struct range hpa_range;
> };
>
> +/**
> + * struct cxl_ed_extent - Extent within an endpoint decoder
> + * @dpa_range: DPA range this extent covers within the decoder
> + * @cxled: reference to the endpoint decoder
> + */
> +struct cxl_ed_extent {
> + struct range dpa_range;
> + struct cxl_endpoint_decoder *cxled;
> +};
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent);
> +
> +/**
> + * struct region_extent - CXL DAX region extent
> + * @dev: device representing this extent
> + * @hpa_range: HPA range of this extent
> + * @label: label of the extent
> + * @ed_ext: Endpoint decoder extent which backs this extent
> + */
> +#define DAX_EXTENT_LABEL_LEN 64
> +struct region_extent {
> + struct device dev;
> + struct range hpa_range;
> + char label[DAX_EXTENT_LABEL_LEN];
> + struct cxl_ed_extent ed_ext;
> +};
> +
> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled);
> +
> +bool is_region_extent(struct device *dev);
> +#define to_region_extent(dev) container_of(dev, struct region_extent, dev)
> +
> /**
> * struct cxl_port - logical collection of upstream port devices and
> * downstream port devices to construct a CXL memory
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 030b388800f0..dc0cc1d5e6a0 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -60,6 +60,7 @@ cxl_core-y += $(CXL_CORE_SRC)/pci.o
> cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> cxl_core-y += $(CXL_CORE_SRC)/pmu.o
> cxl_core-y += $(CXL_CORE_SRC)/cdat.o
> +cxl_core-y += $(CXL_CORE_SRC)/extent.o
> cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
> cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
> cxl_core-y += config_check.o
>

2024-04-01 17:06:33

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 20/26] dax: Document dax dev range tuple



On 3/24/24 4:18 PM, Ira Weiny wrote:
> The device DAX structure is being enhanced to track additional DCD
> information.
>
> The current range tuple was not fully documented. Document it prior to
> adding information for DC.
>
> Suggested-by: Jonathan Cameron <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
>
> ---
> Changes for v1
> [iweiny: new patch]
> ---
> drivers/dax/dax-private.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index c6319c6567fb..ac1ccf158650 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -70,7 +70,10 @@ struct dax_mapping {
> * @dev - device core
> * @pgmap - pgmap for memmap setup / lifetime (driver owned)
> * @nr_range: size of @ranges
> - * @ranges: resource-span + pgoff tuples for the instance
> + * @ranges: range tuples of memory used
> + * @pgoff: page offset
> + * @range: resource-span
> + * @mapping: device to assist in interrogating the range layout
> */
> struct dev_dax {
> struct dax_region *region;
>

2024-04-01 17:19:00

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 21/26] dax/region: Prevent range mapping allocation on sparse regions



On 3/24/24 4:18 PM, Ira Weiny wrote:
> Sparse regions are not fully populated with memory and this complicates
> range mapping of dax devices on those regions. There is no use case for
> range mapping on sparse regions.
>
> Avoid the complication by prevent range mapping of dax devices on sparse
> regions.
>
> Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>

> ---
> drivers/dax/bus.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index bab19fc578d0..56dddaceeccb 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1452,6 +1452,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
> return 0;
> if (a == &dev_attr_mapping.attr && is_static(dax_region))
> return 0;
> + if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
> + return 0;
> if ((a == &dev_attr_align.attr ||
> a == &dev_attr_size.attr) && is_static(dax_region))
> return 0444;
>

2024-04-01 17:57:30

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 23/26] cxl/mem: Trace Dynamic capacity Event Record



On 3/24/24 4:18 PM, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
> Notify the host of extents being added or removed. User space has
> little use for these events other than for debugging.
>
> Add DC trace points to the trace log for debugging purposes.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

Reviewed-by: Dave Jiang <[email protected]>
>
> ---
> Changes for v1
> [iweiny: Adjust to new trace code]
> ---
> drivers/cxl/core/mbox.c | 4 +++
> drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 69 insertions(+)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 7babac2d1c95..cb4576890187 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -978,6 +978,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> ev_type = CXL_CPER_EVENT_DRAM;
> else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
> ev_type = CXL_CPER_EVENT_MEM_MODULE;
> + else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
> + trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
> + return;
> + }
>
> cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
> }
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index bdf117a33744..7646fdd9aee3 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -707,6 +707,71 @@ TRACE_EVENT(cxl_poison,
> )
> );
>
> +/*
> + * DYNAMIC CAPACITY Event Record - DER
> + *
> + * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
> + */
> +
> +#define CXL_DC_ADD_CAPACITY 0x00
> +#define CXL_DC_REL_CAPACITY 0x01
> +#define CXL_DC_FORCED_REL_CAPACITY 0x02
> +#define CXL_DC_REG_CONF_UPDATED 0x03
> +#define show_dc_evt_type(type) __print_symbolic(type, \
> + { CXL_DC_ADD_CAPACITY, "Add capacity"}, \
> + { CXL_DC_REL_CAPACITY, "Release capacity"}, \
> + { CXL_DC_FORCED_REL_CAPACITY, "Forced capacity release"}, \
> + { CXL_DC_REG_CONF_UPDATED, "Region Configuration Updated" } \
> +)
> +
> +TRACE_EVENT(cxl_dynamic_capacity,
> +
> + TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> + struct cxl_event_dcd *rec),
> +
> + TP_ARGS(cxlmd, log, rec),
> +
> + TP_STRUCT__entry(
> + CXL_EVT_TP_entry
> +
> + /* Dynamic capacity Event */
> + __field(u8, event_type)
> + __field(u16, hostid)
> + __field(u8, region_id)
> + __field(u64, dpa_start)
> + __field(u64, length)
> + __array(u8, tag, CXL_DC_EXTENT_TAG_LEN)
> + __field(u16, sh_extent_seq)
> + ),
> +
> + TP_fast_assign(
> + CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +
> + /* Dynamic_capacity Event */
> + __entry->event_type = rec->event_type;
> +
> + /* DCD event record data */
> + __entry->hostid = le16_to_cpu(rec->host_id);
> + __entry->region_id = rec->region_index;
> + __entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
> + __entry->length = le64_to_cpu(rec->extent.length);
> + memcpy(__entry->tag, &rec->extent.tag, CXL_DC_EXTENT_TAG_LEN);
> + __entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
> + ),
> +
> + CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
> + "starting_dpa=%llx length=%llx tag=%s " \
> + "shared_extent_sequence=%d",
> + show_dc_evt_type(__entry->event_type),
> + __entry->hostid,
> + __entry->region_id,
> + __entry->dpa_start,
> + __entry->length,
> + __print_hex(__entry->tag, CXL_DC_EXTENT_TAG_LEN),
> + __entry->sh_extent_seq
> + )
> +);
> +
> #endif /* _CXL_EVENTS_H */
>
> #define TRACE_INCLUDE_FILE trace
>

2024-04-02 11:41:33

by Jørgen Hansen

[permalink] [raw]
Subject: Re: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

On 3/25/24 00:18, [email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> Devices can optionally support Dynamic Capacity (DC). These devices are
> known as Dynamic Capacity Devices (DCD).
>
> Implement the DC mailbox commands as specified in CXL 3.1 section
> 8.2.9.9.9 (opcodes 48XXh). Read the DC configuration and store the DC
> region information in the device state.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [Jørgen: ensure CXL 2.0 device support by removing dc_event_log_size]
> [iweiny/Jørgen: use get DC config command to signal DCD support]
> [djiang: fix subject]
> [Fan: add additional region configuration checks]
> [Jonathan/djiang: split out region mode changes]
> [Jonathan: fix up comments/kdoc]
> [Jonathan: s/cxl_get_dc_id/cxl_get_dc_config/]
> [Jonathan: use __free() in identify call]
> [Jonathan: remove unneeded formatting changes]
> [Jonathan: s/cxl_mbox_dynamic_capacity/cxl_mbox_get_dc_config_out/]
> [Jonathan: s/cxl_mbox_get_dc_config/cxl_mbox_get_dc_config_in/]
> [iweiny: remove type2 work dependancy/rebase on master]
> [iweiny: fix 0day build issues]
> ---
> drivers/cxl/core/mbox.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxlmem.h | 49 +++++++++++++
> drivers/cxl/pci.c | 4 ++
> 3 files changed, 236 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index ed4131c6f50b..14e8a7528a8b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1123,7 +1123,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
> if (rc < 0)
> return rc;
>
> - mds->total_bytes =
> + mds->static_cap =
> le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
> mds->volatile_only_bytes =
> le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1230,6 +1230,175 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> return rc;
> }
>
> +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
> + struct cxl_dc_region_config *region_config)
> +{
> + struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> + struct device *dev = mds->cxlds.dev;
> +
> + dcr->base = le64_to_cpu(region_config->region_base);
> + dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> + dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> + dcr->len = le64_to_cpu(region_config->region_length);
> + dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> + dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> + dcr->flags = region_config->flags;
> + snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> +
> + /* Check regions are in increasing DPA order */
> + if (index > 0) {
> + struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> +
> + if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> + dev_err(dev,
> + "DPA ordering violation for DC region %d and %d\n",
> + index - 1, index);
> + return -EINVAL;
> + }
> + }
> +
> + if (!IS_ALIGNED(dcr->base, SZ_256M) ||
> + !IS_ALIGNED(dcr->base, dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n", index,
> + dcr->base, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
> + !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
> + index, dcr->decode_len, dcr->len, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + if (dcr->blk_size == 0 || dcr->blk_size % 0x40 ||
> + !is_power_of_2(dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid block size; %#llx\n",
> + index, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + dev_dbg(dev,
> + "DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> + dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +
> + return 0;
> +}
> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> + struct cxl_mbox_get_dc_config_out *dc_resp,
> + size_t dc_resp_size)
> +{
> + struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> + .region_count = CXL_MAX_DC_REGION,
> + .start_region_index = start_region,
> + };
> + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> + .payload_in = &get_dc,
> + .size_in = sizeof(get_dc),
> + .size_out = dc_resp_size,
> + .payload_out = dc_resp,
> + .min_out = 1,
> + };
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + rc = dc_resp->avail_region_count - start_region;
> +
> + /*
> + * The number of regions in the payload may have been truncated due to
> + * payload_size limits; if so adjust the returned count to match.
> + */
> + if (mbox_cmd.size_out < sizeof(*dc_resp))
> + rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
> +
> + dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> +
> + return rc;
> +}
> +
> +static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> + return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + *
> + * Read Dynamic Capacity information from the device and populate the state
> + * structures for later use.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> + size_t dc_resp_size = mds->payload_size;
> + struct device *dev = mds->cxlds.dev;
> + u8 start_region, i;
> + int rc = 0;
> +
> + for (i = 0; i < CXL_MAX_DC_REGION; i++)
> + snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> + /* Check GET_DC_CONFIG is supported by device */
> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD not supported\n");
> + return 0;
> + }
> +
> + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> + kvmalloc(dc_resp_size, GFP_KERNEL);
> + if (!dc_resp)
> + return -ENOMEM;
> +
> + start_region = 0;
> + do {
> + int j;
> +
> + rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> + if (rc < 0) {
> + dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> + return rc;
> + }
> +
> + mds->nr_dc_region += rc;
> +
> + if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> + dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> + mds->nr_dc_region);
> + return -EINVAL;
> + }
> +
> + for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> + rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> + if (rc) {
> + dev_dbg(dev, "Failed to save region info: %d\n", rc);
> + return rc;
> + }
> + }
> +
> + start_region = mds->nr_dc_region;
> +
> + } while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> + mds->dynamic_cap =
> + mds->dc_region[mds->nr_dc_region - 1].base +
> + mds->dc_region[mds->nr_dc_region - 1].decode_len -
> + mds->dc_region[0].base;
> + dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
> static int add_dpa_res(struct device *dev, struct resource *parent,
> struct resource *res, resource_size_t start,
> resource_size_t size, const char *type)
> @@ -1260,8 +1429,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> {
> struct cxl_dev_state *cxlds = &mds->cxlds;
> struct device *dev = cxlds->dev;
> + size_t untenanted_mem;
> int rc;
>
> + untenanted_mem = mds->dc_region[0].base - mds->static_cap;
> + mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
> +
> if (!cxlds->media_ready) {
> cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> @@ -1271,6 +1444,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>
> cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
>
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> + dcr->base, dcr->decode_len, dcr->name);
> + if (rc)
> + return rc;
> + }
> +
> if (mds->partition_align_bytes == 0) {
> rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> mds->volatile_only_bytes, "ram");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 79a67cff9143..4624cf612c1e 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -402,6 +402,7 @@ enum cxl_devtype {
> CXL_DEVTYPE_CLASSMEM,
> };
>
> +#define CXL_MAX_DC_REGION 8
> /**
> * struct cxl_dpa_perf - DPA performance property entry
> * @dpa_range - range for DPA address
> @@ -431,6 +432,8 @@ struct cxl_dpa_perf {
> * @dpa_res: Overall DPA resource tree for the device
> * @pmem_res: Active Persistent memory capacity configuration
> * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + * region
> * @serial: PCIe Device Serial Number
> * @type: Generic Memory Class device or Vendor Specific Memory device
> */
> @@ -445,10 +448,22 @@ struct cxl_dev_state {
> struct resource dpa_res;
> struct resource pmem_res;
> struct resource ram_res;
> + struct resource dc_res[CXL_MAX_DC_REGION];
> u64 serial;
> enum cxl_devtype type;
> };
>
> +#define CXL_DC_REGION_STRLEN 8
> +struct cxl_dc_region_info {
> + u64 base;
> + u64 decode_len;
> + u64 len;
> + u64 blk_size;
> + u32 dsmad_handle;
> + u8 flags;
> + u8 name[CXL_DC_REGION_STRLEN];
> +};
> +
> /**
> * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> *
> @@ -467,6 +482,8 @@ struct cxl_dev_state {
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of static RAM and PMEM capacities
> + * @dynamic_cap: Complete DPA range occupied by DC regions

How about naming these total_range, static_cap and dynamic_range to make
it clear that the DPA range occupied by DC regions isn't necessarily
usable capacity (as opposed to the static_cap where the spec defines it
as usable capacity).

Thanks,
Jørgen

> * @volatile_only_bytes: hard volatile capacity
> * @persistent_only_bytes: hard persistent capacity
> * @partition_align_bytes: alignment size for partition-able capacity
> @@ -474,6 +491,8 @@ struct cxl_dev_state {
> * @active_persistent_bytes: sum of hard + soft persistent
> * @next_volatile_bytes: volatile capacity change pending device reset
> * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> * @event: event log driver state
> * @poison: poison driver state info
> * @security: security driver state info
> @@ -494,7 +513,10 @@ struct cxl_memdev_state {
> DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
> u64 total_bytes;
> + u64 static_cap;
> + u64 dynamic_cap;
> u64 volatile_only_bytes;
> u64 persistent_only_bytes;
> u64 partition_align_bytes;
> @@ -506,6 +528,9 @@ struct cxl_memdev_state {
> struct cxl_dpa_perf ram_perf;
> struct cxl_dpa_perf pmem_perf;
>
> + u8 nr_dc_region;
> + struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +
> struct cxl_event_state event;
> struct cxl_poison_state poison;
> struct cxl_security_state security;
> @@ -705,6 +730,29 @@ struct cxl_mbox_set_partition_info {
>
> #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
>
> +struct cxl_mbox_get_dc_config_in {
> + u8 region_count;
> + u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_get_dc_config_out {
> + u8 avail_region_count;
> + u8 rsvd[7];
> + struct cxl_dc_region_config {
> + __le64 region_base;
> + __le64 region_decode_length;
> + __le64 region_length;
> + __le64 region_block_size;
> + __le32 region_dsmad_handle;
> + u8 flags;
> + u8 rsvd[3];
> + } __packed region[];
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_REGIONS_RETURNED(size_out) \
> + ((size_out - 8) / sizeof(struct cxl_dc_region_config))
> +
> /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> struct cxl_mbox_set_timestamp_in {
> __le64 timestamp;
> @@ -828,6 +876,7 @@ enum {
> int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 2ff361e756d6..216881455364 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -874,6 +874,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (rc)
> return rc;
>
> + rc = cxl_dev_dynamic_capacity_identify(mds);
> + if (rc)
> + return rc;
> +
> rc = cxl_mem_create_range_info(mds);
> if (rc)
> return rc;
>
> --
> 2.44.0
>
>

2024-04-02 14:00:26

by Jørgen Hansen

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

On 3/25/24 00:18, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case creation of a new
> region on top of the DC partition (region) is expected to expose those
> extents for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized read the device extent list. For ease of review, this patch
> stops after reading the extent list and leaves realization of the region
> extents to a future patch.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: remove extent list xarray]
> [iweiny: Update spec references to 3.1]
> [iweiny: use struct range in extents]
> [iweiny: remove all reference tracking and let regions track extents
> through the extent devices.]
> [djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
> ---
> drivers/cxl/core/core.h | 9 +++
> drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 29 +++++++
> drivers/cxl/cxlmem.h | 49 ++++++++++++
> 4 files changed, 279 insertions(+)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 91abeffbe985..119b12362977 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,6 +4,8 @@
> #ifndef __CXL_CORE_H__
> #define __CXL_CORE_H__
>
> +#include <cxlmem.h>
> +
> extern const struct device_type cxl_nvdimm_bridge_type;
> extern const struct device_type cxl_nvdimm_type;
> extern const struct device_type cxl_pmu_type;
> @@ -28,6 +30,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
> int cxl_region_init(void);
> void cxl_region_exit(void);
> int cxl_get_poison_by_endpoint(struct cxl_port *port);
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent);
> #else
> static inline int cxl_get_poison_by_endpoint(struct cxl_port *port)
> {
> @@ -43,6 +47,11 @@ static inline int cxl_region_init(void)
> static inline void cxl_region_exit(void)
> {
> }
> +static inline int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> #define CXL_REGION_ATTR(x) NULL
> #define CXL_REGION_TYPE(x) NULL
> #define SET_CXL_REGION_ATTR(x)
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 58b31fa47b93..9e33a0976828 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct device *dev = mds->cxlds.dev;
> + uint64_t start, len;
> +
> + start = le64_to_cpu(dc_extent->start_dpa);
> + len = le64_to_cpu(dc_extent->length);
> +
> + /* Extents must not cross region boundary's */
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + if (dcr->base <= start &&
> + (start + len) <= (dcr->base + dcr->decode_len)) {
> + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> + start, start + len - 1, i, start - dcr->base);
> + return 0;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> + start, start + len - 1);
> + return -EINVAL;
> +}
> +
> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *extent)
> +{
> + uint64_t start = le64_to_cpu(extent->start_dpa);
> + uint64_t length = le64_to_cpu(extent->length);
> + struct range ext_range = (struct range){
> + .start = start,
> + .end = start + length - 1,
> + };
> + struct range ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> +
> + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
> + cxled->dpa_res, start, length);
> +
> + return range_contains(&ed_range, &ext_range);
> +}
> +
> void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> @@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> + unsigned int *extent_gen_num)
> +{
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + struct cxl_mbox_get_dc_extent_out dc_extents;
> + struct cxl_mbox_cmd mbox_cmd;
> + unsigned int count;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(0),
> + .start_extent_index = cpu_to_le32(0),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = sizeof(dc_extents),
> + .payload_out = &dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + count = le32_to_cpu(dc_extents.total_extent_cnt);
> + *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> +
> + return count;
> +}
> +
> +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
> + unsigned int start_gen_num,
> + unsigned int exp_cnt)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + unsigned int start_index, total_read;
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_mbox_cmd mbox_cmd;
> +
> + struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
> + kvmalloc(mds->payload_size, GFP_KERNEL);
> + if (!dc_extents)
> + return -ENOMEM;
> +
> + total_read = 0;
> + start_index = 0;
> + do {
> + unsigned int nr_ext, total_extent_cnt, gen_num;
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(exp_cnt - start_index),
> + .start_extent_index = cpu_to_le32(start_index),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = mds->payload_size,
> + .payload_out = dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
> + total_read += nr_ext;
> + total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> + gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +
> + dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> + total_extent_cnt, gen_num);
> +
> + if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> + dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
> + gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> + return -EIO;
> + }
> +
> + for (int i = 0; i < nr_ext ; i++) {
> + dev_dbg(dev, "Processing extent %d/%d\n",
> + start_index + i, exp_cnt);
> + rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
> + if (rc)
> + continue;
> + if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
> + continue;
> + rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
> + if (rc)
> + return rc;
> + }
> +
> + start_index += nr_ext;
> + } while (exp_cnt > total_read);
> +
> + return 0;
> +}
> +
> +/**
> + * cxl_read_dc_extents() - Read any existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add any existing extents found which belong to this decoder.
> + *
> + * Return: 0 if command was executed successfully, -ERRNO on error.
> + */
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + unsigned int extent_gen_num;
> + int rc;
> +
> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD unsupported\n");
> + return 0;
> + }
> +
> + rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> + dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",
> + rc, extent_gen_num);
> + if (rc <= 0) /* 0 == no records found */
> + return rc;
> +
> + return cxl_dev_get_dc_extents(cxled, extent_gen_num, rc);

Is it necessary to spend a device interaction to get the generation
number? Couldn't cxl_dev_get_dc_extents obtain that as part of the first
call to the device, and then use it to ensure the consistency of any
remaining calls, if any are necessary?

Thanks,
Jørgen

> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_read_dc_extents, CXL);
> +
> static int add_dpa_res(struct device *dev, struct resource *parent,
> struct resource *res, resource_size_t start,
> resource_size_t size, const char *type)
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 0d7b09a49dcf..3e563ab29afe 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,6 +1450,13 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> +
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder *cxled,
> @@ -2773,6 +2780,22 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
> return rc;
> }
>
> +static int cxl_region_read_extents(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int i;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + int rc;
> +
> + rc = cxl_read_dc_extents(p->targets[i]);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> @@ -2807,6 +2830,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> dev_name(dev));
>
> + if (cxlr->mode == CXL_REGION_DC) {
> + rc = cxl_region_read_extents(cxlr);
> + if (rc)
> + goto err;
> + }
> +
> return devm_add_action_or_reset(&cxlr->dev, cxlr_dax_unregister,
> cxlr_dax);
> err:
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 01bee6eedff3..8f2d8944d334 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -604,6 +604,54 @@ enum cxl_opcode {
> UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \
> 0x40, 0x3d, 0x86)
>
> +/*
> + * Add Dynamic Capacity Response
> + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169
> + */
> +struct cxl_mbox_dc_response {
> + __le32 extent_list_size;
> + u8 flags;
> + u8 reserved[3];
> + struct updated_extent_list {
> + __le64 dpa_start;
> + __le64 length;
> + u8 reserved[8];
> + } __packed extent_list[];
> +} __packed;
> +
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51
> + */
> +#define CXL_DC_EXTENT_TAG_LEN 0x10
> +struct cxl_dc_extent {
> + __le64 start_dpa;
> + __le64 length;
> + u8 tag[CXL_DC_EXTENT_TAG_LEN];
> + __le16 shared_extn_seq;
> + u8 reserved[6];
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Input Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-166
> + */
> +struct cxl_mbox_get_dc_extent_in {
> + __le32 extent_cnt;
> + __le32 start_extent_index;
> +} __packed;
> +
> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_dc_extent_out {
> + __le32 ret_extent_cnt;
> + __le32 total_extent_cnt;
> + __le32 extent_list_num;
> + u8 rsvd[4];
> + struct cxl_dc_extent extent[];
> +} __packed;
> +
> struct cxl_mbox_get_supported_logs {
> __le16 entries;
> u8 rsvd[6];
> @@ -879,6 +927,7 @@ int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
>
> --
> 2.44.0
>
>

2024-04-02 22:30:46

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)

Dave Jiang wrote:
>
>
> On 3/24/24 4:18 PM, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >
> > Per the CXL 3.1 specification software must check the Command Effects
> > Log (CEL) to know if a device supports dynamic capacity (DC). If the
> > device does support DC the specifics of the DC Regions (0-7) are read
> > through the mailbox.
> >
> > Flag DC Device (DCD) commands in a device if they are supported.
> > Subsequent patches will key off these bits to configure DCD.
> >
> > Signed-off-by: Navneet Singh <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
>
> Reviewed-by: Dave Jiang <[email protected]>
>
> small formatting nit below
>
> > ---
> > Changes for v1
> > [iweiny: update to latest master]
> > [iweiny: update commit message]
> > [iweiny: Based on the fix:
> > https://lore.kernel.org/all/[email protected]/
> > [jonathan: remove unneeded format change]
> > [jonathan: don't split security code in mbox.c]
> > ---
> > drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
> > drivers/cxl/cxlmem.h | 15 +++++++++++++++
> > 2 files changed, 48 insertions(+)
> >
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 9adda4795eb7..ed4131c6f50b 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -161,6 +161,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
> > }
> > }
> >
> > +static bool cxl_is_dcd_command(u16 opcode)
> > +{
> > +#define CXL_MBOX_OP_DCD_CMDS 0x48
> > +
> > + return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> > +}
> > +
> > +static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> > + u16 opcode)
>
> This seems misaligned.

Fixed,
Ira

2024-04-02 22:38:32

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)

Davidlohr Bueso wrote:
> On Sun, 24 Mar 2024, [email protected] wrote:
>
> >From: Navneet Singh <[email protected]>
> >
> >Per the CXL 3.1 specification software must check the Command Effects
> >Log (CEL) to know if a device supports dynamic capacity (DC). If the
> >device does support DC the specifics of the DC Regions (0-7) are read
> >through the mailbox.
>
> I vote to fold this into patch 3, favoring reduced patch count in the
> series to trvially enlarging that particular patch.

I'll consider it. I've tried hard to split the original series up into
very easily reviewable chunks. So I'm inclined to leave this separate for
now.

>
> >Flag DC Device (DCD) commands in a device if they are supported.
> >Subsequent patches will key off these bits to configure DCD.
>
> It would be good to mention these here explicitly (if this patch will
> live on). For example, that config will be the driver's way of telling
> if dcd is enabled or disabled - we could have cases of that zeroed bit
> but the rest enabled.

It took me a bit to parse this but I see what you mean. Yes the
GET_CONFIG command is the one used specifically to determine if DCD is
supported.

I'll clean up the commit message. In retrospect it was wrong for me to
mention subsequent patches here. I'm going to move this final detail to
patch 3.

>
> lgtm otherwise.
>
> Reviewed-by: Davidlohr Bueso <[email protected]>

Thanks! :-D
Ira

>
> >Signed-off-by: Navneet Singh <[email protected]>
> >Co-developed-by: Ira Weiny <[email protected]>
> >Signed-off-by: Ira Weiny <[email protected]>
> >---
> >Changes for v1
> >[iweiny: update to latest master]
> >[iweiny: update commit message]
> >[iweiny: Based on the fix:
> > https://lore.kernel.org/all/[email protected]/
> >[jonathan: remove unneeded format change]
> >[jonathan: don't split security code in mbox.c]
> >---
> > drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
> > drivers/cxl/cxlmem.h | 15 +++++++++++++++
> > 2 files changed, 48 insertions(+)
> >
> >diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> >index 9adda4795eb7..ed4131c6f50b 100644
> >--- a/drivers/cxl/core/mbox.c
> >+++ b/drivers/cxl/core/mbox.c
> >@@ -161,6 +161,34 @@ static void cxl_set_security_cmd_enabled(struct cxl_security_state *security,
> > }
> > }
> >
> >+static bool cxl_is_dcd_command(u16 opcode)
> >+{
> >+#define CXL_MBOX_OP_DCD_CMDS 0x48
> >+
> >+ return (opcode >> 8) == CXL_MBOX_OP_DCD_CMDS;
> >+}
> >+
> >+static void cxl_set_dcd_cmd_enabled(struct cxl_memdev_state *mds,
> >+ u16 opcode)
> >+{
> >+ switch (opcode) {
> >+ case CXL_MBOX_OP_GET_DC_CONFIG:
> >+ set_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> >+ break;
> >+ case CXL_MBOX_OP_GET_DC_EXTENT_LIST:
> >+ set_bit(CXL_DCD_ENABLED_GET_EXTENT_LIST, mds->dcd_cmds);
> >+ break;
> >+ case CXL_MBOX_OP_ADD_DC_RESPONSE:
> >+ set_bit(CXL_DCD_ENABLED_ADD_RESPONSE, mds->dcd_cmds);
> >+ break;
> >+ case CXL_MBOX_OP_RELEASE_DC:
> >+ set_bit(CXL_DCD_ENABLED_RELEASE, mds->dcd_cmds);
> >+ break;
> >+ default:
> >+ break;
> >+ }
> >+}
> >+
> > static bool cxl_is_poison_command(u16 opcode)
> > {
> > #define CXL_MBOX_OP_POISON_CMDS 0x43
> >@@ -733,6 +761,11 @@ static void cxl_walk_cel(struct cxl_memdev_state *mds, size_t size, u8 *cel)
> > enabled++;
> > }
> >
> >+ if (cxl_is_dcd_command(opcode)) {
> >+ cxl_set_dcd_cmd_enabled(mds, opcode);
> >+ enabled++;
> >+ }
> >+
> > dev_dbg(dev, "Opcode 0x%04x %s\n", opcode,
> > enabled ? "enabled" : "unsupported by driver");
> > }
> >diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> >index 20fb3b35e89e..79a67cff9143 100644
> >--- a/drivers/cxl/cxlmem.h
> >+++ b/drivers/cxl/cxlmem.h
> >@@ -238,6 +238,15 @@ struct cxl_event_state {
> > struct mutex log_lock;
> > };
> >
> >+/* Device enabled DCD commands */
> >+enum dcd_cmd_enabled_bits {
> >+ CXL_DCD_ENABLED_GET_CONFIG,
> >+ CXL_DCD_ENABLED_GET_EXTENT_LIST,
> >+ CXL_DCD_ENABLED_ADD_RESPONSE,
> >+ CXL_DCD_ENABLED_RELEASE,
> >+ CXL_DCD_ENABLED_MAX
> >+};
> >+
> > /* Device enabled poison commands */
> > enum poison_cmd_enabled_bits {
> > CXL_POISON_ENABLED_LIST,
> >@@ -454,6 +463,7 @@ struct cxl_dev_state {
> > * (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> > * @mbox_mutex: Mutex to synchronize mailbox access.
> > * @firmware_version: Firmware version for the memory device.
> >+ * @dcd_cmds: List of DCD commands implemented by memory device
> > * @enabled_cmds: Hardware commands found enabled in CEL.
> > * @exclusive_cmds: Commands that are kernel-internal only
> > * @total_bytes: sum of all possible capacities
> >@@ -481,6 +491,7 @@ struct cxl_memdev_state {
> > size_t lsa_size;
> > struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> > char firmware_version[0x10];
> >+ DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> > DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> > DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> > u64 total_bytes;
> >@@ -551,6 +562,10 @@ enum cxl_opcode {
> > CXL_MBOX_OP_UNLOCK = 0x4503,
> > CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
> > CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
> >+ CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
> >+ CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
> >+ CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
> >+ CXL_MBOX_OP_RELEASE_DC = 0x4803,
> > CXL_MBOX_OP_MAX = 0x10000
> > };
> >
> >
> >--
> >2.44.0
> >

2024-04-02 23:25:20

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

Jonathan Cameron wrote:
> On Sun, 24 Mar 2024 16:18:05 -0700
> [email protected] wrote:

[snip]

> >
> > ---
> > Changes for v1
> > <none>
> > ---
> > drivers/cxl/core/region.c | 77 +++++++++++++++++++++++++++++++++++------------
> > drivers/cxl/cxl.h | 26 ++++++++++++++--
> > 2 files changed, 81 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 4c7fd2d5cccb..1723d17f121e 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
>
>
> > @@ -2800,6 +2814,24 @@ static int match_region_by_range(struct device *dev, void *data)
> > return rc;
> > }
> >
> > +static enum cxl_region_mode
> > +cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> > +{
> > + switch (mode) {
> > + case CXL_DECODER_NONE:
> > + return CXL_REGION_NONE;
> > + case CXL_DECODER_RAM:
> > + return CXL_REGION_RAM;
> > + case CXL_DECODER_PMEM:
> > + return CXL_REGION_PMEM;
> > + case CXL_DECODER_MIXED:
> > + default:
> > + return CXL_REGION_MIXED;
> > + }
> > +
>
> Dead code.

Fixed thanks.

>
> > + return CXL_REGION_MIXED;
> > +}
> > +
> > /* Establish an empty region covering the given HPA range */
> > static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> > struct cxl_endpoint_decoder *cxled)
> > @@ -2808,12 +2840,17 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> > struct cxl_port *port = cxlrd_to_port(cxlrd);
> > struct range *hpa = &cxled->cxld.hpa_range;
> > struct cxl_region_params *p;
> > + enum cxl_region_mode mode;
> > struct cxl_region *cxlr;
> > struct resource *res;
> > int rc;
> >
> > + if (cxled->mode == CXL_DECODER_DEAD)
> > + return ERR_PTR(-EINVAL);
>
> Not a bad thing necessarily, but why do we now need this and didn't before?

Ah. Because in devm_cxl_add_region() the mode of CXL_DECODER_DEAD used to
return -EINVAL.

There is no logical equivalent to decoder dead in the region mode (regions
don't need it). So this correctly flags the error based on the decoder
mode rather than introduce a mode for regions which does not make sense.

I'll update the commit message because that is hard to see.

>
> > +
> > + mode = cxl_decoder_to_region_mode(cxled->mode);
> > do {
> > - cxlr = __create_region(cxlrd, cxled->mode,
> > + cxlr = __create_region(cxlrd, mode,
> > atomic_read(&cxlrd->region_id));
> > } while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
>
>
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 003feebab79b..9a0cce1e6fca 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
>
>
> > /*
> > * Track whether this decoder is reserved for region autodiscovery, or
> > * free for userspace provisioning.
> > @@ -511,7 +532,8 @@ struct cxl_region_params {
> > * struct cxl_region - CXL region
> > * @dev: This region's device
> > * @id: This region's id. Id is globally unique across all regions
> > - * @mode: Endpoint decoder allocation / access mode
> > + * @mode: Region mode which defines which endpoint decoder mode the region is
> mode or potentially modes?
>
> If region is mixed, I guess that means endpoint could be pmem or ram in theory?
> Don't think anyone has implemented anything yet, but is the potential there?

Yes the potential is there. The endpoint decoder is set to
CXL_DECODER_MIXED in __cxl_dpa_reserve(). But I am unclear how that will
ever be executed except if the BIOS sets up a decoder to span ram/pmem.
In this case the rest of the stack is not going to work and will complain
about mixed mode.

Ok Dan clued me in. Check out cxl_dpa_alloc(). Spanning partitions is
not allowed.

So the comment is targeted toward the 'normal' case even though the region
could be created incorrectly via BIOS.

Ira

2024-04-02 23:26:24

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

Davidlohr Bueso wrote:
> On Sun, 24 Mar 2024, [email protected] wrote:
>
> >From: Navneet Singh <[email protected]>
> >
> >Until now region modes and decoder modes were equivalent in that they
> >were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
> >regions (which will represent an array of device regions [better named
> >partitions] the index of which could be different on different
> >interleaved devices), the mode of an endpoint decoder and a region will
> >no longer be equivalent.
> >
> >Define a new region mode enumeration and adjust the code for it.
>
> Could this could also be picked up regardless of dcd?

It could be. But there is no practical need for it without the addition of
DCD. So I don't think it should be.

Ira

2024-04-02 23:27:23

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

Dave Jiang wrote:
>
>
> On 3/27/24 10:22 PM, Ira Weiny wrote:
> > Davidlohr Bueso wrote:
> >> On Sun, 24 Mar 2024, [email protected] wrote:
> >>
> >>> From: Navneet Singh <[email protected]>
> >>>
> >>> Until now region modes and decoder modes were equivalent in that they
> >>> were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
> >>> regions (which will represent an array of device regions [better named
> >>> partitions] the index of which could be different on different
> >>> interleaved devices), the mode of an endpoint decoder and a region will
> >>> no longer be equivalent.
> >>>
> >>> Define a new region mode enumeration and adjust the code for it.
> >>
> >> Could this could also be picked up regardless of dcd?
> >
> > It could but there is no need for it without DCD.
> >
> > I will work on re-ordering the cleanups if Dave will agree to take them
> > early.
>
> There's no reason for the change unless it comes with DCD right? And probably no urgent need to taking it ahead then?
> >

I think I just replied for a 2nd time to this... yea.

LOL I have to do better...

Ira

2024-04-03 22:23:32

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

Jonathan Cameron wrote:
> On Sun, 24 Mar 2024 16:18:06 -0700
> [email protected] wrote:
>

[snip]

> > +
> > + /* Check regions are in increasing DPA order */
> > + if (index > 0) {
> > + struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> > +
> > + if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base) {
> > + dev_err(dev,
> > + "DPA ordering violation for DC region %d and %d\n",
> > + index - 1, index);
> > + return -EINVAL;
> > + }
> > + }
> > +
> > + if (!IS_ALIGNED(dcr->base, SZ_256M) ||
> > + !IS_ALIGNED(dcr->base, dcr->blk_size)) {
> > + dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n", index,
>
> Odd choice of line wrap. I'd drag index onto the line below.

fixed.

>
> > + dcr->base, dcr->blk_size);
> > + return -EINVAL;
> > + }
> > +
> > + if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
> > + !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> > + dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
> > + index, dcr->decode_len, dcr->len, dcr->blk_size);
> > + return -EINVAL;
> > + }
> > +
> > + if (dcr->blk_size == 0 || dcr->blk_size % 0x40 ||
>
> Hmm. I thought we had a define for CXL 'cacheline' size, but can't find it now.
> If not we should add one (and find a better name than that).

Asking me to add a define is fine... Asking me to name said define is...
The issue... I am absolute rubbish at picking names... :-/ ;-) :-D

>
> > + !is_power_of_2(dcr->blk_size)) {
> > + dev_err(dev, "DC region %d invalid block size; %#llx\n",
> > + index, dcr->blk_size);
> > + return -EINVAL;
> > + }
> > +
> > + dev_dbg(dev,
> > + "DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> > + dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> > +
> > + return 0;
> > +}
> > +
> > +/* Returns the number of regions in dc_resp or -ERRNO */
> > +static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> > + struct cxl_mbox_get_dc_config_out *dc_resp,
> > + size_t dc_resp_size)
> > +{
> > + struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> > + .region_count = CXL_MAX_DC_REGION,
> > + .start_region_index = start_region,
> > + };
> > + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> > + .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> > + .payload_in = &get_dc,
> > + .size_in = sizeof(get_dc),
> > + .size_out = dc_resp_size,
> > + .payload_out = dc_resp,
> > + .min_out = 1,
> > + };
> > + struct device *dev = mds->cxlds.dev;
> > + int rc;
> > +
> > + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > + if (rc < 0)
> > + return rc;
> > +
> > + rc = dc_resp->avail_region_count - start_region;
> > +
> > + /*
> > + * The number of regions in the payload may have been truncated due to
> > + * payload_size limits; if so adjust the returned count to match.
> > + */
> > + if (mbox_cmd.size_out < sizeof(*dc_resp))
> > + rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
>
> Why not always return this? If there was space, doesn't it equal
> the value set above anyway?

I've been looking at this more carefully and there is a bigger issue with
this. I need to update this code to handle the regions_returned which was
added in the errata and get rid of this macro.

>
> > +
> > + dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> > +
> > + return rc;
> > +}
>
> > +/**
> > + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> > + * information from the device.
> > + * @mds: The memory device state
> > + *
> > + * Read Dynamic Capacity information from the device and populate the state
> > + * structures for later use.
> > + *
> > + * Return: 0 if identify was executed successfully, -ERRNO on error.
> > + */
> > +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > +{
> > + size_t dc_resp_size = mds->payload_size;
> > + struct device *dev = mds->cxlds.dev;
> > + u8 start_region, i;
> > + int rc = 0;
>
> Is this used before being set?

nope...

>
> > +
> > + for (i = 0; i < CXL_MAX_DC_REGION; i++)
> > + snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> > +
> > + /* Check GET_DC_CONFIG is supported by device */
> > + if (!cxl_dcd_supported(mds)) {
> > + dev_dbg(dev, "DCD not supported\n");
> > + return 0;
> > + }
> > +
> > + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> > + kvmalloc(dc_resp_size, GFP_KERNEL);
> > + if (!dc_resp)
> > + return -ENOMEM;
> > +
> > + start_region = 0;
> > + do {
> > + int j;
> > +
> > + rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> > + if (rc < 0) {
> > + dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> > + return rc;
> > + }
> > +
> > + mds->nr_dc_region += rc;
> > +
> > + if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> > + dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> > + mds->nr_dc_region);
> > + return -EINVAL;
> > + }
> > +
> > + for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> > + rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> > + if (rc) {
> > + dev_dbg(dev, "Failed to save region info: %d\n", rc);
> > + return rc;
> > + }
> > + }
> > +
> > + start_region = mds->nr_dc_region;
> > +
> > + } while (mds->nr_dc_region < dc_resp->avail_region_count);
> > +
> > + mds->dynamic_cap =
> > + mds->dc_region[mds->nr_dc_region - 1].base +
> > + mds->dc_region[mds->nr_dc_region - 1].decode_len -
> > + mds->dc_region[0].base;
> > + dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
>
>
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 79a67cff9143..4624cf612c1e 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
>
> > /**
> > * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> > *
> > @@ -467,6 +482,8 @@ struct cxl_dev_state {
> > * @enabled_cmds: Hardware commands found enabled in CEL.
> > * @exclusive_cmds: Commands that are kernel-internal only
> > * @total_bytes: sum of all possible capacities
> > + * @static_cap: Sum of static RAM and PMEM capacities
> > + * @dynamic_cap: Complete DPA range occupied by DC regions
> > * @volatile_only_bytes: hard volatile capacity
> > * @persistent_only_bytes: hard persistent capacity
> > * @partition_align_bytes: alignment size for partition-able capacity
> > @@ -474,6 +491,8 @@ struct cxl_dev_state {
> > * @active_persistent_bytes: sum of hard + soft persistent
> > * @next_volatile_bytes: volatile capacity change pending device reset
> > * @next_persistent_bytes: persistent capacity change pending device reset
>
> Looks like we have some ordering issues ram_perf and pmem_perf (at least)
> that we should fix up as a precursor. I sent a reply to the QoS patch
> that added these.

I see. That will likely resolve out when I rebase. But seems nothing to
be done for this patch and best left as a separate patch from this series.

>
> > + * @nr_dc_region: number of DC regions implemented in the memory device
> > + * @dc_region: array containing info about the DC regions
> > * @event: event log driver state
> > * @poison: poison driver state info
> > * @security: security driver state info
> > @@ -494,7 +513,10 @@ struct cxl_memdev_state {
> > DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> > DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> > DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> > +
> Trivial but this is an unrelated change and shouldn't be in this patch.
>
> > u64 total_bytes;
> > + u64 static_cap;
> > + u64 dynamic_cap;
> > u64 volatile_only_bytes;
> > u64 persistent_only_bytes;
> > u64 partition_align_bytes;
> > @@ -506,6 +528,9 @@ struct cxl_memdev_state {
> > struct cxl_dpa_perf ram_perf;
> > struct cxl_dpa_perf pmem_perf;
> >
> > + u8 nr_dc_region;
> > + struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> > +
> > struct cxl_event_state event;
> > struct cxl_poison_state poison;
> > struct cxl_security_state security;
>
> > +
> > +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_get_dc_config_out {
> > + u8 avail_region_count;
> > + u8 rsvd[7];
> > + struct cxl_dc_region_config {
> > + __le64 region_base;
> > + __le64 region_decode_length;
> > + __le64 region_length;
> > + __le64 region_block_size;
> > + __le32 region_dsmad_handle;
> > + u8 flags;
> > + u8 rsvd[3];
> > + } __packed region[];
> > +} __packed;
> > +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> > +#define CXL_REGIONS_RETURNED(size_out) \
> > + ((size_out - 8) / sizeof(struct cxl_dc_region_config))
>
> Can we make that 8 self documenting?
> offsetof(struct cxl_dc_region_config, region) perhaps?

As I said above I think this macro is wrong I'm adjusting to remove it.

Thanks,
Ira

2024-04-03 22:41:53

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

fan wrote:
> On Sun, Mar 24, 2024 at 04:18:06PM -0700, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >

[snip]

> >
> > +struct cxl_mbox_get_dc_config_in {
> > + u8 region_count;
> > + u8 start_region_index;
> > +} __packed;
> > +
> > +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> > +struct cxl_mbox_get_dc_config_out {
> > + u8 avail_region_count;
> > + u8 rsvd[7];
> > + struct cxl_dc_region_config {
> > + __le64 region_base;
> > + __le64 region_decode_length;
> > + __le64 region_length;
> > + __le64 region_block_size;
> > + __le32 region_dsmad_handle;
> > + u8 flags;
> > + u8 rsvd[3];
> > + } __packed region[];
> > +} __packed;
> > +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> > +#define CXL_REGIONS_RETURNED(size_out) \
> > + ((size_out - 8) / sizeof(struct cxl_dc_region_config))
>
> Although the result may be unchanged, but in cxl spec r3.1, there are four
> fields after the region configuration structure.

Yes. This macro is not needed.

The fields after the structure are of little use to the host at this time.
So I'm going to leave them out until a use can be found for them.

Ira

2024-04-04 09:09:15

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 08/26] cxl/mem: Expose device dynamic capacity capabilities

On Sun, 24 Mar 2024 16:18:11 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> To properly configure CXL regions on Dynamic Capacity Devices (DCD),
> user space will need to know the details of the DC Regions available on
> a device.
>
> Expose driver dynamic capacity capabilities through sysfs
> attributes.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

Trivial comments inline.
Whilst I'd like the directory hidden as per the other suggestions,
I don't mind that much.

Reviewed-by: Jonathan Cameron <[email protected]>


>
> ---
> Changes for v1:
> [iweiny: remove review tags]
> [iweiny: mark sysfs for 6.10 kernel]
> ---
> Documentation/ABI/testing/sysfs-bus-cxl | 17 ++++++++
> drivers/cxl/core/memdev.c | 76 +++++++++++++++++++++++++++++++++
> 2 files changed, 93 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 8b3efaf6563c..8a4f572c8498 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -54,6 +54,23 @@ Description:
> identically named field in the Identify Memory Device Output
> Payload in the CXL-2.0 specification.
>
> +What: /sys/bus/cxl/devices/memX/dc/region_count
> +Date: June, 2024
> +KernelVersion: v6.10
> +Contact: [email protected]
> +Description:
> + (RO) Number of Dynamic Capacity (DC) regions supported on the
> + device. May be 0 if the device does not support Dynamic
> + Capacity.

That will change if you go ahead and hide the directory as per suggestions.

> +
> +What: /sys/bus/cxl/devices/memX/dc/regionY_size
> +Date: June, 2024
> +KernelVersion: v6.10
> +Contact: [email protected]
> +Description:
> + (RO) Size of the Dynamic Capacity (DC) region Y. Only

Units always good to have in docs even if somewhat obvious.

> + available on devices which support DC and only for those
> + region indexes supported by the device.
>
> What: /sys/bus/cxl/devices/memX/pmem/qos_class
> Date: May, 2023
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index d4e259f3a7e9..a7b880e33a7e 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -101,6 +101,18 @@ static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,

..

> +static struct attribute *cxl_memdev_dc_attributes[] = {
> + &dev_attr_region0_size.attr,
> + &dev_attr_region1_size.attr,
> + &dev_attr_region2_size.attr,
> + &dev_attr_region3_size.attr,
> + &dev_attr_region4_size.attr,
> + &dev_attr_region5_size.attr,
> + &dev_attr_region6_size.attr,
> + &dev_attr_region7_size.attr,
> + &dev_attr_region_count.attr,
> + NULL,
> +};
> +
> +static umode_t cxl_dc_visible(struct kobject *kobj, struct attribute *a, int n)
> +{
> + struct device *dev = kobj_to_dev(kobj);
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
> +
> + /* Not a memory device */
> + if (!mds)
> + return 0;
> +
> + if (a == &dev_attr_region_count.attr)
> + return a->mode;
> +
> + /* Show only the regions supported */
> + if (n < mds->nr_dc_region)
> + return a->mode;

This feels a bit fragile if anyone adds new attrs in future and for whatever reason
does it before these.

Maybe add a comment at top of cxl_memdev_dc_attributes()? Say they must be first.

> +
> + return 0;
> +}
> +

>


2024-04-04 10:20:58

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)


I haven't quite gone through the whole patch set yet, so ignore me
if you already have this somewhere, but I'd like to see this info
captured somewhere other than the cover letter. Either some Documentation
files or maybe a bit code comment in appropriate location?

> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows the memory capacity to change dynamically, without
> the need for resetting the device, reconfiguring HDM decoders, or
> reconfiguring software DAX regions.
>
> One of the biggest use cases for Dynamic Capacity is to allow hosts to
> share memory dynamically within a data center without increasing the

Probably good to rephrase to avoid share - given 'sharing' isn't
yet supported.
"to access a common pool of memory dynamically ..." maybe?

> per-host attached memory.
>
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory. Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Device
> the host sees, the Host Kernel, and a Host User.
>
> Typical work flows are shown below.
>
> Orchestrator FM Device Host Kernel Host User
>
> | | | | |
> |-------------- Create region ----------------------->|
> | | | | |
> | | | |<-- Create ---|
> | | | | Region |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Accept -|<- Accept -| |
> | | Extent | Extent | |
> | | | |<- Create --->|
> | | | | DAX dev |-- Use memory
> | | | | | |
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> | | | | |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Remove -->|- Release->|- Release ->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |

Missing a signal from the FM to orchestrator to say release done.

> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Accept -|<- Accept -| |
> | | Extent | Extent | |
> | | | |<- Create ----|
> | | | | DAX dev |-- Use memory
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Remove -->|- Release->|- Release ->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |

As above, need to let the Orchestrator know it's done (the FM
knows so can pass the info onwards

> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | |<- Create ----|
> | | | | DAX dev |-- Use memory
> | | | | | |
> |-- Remove -->|- Release->|- Release ->| | |
> | Capacity | Extent | Extent | | |
> | | | | | |
> | | | (Release Ignored) | |
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> |<------------- Signal done --------------------------|
> | | | | |
> | |- Release->|- Release ->| |
> | | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |

My guess is FM would let the orchestrator know it had some capacity
back?

> | | | |<- Destroy ---|
> | | | | Region |
> | | | | |

No path for async release yet? I think we will want to add that
soon. Host rebooting etc may not care to talk directly to the
orchestrator.

>
> Previous RFCs of this series[0] resulted in significant architectural
> comments. Previous versions allowed memory capacity to be accepted by
> the host regardless of the existence of a software region being mapped.
>
> With this new patch set the order of the create region and DAX device
> creation must be synchronized with the Orchestrator adding/removing
> capacity. The host kernel will reject an add extent event if the region
> is not created yet. It will also ignore a release if the DAX device is
> created and referencing an extent.
>
> Neither of these synchronizations are anticipated to be an issue with
> real applications.
>
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced. A sparse DAX region may have 0 or
> more bytes of available space. The total space depends on the number
> and size of the extents which have been added.
>
> Initially it is anticipated that users of the memory will carefully
> coordinate the surfacing of additional capacity with the creation of DAX
> devices which use that capacity. Therefore, the allocation of the
> memory to DAX devices does not allow for specific associations between
> DAX device and extent. This keeps allocations very similar to existing
> DAX region behavior.

I don't quite follow. Is the point that there is only one active dax
dev to which any new extents are added? Or that a particular set
of extents offered together get put into a new device and next set get
another new device?

>
> Great care was taken to greatly simplify extent tracking. Specifically,
> in comparison to previous versions of the patch set, all extent tracking
> xarrays have been eliminated from the code. In addition, most of the
> extra software objects and associated referenced counts have been
> eliminated.
>
> In this version, extents are tracked purely as sub-devices of the
> region. This ensures that the region destruction cleans up all extent
> allocations properly. Device managed callbacks are wired to ensure any
> additional data required for DAX device references are handled
> correctly.
>
> Due to these major changes I'm setting this new series to V1.
>
> In summary the major functionality of this series includes:
>
> - Getting the dynamic capacity (DC) configuration information from cxl
> devices
>
> - Configuring the DC regions reported by hardware
>
> - Enhancing the CXL and DAX regions for dynamic capacity support
> a. Maintain a logical separation between hardware extents and
> software managed region extents. This provides an
> abstraction between the layers and should allow for
> interleaving in the future
>
> - Get hardware extent lists for endpoint decoders upon
> region creation.
>
> - Adjust extent/region memory available on the following events.
> a. Add capacity Events
> b. Release capacity events
Trivial but fix the indent

>
> - Host response for add capacity
> a. do not accept the extent if:
> If the region does not exist
> or an error occurs realizing the extent
> B. If the region does exist
b.
> realize a DAX region extent with 1:1 mapping (no
> interleave yet)
>
> - Host response for remove capacity
> a. If no DAX devices reference the extent release the extent
> b. If a reference does exist, ignore the request.
> (Require FM to issue release again.)
>
> - Modify DAX device creation/resize to account for extents within a
> sparse DAX region
>
> - Trace Dynamic Capacity events for debugging
>
> - Add cxl-test infrastructure to allow for faster unit testing
> (See new ndctl branch for cxl-dcd.sh test[1])
>
> Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
>
> Remaining work:
>
> 1) Integrate the QoS work from Dave Jiang
> 2) Interleave support
>
> Possible additional work depending on requirements:
>
> 1) Allow mapping to specific extents (perhaps based on
> label/tag)
> 2) Release extents when DAX devices are released if a release
> was previously seen from the device
> 3) Accept a new extent which extends (but overlaps) an existing
> extent(s)

>
> [0] RFC v2: https://lore.kernel.org/r/[email protected]
> [1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-03-22
> [2] https://lore.kernel.org/all/[email protected]/
>
> ---
> Changes for v1:
> - iweiny: Largely new series
> - iweiny: Remove review tags due to the series being a major rework
> - iweiny: Fix authorship for Navneet patches
> - iweiny: Remove extent xarrays
> - iweiny: Remove kreferences, replace with 1 use count protected under dax_rwsem
> - iweiny: Mark all sysfs entries for the 6.10 June 2024 kernel
> - iweiny: Remove gotos
> - iweiny: Fix 0day issues
> - Jonathan Cameron: address comments
> - Navneet Singh: address comments
> - Dan Williams: address comments
> - Dave Jiang: address comments
> - Fan Ni: address comments
> - J?rgen Hansen: address comments
> - Link to RFC v2: https://lore.kernel.org/r/[email protected]
>
> ---
> Ira Weiny (12):
> cxl/core: Simplify cxl_dpa_set_mode()
> cxl/events: Factor out event msgnum configuration
> cxl/pci: Delay event buffer allocation
> cxl/pci: Factor out interrupt policy check
> range: Add range_overlaps()
> dax/bus: Factor out dev dax resize logic
> dax: Document dax dev range tuple
> dax/region: Prevent range mapping allocation on sparse regions
> dax/region: Support DAX device creation on sparse DAX regions
> tools/testing/cxl: Make event logs dynamic
> tools/testing/cxl: Add DC Regions to mock mem data
> tools/testing/cxl: Add Dynamic Capacity events
>
> Navneet Singh (14):
> cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
> cxl/core: Separate region mode from decoder mode
> cxl/mem: Read dynamic capacity configuration from the device
> cxl/region: Add dynamic capacity decoder and region modes
> cxl/port: Add Dynamic Capacity mode support to endpoint decoders
> cxl/port: Add dynamic capacity size support to endpoint decoders
> cxl/mem: Expose device dynamic capacity capabilities
> cxl/region: Add Dynamic Capacity CXL region support
> cxl/mem: Configure dynamic capacity interrupts
> cxl/region: Read existing extents on region creation
> cxl/extent: Realize extent devices
> dax/region: Create extent resources on DAX region driver load
> cxl/mem: Handle DCD add & release capacity events.
> cxl/mem: Trace Dynamic capacity Event Record
>
> Documentation/ABI/testing/sysfs-bus-cxl | 60 ++-
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/core.h | 10 +
> drivers/cxl/core/extent.c | 145 +++++
> drivers/cxl/core/hdm.c | 254 +++++++--
> drivers/cxl/core/mbox.c | 591 ++++++++++++++++++++-
> drivers/cxl/core/memdev.c | 76 +++
> drivers/cxl/core/port.c | 19 +
> drivers/cxl/core/region.c | 334 +++++++++++-
> drivers/cxl/core/trace.h | 65 +++
> drivers/cxl/cxl.h | 127 ++++-
> drivers/cxl/cxlmem.h | 114 ++++
> drivers/cxl/mem.c | 45 ++
> drivers/cxl/pci.c | 122 +++--
> drivers/dax/bus.c | 353 +++++++++---
> drivers/dax/bus.h | 4 +-
> drivers/dax/cxl.c | 127 ++++-
> drivers/dax/dax-private.h | 40 +-
> drivers/dax/hmem/hmem.c | 2 +-
> drivers/dax/pmem.c | 2 +-
> fs/btrfs/ordered-data.c | 10 +-
> include/linux/cxl-event.h | 31 ++
> include/linux/range.h | 7 +
> tools/testing/cxl/Kbuild | 1 +
> tools/testing/cxl/test/mem.c | 914 ++++++++++++++++++++++++++++----
> 25 files changed, 3152 insertions(+), 302 deletions(-)
> ---
> base-commit: dff54316795991e88a453a095a9322718a34034a
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
>
> Best regards,


2024-04-04 10:26:47

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 09/26] cxl/region: Add Dynamic Capacity CXL region support

On Sun, 24 Mar 2024 16:18:12 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> CXL devices optionally support dynamic capacity. CXL Regions must be
> configured correctly to access this capacity. Similar to ram and pmem
> partitions, DC Regions, as they are called in CXL 3.1, represent
> different partitions of the DPA space.
>
> Introduce the concept of a sparse DAX region. Add the create_dc_region
> sysfs entry to create sparse DC DAX regions. Special case DC capable
> regions to create a 0 sized seed DAX device to maintain backwards
> compatibility with older software which needs a default DAX device to
> hold the region reference.
>
> Flag sparse DAX regions to indicate 0 capacity available until such time
> as DC capacity is added.
>
> Interleaving is deferred in this series. Add an early check.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
With the -EBUSY others addressed LGTM. Fan's duplication comment
might be something to tidy up later.

Reviewed-by: Jonathan Cameron <[email protected]>

2024-04-04 15:08:13

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 10/26] cxl/events: Factor out event msgnum configuration

On Sun, 24 Mar 2024 16:18:13 -0700
Ira Weiny <[email protected]> wrote:

> Dynamic Capacity Devices (DCD) require events to process extent addition
> or removal. BIOS may have control over memory event processing.
>
> Factor out cxl_event_config_msgnums() in preparation for setting up DCD
> event interrupts separate from memory events.

>
> Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>

2024-04-04 15:09:17

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 11/26] cxl/pci: Delay event buffer allocation

On Sun, 24 Mar 2024 16:18:14 -0700
Ira Weiny <[email protected]> wrote:

> The event buffer does not need to be allocated if something has failed
> in setting up event irq's.
>
> In prep for adjusting event configuration for DCD events move the buffer
> allocation to the end of the event configuration.
>
> Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>

2024-04-04 15:10:47

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 12/26] cxl/pci: Factor out interrupt policy check

On Sun, 24 Mar 2024 16:18:15 -0700
Ira Weiny <[email protected]> wrote:

> Dynamic capacity devices (DCD) require interrupts to notify the host of
> events in the DCD log. The interrupts for DCD may be supported despite
> FW control of memory event logs.
>
> Prepare to support DCD event interrupts separate from other event
> interrupts by factoring out the check for event interrupt settings.
>
> Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>

2024-04-04 15:23:44

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts

On Sun, 24 Mar 2024 16:18:16 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism. The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications.
>
> Firmware can't configure DCD events to be FW controlled but can retain
> control of memory events. Split irq configuration of memory events and
> DCD events to allow for FW control of memory events while DCD is host
> controlled.
>
> Configure DCD event log interrupts on devices supporting dynamic
> capacity. Disable DCD if interrupts are not supported.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

Trivial comment inline and Fan's suggestion on the debug print seems sensible
to me. Either way

Reviewed-by: Jonathan Cameron <[email protected]>

> /**
> * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 15d418b3bc9b..d585f5fdd3ae 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -164,11 +164,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
> #define CXLDEV_EVENT_STATUS_WARN BIT(1)
> #define CXLDEV_EVENT_STATUS_FAIL BIT(2)
> #define CXLDEV_EVENT_STATUS_FATAL BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD BIT(4)
>
> #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \
> CXLDEV_EVENT_STATUS_WARN | \
> CXLDEV_EVENT_STATUS_FAIL | \
> - CXLDEV_EVENT_STATUS_FATAL)
> + CXLDEV_EVENT_STATUS_FATAL| \

Space after L
You could realign the others but I wouldn't bother.


> + CXLDEV_EVENT_STATUS_DCD)




2024-04-04 16:07:38

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 15/26] range: Add range_overlaps()

On Sun, 24 Mar 2024 16:18:18 -0700
Ira Weiny <[email protected]> wrote:

> Code to support CXL Dynamic Capacity devices will have extent ranges
> which need to be compared for intersection not a subset as is being
> checked in range_contains().
>
> range_overlaps() is defined in btrfs with a different meaning from what
> is required in the standard range code. Dan Williams pointed this out
> in [1]. Adjust the btrfs call according to his suggestion there.
>
> Then add a generic range_overlaps().
>
> Cc: Dan Williams <[email protected]>
> Cc: Chris Mason <[email protected]>
> Cc: Josef Bacik <[email protected]>
> Cc: David Sterba <[email protected]>
> Cc: [email protected]
> Signed-off-by: Ira Weiny <[email protected]>
FWIW given it's well review already.

Reviewed-by: Jonathan Cameron <[email protected]>
>
> [1] https://lore.kernel.org/all/[email protected]/
> ---
> fs/btrfs/ordered-data.c | 10 +++++-----
> include/linux/range.h | 7 +++++++
> 2 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 59850dc17b22..032d30a49edc 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -111,8 +111,8 @@ static struct rb_node *__tree_search(struct rb_root *root, u64 file_offset,
> return NULL;
> }
>
> -static int range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
> - u64 len)
> +static int btrfs_range_overlaps(struct btrfs_ordered_extent *entry, u64 file_offset,
> + u64 len)
> {
> if (file_offset + len <= entry->file_offset ||
> entry->file_offset + entry->num_bytes <= file_offset)
> @@ -914,7 +914,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
>
> while (1) {
> entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> break;
>
> if (entry->file_offset >= file_offset + len) {
> @@ -1043,12 +1043,12 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
> }
> if (prev) {
> entry = rb_entry(prev, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> goto out;
> }
> if (next) {
> entry = rb_entry(next, struct btrfs_ordered_extent, rb_node);
> - if (range_overlaps(entry, file_offset, len))
> + if (btrfs_range_overlaps(entry, file_offset, len))
> goto out;
> }
> /* No ordered extent in the range */
> diff --git a/include/linux/range.h b/include/linux/range.h
> index 6ad0b73cb7ad..9a46f3212965 100644
> --- a/include/linux/range.h
> +++ b/include/linux/range.h
> @@ -13,11 +13,18 @@ static inline u64 range_len(const struct range *range)
> return range->end - range->start + 1;
> }
>
> +/* True if r1 completely contains r2 */
> static inline bool range_contains(struct range *r1, struct range *r2)
> {
> return r1->start <= r2->start && r1->end >= r2->end;
> }
>
> +/* True if any part of r1 overlaps r2 */
> +static inline bool range_overlaps(struct range *r1, struct range *r2)
> +{
> + return r1->start <= r2->end && r1->end >= r2->start;
> +}
> +
> int add_range(struct range *range, int az, int nr_range,
> u64 start, u64 end);
>
>


2024-04-04 16:14:05

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation


> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *extent)
> +{
> + uint64_t start = le64_to_cpu(extent->start_dpa);
> + uint64_t length = le64_to_cpu(extent->length);
> + struct range ext_range = (struct range){
space ) {

> + .start = start,
> + .end = start + length - 1,
> + };
> + struct range ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };

2024-04-04 16:33:02

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices

On Sun, 24 Mar 2024 16:18:19 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> Once all extents of an interleave set are present a region must
> surface an extent to the region.
>
> Without interleaving; endpoint decoder and region extents have a 1:1
> relationship. Future support for IW > 1 will maintain a N:1
> relationship between the device extents and region extents.
>
> Create a region extent device for every device extent found. Release of
> the extent device triggers a response to the underlying hardware extent.
>
> There is no strong use case to support the addition of extents which
> overlap previously accepted extent ranges. Reject such new extents
> until such time as a good use case emerges.
>
> Expose the necessary details of region extents by creating the following
> sysfs entries.
>
> /sys/bus/cxl/devices/dax_regionX/extentY
> /sys/bus/cxl/devices/dax_regionX/extentY/offset
> /sys/bus/cxl/devices/dax_regionX/extentY/length
> /sys/bus/cxl/devices/dax_regionX/extentY/label

Docs? The label in particular worries me a little as I'm not sure what
is in it. If it's the tag one possible format is a uuid (not a coincidence
that it is the same length) and interpreting that as characters isn't
going to get us far. I wonder if we have to treat it as a binary attr
given we have no idea what it is.

Otherwise a query inline that may well be answered in later patches.

>
> The use of the extent devices by the DAX layer is deferred to later
> patches.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>



> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct region_extent *reg_ext;
> + struct device *dev;
> + int rc, id;
> +
> + id = ida_alloc(&cxl_extent_ida, GFP_KERNEL);
> + if (id < 0)
> + return -ENOMEM;

Whilst it doesn't matter hugely, it's nice if the release does things
in opposite order of the creation. So perhaps move the ida_alloc
after kzalloc or reg_ext?

> +
> + reg_ext = kzalloc(sizeof(*reg_ext), GFP_KERNEL);
> + if (!reg_ext)
> + return -ENOMEM;
> +
> + reg_ext->hpa_range = *hpa_range;
> + reg_ext->ed_ext.dpa_range = *dpa_range;
> + reg_ext->ed_ext.cxled = cxled;
> + snprintf(reg_ext->label, DAX_EXTENT_LABEL_LEN, "%s", label);
> +
> + dev = &reg_ext->dev;
> + device_initialize(dev);
> + dev->id = id;
> + device_set_pm_not_required(dev);
> + dev->parent = &cxlr_dax->dev;
> + dev->type = &region_extent_type;
> + rc = dev_set_name(dev, "extent%d", dev->id);
> + if (rc)
> + goto err;
> +
> + rc = device_add(dev);
> + if (rc)
> + goto err;
> +
> + dev_dbg(dev, "DAX region extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> + reg_ext);

Indent

> +
> +err:
> + dev_err(&cxlr_dax->dev, "Failed to initialize DAX extent dev HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + put_device(dev);
> + return rc;
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9e33a0976828..6b00e717e42b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> + struct range *extent, int opcode)
> +{
> + struct cxl_mbox_cmd mbox_cmd;
> + size_t size;
> +
> + struct cxl_mbox_dc_response *dc_res __free(kfree);
> + size = struct_size(dc_res, extent_list, 1);
> + dc_res = kzalloc(size, GFP_KERNEL);
> + if (!dc_res)
> + return -ENOMEM;
> +
> + dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
> + memset(dc_res->extent_list[0].reserved, 0, 8);
> + dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
> + dc_res->extent_list_size = cpu_to_le32(1);

I guess this comes up later, but such a response means that if we are offered
multiple extents in an add with the more flag set then we always reject all
but the first one.

> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = opcode,
> + .size_in = size,
> + .payload_in = dc_res,
> + };
> +
> + return cxl_internal_send_cmd(mds, &mbox_cmd);
> +}
> +
> static struct cxl_memdev_state *
> cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> {
> @@ -1029,6 +1055,23 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> return container_of(cxlds, struct cxl_memdev_state, cxlds);
> }
>
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent)
> +{
> + struct cxl_endpoint_decoder *cxled = extent->cxled;
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + dev_dbg(dev, "Releasing DC extent DPA %#llx - %#llx\n",
> + extent->dpa_range.start, extent->dpa_range.end);
> +
> + rc = cxl_send_dc_cap_response(mds, &extent->dpa_range, CXL_MBOX_OP_RELEASE_DC);

Long line that doesn't really need to be.

> + if (rc)
> + dev_dbg(dev, "Failed to respond releasing extent DPA %#llx - %#llx; %d\n",
> + extent->dpa_range.start, extent->dpa_range.end, rc);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_release_ed_extent, CXL);
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3e563ab29afe..7635ff109578 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,11 +1450,81 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }

> static int cxl_region_attach_position(struct cxl_region *cxlr,
> @@ -2684,6 +2754,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>
> dev = &cxlr_dax->dev;
> cxlr_dax->cxlr = cxlr;
> + cxlr->cxlr_dax = cxlr_dax;
> device_initialize(dev);
> lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> device_set_pm_not_required(dev);
> @@ -2799,7 +2870,10 @@ static int cxl_region_read_extents(struct cxl_region *cxlr)
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> + struct cxl_region *cxlr = cxlr_dax->cxlr;
>
> + cxlr->cxlr_dax = NULL;
> + cxlr_dax->cxlr = NULL;
> device_unregister(&cxlr_dax->dev);
> }
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d585f5fdd3ae..5379ad7f5852 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h


> +/**
> + * struct region_extent - CXL DAX region extent
> + * @dev: device representing this extent
> + * @hpa_range: HPA range of this extent
> + * @label: label of the extent
> + * @ed_ext: Endpoint decoder extent which backs this extent
> + */
> +#define DAX_EXTENT_LABEL_LEN 64

Something called DAX_* doesn't belong in this header...
Either give a CXL_DAX_ prefix or move the definition if appropriate.

2024-04-04 16:35:47

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

On Sun, 24 Mar 2024 16:18:17 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case creation of a new
> region on top of the DC partition (region) is expected to expose those
> extents for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized read the device extent list. For ease of review, this patch
> stops after reading the extent list and leaves realization of the region
> extents to a future patch.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

A few things inline.

J
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> + unsigned int *extent_gen_num)
> +{
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + struct cxl_mbox_get_dc_extent_out dc_extents;
> + struct cxl_mbox_cmd mbox_cmd;
> + unsigned int count;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(0),
> + .start_extent_index = cpu_to_le32(0),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = sizeof(dc_extents),
> + .payload_out = &dc_extents,
> + .min_out = 1,

Why 1?

> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + count = le32_to_cpu(dc_extents.total_extent_cnt);
> + *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> +
> + return count;
> +}
> +
> +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
> + unsigned int start_gen_num,
> + unsigned int exp_cnt)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + unsigned int start_index, total_read;
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_mbox_cmd mbox_cmd;
> +
> + struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
> + kvmalloc(mds->payload_size, GFP_KERNEL);
> + if (!dc_extents)
> + return -ENOMEM;
> +
> + total_read = 0;
> + start_index = 0;
> + do {
> + unsigned int nr_ext, total_extent_cnt, gen_num;
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(exp_cnt - start_index),
> + .start_extent_index = cpu_to_le32(start_index),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = mds->payload_size,
> + .payload_out = dc_extents,
> + .min_out = 1,

Why 1?

> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
> + total_read += nr_ext;
> + total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> + gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +
> + dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> + total_extent_cnt, gen_num);
> +
> + if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> + dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
> + gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> + return -EIO;
> + }
> +
> + for (int i = 0; i < nr_ext ; i++) {
> + dev_dbg(dev, "Processing extent %d/%d\n",
> + start_index + i, exp_cnt);
> + rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
> + if (rc)
> + continue;

A blank line here

> + if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
> + continue;
and here would make this more readable I think.

> + rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
> + if (rc)
> + return rc;
> + }
> +
> + start_index += nr_ext;
> + } while (exp_cnt > total_read);
> +
> + return 0;
> +}

> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 0d7b09a49dcf..3e563ab29afe 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c

>
> +static int cxl_region_read_extents(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int i;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + int rc;
Maybe worth giving up early if we see nr_targets > 1?

If nothing else it saves people trying to figure out what happens if we
reboot into an older kernel that doesn't support interleave (from one
that does)

> +
> + rc = cxl_read_dc_extents(p->targets[i]);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +

> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 01bee6eedff3..8f2d8944d334 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h

> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51

Throw a table name or section name into the reference so people
can find it in CXL rNext.

> + */
> +#define CXL_DC_EXTENT_TAG_LEN 0x10
> +struct cxl_dc_extent {
> + __le64 start_dpa;
> + __le64 length;
> + u8 tag[CXL_DC_EXTENT_TAG_LEN];
> + __le16 shared_extn_seq;
> + u8 reserved[6];
> +} __packed;
> +

> +/*
> + * Get Dynamic Capacity Extent List; Output Payload
> + * CXL rev 3.1 section 8.2.9.9.9.2; Table 8-167
> + */
> +struct cxl_mbox_get_dc_extent_out {
> + __le32 ret_extent_cnt;
> + __le32 total_extent_cnt;
> + __le32 extent_list_num;

Naming isn't that clear given generation bit missing.

> + u8 rsvd[4];
> + struct cxl_dc_extent extent[];
> +} __packed;
> +


2024-04-04 17:03:47

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 17/26] dax/region: Create extent resources on DAX region driver load

On Sun, 24 Mar 2024 16:18:20 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> DAX regions mapping dynamic capacity partitions introduce a requirement
> for the memory backing the region to come and go as required. This
> results in a DAX region with sparse areas of memory backing. To track
> the sparseness of the region, DAX extent objects need to track
> sub-resource information as a new layer between the DAX region resource
> and DAX device range resources.
>
> Recall that DCD extents may be accepted when a region is first created.
> Extend this support on region driver load. Scan existing extents and
> create DAX extent resources as a first step to DAX extent realization.
>
> The lifetime of a DAX extent is tricky to manage because the extent life
> may end in one of two ways. First, the device may request the extent be
> released. Second, the region may release the extent when it is
> destroyed without hardware involvement. Support extent release without
> hardware involvement first. Subsequent patches will provide for
> hardware to request extent removal.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
LGTM though I'm far from an expert on DAX..

Reviewed-by: Jonathan Cameron <[email protected]>

>
> ---
> Changes for v1
> [iweiny: remove xarrays]
> [iweiny: remove as much of extra reference stuff as possible]
> [iweiny: Move extent resource handling to core DAX code]
> ---
> drivers/dax/bus.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/dax/cxl.c | 43 ++++++++++++++++++++++++++++++++++--
> drivers/dax/dax-private.h | 12 +++++++++++
> 3 files changed, 108 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 903566aff5eb..4d5ed7ab6537 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -186,6 +186,61 @@ static bool is_sparse(struct dax_region *dax_region)
> return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> }
>
> +static int dax_region_add_resource(struct dax_region *dax_region,
> + struct dax_extent *dax_ext,
> + resource_size_t start,
> + resource_size_t length)
> +{
> + struct resource *ext_res;
> +
> + dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
> + ext_res = __request_region(&dax_region->res, start, length, "extent", 0);
> + if (!ext_res) {
> + dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
> + &start, &length);
> + return -ENOSPC;
> + }
> +
> + dax_ext->region = dax_region;
> + dax_ext->res = ext_res;
> + dev_dbg(dax_region->dev, "Extent add resource %pr\n", ext_res);
> +
> + return 0;
> +}


2024-04-04 17:05:30

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 18/26] cxl/mem: Handle DCD add & release capacity events.

On Sun, 24 Mar 2024 16:18:21 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> A dynamic capacity devices (DCD) send events to signal the host about
> changes in the availability of Dynamic Capacity (DC) memory. These
> events contain extents, the addition or removal of which may occur at
> any time.
>
> Adding memory is straight forward. If no region exists the extent is
> rejected. If a region does exist, a region extent is formed and
> surfaced.
>
> Removing memory requires checking if the memory is currently in use.
> Memory use tracking is added in a subsequent patch so here the memory is
> never in use and the removal occurs immediately.
>
> Most often extents will be offered to and accepted by the host in well
> defined chunks. However, part of an extent may be requested for
> release. Simplify extent tracking by signaling removal of any extent
> which overlaps the requested release range.
>
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive or otherwise broken.
> Purposely ignore force removal events.
>
> Process DCD extents.
>
> Recall that all devices of an interleave set must offer a corresponding
> extent for the region extent to be realized. This patch limits
> interleave to 1. Thus the 1:1 mapping between device extent and DAX
> region extent allows immediate surfacing.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

..

> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 6b00e717e42b..7babac2d1c95 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -870,6 +870,37 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>
> +static int cxl_notify_dc_extent(struct cxl_memdev_state *mds,
> + enum dc_event event,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct cxl_drv_nd nd = (struct cxl_drv_nd) {
> + .event = event,
> + .dc_extent = dc_extent
> + };
> + struct device *dev;
> + int rc = -ENXIO;
> +
> + dev = &mds->cxlds.cxlmd->dev;
> + dev_dbg(dev, "Notify: type %d DPA:%#llx LEN:%#llx\n",
> + event, le64_to_cpu(dc_extent->start_dpa),
> + le64_to_cpu(dc_extent->length));
> +
> + device_lock(dev);

guard(device)(dev);
if (!dev->driver)
return -ENXIO;

...


> + if (dev->driver) {
> + struct cxl_driver *mem_drv = to_cxl_drv(dev->driver);
> +
> + if (mem_drv->notify) {
> + dev_dbg(dev, "Notify driver type %d DPA:%#llx LEN:%#llx\n",
> + event, le64_to_cpu(dc_extent->start_dpa),
> + le64_to_cpu(dc_extent->length));
> + rc = mem_drv->notify(dev, &nd);
> + }
> + }
> + device_unlock(dev);
> + return rc;
> +}


..

> +static int cxl_handle_dcd_add_event(struct cxl_memdev_state *mds,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct range alloc_range, *resp_range;
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + alloc_range = (struct range){
> + .start = le64_to_cpu(dc_extent->start_dpa),
> + .end = le64_to_cpu(dc_extent->start_dpa) +
> + le64_to_cpu(dc_extent->length) - 1,
> + };
> + resp_range = &alloc_range;
Code structure is a little odd to follow as sets up a bunch of stuff
that may or may not be used, perhaps duplicate final call.
I'm not 100% convinced it is worth it though.


rc = cxl_notify_dc_extents(mds, DCD_ADD_CAPACITY, dc_extent);
if (rc) {
dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
le64_to_cpu(dc_extent->start_dpa),
le64_to_cpu(dc_extent->length));
return cxl_send_dc_cap_response(mds, NULL,
CXL_MBOX_OP_ADD_DC_RESPONSE);
}

alloc_range = (struct range){
.start = le64_to_cpu(dc_extent->start_dpa),
.end = le64_to_cpu(dc_extent->start_dpa) +
le64_to_cpu(dc_extent->length) - 1,
};

return cxl_send_dc_cap_response(mds, &alloc_range,
CXL_MBOX_OP_ADD_DC_RESPONSE);


> +
> + rc = cxl_notify_dc_extent(mds, DCD_ADD_CAPACITY, dc_extent);
> + if (rc) {
> + dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n",
> + le64_to_cpu(dc_extent->start_dpa),
> + le64_to_cpu(dc_extent->length));
> + resp_range = NULL;
> + }
> +
> + return cxl_send_dc_cap_response(mds, resp_range,
> + CXL_MBOX_OP_ADD_DC_RESPONSE);
> +}

> +static int cxl_handle_dcd_event_records(struct cxl_memdev_state *mds,
> + struct cxl_event_record_raw *raw_rec)
> +{
> + struct cxl_event_dcd *event = &raw_rec->event.dcd;
> + struct cxl_dc_extent *dc_extent = &event->extent;
> + struct device *dev = mds->cxlds.dev;
> + uuid_t *id = &raw_rec->id;
> +
> + if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID))
> + return -EINVAL;
> +
> + dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n",
> + cxl_dcd_evt_type_str(event->event_type),
> + le64_to_cpu(dc_extent->start_dpa),
> + le64_to_cpu(dc_extent->length));
> +
> + switch (event->event_type) {
> + case DCD_ADD_CAPACITY:
> + return cxl_handle_dcd_add_event(mds, dc_extent);
> + case DCD_RELEASE_CAPACITY:
> + return cxl_handle_dcd_release_event(mds, dc_extent);
> + case DCD_FORCED_CAPACITY_RELEASE:
> + dev_err_ratelimited(dev, "Forced release event ignored.\n");
> + return 0;
> + default:
> + return -EINVAL;
> + }
> +
> + return 0;

dead code.

> +}
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1109,9 +1225,17 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> if (!nr_rec)
> break;
>
> - for (i = 0; i < nr_rec; i++)
> + for (i = 0; i < nr_rec; i++) {
> __cxl_event_trace_record(cxlmd, type,
> &payload->records[i]);
> + if (type == CXL_EVENT_TYPE_DCD) {

Perhaps flip condition so we can reduce indent.

if (type != CXL_EVENT_TYPE_DCD)
continue;
rc =
> + rc = cxl_handle_dcd_event_records(mds,
> + &payload->records[i]);
> + if (rc)
> + dev_err_ratelimited(dev, "dcd event failed: %d\n",
> + rc);
> + }
> + }
>
> if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW)
> trace_cxl_overflow(cxlmd, type, payload);

> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 7635ff109578..a07d95136f0d 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,6 +1450,57 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +int cxl_region_notify_extent(struct cxl_region *cxlr, enum dc_event event,
> + struct region_extent *reg_ext)
> +{
> + struct cxl_dax_region *cxlr_dax;
> + struct device *dev;
> + int rc = -ENXIO;
> +
> + cxlr_dax = cxlr->cxlr_dax;
> + dev = &cxlr_dax->dev;
> + dev_dbg(dev, "Trying notify: type %d HPA %#llx - %#llx\n",
> + event, reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + device_lock(dev);

guard(device)(dev);
or scoped_guard() if you are adding things in later patches (I haven't checked yet)

> + if (dev->driver) {
> + struct cxl_driver *reg_drv = to_cxl_drv(dev->driver);
> + struct cxl_drv_nd nd = (struct cxl_drv_nd) {
> + .event = event,
> + .reg_ext = reg_ext,
> + };
> +
> + if (reg_drv->notify) {
> + dev_dbg(dev, "Notify: type %d HPA %#llx - %#llx\n",
> + event, reg_ext->hpa_range.start,
> + reg_ext->hpa_range.end);
> + rc = reg_drv->notify(dev, &nd);
> + }
> + }
> + device_unlock(dev);
> + return rc;
> +}
> +
> +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled,

I'd be tempted to drag this to earlier patch.
Whilst it may look over the top there to have a separate function
I think that is cleaner than introducing the code and then factoring
it out in a patch doing lots of stuff like this one.

> + struct cxl_dax_region *cxlr_dax,
> + struct cxl_dc_extent *dc_extent,
> + struct range *dpa_range,
> + struct range *hpa_range)
> +{
> + resource_size_t dpa_offset, hpa;
> +
> + /*
> + * Without interleave...
> + * HPA offset == DPA offset
> + * ... but do the math anyway
> + */
> + dpa_offset = dpa_range->start - cxled->dpa_res->start;
> + hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> + hpa_range->start = hpa - cxlr_dax->hpa_range.start;
> + hpa_range->end = hpa_range->start + range_len(dpa_range) - 1;
> +}
> +
> static int extent_check_overlap(struct device *dev, void *arg)
> {
> struct range *new_range = arg;
> @@ -1480,7 +1531,6 @@ int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> struct cxl_region *cxlr = cxled->cxld.region;
> struct range ext_dpa_range, ext_hpa_range;
> struct device *dev = &cxlr->dev;
> - resource_size_t dpa_offset, hpa;
>
> /*
> * Interleave ways == 1 means this coresponds to a 1:1 mapping between
> @@ -1502,18 +1552,7 @@ int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> dev_dbg(dev, "Adding DC extent DPA %#llx - %#llx\n",
> ext_dpa_range.start, ext_dpa_range.end);
>
> - /*
> - * Without interleave...
> - * HPA offset == DPA offset
> - * ... but do the math anyway
> - */
> - dpa_offset = ext_dpa_range.start - cxled->dpa_res->start;
> - hpa = cxled->cxld.hpa_range.start + dpa_offset;
> -
> - ext_hpa_range = (struct range) {
> - .start = hpa - cxlr->cxlr_dax->hpa_range.start,
> - .end = ext_hpa_range.start + range_len(&ext_dpa_range) - 1,
> - };
> + calc_hpa_range(cxled, cxlr->cxlr_dax, dc_extent, &ext_dpa_range, &ext_hpa_range);
>
> if (extent_overlaps(cxlr->cxlr_dax, &ext_hpa_range))
> return -EINVAL;
> @@ -1527,6 +1566,80 @@ int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> cxled);
> }

> +static int cxl_rm_reg_ext_by_range(struct device *dev, void *data)
> +{
> + struct rm_data *rm_data = data;
> + struct region_extent *reg_ext;
> +
> + if (!is_region_extent(dev))
> + return 0;
> + reg_ext = to_region_extent(dev);
> +
> + /*
> + * Any extent which 'touches' the released range is notified
> + * for removal. No partials of the extent are released.
> + */
> + if (range_overlaps(rm_data->range, &reg_ext->hpa_range)) {
> + struct cxl_region *cxlr = rm_data->cxlr;
> +
> + dev_dbg(dev, "Remove DAX region ext HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> + cxl_ed_rm_region_extent(cxlr, reg_ext);

Is it worth improving efficiency if we have a precise match and returning 1
to stop iterating? Perhaps premature optimization.

> + }
> + return 0;
> +}
> +
> +static int cxl_ed_rm_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct cxl_region *cxlr = cxled->cxld.region;
> + struct range hpa_range;
> +
> + struct range rel_dpa_range = {
> + .start = le64_to_cpu(dc_extent->start_dpa),
> + .end = le64_to_cpu(dc_extent->start_dpa) +
> + le64_to_cpu(dc_extent->length) - 1,
> + };
> +
> + calc_hpa_range(cxled, cxlr->cxlr_dax, dc_extent, &rel_dpa_range, &hpa_range);
> +
> + struct rm_data rm_data = {
> + .cxlr = cxlr,
> + .range = &hpa_range,
> + };
> +
> + return device_for_each_child(&cxlr->cxlr_dax->dev, &rm_data,
> + cxl_rm_reg_ext_by_range);
> +}
> +

> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5379ad7f5852..156d7c9a8de5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -10,6 +10,7 @@
> #include <linux/log2.h>
> #include <linux/node.h>
> #include <linux/io.h>
> +#include <linux/cxl-event.h>
>
> /**
> * DOC: cxl objects
> @@ -613,6 +614,14 @@ struct cxl_pmem_region {
> struct cxl_pmem_region_mapping mapping[];
> };
>
> +/* See CXL 3.0 8.2.9.2.1.5 */

Add a name for the section to help searching future spec

> +enum dc_event {
> + DCD_ADD_CAPACITY,
> + DCD_RELEASE_CAPACITY,
> + DCD_FORCED_CAPACITY_RELEASE,
> + DCD_REGION_CONFIGURATION_UPDATED,
> +};

> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 0c79d9ce877c..20832f09c40c 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -103,6 +103,50 @@ static int cxl_debugfs_poison_clear(void *data, u64 dpa)
> DEFINE_DEBUGFS_ATTRIBUTE(cxl_poison_clear_fops, NULL,
> cxl_debugfs_poison_clear, "%llx\n");
>

> +static int cxl_mem_notify(struct device *dev, struct cxl_drv_nd *nd)
> +{
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_port *endpoint = cxlmd->endpoint;
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_dc_extent *dc_extent;
> + struct device *ep_dev;
> + int rc;
> +
> + dc_extent = nd->dc_extent;
> + dev_dbg(dev, "notify DC action %d DPA:%#llx LEN:%#llx\n",
> + nd->event, le64_to_cpu(dc_extent->start_dpa),
> + le64_to_cpu(dc_extent->length));
> +
> + ep_dev = device_find_child(&endpoint->dev, dc_extent,

Can use __free(put_device) magic here to deal with the trailing put device.
Minor tidy up, but nice to avoid the rc = / put / return rc dance

> + match_ep_decoder_by_range);
> + if (!ep_dev) {
> + dev_dbg(dev, "Extent DPA:%#llx LEN:%#llx not mapped; evt %d\n",
> + le64_to_cpu(dc_extent->start_dpa),
> + le64_to_cpu(dc_extent->length), nd->event);
> + return -ENXIO;
> + }
> +
> + cxled = to_cxl_endpoint_decoder(ep_dev);
> + rc = cxl_ed_notify_extent(cxled, nd);
> + put_device(ep_dev);
> + return rc;
> +}


> diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h
> index 03fa6d50d46f..6b745c913f96 100644
> --- a/include/linux/cxl-event.h
> +++ b/include/linux/cxl-event.h
> @@ -91,11 +91,42 @@ struct cxl_event_mem_module {
> u8 reserved[0x3d];
> } __packed;
>
> +/*
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51

Carries forward from earlier. Throw a table heading
in there for easy of searching future specs.

> + */
> +#define CXL_DC_EXTENT_TAG_LEN 0x10
> +struct cxl_dc_extent {
> + __le64 start_dpa;
> + __le64 length;
> + u8 tag[CXL_DC_EXTENT_TAG_LEN];
> + __le16 shared_extn_seq;
> + u8 reserved[0x6];
> +} __packed;
> +
> +/*
> + * Dynamic Capacity Event Record
> + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50
> + */
> +struct cxl_event_dcd {
> + struct cxl_event_record_hdr hdr;
> + u8 event_type;
> + u8 validity_flags;
> + __le16 host_id;

Could perhaps add a comment that this field isn't ever set for host.
It's there for FM event records when the host has sent the device
an Add capacity response or Capacity is released.

> + u8 region_index;
> + u8 flags;
> + u8 reserved1[0x2];
> + struct cxl_dc_extent extent;
> + u8 reserved2[0x18];
> + __le32 num_avail_extents;
> + __le32 num_avail_tags;
> +} __packed;
> +
> union cxl_event {
> struct cxl_event_generic generic;
> struct cxl_event_gen_media gen_media;
> struct cxl_event_dram dram;
> struct cxl_event_mem_module mem_module;
> + struct cxl_event_dcd dcd;
> } __packed;
>
> /*
>


2024-04-04 17:15:44

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 19/26] dax/bus: Factor out dev dax resize logic

On Sun, 24 Mar 2024 16:18:22 -0700
Ira Weiny <[email protected]> wrote:

> Dynamic Capacity regions must limit dev dax resources to those areas
> which have extents backing real memory. Such DAX regions are dubbed
> 'sparse' regions. In order to manage where memory is available four
> alternatives were considered:
>
> 1) Create a single region resource child on region creation which
> reserves the entire region. Then as extents are added punch holes in
> this reservation. This requires new resource manipulation to punch
> the holes and still requires an additional iteration over the extent
> areas which may already have existing dev dax resources used.
>
> 2) Maintain an ordered xarray of extents which can be queried while
> processing the resize logic. The issue is that existing region->res
> children may artificially limit the allocation size sent to
> alloc_dev_dax_range(). IE the resource children can't be directly
> used in the resize logic to find where space in the region is. This
> also poses a problem of managing the available size in 2 places.
>
> 3) Maintain a separate resource tree with extents. This option is the
> same as 2) but with the different data structure. Most ideally there
> should be a unified representation of the resource tree not two places
> to look for space.
>
> 4) Create region resource children for each extent. Manage the dax dev
> resize logic in the same way as before but use a region child
> (extent) resource as the parents to find space within each extent.
>
> Option 4 can leverage the existing resize algorithm to find space within
> the extents. It manages the available space in a singular resource tree
> which is less complicated for finding space.
>
> In preparation for this change, factor out the dev_dax_resize logic.
> For static regions use dax_region->res as the parent to find space for
> the dax ranges. Future patches will use the same algorithm with
> individual extent resources as the parent.
>
> Signed-off-by: Ira Weiny <[email protected]>
Seems like a straight forward refactor to me. Some trivial comments inline.

However, maybe move this discussion to a different patch. It's a lot
to have here when code it really refers to is later (22 I think?)

>
> ---
> Changes for V1
> [iweiny: Rebase on new DAX region locking]
> [iweiny: Reword commit message]
> [iweiny: Drop reviews]
> ---
> drivers/dax/bus.c | 129 +++++++++++++++++++++++++++++++++---------------------
> 1 file changed, 79 insertions(+), 50 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 4d5ed7ab6537..bab19fc578d0 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c

>
> -static ssize_t dev_dax_resize(struct dax_region *dax_region,
> - struct dev_dax *dev_dax, resource_size_t size)
> +/**
> + * dev_dax_resize_static - Expand the device into the unused portion of the
> + * region. This may involve adjusting the end of an existing resource, or
> + * allocating a new resource.
> + *
> + * @parent: parent resource to allocate this range in
> + * @dev_dax: DAX device to be expanded
> + * @to_alloc: amount of space to alloc; must be <= space available in @parent
> + *
> + * Return the amount of space allocated or -ERRNO on failure
> + */
> +static ssize_t dev_dax_resize_static(struct resource *parent,
> + struct dev_dax *dev_dax,
> + resource_size_t to_alloc)
> {
> - resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
> - resource_size_t dev_size = dev_dax_size(dev_dax);
> - struct resource *region_res = &dax_region->res;
> - struct device *dev = &dev_dax->dev;
> struct resource *res, *first;
> - resource_size_t alloc = 0;
> int rc;
>
> - if (dev->driver)
> - return -EBUSY;
> - if (size == dev_size)
> - return 0;
> - if (size > dev_size && size - dev_size > avail)
> - return -ENOSPC;
> - if (size < dev_size)
> - return dev_dax_shrink(dev_dax, size);
> -
> - to_alloc = size - dev_size;
> - if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
> - "resize of %pa misaligned\n", &to_alloc))
> - return -ENXIO;
> -
> - /*
> - * Expand the device into the unused portion of the region. This
> - * may involve adjusting the end of an existing resource, or
> - * allocating a new resource.
> - */
> -retry:
> - first = region_res->child;
> - if (!first)
> - return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
> + first = parent->child;
> + if (!first) {
> + rc = alloc_dev_dax_range(parent, dev_dax,
> + parent->start, to_alloc);

Quirky indent I think.

..

> +static ssize_t dev_dax_resize(struct dax_region *dax_region,
> + struct dev_dax *dev_dax, resource_size_t size)
> +{
> + resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
Trivial...
Whilst here nice to tidy up and move that to_alloc to it's own line as
not nice to mix declarations with assignments with those that aren't.


> + resource_size_t dev_size = dev_dax_size(dev_dax);
> + struct device *dev = &dev_dax->dev;
> + resource_size_t alloc = 0;
> +
> + if (dev->driver)
> + return -EBUSY;
> + if (size == dev_size)
> + return 0;
> + if (size > dev_size && size - dev_size > avail)
> + return -ENOSPC;
> + if (size < dev_size)
> + return dev_dax_shrink(dev_dax, size);
> +
> + to_alloc = size - dev_size;
> + if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
> + "resize of %pa misaligned\n", &to_alloc))
> + return -ENXIO;
> +
> +retry:
> + alloc = dev_dax_resize_static(&dax_region->res, dev_dax, to_alloc);
> + if (alloc <= 0)
> + return alloc;
> to_alloc -= alloc;
> if (to_alloc)
> goto retry;
> @@ -1283,7 +1310,8 @@ static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
>
> to_alloc = range_len(&r);
> if (alloc_is_aligned(dev_dax, to_alloc))
> - rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
> + rc = alloc_dev_dax_range(&dax_region->res, dev_dax, r.start,
> + to_alloc);
> up_write(&dax_dev_rwsem);
> up_write(&dax_region_rwsem);
>
> @@ -1506,7 +1534,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
> device_initialize(dev);
> dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
>
> - rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
> + rc = alloc_dev_dax_range(&dax_region->res, dev_dax, dax_region->res.start,
> + data->size);
> if (rc)
> goto err_range;
>
>


2024-04-04 17:37:46

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 22/26] dax/region: Support DAX device creation on sparse DAX regions

On Sun, 24 Mar 2024 16:18:25 -0700
Ira Weiny <[email protected]> wrote:

> Previous patches introduced a new sparse DAX region type. This region
> type may have 0 or more bytes of backing memory.
>
> DAX devices already have the ability to reference sparse ranges of a DAX
> region. Leverage the range support of DAX devices to track memory
> across a sparse set of region extents.
>
> Requests for extent removal can be received from the device at any time.
> But the host is not obliged to release that memory until it is finished
> with it. Introduce a use count to track how many DAX devices are using
> an extent. If that extent is in use reject the removal of the extent.
>
> Leverage the region RW semaphore to protect the extent data as any
> changes to the use of the extent require DAX device, DAX region, and
> extent stability during those operations.
>
> Signed-off-by: Ira Weiny <[email protected]>
Comments are minor. I'm not 100% confident on this yet, but
that's more a case of I need to look at the end result of the whole
series. Fairly happy though so...

Reviewed-by: Jonathan Cameron <[email protected]>

>
> ---
> Changes for v3
> [iweiny: simplify the extent objects]
> [iweiny: refactor based on the new extent objects created]
> [iweiny: remove xarray]
> [iweiny: use lock/invalidate/cnt rather than kref]
> ---
> drivers/cxl/core/extent.c | 8 ++
> drivers/cxl/core/region.c | 6 +-
> drivers/cxl/cxl.h | 1 +
> drivers/dax/bus.c | 191 +++++++++++++++++++++++++++++++++++++++-------
> drivers/dax/bus.h | 3 +-
> drivers/dax/cxl.c | 55 ++++++++++++-
> drivers/dax/dax-private.h | 23 ++++++
> drivers/dax/hmem/hmem.c | 2 +-
> drivers/dax/pmem.c | 2 +-
> 9 files changed, 258 insertions(+), 33 deletions(-)
>


> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 56dddaceeccb..70a559763e8c 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -236,11 +236,32 @@ int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
> if (rc)
> return rc;
>
> - return devm_add_action_or_reset(ext_dev, dax_region_release_extent,
> + /* Assume the devm action will be configured without error */
> + dev_set_drvdata(ext_dev, dax_ext);
> + rc = devm_add_action_or_reset(ext_dev, dax_region_release_extent,
> no_free_ptr(dax_ext));

Indent needs tweaking.

> + if (rc)
> + dev_set_drvdata(ext_dev, NULL);
> + return rc;
> }
> EXPORT_SYMBOL_GPL(dax_region_add_extent);

> @@ -507,15 +553,26 @@ EXPORT_SYMBOL_GPL(kill_dev_dax);
> static void trim_dev_dax_range(struct dev_dax *dev_dax)
> {
> int i = dev_dax->nr_range - 1;
> - struct range *range = &dev_dax->ranges[i].range;
> + struct dev_dax_range *dev_range = &dev_dax->ranges[i];
> + struct range *range = &dev_range->range;
> struct dax_region *dax_region = dev_dax->region;
> + struct resource *res = &dax_region->res;

>
> WARN_ON_ONCE(!rwsem_is_locked(&dax_region_rwsem));
> dev_dbg(&dev_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
> (unsigned long long)range->start,
> (unsigned long long)range->end);
>
> - __release_region(&dax_region->res, range->start, range_len(range));
> + if (dev_range->dax_ext) {
> + res = dev_range->dax_ext->res;
> + dev_dbg(&dev_dax->dev, "Trim sparse extent %pr\n", res);
> + }
> +
> + __release_region(res, range->start, range_len(range));
> +
> + if (dev_range->dax_ext)

May be worth considering splitting this core bit into two
functions for dev_range->dax_ext and not. The overriding of res
is not giving nice readable code.


> + dev_range->dax_ext->use_cnt--;
> +
> if (--dev_dax->nr_range == 0) {
> kfree(dev_dax->ranges);
> dev_dax->ranges = NULL;

>
> /**
> - * dev_dax_resize_static - Expand the device into the unused portion of the
> - * region. This may involve adjusting the end of an existing resource, or
> - * allocating a new resource.
> + * __dev_dax_resize - Expand the device into the unused portion of the region.
> + * This may involve adjusting the end of an existing resource, or allocating a
> + * new resource.
> *
> * @parent: parent resource to allocate this range in
> * @dev_dax: DAX device to be expanded
> * @to_alloc: amount of space to alloc; must be <= space available in @parent
> + * @dax_ext: if sparse; the extent containing parent

If not, what? NULL, but maybe docs should make that explicit.

> *
> * Return the amount of space allocated or -ERRNO on failure
> */

> +static ssize_t dev_dax_resize_sparse(struct dax_region *dax_region,
> + struct dev_dax *dev_dax,
> + resource_size_t to_alloc)
> +{
> + struct dax_extent *dax_ext;
> + resource_size_t extent_max;
> + struct device *ext_dev;
> + ssize_t alloc;
> +
> + ext_dev = dax_region->sparse_ops->find_ext(dax_region, &extent_max,
> + dax_ext_match_avail_size);
> + if (!ext_dev)
> + return -ENOSPC;
> +
> + dax_ext = dev_get_drvdata(ext_dev);
> + if (!dax_ext)
> + return -ENOSPC;
> +
> + to_alloc = min(extent_max, to_alloc);
> + alloc = __dev_dax_resize(dax_ext->res, dev_dax, to_alloc, dax_ext);
> + if (alloc < 0)
> + dax_ext->use_cnt--;

Maybe define a put_dax_ext() / get_dax_ext() given this is operating somewhat like that
in that find_ext takes a reference and that is dropped on error.

> + return alloc;
> +}

> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 83ee45aff69a..3cb95e5988ae 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c

> +/**
> + * find_ext - Match Extent callback
> + * @dax_region: region to search
> + * @size_avail: the available size if an extent is found
> + * @match_fn: match function
> + *
> + * Callback to itterate through the child devices of the DAX region calling

Spell check. Iterate

> + * match_fn only on those devices which are extents.
> + *
> + * If a match is found match_fn is responsible for locking or reference
> + * counting dax_ext as needed.
> + */
> +static struct device *find_ext(struct dax_region *dax_region,
> + resource_size_t *size_avail,
> + match_cb match_fn)
> +{
> + struct match_data md = {
> + .match_fn = match_fn,
> + .size_avail = size_avail,
> + };
> + struct device *ext_dev;
> +
> + ext_dev = device_find_child(dax_region->dev, &md, cxl_dax_match_ext);
> +

Trivial but I'd drop this blank line to closely group the check with the find.

> + if (!ext_dev)
> + return NULL;
> +
> + /* caller must hold a count on extent data */
> + put_device(ext_dev);
> + return ext_dev;
> +}


2024-04-04 17:39:19

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 23/26] cxl/mem: Trace Dynamic capacity Event Record

On Sun, 24 Mar 2024 16:18:26 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
> Notify the host of extents being added or removed. User space has
> little use for these events other than for debugging.
>
> Add DC trace points to the trace log for debugging purposes.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>

One trivial thing

Reviewed-by: Jonathan Cameron <[email protected]>

> +/*
> + * DYNAMIC CAPACITY Event Record - DER
> + *
> + * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
> + */
> +
> +#define CXL_DC_ADD_CAPACITY 0x00
> +#define CXL_DC_REL_CAPACITY 0x01
> +#define CXL_DC_FORCED_REL_CAPACITY 0x02
> +#define CXL_DC_REG_CONF_UPDATED 0x03
> +#define show_dc_evt_type(type) __print_symbolic(type, \
> + { CXL_DC_ADD_CAPACITY, "Add capacity"}, \
> + { CXL_DC_REL_CAPACITY, "Release capacity"}, \
> + { CXL_DC_FORCED_REL_CAPACITY, "Forced capacity release"}, \
> + { CXL_DC_REG_CONF_UPDATED, "Region Configuration Updated" } \
Tidy up indents and alignment etc. Doesn't look consistent.


2024-04-04 17:49:21

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)


>
> Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
Hi Ira, Navneet.
>
> Remaining work:
>
> 1) Integrate the QoS work from Dave Jiang
> 2) Interleave support


More flag. This one I think is potentially important and don't
see any handling in here.

Whilst an FM could in theory be careful to avoid sending a
sparse set of extents, if the device is managing the memory range
(which is possible all it supports) and the FM issues an Initiate Dynamic
Capacity Add with Free (again may be all device supports) then we
can't stop the device issuing a bunch of sparse extents.

Now it won't be broken as such without this, but every time we
accept the first extent that will implicitly reject the rest.
That will look very ugly to an FM which has to poke potentially many
times to successfully allocate memory to a host.

I also don't think it will be that hard to support, but maybe I'm
missing something?

My first thought is it's just a loop in cxl_handle_dcd_add_extent()
over a list of extents passed in then slightly more complex response
generation.

I don't want this to block getting initial DCD support in but it
will be a bit ugly if we quickly support the more flag and then end
up with just one kernel that an FM has to be careful with...

Jonathan

2024-04-04 17:49:28

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 20/26] dax: Document dax dev range tuple

On Sun, 24 Mar 2024 16:18:23 -0700
Ira Weiny <[email protected]> wrote:

> The device DAX structure is being enhanced to track additional DCD
> information.
>
> The current range tuple was not fully documented. Document it prior to
> adding information for DC.
>
> Suggested-by: Jonathan Cameron <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>

There is a style convention for nested structs.
Maybe needs tweaking for a pointer like this though...

Perhaps poke it with kernel-doc script an see what comes out.

https://docs.kernel.org/doc-guide/kernel-doc.html#nested-structs-unions
>
> ---
> Changes for v1
> [iweiny: new patch]
> ---
> drivers/dax/dax-private.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index c6319c6567fb..ac1ccf158650 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -70,7 +70,10 @@ struct dax_mapping {
> * @dev - device core
> * @pgmap - pgmap for memmap setup / lifetime (driver owned)
> * @nr_range: size of @ranges
> - * @ranges: resource-span + pgoff tuples for the instance
> + * @ranges: range tuples of memory used
> + * @pgoff: page offset
> + * @range: resource-span
> + * @mapping: device to assist in interrogating the range layout
> */
> struct dev_dax {
> struct dax_region *region;
>


2024-04-04 08:59:04

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 08/26] cxl/mem: Expose device dynamic capacity capabilities

On Tue, 26 Mar 2024 11:30:38 -0700
fan <[email protected]> wrote:

> On Mon, Mar 25, 2024 at 04:40:16PM -0700, Davidlohr Bueso wrote:
> > On Sun, 24 Mar 2024, [email protected] wrote:
> >
> > > +What: /sys/bus/cxl/devices/memX/dc/region_count
> > > +Date: June, 2024
> > > +KernelVersion: v6.10
> > > +Contact: [email protected]
> > > +Description:
> > > + (RO) Number of Dynamic Capacity (DC) regions supported on the
> > > + device. May be 0 if the device does not support Dynamic
> > > + Capacity.
> >
> > If dcd is not supported then we should not have the dc/ directory
> > altogether.
> >
> > Thanks,
> > Davidlohr
>
> I also think so. However, I also noticed one thing (not DCD related).
> Even for a PMEM device, for example, we have a ram directory under the
> device directory.

True, but it's new ABI so we don't have to copy and Dan's patch to
allow for hiding static attribute directories has landed in the meantime.
So I vote for hiding it.

>
> ===================
> root@DT:~# cxl list
> [
> {
> "memdev":"mem0",
> "pmem_size":536870912,
> "serial":0,
> "host":"0000:0d:00.0"
> }
> ]
> root@DT:~# ls /sys/bus/cxl/devices/mem0/
> dc dev driver firmware firmware_version label_storage_size numa_node payload_max pmem pmem0 ram security serial subsystem trigger_poison_list uevent
> root@DT:~#
> ===================
>
> Fan
>


2024-04-03 20:46:34

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

On Wed, 27 Mar 2024 22:20:45 -0700
Ira Weiny <[email protected]> wrote:

> fan wrote:
> > On Sun, Mar 24, 2024 at 04:18:03PM -0700, [email protected] wrote:
> > > A git tree of this series can be found here:
>
> [snip]
>
> > >
> >
> > Hi Ira,
> > Have not got a chance to check the code yet, but I noticed one thing
> > when testing with my DCD emulation code.
> > Currently, if we do partial release, it seems the whole extent will be
> > removed. Is it designed intentionally?
> >
>
> Yes that is my intent. I specifically called that out in patch 18.
>
> https://lore.kernel.org/all/[email protected]/
>
> I thought we discussed this in one of the collaboration calls. Mainly
> this is to simplify by not attempting any split of the extents the host is
> tracking. It really is expected that the FM/device is going to keep those
> extents offered and release them in their entirety. I understand this may
> complicate the device because it may see a release of memory prior to the
> request of that release. And perhaps this complicates the device. But in
> that case it (or the FM really) should not attempt to release partial
> extents.

It was discussed at some point as you say. Feels like something that might not
be set in stone for ever, but for now it is a reasonable simplifying assumption.
The device might not maintain the separation of neighboring extents
but the FM probably will. If it turns out real use models are different,
then we 'guessed' wrong and get to write more complex code.

Device always has to cope with unsolicited release so don't think this adds
any burden. That includes a race where the host releases capacity when
it hasn't yet seen the event the device has sent to release part of the same
capacity. There is text about async release always being possible in the
spec to cover these overlapping cases but upshot of that one is it must be
permissible to release a containing capacity as you are doing.

Jonathan

> Ira
>
> [snip]


2024-04-05 13:55:17

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 07/26] cxl/port: Add dynamic capacity size support to endpoint decoders

On Sun, 24 Mar 2024 16:18:10 -0700
[email protected] wrote:

> From: Navneet Singh <[email protected]>
>
> To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> map DC partitions (regions). In addition to assigning the size of the
> DC partition, the decoder must assign any skip value from the previous
> decoder. This must be done within a contiguous DPA space.
>
> Two complications arise with Dynamic Capacity regions which did not
> exist with Ram and PMEM partitions. First, gaps in the DPA space can
> exist between and around the DC Regions. Second, the Linux resource
> tree does not allow a resource to be marked across existing nodes within
> a tree.
>
> For clarity, below is an example of an 60GB device with 10GB of RAM,
> 10GB of PMEM and 10GB for each of 2 DC Regions. The desired CXL mapping
> is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.
>
> DPA RANGE
> (dpa_res)
> 0GB 10GB 20GB 30GB 40GB 50GB 60GB
> |----------|----------|----------|----------|----------|----------|
>
> RAM PMEM DC0 DC1
> (ram_res) (pmem_res) (dc_res[0]) (dc_res[1])
> |----------|----------| <gap> |----------| <gap> |----------|
>
> RAM PMEM DC1
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> 0GB 5GB 10GB 15GB 20GB 30GB 40GB 50GB 60GB


To add another corner to the example, maybe map only part of DC1?

>
> The previous skip resource between RAM and PMEM was always a child of
> the RAM resource and fit nicely [see (S) below]. Because of this
> simplicity this skip resource reference was not stored in any CXL state.
> On release the skip range could be calculated based on the endpoint
> decoders stored values.
>
> Now when DC1 is being mapped 4 skip resources must be created as
> children. One for the PMEM resource (A), two of the parent DPA resource
> (B,D), and one more child of the DC0 resource (C).
>
> 0GB 10GB 20GB 30GB 40GB 50GB 60GB
> |----------|----------|----------|----------|----------|----------|
> | |
> |----------|----------| | |----------| | |----------|
> | | | | |
> (S) (A) (B) (C) (D)
> v v v v v
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> skip skip skip skip skip
>
> Expand the calculation of DPA freespace and enhance the logic to support
> mapping/unmapping DC DPA space. To track the potential of multiple skip
> resources an xarray is attached to the endpoint decoder. The existing
> algorithm between RAM and PMEM is consolidated within the new one to
> streamline the code even though the result is the storage of a single
> skip resource in the xarray.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: Update cover letter]
> ---
> drivers/cxl/core/hdm.c | 192 +++++++++++++++++++++++++++++++++++++++++++-----
> drivers/cxl/core/port.c | 2 +
> drivers/cxl/cxl.h | 2 +
> 3 files changed, 179 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index e22b6f4f7145..da7d58184490 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -210,6 +210,25 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);
>
> +static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> + struct cxl_port *port = cxled_to_port(cxled);
> + struct device *dev = &port->dev;
> + unsigned long index;
> + void *entry;
> +
> + xa_for_each(&cxled->skip_res, index, entry) {
> + struct resource *res = entry;
> +
> + dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
> + port->id, cxled->cxld.id, res);
> + __release_region(&cxlds->dpa_res, res->start,
> + resource_size(res));
> + xa_erase(&cxled->skip_res, index);
> + }
> +}
> +
> /*
> * Must be called in a context that synchronizes against this decoder's
> * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -220,15 +239,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> struct cxl_port *port = cxled_to_port(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct resource *res = cxled->dpa_res;
> - resource_size_t skip_start;
>
> lockdep_assert_held_write(&cxl_dpa_rwsem);
>
> - /* save @skip_start, before @res is released */
> - skip_start = res->start - cxled->skip;
> __release_region(&cxlds->dpa_res, res->start, resource_size(res));
> - if (cxled->skip)
> - __release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> + cxl_skip_release(cxled);
> cxled->skip = 0;
> cxled->dpa_res = NULL;
> put_device(&cxled->cxld.dev);
> @@ -263,6 +278,100 @@ static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> return mode - CXL_DECODER_DC0;
> }
>
> +static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
> + resource_size_t skip_base, resource_size_t skip_len)
> +{
> + struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> + const char *name = dev_name(&cxled->cxld.dev);
> + struct cxl_port *port = cxled_to_port(cxled);
> + struct resource *dpa_res = &cxlds->dpa_res;
> + struct device *dev = &port->dev;
> + struct resource *res;
> + int rc;
> +
> + res = __request_region(dpa_res, skip_base, skip_len, name, 0);
> + if (!res)
> + return -EBUSY;
> +
> + rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
> + if (rc) {
> + __release_region(dpa_res, skip_base, skip_len);
> + return rc;
> + }
> +
> + dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
> + port->id, cxled->cxld.id, res);
> + return 0;
> +}
> +
> +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> + resource_size_t base, resource_size_t skipped)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_port *port = cxled_to_port(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + resource_size_t skip_base = base - skipped;
> + struct device *dev = &port->dev;
> + resource_size_t skip_len = 0;
> + int rc, index;
> +
> + if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
> + skip_len = cxlds->ram_res.end - skip_base + 1;
> + rc = cxl_request_skip(cxled, skip_base, skip_len);
> + if (rc)
> + return rc;
> + skip_base += skip_len;
> + }
> +
> + if (skip_base == base) {
> + dev_dbg(dev, "skip done ram!\n");
> + return 0;
> + }
> +
> + if (resource_size(&cxlds->pmem_res) &&
> + skip_base <= cxlds->pmem_res.end) {
> + skip_len = cxlds->pmem_res.end - skip_base + 1;
> + rc = cxl_request_skip(cxled, skip_base, skip_len);
> + if (rc)
> + return rc;
> + skip_base += skip_len;
> + }
> +
> + index = dc_mode_to_region_index(cxled->mode);
> + for (int i = 0; i <= index; i++) {
> + struct resource *dcr = &cxlds->dc_res[i];
> +
> + if (skip_base < dcr->start) {
> + skip_len = dcr->start - skip_base;
> + rc = cxl_request_skip(cxled, skip_base, skip_len);
> + if (rc)
> + return rc;
> + skip_base += skip_len;
> + }
> +
> + if (skip_base == base) {
> + dev_dbg(dev, "skip done DC region %d!\n", i);
> + break;
> + }
> +
> + if (resource_size(dcr) && skip_base <= dcr->end) {
> + if (skip_base > base) {
> + dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
> + i, &skip_base, &base);
> + return -ENXIO;
> + }
> +
> + skip_len = dcr->end - skip_base + 1;
> + rc = cxl_request_skip(cxled, skip_base, skip_len);
> + if (rc)
> + return rc;
> + skip_base += skip_len;
> + }
> + }
> +
> + return 0;
> +}
> +
> static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> resource_size_t base, resource_size_t len,
> resource_size_t skipped)
> @@ -300,13 +409,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> }
>
> if (skipped) {
> - res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> - dev_name(&cxled->cxld.dev), 0);
> - if (!res) {
> - dev_dbg(dev,
> - "decoder%d.%d: failed to reserve skipped space\n",
> - port->id, cxled->cxld.id);
> - return -EBUSY;
> + int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
> +
> + if (rc) {
> + dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
> + port->id, cxled->cxld.id, &base, &skipped);
> + return rc;
> }
> }
> res = __request_region(&cxlds->dpa_res, base, len,
> @@ -314,14 +422,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> if (!res) {
> dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> port->id, cxled->cxld.id);
> - if (skipped)
> - __release_region(&cxlds->dpa_res, base - skipped,
> - skipped);
> + cxl_skip_release(cxled);
> return -EBUSY;
> }
> cxled->dpa_res = res;
> cxled->skip = skipped;
>
> + for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> + int index = dc_mode_to_region_index(mode);
> +
> + if (resource_contains(&cxlds->dc_res[index], res)) {
> + cxled->mode = mode;
> + goto success;
> + }
> + }
> if (resource_contains(&cxlds->pmem_res, res))
> cxled->mode = CXL_DECODER_PMEM;
> else if (resource_contains(&cxlds->ram_res, res))
> @@ -332,6 +446,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> cxled->mode = CXL_DECODER_MIXED;
> }
>
> +success:
> + dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
> + cxled->dpa_res, cxled->mode);
> port->hdm_end++;
> get_device(&cxled->cxld.dev);
> return 0;
> @@ -463,14 +580,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>
> int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> {
> + resource_size_t free_ram_start, free_pmem_start, free_dc_start;
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> - resource_size_t free_ram_start, free_pmem_start;
> struct cxl_port *port = cxled_to_port(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &cxled->cxld.dev;
> resource_size_t start, avail, skip;
> struct resource *p, *last;
> - int rc;
> + int rc, dc_index;
>
> down_write(&cxl_dpa_rwsem);

Obviously not related to this patch as such, but maybe a good place
for scoped_guard() to avoid the dance around unlocking the rwsem
and allow some early returns on the error paths.


> if (cxled->cxld.region) {
> @@ -500,6 +617,21 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> else
> free_pmem_start = cxlds->pmem_res.start;
>
> + /*
> + * Limit each decoder to a single DC region to map memory with
> + * different DSMAS entry.
> + */
> + dc_index = dc_mode_to_region_index(cxled->mode);
> + if (dc_index >= 0) {
> + if (cxlds->dc_res[dc_index].child) {
> + dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
> + dc_index);
> + rc = -EINVAL;
> + goto out;
> + }
> + free_dc_start = cxlds->dc_res[dc_index].start;
> + }
> +
> if (cxled->mode == CXL_DECODER_RAM) {
> start = free_ram_start;
> avail = cxlds->ram_res.end - start + 1;
> @@ -521,12 +653,38 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> else
> skip_end = start - 1;
> skip = skip_end - skip_start + 1;
> + } else if (cxl_decoder_mode_is_dc(cxled->mode)) {
> + resource_size_t skip_start, skip_end;
> +
> + start = free_dc_start;
> + avail = cxlds->dc_res[dc_index].end - start + 1;
> + if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> + skip_start = free_ram_start;
> + else
> + skip_start = free_pmem_start;
> + /*
> + * If any dc region is already mapped, then that allocation
> + * already handled the RAM and PMEM skip. Check for DC region
> + * skip.
> + */
> + for (int i = dc_index - 1; i >= 0 ; i--) {
> + if (cxlds->dc_res[i].child) {
> + skip_start = cxlds->dc_res[i].child->end + 1;
> + break;
> + }
> + }
> +
> + skip_end = start - 1;
> + skip = skip_end - skip_start + 1;

I notice in the pmem equivalent there is a case for part of the region already mapped.
Can that not happen for a DC region as well?

> } else {
> dev_dbg(dev, "mode not set\n");
> rc = -EINVAL;
> goto out;
> }
>
> + dev_dbg(dev, "DPA Allocation start: %pa len: %#llx Skip: %pa\n",
> + &start, size, &skip);
> +
> if (size > avail) {
> dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
> cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",


2024-04-05 18:09:50

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

J?rgen Hansen wrote:
> On 3/25/24 00:18, [email protected] wrote:
>
> > From: Navneet Singh <[email protected]>
> >

[snip]

> > /**
> > * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> > *
> > @@ -467,6 +482,8 @@ struct cxl_dev_state {
> > * @enabled_cmds: Hardware commands found enabled in CEL.
> > * @exclusive_cmds: Commands that are kernel-internal only
> > * @total_bytes: sum of all possible capacities
> > + * @static_cap: Sum of static RAM and PMEM capacities
> > + * @dynamic_cap: Complete DPA range occupied by DC regions
>
> How about naming these total_range, static_cap and dynamic_range to make
> it clear that the DPA range occupied by DC regions isn't necessarily
> usable capacity (as opposed to the static_cap where the spec defines it
> as usable capacity).

I thought this was a good idea but on second thought these are not range
variables at all. They really represent the various lengths of the
resources.

For total_bytes the documentation already says 'sum of all __possible__
capacities'.

I think you have a point for the new fields though. They should all be
named in some consistent manner and documented as such.

So I propose:

diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 94531af018f8..9c18b229f69a 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -481,9 +481,9 @@ struct cxl_dc_region_info {
* @dcd_cmds: List of DCD commands implemented by memory device
* @enabled_cmds: Hardware commands found enabled in CEL.
* @exclusive_cmds: Commands that are kernel-internal only
- * @total_bytes: sum of all possible capacities
- * @static_cap: Sum of static RAM and PMEM capacities
- * @dynamic_cap: Complete DPA range occupied by DC regions
+ * @total_bytes: length of all possible capacities
+ * @static_bytes: length of possible static RAM and PMEM partitions
+ * @dynamic_bytes: length of possible DC partitions (DC Regions)
* @volatile_only_bytes: hard volatile capacity
* @persistent_only_bytes: hard persistent capacity
* @partition_align_bytes: alignment size for partition-able capacity
@@ -515,8 +515,8 @@ struct cxl_memdev_state {
DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);

u64 total_bytes;
- u64 static_cap;
- u64 dynamic_cap;
+ u64 static_bytes;
+ u64 dynamic_bytes;
u64 volatile_only_bytes;
u64 persistent_only_bytes;
u64 partition_align_bytes;

2024-04-05 18:20:00

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 04/26] cxl/region: Add dynamic capacity decoder and region modes

Dave Jiang wrote:
>
>
> On 3/24/24 4:18 PM, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >
> > Region mode must reflect a general dynamic capacity type which is
> > associated with a specific Dynamic Capacity (DC) partitions in each
> > device decoder within the region. DC partitions are also know as DC
> > regions per CXL 3.1.
>
> This section reads somewhat awkward to me. Does this read any better?
>
> One or more Dynamic Capacity (DC) partitions (and decoders) form a CXL
> software region. The region mode reflects composition of that entire software
> region. Decoder mode reflects a specific DC partition. DC partitions are also
> known as DC regions per CXL specification r3.1 but is not the same entity as
> CXL software regions.

Yea that does sound better but I think this builds on your text and is even
more clear.

<commit>
cxl/region: Add dynamic capacity decoder and region modes

One or more decoders each pointing to a Dynamic Capacity (DC) partition form a
CXL software region. The region mode reflects composition of that entire
software region. Decoder mode reflects a specific DC partition. DC partitions
are also known as DC regions per CXL specification r3.1 but they are not the
same entity as CXL software regions.

Define the new modes and helper functions required to make the association
between these new modes.

</commit>


Ira

2024-04-05 19:22:32

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

Dave Jiang wrote:
>
>
> On 3/24/24 4:18 PM, Ira Weiny wrote:
> > cxl_dpa_set_mode() checks the mode for validity two times, once outside
> > of the DPA RW semaphore and again within. The function is not in a
> > critical path. Prior to Dynamic Capacity the extra check was not much
> > of an issue. The addition of DC modes increases the complexity of
> > the check.
> >
> > Simplify the mode check before adding the more complex DC modes.
>
> I would augment this by saying simplify "by using scope-based resource menagement".

However, using the guard cleanup is not really the simplification here. It is
more about checking the mode a single time.

That said I will change this to:

Simplify the mode check and convert to use of a cleanup guard.

Ira

2024-04-06 00:01:45

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 04/26] cxl/region: Add dynamic capacity decoder and region modes



On 4/5/24 11:19 AM, Ira Weiny wrote:
> Dave Jiang wrote:
>>
>>
>> On 3/24/24 4:18 PM, [email protected] wrote:
>>> From: Navneet Singh <[email protected]>
>>>
>>> Region mode must reflect a general dynamic capacity type which is
>>> associated with a specific Dynamic Capacity (DC) partitions in each
>>> device decoder within the region. DC partitions are also know as DC
>>> regions per CXL 3.1.
>>
>> This section reads somewhat awkward to me. Does this read any better?
>>
>> One or more Dynamic Capacity (DC) partitions (and decoders) form a CXL
>> software region. The region mode reflects composition of that entire software
>> region. Decoder mode reflects a specific DC partition. DC partitions are also
>> known as DC regions per CXL specification r3.1 but is not the same entity as
>> CXL software regions.
>
> Yea that does sound better but I think this builds on your text and is even
> more clear.
>
> <commit>
> cxl/region: Add dynamic capacity decoder and region modes
>
> One or more decoders each pointing to a Dynamic Capacity (DC) partition form a
> CXL software region. The region mode reflects composition of that entire
> software region. Decoder mode reflects a specific DC partition. DC partitions
> are also known as DC regions per CXL specification r3.1 but they are not the
> same entity as CXL software regions.
>
> Define the new modes and helper functions required to make the association
> between these new modes.
>
> </commit>
>

LGTM
>
> Ira

2024-04-06 00:03:15

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()



On 4/5/24 12:21 PM, Ira Weiny wrote:
> Dave Jiang wrote:
>>
>>
>> On 3/24/24 4:18 PM, Ira Weiny wrote:
>>> cxl_dpa_set_mode() checks the mode for validity two times, once outside
>>> of the DPA RW semaphore and again within. The function is not in a
>>> critical path. Prior to Dynamic Capacity the extra check was not much
>>> of an issue. The addition of DC modes increases the complexity of
>>> the check.
>>>
>>> Simplify the mode check before adding the more complex DC modes.
>>
>> I would augment this by saying simplify "by using scope-based resource menagement".
>
> However, using the guard cleanup is not really the simplification here. It is
> more about checking the mode a single time.
>
> That said I will change this to:
>
> Simplify the mode check and convert to use of a cleanup guard.

Ok

>
> Ira

2024-04-09 00:44:04

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

On Sun, Mar 24, 2024 at 04:18:08PM -0700, Ira Weiny wrote:
> cxl_dpa_set_mode() checks the mode for validity two times, once outside
> of the DPA RW semaphore and again within.

Not true. It only checks mode once before the lock. It checks for
capacity after the lock. If it didn't check mode before the lock,
then unsupported modes would fall through.

> The function is not in a critical path.

Implying what here? OK to check twice (even though it wasn't)
or OK to expand scope of locking.

> Prior to Dynamic Capacity the extra check was not much
> of an issue. The addition of DC modes increases the complexity of
> the check.
>
> Simplify the mode check before adding the more complex DC modes.
>

The addition of the DC mode check doesn't seem complex.

Pardon my picking at the words, but if you'd like to refactor the
function, just say so. The final result is a bit more readable, but
also adding the DC mode checks without refactoring would read fine
also.

and...a bit spacing nit below -

> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: new patch]
> [Jonathan: based on getting rid of the loop in cxl_dpa_set_mode]
> [Jonathan: standardize on resource_size() == 0]
> ---
> drivers/cxl/core/hdm.c | 45 ++++++++++++++++++---------------------------
> 1 file changed, 18 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7d97790b893d..66b8419fd0c3 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -411,44 +411,35 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &cxled->cxld.dev;
> - int rc;
>
> + guard(rwsem_write)(&cxl_dpa_rwsem);
> + if (cxled->cxld.flags & CXL_DECODER_F_ENABLE)
> + return -EBUSY;
> +
> + /*
> + * Check that the mode is supported by the current partition
> + * configuration
> + */
> switch (mode) {
> case CXL_DECODER_RAM:
> + if (!resource_size(&cxlds->ram_res)) {
> + dev_dbg(dev, "no available ram capacity\n");
> + return -ENXIO;
> + }
> + break;
> case CXL_DECODER_PMEM:
> + if (!resource_size(&cxlds->pmem_res)) {
> + dev_dbg(dev, "no available pmem capacity\n");
> + return -ENXIO;
> + }
> break;
> default:
> dev_dbg(dev, "unsupported mode: %d\n", mode);
> return -EINVAL;
> }
>

delete extra line

> - down_write(&cxl_dpa_rwsem);
> - if (cxled->cxld.flags & CXL_DECODER_F_ENABLE) {
> - rc = -EBUSY;
> - goto out;
> - }
> -
> - /*
> - * Only allow modes that are supported by the current partition
> - * configuration
> - */
> - if (mode == CXL_DECODER_PMEM && !resource_size(&cxlds->pmem_res)) {
> - dev_dbg(dev, "no available pmem capacity\n");
> - rc = -ENXIO;
> - goto out;
> - }
> - if (mode == CXL_DECODER_RAM && !resource_size(&cxlds->ram_res)) {
> - dev_dbg(dev, "no available ram capacity\n");
> - rc = -ENXIO;
> - goto out;
> - }
> -
> cxled->mode = mode;
> - rc = 0;
> -out:
> - up_write(&cxl_dpa_rwsem);
> -
> - return rc;
insert blank line
> + return 0;
> }
>
> int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
>
> --
> 2.44.0
>

2024-04-09 02:00:43

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

On Sun, Mar 24, 2024 at 04:18:06PM -0700, Ira Weiny wrote:
> From: Navneet Singh <[email protected]>
>
> Devices can optionally support Dynamic Capacity (DC). These devices are
> known as Dynamic Capacity Devices (DCD).
>
> Implement the DC mailbox commands as specified in CXL 3.1 section
> 8.2.9.9.9 (opcodes 48XXh). Read the DC configuration and store the DC
> region information in the device state.

It seems worth mentioning that it validates against a bunch of
alignment rules. Speaking of which...


>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [J?rgen: ensure CXL 2.0 device support by removing dc_event_log_size]
> [iweiny/J?rgen: use get DC config command to signal DCD support]
> [djiang: fix subject]
> [Fan: add additional region configuration checks]
> [Jonathan/djiang: split out region mode changes]
> [Jonathan: fix up comments/kdoc]
> [Jonathan: s/cxl_get_dc_id/cxl_get_dc_config/]
> [Jonathan: use __free() in identify call]
> [Jonathan: remove unneeded formatting changes]
> [Jonathan: s/cxl_mbox_dynamic_capacity/cxl_mbox_get_dc_config_out/]
> [Jonathan: s/cxl_mbox_get_dc_config/cxl_mbox_get_dc_config_in/]
> [iweiny: remove type2 work dependancy/rebase on master]
> [iweiny: fix 0day build issues]
> ---
> drivers/cxl/core/mbox.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
> drivers/cxl/cxlmem.h | 49 +++++++++++++
> drivers/cxl/pci.c | 4 ++
> 3 files changed, 236 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index ed4131c6f50b..14e8a7528a8b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1123,7 +1123,7 @@ int cxl_dev_state_identify(struct cxl_memdev_state *mds)
> if (rc < 0)
> return rc;
>
> - mds->total_bytes =
> + mds->static_cap =
> le64_to_cpu(id.total_capacity) * CXL_CAPACITY_MULTIPLIER;
> mds->volatile_only_bytes =
> le64_to_cpu(id.volatile_capacity) * CXL_CAPACITY_MULTIPLIER;
> @@ -1230,6 +1230,175 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> return rc;
> }
>
> +static int cxl_dc_save_region_info(struct cxl_memdev_state *mds, u8 index,
> + struct cxl_dc_region_config *region_config)
> +{
> + struct cxl_dc_region_info *dcr = &mds->dc_region[index];
> + struct device *dev = mds->cxlds.dev;
> +
> + dcr->base = le64_to_cpu(region_config->region_base);
> + dcr->decode_len = le64_to_cpu(region_config->region_decode_length);
> + dcr->decode_len *= CXL_CAPACITY_MULTIPLIER;
> + dcr->len = le64_to_cpu(region_config->region_length);
> + dcr->blk_size = le64_to_cpu(region_config->region_block_size);
> + dcr->dsmad_handle = le32_to_cpu(region_config->region_dsmad_handle);
> + dcr->flags = region_config->flags;
> + snprintf(dcr->name, CXL_DC_REGION_STRLEN, "dc%d", index);
> +

Below - where are these rules defined in CXL spec?
Maybe one general comment referring to a CXL spec section if available?

> + /* Check regions are in increasing DPA order */

Better to state the rule and who's rule it is:
/* CXL spec mandates increasing DPA order */

> + /* Check regions are in increasing DPA order */
> + if (index > 0) {
> + struct cxl_dc_region_info *prev_dcr = &mds->dc_region[index - 1];
> +
> + if ((prev_dcr->base + prev_dcr->decode_len) > dcr->base)

Is that allowing overlap at dcr->base?

> + dev_err(dev,
> + "DPA ordering violation for DC region %d and %d\n",
> + index - 1, index);
> + return -EINVAL;
> + }
> + }
> +
> + if (!IS_ALIGNED(dcr->base, SZ_256M) ||
> + !IS_ALIGNED(dcr->base, dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid base %#llx blk size %#llx\n", index,
> + dcr->base, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + if (dcr->decode_len == 0 || dcr->len == 0 || dcr->decode_len < dcr->len ||
> + !IS_ALIGNED(dcr->len, dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid length; decode %#llx len %#llx blk size %#llx\n",
> + index, dcr->decode_len, dcr->len, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + if (dcr->blk_size == 0 || dcr->blk_size % 0x40 ||

OK - I know 0x40 must be cache align, but only because I saw
Jonathans comment. Please comment or macro.


> + !is_power_of_2(dcr->blk_size)) {
> + dev_err(dev, "DC region %d invalid block size; %#llx\n",
> + index, dcr->blk_size);
> + return -EINVAL;
> + }
> +
> + dev_dbg(dev,
> + "DC region %s DPA: %#llx LEN: %#llx BLKSZ: %#llx\n",
> + dcr->name, dcr->base, dcr->decode_len, dcr->blk_size);
> +
> + return 0;
> +}
> +
> +/* Returns the number of regions in dc_resp or -ERRNO */
> +static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> + struct cxl_mbox_get_dc_config_out *dc_resp,
> + size_t dc_resp_size)
> +{
> + struct cxl_mbox_get_dc_config_in get_dc = (struct cxl_mbox_get_dc_config_in) {
> + .region_count = CXL_MAX_DC_REGION,
> + .start_region_index = start_region,
> + };
> + struct cxl_mbox_cmd mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_CONFIG,
> + .payload_in = &get_dc,
> + .size_in = sizeof(get_dc),
> + .size_out = dc_resp_size,
> + .payload_out = dc_resp,
> + .min_out = 1,
> + };
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + rc = dc_resp->avail_region_count - start_region;
> +
> + /*
> + * The number of regions in the payload may have been truncated due to
> + * payload_size limits; if so adjust the returned count to match.
> + */
> + if (mbox_cmd.size_out < sizeof(*dc_resp))
> + rc = CXL_REGIONS_RETURNED(mbox_cmd.size_out);
> +
> + dev_dbg(dev, "Read %d/%d DC regions\n", rc, dc_resp->avail_region_count);
> +
> + return rc;
> +}
> +
> +static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +{
> + return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +
> +/**
> + * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> + * information from the device.
> + * @mds: The memory device state
> + *
> + * Read Dynamic Capacity information from the device and populate the state
> + * structures for later use.
> + *
> + * Return: 0 if identify was executed successfully, -ERRNO on error.
> + */
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> +{
> + size_t dc_resp_size = mds->payload_size;
> + struct device *dev = mds->cxlds.dev;
> + u8 start_region, i;
> + int rc = 0;
> +
> + for (i = 0; i < CXL_MAX_DC_REGION; i++)
> + snprintf(mds->dc_region[i].name, CXL_DC_REGION_STRLEN, "<nil>");
> +
> + /* Check GET_DC_CONFIG is supported by device */

Needless comment above due to nicely named cxl_dcd_supported() below

> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD not supported\n");
> + return 0;
> + }
> +
> + struct cxl_mbox_get_dc_config_out *dc_resp __free(kfree) =
> + kvmalloc(dc_resp_size, GFP_KERNEL);
> + if (!dc_resp)
> + return -ENOMEM;
> +
> + start_region = 0;
> + do {
> + int j;
> +
> + rc = cxl_get_dc_config(mds, start_region, dc_resp, dc_resp_size);
> + if (rc < 0) {
> + dev_dbg(dev, "Failed to get DC config: %d\n", rc);
> + return rc;
> + }
> +
> + mds->nr_dc_region += rc;
> +
> + if (mds->nr_dc_region < 1 || mds->nr_dc_region > CXL_MAX_DC_REGION) {
> + dev_err(dev, "Invalid num of dynamic capacity regions %d\n",
> + mds->nr_dc_region);
> + return -EINVAL;
> + }
> +
> + for (i = start_region, j = 0; i < mds->nr_dc_region; i++, j++) {
> + rc = cxl_dc_save_region_info(mds, i, &dc_resp->region[j]);
> + if (rc) {
> + dev_dbg(dev, "Failed to save region info: %d\n", rc);
> + return rc;
> + }
> + }
> +
> + start_region = mds->nr_dc_region;
> +
> + } while (mds->nr_dc_region < dc_resp->avail_region_count);
> +
> + mds->dynamic_cap =
> + mds->dc_region[mds->nr_dc_region - 1].base +
> + mds->dc_region[mds->nr_dc_region - 1].decode_len -
> + mds->dc_region[0].base;
> + dev_dbg(dev, "Total dynamic capacity: %#llx\n", mds->dynamic_cap);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> +
> static int add_dpa_res(struct device *dev, struct resource *parent,
> struct resource *res, resource_size_t start,
> resource_size_t size, const char *type)
> @@ -1260,8 +1429,12 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> {
> struct cxl_dev_state *cxlds = &mds->cxlds;
> struct device *dev = cxlds->dev;
> + size_t untenanted_mem;
> int rc;
>
> + untenanted_mem = mds->dc_region[0].base - mds->static_cap;
> + mds->total_bytes = mds->static_cap + untenanted_mem + mds->dynamic_cap;
> +
> if (!cxlds->media_ready) {
> cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> cxlds->ram_res = DEFINE_RES_MEM(0, 0);
> @@ -1271,6 +1444,15 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>
> cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
>
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->dc_res[i],
> + dcr->base, dcr->decode_len, dcr->name);
> + if (rc)
> + return rc;
> + }
> +
> if (mds->partition_align_bytes == 0) {
> rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->ram_res, 0,
> mds->volatile_only_bytes, "ram");
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 79a67cff9143..4624cf612c1e 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -402,6 +402,7 @@ enum cxl_devtype {
> CXL_DEVTYPE_CLASSMEM,
> };
>
> +#define CXL_MAX_DC_REGION 8
> /**
> * struct cxl_dpa_perf - DPA performance property entry
> * @dpa_range - range for DPA address
> @@ -431,6 +432,8 @@ struct cxl_dpa_perf {
> * @dpa_res: Overall DPA resource tree for the device
> * @pmem_res: Active Persistent memory capacity configuration
> * @ram_res: Active Volatile memory capacity configuration
> + * @dc_res: Active Dynamic Capacity memory configuration for each possible
> + * region
> * @serial: PCIe Device Serial Number
> * @type: Generic Memory Class device or Vendor Specific Memory device
> */
> @@ -445,10 +448,22 @@ struct cxl_dev_state {
> struct resource dpa_res;
> struct resource pmem_res;
> struct resource ram_res;
> + struct resource dc_res[CXL_MAX_DC_REGION];
> u64 serial;
> enum cxl_devtype type;
> };
>
> +#define CXL_DC_REGION_STRLEN 8
> +struct cxl_dc_region_info {
> + u64 base;
> + u64 decode_len;
> + u64 len;
> + u64 blk_size;
> + u32 dsmad_handle;
> + u8 flags;
> + u8 name[CXL_DC_REGION_STRLEN];
> +};
> +
> /**
> * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
> *
> @@ -467,6 +482,8 @@ struct cxl_dev_state {
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> * @total_bytes: sum of all possible capacities
> + * @static_cap: Sum of static RAM and PMEM capacities
> + * @dynamic_cap: Complete DPA range occupied by DC regions
> * @volatile_only_bytes: hard volatile capacity
> * @persistent_only_bytes: hard persistent capacity
> * @partition_align_bytes: alignment size for partition-able capacity
> @@ -474,6 +491,8 @@ struct cxl_dev_state {
> * @active_persistent_bytes: sum of hard + soft persistent
> * @next_volatile_bytes: volatile capacity change pending device reset
> * @next_persistent_bytes: persistent capacity change pending device reset
> + * @nr_dc_region: number of DC regions implemented in the memory device
> + * @dc_region: array containing info about the DC regions
> * @event: event log driver state
> * @poison: poison driver state info
> * @security: security driver state info
> @@ -494,7 +513,10 @@ struct cxl_memdev_state {
> DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> +
> u64 total_bytes;
> + u64 static_cap;
> + u64 dynamic_cap;
> u64 volatile_only_bytes;
> u64 persistent_only_bytes;
> u64 partition_align_bytes;
> @@ -506,6 +528,9 @@ struct cxl_memdev_state {
> struct cxl_dpa_perf ram_perf;
> struct cxl_dpa_perf pmem_perf;
>
> + u8 nr_dc_region;
> + struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION];
> +
> struct cxl_event_state event;
> struct cxl_poison_state poison;
> struct cxl_security_state security;
> @@ -705,6 +730,29 @@ struct cxl_mbox_set_partition_info {
>
> #define CXL_SET_PARTITION_IMMEDIATE_FLAG BIT(0)
>
> +struct cxl_mbox_get_dc_config_in {
> + u8 region_count;
> + u8 start_region_index;
> +} __packed;
> +
> +/* See CXL 3.0 Table 125 get dynamic capacity config Output Payload */
> +struct cxl_mbox_get_dc_config_out {
> + u8 avail_region_count;
> + u8 rsvd[7];
> + struct cxl_dc_region_config {
> + __le64 region_base;
> + __le64 region_decode_length;
> + __le64 region_length;
> + __le64 region_block_size;
> + __le32 region_dsmad_handle;
> + u8 flags;
> + u8 rsvd[3];
> + } __packed region[];
> +} __packed;
> +#define CXL_DYNAMIC_CAPACITY_SANITIZE_ON_RELEASE_FLAG BIT(0)
> +#define CXL_REGIONS_RETURNED(size_out) \
> + ((size_out - 8) / sizeof(struct cxl_dc_region_config))
> +
> /* Set Timestamp CXL 3.0 Spec 8.2.9.4.2 */
> struct cxl_mbox_set_timestamp_in {
> __le64 timestamp;
> @@ -828,6 +876,7 @@ enum {
> int cxl_internal_send_cmd(struct cxl_memdev_state *mds,
> struct cxl_mbox_cmd *cmd);
> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
> +int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds);
> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
> int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 2ff361e756d6..216881455364 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -874,6 +874,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (rc)
> return rc;
>
> + rc = cxl_dev_dynamic_capacity_identify(mds);
> + if (rc)
> + return rc;
> +
> rc = cxl_mem_create_range_info(mds);
> if (rc)
> return rc;
>
> --
> 2.44.0
>

2024-04-09 08:43:12

by Jørgen Hansen

[permalink] [raw]
Subject: Re: [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device

On 4/5/24 20:09, Ira Weiny wrote:
> Jørgen Hansen wrote:
>> On 3/25/24 00:18, [email protected] wrote:
>>
>>> From: Navneet Singh <[email protected]>
>>>
>
> [snip]
>
>>> /**
>>> * struct cxl_memdev_state - Generic Type-3 Memory Device Class driver data
>>> *
>>> @@ -467,6 +482,8 @@ struct cxl_dev_state {
>>> * @enabled_cmds: Hardware commands found enabled in CEL.
>>> * @exclusive_cmds: Commands that are kernel-internal only
>>> * @total_bytes: sum of all possible capacities
>>> + * @static_cap: Sum of static RAM and PMEM capacities
>>> + * @dynamic_cap: Complete DPA range occupied by DC regions
>>
>> How about naming these total_range, static_cap and dynamic_range to make
>> it clear that the DPA range occupied by DC regions isn't necessarily
>> usable capacity (as opposed to the static_cap where the spec defines it
>> as usable capacity).
>
> I thought this was a good idea but on second thought these are not range
> variables at all. They really represent the various lengths of the
> resources.
>
> For total_bytes the documentation already says 'sum of all __possible__
> capacities >
> I think you have a point for the new fields though. They should all be
> named in some consistent manner and documented as such.
>
> So I propose:
>
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 94531af018f8..9c18b229f69a 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -481,9 +481,9 @@ struct cxl_dc_region_info {
> * @dcd_cmds: List of DCD commands implemented by memory device
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only
> - * @total_bytes: sum of all possible capacities
> - * @static_cap: Sum of static RAM and PMEM capacities
> - * @dynamic_cap: Complete DPA range occupied by DC regions
> + * @total_bytes: length of all possible capacities
> + * @static_bytes: length of possible static RAM and PMEM partitions
> + * @dynamic_bytes: length of possible DC partitions (DC Regions)
> * @volatile_only_bytes: hard volatile capacity
> * @persistent_only_bytes: hard persistent capacity
> * @partition_align_bytes: alignment size for partition-able capacity
> @@ -515,8 +515,8 @@ struct cxl_memdev_state {
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
>
> u64 total_bytes;
> - u64 static_cap;
> - u64 dynamic_cap;
> + u64 static_bytes;
> + u64 dynamic_bytes;
> u64 volatile_only_bytes;
> u64 persistent_only_bytes;
> u64 partition_align_bytes;

That looks good. My main concern was that the DC regions may be
separated by gaps that take up part of the DPA range but isn't part of
the usable capacity. Pre-DCD, total_bytes was in fact all the usable
capacity of the device as reported by the device itself, but now it
includes the potential gaps between DC regions as well as the potential
gap between static and dynamic regions.

Thanks,
Jørgen

2024-04-09 16:22:45

by fan

[permalink] [raw]
Subject: Re: [PATCH 17/26] dax/region: Create extent resources on DAX region driver load

On Sun, Mar 24, 2024 at 04:18:20PM -0700, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> DAX regions mapping dynamic capacity partitions introduce a requirement
> for the memory backing the region to come and go as required. This
> results in a DAX region with sparse areas of memory backing. To track
> the sparseness of the region, DAX extent objects need to track
> sub-resource information as a new layer between the DAX region resource
> and DAX device range resources.
>
> Recall that DCD extents may be accepted when a region is first created.
> Extend this support on region driver load. Scan existing extents and
> create DAX extent resources as a first step to DAX extent realization.
>
> The lifetime of a DAX extent is tricky to manage because the extent life
> may end in one of two ways. First, the device may request the extent be
> released. Second, the region may release the extent when it is
> destroyed without hardware involvement. Support extent release without
> hardware involvement first. Subsequent patches will provide for
> hardware to request extent removal.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
Trivial comments inline.
> ---
> Changes for v1
> [iweiny: remove xarrays]
> [iweiny: remove as much of extra reference stuff as possible]
> [iweiny: Move extent resource handling to core DAX code]
> ---
> drivers/dax/bus.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/dax/cxl.c | 43 ++++++++++++++++++++++++++++++++++--
> drivers/dax/dax-private.h | 12 +++++++++++
> 3 files changed, 108 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 903566aff5eb..4d5ed7ab6537 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -186,6 +186,61 @@ static bool is_sparse(struct dax_region *dax_region)
> return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> }
>
> +static int dax_region_add_resource(struct dax_region *dax_region,
> + struct dax_extent *dax_ext,
> + resource_size_t start,
> + resource_size_t length)
> +{
> + struct resource *ext_res;
> +
> + dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
> + ext_res = __request_region(&dax_region->res, start, length, "extent", 0);
> + if (!ext_res) {
> + dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
> + &start, &length);
> + return -ENOSPC;
> + }
> +
> + dax_ext->region = dax_region;
> + dax_ext->res = ext_res;
> + dev_dbg(dax_region->dev, "Extent add resource %pr\n", ext_res);
> +
> + return 0;
> +}
> +
> +static void dax_region_release_extent(void *ext)
> +{
> + struct dax_extent *dax_ext = ext;
> + struct dax_region *dax_region = dax_ext->region;
> +
> + dev_dbg(dax_region->dev, "Extent release resource %pr\n", dax_ext->res);
> + if (dax_ext->res)
> + __release_region(&dax_region->res, dax_ext->res->start,
> + resource_size(dax_ext->res));
> +
> + kfree(dax_ext);
> +}
> +
> +int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
> + resource_size_t start, resource_size_t length)
> +{
> + int rc;
> +
> + struct dax_extent *dax_ext __free(kfree) = kzalloc(sizeof(*dax_ext),
> + GFP_KERNEL);
> + if (!dax_ext)
> + return -ENOMEM;
> +
> + guard(rwsem_write)(&dax_region_rwsem);
> + rc = dax_region_add_resource(dax_region, dax_ext, start, length);
> + if (rc)
> + return rc;
> +
> + return devm_add_action_or_reset(ext_dev, dax_region_release_extent,
> + no_free_ptr(dax_ext));
> +}
> +EXPORT_SYMBOL_GPL(dax_region_add_extent);
> +
> bool static_dev_dax(struct dev_dax *dev_dax)
> {
> return is_static(dev_dax->region);
> diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> index 415d03fbf9b6..70bdc7a878ab 100644
> --- a/drivers/dax/cxl.c
> +++ b/drivers/dax/cxl.c
> @@ -5,6 +5,42 @@
>
> #include "../cxl/cxl.h"
> #include "bus.h"
> +#include "dax-private.h"
> +
> +static int __cxl_dax_region_add_extent(struct dax_region *dax_region,
> + struct region_extent *reg_ext)
> +{
> + struct device *ext_dev = &reg_ext->dev;
> + resource_size_t start, length;
> +
> + dev_dbg(dax_region->dev, "Adding extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + start = dax_region->res.start + reg_ext->hpa_range.start;
> + length = reg_ext->hpa_range.end - reg_ext->hpa_range.start + 1;
use range_len() instead?

Fan

> +
> + return dax_region_add_extent(dax_region, ext_dev, start, length);
> +}
> +
> +static int cxl_dax_region_add_extent(struct device *dev, void *data)
> +{
> + struct dax_region *dax_region = data;
> + struct region_extent *reg_ext;
> +
> + if (!is_region_extent(dev))
> + return 0;
> +
> + reg_ext = to_region_extent(dev);
> +
> + return __cxl_dax_region_add_extent(dax_region, reg_ext);
> +}
> +
> +static void cxl_dax_region_add_extents(struct cxl_dax_region *cxlr_dax,
> + struct dax_region *dax_region)
> +{
> + dev_dbg(&cxlr_dax->dev, "Adding extents\n");
> + device_for_each_child(&cxlr_dax->dev, dax_region, cxl_dax_region_add_extent);
> +}
>
> static int cxl_dax_region_probe(struct device *dev)
> {
> @@ -29,9 +65,12 @@ static int cxl_dax_region_probe(struct device *dev)
> return -ENOMEM;
>
> dev_size = range_len(&cxlr_dax->hpa_range);
> - /* Add empty seed dax device */
> - if (cxlr->mode == CXL_REGION_DC)
> + if (cxlr->mode == CXL_REGION_DC) {
> + /* NOTE: Depends on dax_region being set in driver data */
> + cxl_dax_region_add_extents(cxlr_dax, dax_region);
> + /* Add empty seed dax device */
> dev_size = 0;
> + }
>
> data = (struct dev_dax_data) {
> .dax_region = dax_region,
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 446617b73aea..c6319c6567fb 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -16,6 +16,18 @@ struct inode *dax_inode(struct dax_device *dax_dev);
> int dax_bus_init(void);
> void dax_bus_exit(void);
>
> +/**
> + * struct dax_extent - For sparse regions; an active extent
> + * @region: dax_region this resources is in
> + * @res: resource this extent covers
> + */
> +struct dax_extent {
> + struct dax_region *region;
> + struct resource *res;
> +};
> +int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
> + resource_size_t start, resource_size_t length);
> +
> /**
> * struct dax_region - mapping infrastructure for dax devices
> * @id: kernel-wide unique region for a memory range
>
> --
> 2.44.0
>

2024-04-10 04:26:02

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 09/26] cxl/region: Add Dynamic Capacity CXL region support

fan wrote:
> On Sun, Mar 24, 2024 at 04:18:12PM -0700, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >
> > CXL devices optionally support dynamic capacity. CXL Regions must be
> > configured correctly to access this capacity. Similar to ram and pmem
> > partitions, DC Regions, as they are called in CXL 3.1, represent
> > different partitions of the DPA space.
> >
> > Introduce the concept of a sparse DAX region. Add the create_dc_region
> > sysfs entry to create sparse DC DAX regions. Special case DC capable
> > regions to create a 0 sized seed DAX device to maintain backwards
> > compatibility with older software which needs a default DAX device to
> > hold the region reference.
> >
> > Flag sparse DAX regions to indicate 0 capacity available until such time
> > as DC capacity is added.
> >
> > Interleaving is deferred in this series. Add an early check.
> >
> > Signed-off-by: Navneet Singh <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
> >
> > ---
> > Changes for v1:
> > [djiang: mark sysfs entries to be in 6.10 kernel including date]
> > [djbw: change dax region typing to be 'sparse' rather than 'dynamic']
> > [iweiny: rebase changes to master instead of type2 patches]
> > ---
> > Documentation/ABI/testing/sysfs-bus-cxl | 22 +++++++++++-----------
> > drivers/cxl/core/core.h | 1 +
> > drivers/cxl/core/port.c | 1 +
> > drivers/cxl/core/region.c | 33 +++++++++++++++++++++++++++++++++
> > drivers/dax/bus.c | 8 ++++++++
> > drivers/dax/bus.h | 1 +
> > drivers/dax/cxl.c | 15 +++++++++++++--
> > 7 files changed, 68 insertions(+), 13 deletions(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 8a4f572c8498..f0cf52fff9fa 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -411,20 +411,20 @@ Description:
> > interleave_granularity).
> >
> >
> > -What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> > -Date: May, 2022, January, 2023
> > -KernelVersion: v6.0 (pmem), v6.3 (ram)
> > +What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
> > +Date: May, 2022, January, 2023, June 2024
> > +KernelVersion: v6.0 (pmem), v6.3 (ram), v6.10 (dc)
> > Contact: [email protected]
> > Description:
> > (RW) Write a string in the form 'regionZ' to start the process
> > - of defining a new persistent, or volatile memory region
> > - (interleave-set) within the decode range bounded by root decoder
> > - 'decoderX.Y'. The value written must match the current value
> > - returned from reading this attribute. An atomic compare exchange
> > - operation is done on write to assign the requested id to a
> > - region and allocate the region-id for the next creation attempt.
> > - EBUSY is returned if the region name written does not match the
> > - current cached value.
> > + of defining a new persistent, volatile, or Dynamic Capacity
> > + (DC) memory region (interleave-set) within the decode range
> > + bounded by root decoder 'decoderX.Y'. The value written must
> > + match the current value returned from reading this attribute.
> > + An atomic compare exchange operation is done on write to assign
> > + the requested id to a region and allocate the region-id for the
> > + next creation attempt. EBUSY is returned if the region name
> > + written does not match the current cached value.
> >
> >
> > What: /sys/bus/cxl/devices/decoderX.Y/delete_region
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index 3b64fb1b9ed0..91abeffbe985 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -13,6 +13,7 @@ extern struct attribute_group cxl_base_attribute_group;
> > #ifdef CONFIG_CXL_REGION
> > extern struct device_attribute dev_attr_create_pmem_region;
> > extern struct device_attribute dev_attr_create_ram_region;
> > +extern struct device_attribute dev_attr_create_dc_region;
> > extern struct device_attribute dev_attr_delete_region;
> > extern struct device_attribute dev_attr_region;
> > extern const struct device_type cxl_pmem_region_type;
> > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> > index 036b61cb3007..661177b575f7 100644
> > --- a/drivers/cxl/core/port.c
> > +++ b/drivers/cxl/core/port.c
> > @@ -335,6 +335,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
> > &dev_attr_qos_class.attr,
> > SET_CXL_REGION_ATTR(create_pmem_region)
> > SET_CXL_REGION_ATTR(create_ram_region)
> > + SET_CXL_REGION_ATTR(create_dc_region)
> > SET_CXL_REGION_ATTR(delete_region)
> > NULL,
> > };
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index ec3b8c6948e9..0d7b09a49dcf 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -2205,6 +2205,7 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
> > switch (mode) {
> > case CXL_REGION_RAM:
> > case CXL_REGION_PMEM:
> > + case CXL_REGION_DC:
> > break;
> > default:
> > dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %s\n",
> > @@ -2314,6 +2315,32 @@ static ssize_t create_ram_region_store(struct device *dev,
> > }
> > DEVICE_ATTR_RW(create_ram_region);
> >
> > +static ssize_t create_dc_region_show(struct device *dev,
> > + struct device_attribute *attr, char *buf)
> > +{
> > + return __create_region_show(to_cxl_root_decoder(dev), buf);
> > +}
> > +
> > +static ssize_t create_dc_region_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf, size_t len)
> > +{
> > + struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> > + struct cxl_region *cxlr;
> > + int rc, id;
> > +
> > + rc = sscanf(buf, "region%d\n", &id);
> > + if (rc != 1)
> > + return -EINVAL;
> > +
> > + cxlr = __create_region(cxlrd, CXL_REGION_DC, id);
> > + if (IS_ERR(cxlr))
> > + return PTR_ERR(cxlr);
> > +
> > + return len;
> > +}
> > +DEVICE_ATTR_RW(create_dc_region);
>
> create_ram_region_store, create_pmem_region_store and
> create_dc_region_store have mostly duplicate code, should we consider
> extracting out as a helper function and pass region type for ram/pmem/dc region
> store?

That is mostly done with __create_region(). But the following is a nice cleanup.

I'll test it.

Ira

21:24:37 > git di
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e8549af47da7..8a83a415fd0b 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2638,9 +2638,8 @@ static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_HOSTONLYMEM);
}

-static ssize_t create_pmem_region_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t len)
+static ssize_t create_region_store(struct device *dev, const char *buf,
+ size_t len, enum cxl_region_mode mode)
{
struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
struct cxl_region *cxlr;
@@ -2650,31 +2649,26 @@ static ssize_t create_pmem_region_store(struct device *dev,
if (rc != 1)
return -EINVAL;

- cxlr = __create_region(cxlrd, CXL_REGION_PMEM, id);
+ cxlr = __create_region(cxlrd, mode, id);
if (IS_ERR(cxlr))
return PTR_ERR(cxlr);

return len;
}
+
+static ssize_t create_pmem_region_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ return create_region_store(dev, buf, len, CXL_REGION_PMEM);
+}
DEVICE_ATTR_RW(create_pmem_region);

static ssize_t create_ram_region_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t len)
{
- struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
- struct cxl_region *cxlr;
- int rc, id;
-
- rc = sscanf(buf, "region%d\n", &id);
- if (rc != 1)
- return -EINVAL;
-
- cxlr = __create_region(cxlrd, CXL_REGION_RAM, id);
- if (IS_ERR(cxlr))
- return PTR_ERR(cxlr);
-
- return len;
+ return create_region_store(dev, buf, len, CXL_REGION_RAM);
}
DEVICE_ATTR_RW(create_ram_region);

@@ -2688,19 +2682,7 @@ static ssize_t create_dc_region_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t len)
{
- struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
- struct cxl_region *cxlr;
- int rc, id;
-
- rc = sscanf(buf, "region%d\n", &id);
- if (rc != 1)
- return -EINVAL;
-
- cxlr = __create_region(cxlrd, CXL_REGION_DC, id);
- if (IS_ERR(cxlr))
- return PTR_ERR(cxlr);
-
- return len;
+ return create_region_store(dev, buf, len, CXL_REGION_DC);
}
DEVICE_ATTR_RW(create_dc_region);

2024-04-10 04:37:06

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 09/26] cxl/region: Add Dynamic Capacity CXL region support

Dave Jiang wrote:
>
>
> On 3/24/24 4:18 PM, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >

[snip]

> >
> > -What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> > -Date: May, 2022, January, 2023
> > -KernelVersion: v6.0 (pmem), v6.3 (ram)
> > +What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram,dc}_region
> > +Date: May, 2022, January, 2023, June 2024
> > +KernelVersion: v6.0 (pmem), v6.3 (ram), v6.10 (dc)
> > Contact: [email protected]
> > Description:
> > (RW) Write a string in the form 'regionZ' to start the process
> > - of defining a new persistent, or volatile memory region
> > - (interleave-set) within the decode range bounded by root decoder
> > - 'decoderX.Y'. The value written must match the current value
> > - returned from reading this attribute. An atomic compare exchange
> > - operation is done on write to assign the requested id to a
> > - region and allocate the region-id for the next creation attempt.
> > - EBUSY is returned if the region name written does not match the
> > - current cached value.
> > + of defining a new persistent, volatile, or Dynamic Capacity
> > + (DC) memory region (interleave-set) within the decode range
> > + bounded by root decoder 'decoderX.Y'. The value written must
> > + match the current value returned from reading this attribute.
> > + An atomic compare exchange operation is done on write to assign
> > + the requested id to a region and allocate the region-id for the
> > + next creation attempt. EBUSY is returned if the region name
>
> -EBUSY?
>

To match the other documentation I would say no. The other docs show
ENXIO/EBUSY/EINVAL without the negative indicator.


[snip]

> > diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c
> > index c696837ab23c..415d03fbf9b6 100644
> > --- a/drivers/dax/cxl.c
> > +++ b/drivers/dax/cxl.c
> > @@ -13,19 +13,30 @@ static int cxl_dax_region_probe(struct device *dev)
> > struct cxl_region *cxlr = cxlr_dax->cxlr;
> > struct dax_region *dax_region;
> > struct dev_dax_data data;
> > + resource_size_t dev_size;
> > + unsigned long flags;
> >
> > if (nid == NUMA_NO_NODE)
> > nid = memory_add_physaddr_to_nid(cxlr_dax->hpa_range.start);
> >
> > + flags = IORESOURCE_DAX_KMEM;
> > + if (cxlr->mode == CXL_REGION_DC)
> > + flags |= IORESOURCE_DAX_SPARSE_CAP;
> > +
> > dax_region = alloc_dax_region(dev, cxlr->id, &cxlr_dax->hpa_range, nid,
> > - PMD_SIZE, IORESOURCE_DAX_KMEM);
> > + PMD_SIZE, flags);
> > if (!dax_region)
> > return -ENOMEM;
> >
> > + dev_size = range_len(&cxlr_dax->hpa_range);
> > + /* Add empty seed dax device */
> > + if (cxlr->mode == CXL_REGION_DC)
> > + dev_size = 0;
>
> Nit. Just do if/else so dev_size isn't set twice if mode is DC.

Ok yea.

Ira

2024-04-10 04:43:40

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 09/26] cxl/region: Add Dynamic Capacity CXL region support

Jonathan Cameron wrote:
> On Sun, 24 Mar 2024 16:18:12 -0700
> [email protected] wrote:
>
> > From: Navneet Singh <[email protected]>
> >
> > CXL devices optionally support dynamic capacity. CXL Regions must be
> > configured correctly to access this capacity. Similar to ram and pmem
> > partitions, DC Regions, as they are called in CXL 3.1, represent
> > different partitions of the DPA space.
> >
> > Introduce the concept of a sparse DAX region. Add the create_dc_region
> > sysfs entry to create sparse DC DAX regions. Special case DC capable
> > regions to create a 0 sized seed DAX device to maintain backards
> > compatibility with older software which needs a default DAX device to
> > hold the region reference.
> >
> > Flag sparse DAX regions to indicate 0 capacity available until such time
> > as DC capacity is added.
> >
> > Interleaving is deferred in this series. Add an early check.
> >
> > Signed-off-by: Navneet Singh <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
> With the -EBUSY others addressed LGTM.

But the other docs do not have that notation. Also the EBUSY was not changed
from the previous documentation. Only the text for DC was added.

I'm inclined to leave this.

> Fan's duplication comment
> might be something to tidy up later.
>
> Reviewed-by: Jonathan Cameron <[email protected]>

Thanks,
Ira



2024-04-10 04:49:55

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts

fan wrote:
> On Sun, Mar 24, 2024 at 04:18:16PM -0700, [email protected] wrote:
> > From: Navneet Singh <[email protected]>

[snip]

> > @@ -786,12 +830,15 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> > if (rc)
> > return rc;
> >
> > - rc = cxl_event_irqsetup(mds, &policy);
> > + rc = cxl_irqsetup(mds, &policy, native_cxl);
> > if (rc)
> > return rc;
> >
> > cxl_mem_get_event_records(mds, CXLDEV_EVENT_STATUS_ALL);
> >
> > + dev_dbg(mds->cxlds.dev, "Event config : %d %d\n",
> > + native_cxl, cxl_dcd_supported(mds));
>
> The message will print out two numbers, seems not very clear. Should we
> translate to more straightforward message, like
> native_cxl? "OS...":""
> cxl_dcd_supported(msd)? "DCD supported": "DCD not supported"?

Perhaps but it is just a debug message to know if something is wonky with a BIOS configuration.

So I'm inclined to leave it alone.
Ira

2024-04-10 05:27:33

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts

Dave Jiang wrote:
>
>
> On 3/24/24 4:18 PM, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >
> > Dynamic Capacity Devices (DCD) support extent change notifications
> > through the event log mechanism. The interrupt mailbox commands were
> > extended in CXL 3.1 to support these notifications.
> >
> > Firmware can't configure DCD events to be FW controlled but can retain
> > control of memory events. Split irq configuration of memory events and
> > DCD events to allow for FW control of memory events while DCD is host
> > controlled.
> >
> > Configure DCD event log interrupts on devices supporting dynamic
> > capacity. Disable DCD if interrupts are not supported.
> >
> > Signed-off-by: Navneet Singh <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
>
> A few minor comments. The rest LGTM.
> >
> > ---
> > Changes for v1
> > [iweiny: rebase to upstream irq code]
> > [iweiny: disable DCD if irqs not supported]
> > ---
> > drivers/cxl/core/mbox.c | 9 ++++++-
> > drivers/cxl/cxl.h | 4 ++-
> > drivers/cxl/cxlmem.h | 4 +++
> > drivers/cxl/pci.c | 71 ++++++++++++++++++++++++++++++++++++++++---------
> > 4 files changed, 74 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 14e8a7528a8b..58b31fa47b93 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1323,10 +1323,17 @@ static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> > return rc;
> > }
> >
> > -static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> > +bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> > {
> > return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > }
> > +EXPORT_SYMBOL_NS_GPL(cxl_dcd_supported, CXL);
> > +
> > +void cxl_disable_dcd(struct cxl_memdev_state *mds)
> > +{
> > + clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_disable_dcd, CXL);
>
> Should these one-liners just go into a header file?

Yea they could.

>
> >
> > /**
> > * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 15d418b3bc9b..d585f5fdd3ae 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -164,11 +164,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
> > #define CXLDEV_EVENT_STATUS_WARN BIT(1)
> > #define CXLDEV_EVENT_STATUS_FAIL BIT(2)
> > #define CXLDEV_EVENT_STATUS_FATAL BIT(3)
> > +#define CXLDEV_EVENT_STATUS_DCD BIT(4)
>
> extra tab?

It does not look like it on my end... :-/

#define CXLDEV_DEV_EVENT_STATUS_OFFSET>->-------0x00$
#define CXLDEV_EVENT_STATUS_INFO>------->-------BIT(0)$
#define CXLDEV_EVENT_STATUS_WARN>------->-------BIT(1)$
#define CXLDEV_EVENT_STATUS_FAIL>------->-------BIT(2)$
#define CXLDEV_EVENT_STATUS_FATAL>------>-------BIT(3)$
#define CXLDEV_EVENT_STATUS_DCD>>------->-------BIT(4)$


Ira

2024-04-10 05:34:35

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts

Jonathan Cameron wrote:
> On Sun, 24 Mar 2024 16:18:16 -0700
> [email protected] wrote:
>
> > From: Navneet Singh <[email protected]>
> >
> > Dynamic Capacity Devices (DCD) support extent change notifications
> > through the event log mechanism. The interrupt mailbox commands were
> > extended in CXL 3.1 to support these notifications.
> >
> > Firmware can't configure DCD events to be FW controlled but can retain
> > control of memory events. Split irq configuration of memory events and
> > DCD events to allow for FW control of memory events while DCD is host
> > controlled.
> >
> > Configure DCD event log interrupts on devices supporting dynamic
> > capacity. Disable DCD if interrupts are not supported.
> >
> > Signed-off-by: Navneet Singh <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
>
> Trivial comment inline and Fan's suggestion on the debug print seems sensible
> to me.

Ok I went ahead and added it.

> Either way
>
> Reviewed-by: Jonathan Cameron <[email protected]>

Thanks,
Ira

2024-04-10 05:47:02

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

fan wrote:
> On Sun, Mar 24, 2024 at 04:18:17PM -0700, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >

[snip]

> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 58b31fa47b93..9e33a0976828 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> > }
> > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
> >
> > +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> > + struct cxl_dc_extent *dc_extent)
> > +{
> > + struct device *dev = mds->cxlds.dev;
> > + uint64_t start, len;
> > +
> > + start = le64_to_cpu(dc_extent->start_dpa);
> > + len = le64_to_cpu(dc_extent->length);
> > +
> > + /* Extents must not cross region boundary's */
> > + for (int i = 0; i < mds->nr_dc_region; i++) {
> > + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > + if (dcr->base <= start &&
> > + (start + len) <= (dcr->base + dcr->decode_len)) {
>
> Why not use range_contains here as below?

Because when I initially wrote this I (or perhaps Navneet, I can't remember) we
were not using ranges. This version I tried to convert to ranges and I missed
this one.

Good catch!

Ira

2024-04-10 06:10:50

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

Dave Jiang wrote:
>
>
> On 3/24/24 4:18 PM, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >

[snip]

> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 58b31fa47b93..9e33a0976828 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> > }
> > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
> >
> > +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> > + struct cxl_dc_extent *dc_extent)
> > +{
> > + struct device *dev = mds->cxlds.dev;
> > + uint64_t start, len;
> u64
>

Yep

>
> > +
> > + start = le64_to_cpu(dc_extent->start_dpa);
> > + len = le64_to_cpu(dc_extent->length);
> > +
> > + /* Extents must not cross region boundary's */
> > + for (int i = 0; i < mds->nr_dc_region; i++) {
> > + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > + if (dcr->base <= start &&
> > + (start + len) <= (dcr->base + dcr->decode_len)) {
>
> Can range_contains() be used here as well?

Yep and done.

>
> > + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> > + start, start + len - 1, i, start - dcr->base);
> > + return 0;
> > + }
> > + }
> > +
> > + dev_err_ratelimited(dev,
> > + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> > + start, start + len - 1);
> > + return -EINVAL;
> > +}
> > +
> > +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
>
> cxl_dc_extent_in_endpoint_decoder() is more readable

Sure but we are getting awful close to Java naming there... j/k ;-) Changed.

>
> > + struct cxl_dc_extent *extent)
> > +{
> > + uint64_t start = le64_to_cpu(extent->start_dpa);
> > + uint64_t length = le64_to_cpu(extent->length);
> u64
>

Yep


[snip]

> >
> > +static struct cxl_memdev_state *
> > +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> > +{
> > + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +
> > + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> > +}
> > +
> > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> > enum cxl_event_log_type type)
> > {
> > @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > }
> > EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> >
> > +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
>
> cxl_dev_get_dc_extent_generation()? or spell out count

I'll spell out count because that is the primary goal. The generation number
is just to be able to check each query to ensure the list does not change
whilst reading.

Ira

2024-04-10 06:20:20

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

fan wrote:
> On Sun, Mar 24, 2024 at 04:18:17PM -0700, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >

[snip]

> > +
> > +/**
> > + * cxl_read_dc_extents() - Read any existing extents
> > + * @cxled: Endpoint decoder which is part of a region
> > + *
> > + * Issue the Get Dynamic Capacity Extent List command to the device
> > + * and add any existing extents found which belong to this decoder.
> > + *
> > + * Return: 0 if command was executed successfully, -ERRNO on error.
> > + */
> > +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> > +{
> > + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> > + struct device *dev = mds->cxlds.dev;
> > + unsigned int extent_gen_num;
> > + int rc;
> > +
> > + if (!cxl_dcd_supported(mds)) {
> > + dev_dbg(dev, "DCD unsupported\n");
> > + return 0;
> > + }
> > +
> > + rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> > + dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",
> > + rc, extent_gen_num);
> > + if (rc <= 0) /* 0 == no records found */
> > + return rc;
> > +
> > + return cxl_dev_get_dc_extents(cxled, extent_gen_num, rc);
>
> Not sure about the behaviour here. From the cxl_dev_get_dc_extents
> implementation below, if gen_num changed or the expected extent count
> changed, it will return error.

yep.

> If I understand it correctly, if the above two values change, it means
> the extent list has been updated due to extent add/release since last
> time we read the extent list info (cxl_dev_get_dc_extent_cnt), do we
> need to fail the operation or try again?

The original series was safe to fail the operation because the list was read on
memory device driver load and not when the regions were created. This is an
oversight with the new architecture. Now that regions query for the list
independent of other regions being active the list could indeed change during
this operation. :-/ So a retry is necessary.

Let me work on the retry because some of the extents may have been surfaced
during the list processing which means a re-read of the list will need to
properly ignore those already found. Or some other tracking needs to be put in
place.

Ira

2024-04-10 06:30:51

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

J?rgen Hansen wrote:
> On 3/25/24 00:18, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >
> > Dynamic capacity device extents may be left in an accepted state on a
> > device due to an unexpected host crash. In this case creation of a new
> > region on top of the DC partition (region) is expected to expose those
> > extents for continued use.
> >
> > Once all endpoint decoders are part of a region and the region is being
> > realized read the device extent list. For ease of review, this patch
> > stops after reading the extent list and leaves realization of the region
> > extents to a future patch.
> >
> > Signed-off-by: Navneet Singh <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
> >
> > ---
> > Changes for v1:
> > [iweiny: remove extent list xarray]
> > [iweiny: Update spec references to 3.1]
> > [iweiny: use struct range in extents]
> > [iweiny: remove all reference tracking and let regions track extents
> > through the extent devices.]
> > [djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
> > ---
> > drivers/cxl/core/core.h | 9 +++
> > drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
> > drivers/cxl/core/region.c | 29 +++++++
> > drivers/cxl/cxlmem.h | 49 ++++++++++++
> > 4 files changed, 279 insertions(+)
> >
> > diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> > index 91abeffbe985..119b12362977 100644
> > --- a/drivers/cxl/core/core.h
> > +++ b/drivers/cxl/core/core.h
> > @@ -4,6 +4,8 @@
> > #ifndef __CXL_CORE_H__
> > #define __CXL_CORE_H__
> >
> > +#include <cxlmem.h>
> > +
> > extern const struct device_type cxl_nvdimm_bridge_type;
> > extern const struct device_type cxl_nvdimm_type;
> > extern const struct device_type cxl_pmu_type;
> > @@ -28,6 +30,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
> > int cxl_region_init(void);
> > void cxl_region_exit(void);
> > int cxl_get_poison_by_endpoint(struct cxl_port *port);
> > +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> > + struct cxl_dc_extent *dc_extent);
> > #else
> > static inline int cxl_get_poison_by_endpoint(struct cxl_port *port)
> > {
> > @@ -43,6 +47,11 @@ static inline int cxl_region_init(void)
> > static inline void cxl_region_exit(void)
> > {
> > }
> > +static inline int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> > + struct cxl_dc_extent *dc_extent)
> > +{
> > + return 0;
> > +}
> > #define CXL_REGION_ATTR(x) NULL
> > #define CXL_REGION_TYPE(x) NULL
> > #define SET_CXL_REGION_ATTR(x)
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 58b31fa47b93..9e33a0976828 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> > }
> > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
> >
> > +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> > + struct cxl_dc_extent *dc_extent)
> > +{
> > + struct device *dev = mds->cxlds.dev;
> > + uint64_t start, len;
> > +
> > + start = le64_to_cpu(dc_extent->start_dpa);
> > + len = le64_to_cpu(dc_extent->length);
> > +
> > + /* Extents must not cross region boundary's */
> > + for (int i = 0; i < mds->nr_dc_region; i++) {
> > + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> > +
> > + if (dcr->base <= start &&
> > + (start + len) <= (dcr->base + dcr->decode_len)) {
> > + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> > + start, start + len - 1, i, start - dcr->base);
> > + return 0;
> > + }
> > + }
> > +
> > + dev_err_ratelimited(dev,
> > + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> > + start, start + len - 1);
> > + return -EINVAL;
> > +}
> > +
> > +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> > + struct cxl_dc_extent *extent)
> > +{
> > + uint64_t start = le64_to_cpu(extent->start_dpa);
> > + uint64_t length = le64_to_cpu(extent->length);
> > + struct range ext_range = (struct range){
> > + .start = start,
> > + .end = start + length - 1,
> > + };
> > + struct range ed_range = (struct range) {
> > + .start = cxled->dpa_res->start,
> > + .end = cxled->dpa_res->end,
> > + };
> > +
> > + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
> > + cxled->dpa_res, start, length);
> > +
> > + return range_contains(&ed_range, &ext_range);
> > +}
> > +
> > void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> > enum cxl_event_log_type type,
> > enum cxl_event_type event_type,
> > @@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> > return rc;
> > }
> >
> > +static struct cxl_memdev_state *
> > +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> > +{
> > + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > +
> > + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> > +}
> > +
> > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> > enum cxl_event_log_type type)
> > {
> > @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> > }
> > EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
> >
> > +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> > + unsigned int *extent_gen_num)
> > +{
> > + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> > + struct cxl_mbox_get_dc_extent_out dc_extents;
> > + struct cxl_mbox_cmd mbox_cmd;
> > + unsigned int count;
> > + int rc;
> > +
> > + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> > + .extent_cnt = cpu_to_le32(0),
> > + .start_extent_index = cpu_to_le32(0),
> > + };
> > +
> > + mbox_cmd = (struct cxl_mbox_cmd) {
> > + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > + .payload_in = &get_dc_extent,
> > + .size_in = sizeof(get_dc_extent),
> > + .size_out = sizeof(dc_extents),
> > + .payload_out = &dc_extents,
> > + .min_out = 1,
> > + };
> > +
> > + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > + if (rc < 0)
> > + return rc;
> > +
> > + count = le32_to_cpu(dc_extents.total_extent_cnt);
> > + *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);
> > +
> > + return count;
> > +}
> > +
> > +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
> > + unsigned int start_gen_num,
> > + unsigned int exp_cnt)
> > +{
> > + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> > + unsigned int start_index, total_read;
> > + struct device *dev = mds->cxlds.dev;
> > + struct cxl_mbox_cmd mbox_cmd;
> > +
> > + struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
> > + kvmalloc(mds->payload_size, GFP_KERNEL);
> > + if (!dc_extents)
> > + return -ENOMEM;
> > +
> > + total_read = 0;
> > + start_index = 0;
> > + do {
> > + unsigned int nr_ext, total_extent_cnt, gen_num;
> > + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> > + int rc;
> > +
> > + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> > + .extent_cnt = cpu_to_le32(exp_cnt - start_index),
> > + .start_extent_index = cpu_to_le32(start_index),
> > + };
> > +
> > + mbox_cmd = (struct cxl_mbox_cmd) {
> > + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> > + .payload_in = &get_dc_extent,
> > + .size_in = sizeof(get_dc_extent),
> > + .size_out = mds->payload_size,
> > + .payload_out = dc_extents,
> > + .min_out = 1,
> > + };
> > +
> > + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> > + if (rc < 0)
> > + return rc;
> > +
> > + nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);
> > + total_read += nr_ext;
> > + total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> > + gen_num = le32_to_cpu(dc_extents->extent_list_num);
> > +
> > + dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> > + total_extent_cnt, gen_num);
> > +
> > + if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> > + dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
> > + gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> > + return -EIO;
> > + }
> > +
> > + for (int i = 0; i < nr_ext ; i++) {
> > + dev_dbg(dev, "Processing extent %d/%d\n",
> > + start_index + i, exp_cnt);
> > + rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
> > + if (rc)
> > + continue;
> > + if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
> > + continue;
> > + rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
> > + if (rc)
> > + return rc;
> > + }
> > +
> > + start_index += nr_ext;
> > + } while (exp_cnt > total_read);
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * cxl_read_dc_extents() - Read any existing extents
> > + * @cxled: Endpoint decoder which is part of a region
> > + *
> > + * Issue the Get Dynamic Capacity Extent List command to the device
> > + * and add any existing extents found which belong to this decoder.
> > + *
> > + * Return: 0 if command was executed successfully, -ERRNO on error.
> > + */
> > +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> > +{
> > + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> > + struct device *dev = mds->cxlds.dev;
> > + unsigned int extent_gen_num;
> > + int rc;
> > +
> > + if (!cxl_dcd_supported(mds)) {
> > + dev_dbg(dev, "DCD unsupported\n");
> > + return 0;
> > + }
> > +
> > + rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> > + dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",
> > + rc, extent_gen_num);
> > + if (rc <= 0) /* 0 == no records found */
> > + return rc;
> > +
> > + return cxl_dev_get_dc_extents(cxled, extent_gen_num, rc);
>
> Is it necessary to spend a device interaction to get the generation
> number?

Not completely necessary no.

> Couldn't cxl_dev_get_dc_extents obtain that as part of the first
> call to the device, and then use it to ensure the consistency of any
> remaining calls, if any are necessary?

.. However, this is not a critical path and the extra query to hardware makes
the code a bit easier to follow IMO. There are 2 distinct steps.

1) get expected number of extents and the current generation number
2) query for that number whilst checking that the gen number is stable

Doing what you suggest results in special casing the first query within the
loop which is kind of ugly IMO.

That said, with the new retry requirement Fan pointed out I'll consider this in
that new algorithm context.

Ira

2024-04-10 17:17:05

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 23/26] cxl/mem: Trace Dynamic capacity Event Record

On Sun, Mar 24, 2024 at 04:18:26PM -0700, Ira Weiny wrote:
> From: Navneet Singh <[email protected]>
>
> CXL rev 3.1 section 8.2.9.2.1 adds the Dynamic Capacity Event Records.
> Notify the host of extents being added or removed. User space has
> little use for these events other than for debugging.

Is there really any 'Notify' going on here?

Can you state the usage in the positive, rather than saying it has
little use. Is this the only method for users to track this activity.

If it were just for your kernel debugging, I'd guess you'd just throw
in more dev_dbg() messages.

I see 'dpa_start'. Will we need to do any dpa->hpa translation work
here?

--Alison


>
> Add DC trace points to the trace log for debugging purposes.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [iweiny: Adjust to new trace code]
> ---
> drivers/cxl/core/mbox.c | 4 +++
> drivers/cxl/core/trace.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 69 insertions(+)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 7babac2d1c95..cb4576890187 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -978,6 +978,10 @@ static void __cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> ev_type = CXL_CPER_EVENT_DRAM;
> else if (uuid_equal(uuid, &CXL_EVENT_MEM_MODULE_UUID))
> ev_type = CXL_CPER_EVENT_MEM_MODULE;
> + else if (uuid_equal(uuid, &CXL_EVENT_DC_EVENT_UUID)) {
> + trace_cxl_dynamic_capacity(cxlmd, type, &record->event.dcd);
> + return;
> + }
>
> cxl_event_trace_record(cxlmd, type, ev_type, uuid, &record->event);
> }
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index bdf117a33744..7646fdd9aee3 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -707,6 +707,71 @@ TRACE_EVENT(cxl_poison,
> )
> );
>
> +/*
> + * DYNAMIC CAPACITY Event Record - DER
> + *
> + * CXL rev 3.0 section 8.2.9.2.1.5 Table 8-47
> + */
> +
> +#define CXL_DC_ADD_CAPACITY 0x00
> +#define CXL_DC_REL_CAPACITY 0x01
> +#define CXL_DC_FORCED_REL_CAPACITY 0x02
> +#define CXL_DC_REG_CONF_UPDATED 0x03
> +#define show_dc_evt_type(type) __print_symbolic(type, \
> + { CXL_DC_ADD_CAPACITY, "Add capacity"}, \
> + { CXL_DC_REL_CAPACITY, "Release capacity"}, \
> + { CXL_DC_FORCED_REL_CAPACITY, "Forced capacity release"}, \
> + { CXL_DC_REG_CONF_UPDATED, "Region Configuration Updated" } \
> +)
> +
> +TRACE_EVENT(cxl_dynamic_capacity,
> +
> + TP_PROTO(const struct cxl_memdev *cxlmd, enum cxl_event_log_type log,
> + struct cxl_event_dcd *rec),
> +
> + TP_ARGS(cxlmd, log, rec),
> +
> + TP_STRUCT__entry(
> + CXL_EVT_TP_entry
> +
> + /* Dynamic capacity Event */
> + __field(u8, event_type)
> + __field(u16, hostid)
> + __field(u8, region_id)
> + __field(u64, dpa_start)
> + __field(u64, length)
> + __array(u8, tag, CXL_DC_EXTENT_TAG_LEN)
> + __field(u16, sh_extent_seq)
> + ),
> +
> + TP_fast_assign(
> + CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr);
> +
> + /* Dynamic_capacity Event */
> + __entry->event_type = rec->event_type;
> +
> + /* DCD event record data */
> + __entry->hostid = le16_to_cpu(rec->host_id);
> + __entry->region_id = rec->region_index;
> + __entry->dpa_start = le64_to_cpu(rec->extent.start_dpa);
> + __entry->length = le64_to_cpu(rec->extent.length);
> + memcpy(__entry->tag, &rec->extent.tag, CXL_DC_EXTENT_TAG_LEN);
> + __entry->sh_extent_seq = le16_to_cpu(rec->extent.shared_extn_seq);
> + ),
> +
> + CXL_EVT_TP_printk("event_type='%s' host_id='%d' region_id='%d' " \
> + "starting_dpa=%llx length=%llx tag=%s " \
> + "shared_extent_sequence=%d",
> + show_dc_evt_type(__entry->event_type),
> + __entry->hostid,
> + __entry->region_id,
> + __entry->dpa_start,
> + __entry->length,
> + __print_hex(__entry->tag, CXL_DC_EXTENT_TAG_LEN),
> + __entry->sh_extent_seq
> + )
> +);
> +
> #endif /* _CXL_EVENTS_H */
>
> #define TRACE_INCLUDE_FILE trace
>
> --
> 2.44.0
>

2024-04-10 17:44:20

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 14/26] cxl/region: Read existing extents on region creation

On Sun, Mar 24, 2024 at 04:18:17PM -0700, Ira Weiny wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case creation of a new
> region on top of the DC partition (region) is expected to expose those
> extents for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized read the device extent list. For ease of review, this patch
> stops after reading the extent list and leaves realization of the region
> extents to a future patch.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: remove extent list xarray]
> [iweiny: Update spec references to 3.1]
> [iweiny: use struct range in extents]
> [iweiny: remove all reference tracking and let regions track extents
> through the extent devices.]
> [djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
> ---
> drivers/cxl/core/core.h | 9 +++
> drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 29 +++++++
> drivers/cxl/cxlmem.h | 49 ++++++++++++
> 4 files changed, 279 insertions(+)

snip

>
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct device *dev = mds->cxlds.dev;
> + uint64_t start, len;
> +
> + start = le64_to_cpu(dc_extent->start_dpa);
> + len = le64_to_cpu(dc_extent->length);
> +
> + /* Extents must not cross region boundary's */
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +

I think you already got range_contains suggestion

> + if (dcr->base <= start &&
> + (start + len) <= (dcr->base + dcr->decode_len)) {
> + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> + start, start + len - 1, i, start - dcr->base);
> + return 0;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> + start, start + len - 1);

Need some clarification.
Isn't this checking that the extent is fully contained within a region?
And then, it dev_err's if not fully contained. There is not actually
a check and an error message about crossing region boundary's as the
comment suggests. Maybe update the comment to reflect the work.. like:

/* Extent must be fully contained in a region */


> + return -EINVAL;
> +}
> +
> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *extent)
> +{
> + uint64_t start = le64_to_cpu(extent->start_dpa);
> + uint64_t length = le64_to_cpu(extent->length);

u64 here (and in other places too)


> + struct range ext_range = (struct range){
> + .start = start,
> + .end = start + length - 1,
> + };
> + struct range ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> +
> + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
> + cxled->dpa_res, start, length);
> +
> + return range_contains(&ed_range, &ext_range);
> +}
> +
> void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> @@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}
> +

That's nice!


> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,

Perhaps drop the _dev_ from this (and other, like below) function names.

> +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,


snip

> +/**
> + * cxl_read_dc_extents() - Read any existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add any existing extents found which belong to this decoder.
> + *
> + * Return: 0 if command was executed successfully, -ERRNO on error.
> + */
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + unsigned int extent_gen_num;
> + int rc;
> +
> + if (!cxl_dcd_supported(mds)) {
> + dev_dbg(dev, "DCD unsupported\n");
> + return 0;
> + }
> +
> + rc = cxl_dev_get_dc_extent_cnt(mds, &extent_gen_num);
> + dev_dbg(mds->cxlds.dev, "Extent count: %d Generation Num: %d\n",

Either use the *dev defined in both dev_dbg()'s or get rid of it use mds->cxlds.dev.

> + rc, extent_gen_num);
> + if (rc <= 0) /* 0 == no records found */
> + return rc;
> +
> + return cxl_dev_get_dc_extents(cxled, extent_gen_num, rc);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_read_dc_extents, CXL);
> +

snip

>
> +static int cxl_region_read_extents(struct cxl_region *cxlr)
> +{

How about:
static int cxl_region_read_extents(struct cxl_region_params *p)


> + struct cxl_region_params *p = &cxlr->params;
> + int i;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + int rc;
> +
> + rc = cxl_read_dc_extents(p->targets[i]);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}

snip to end

2024-04-10 18:01:44

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

On Sun, Mar 24, 2024 at 04:18:03PM -0700, Ira Weiny wrote:
> A git tree of this series can be found here:
>
> https://github.com/weiny2/linux-kernel/tree/dcd-2024-03-24

This would benefit from another checkpatch run and cleanup.

--Alison

2024-04-10 18:16:11

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 01/26] cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)

On Sun, Mar 24, 2024 at 04:18:04PM -0700, Ira Weiny wrote:
> From: Navneet Singh <[email protected]>
>
> Per the CXL 3.1 specification software must check the Command Effects
> Log (CEL) to know if a device supports dynamic capacity (DC). If the
> device does support DC the specifics of the DC Regions (0-7) are read
> through the mailbox.

Do I need to know this 'If the device...' piece to understand this
patch? I like that below you say 'Subsequent patches will...'
That seems enough to set the scene.

>
> Flag DC Device (DCD) commands in a device if they are supported.
Why be vague w 'Flag'. How about 'Add a bitmap of DCD enabled
commands to the driver device state structure.'

> Subsequent patches will key off these bits to configure DCD.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
> ---
> Changes for v1
> [iweiny: update to latest master]
> [iweiny: update commit message]
> [iweiny: Based on the fix:
> https://lore.kernel.org/all/[email protected]/
> [jonathan: remove unneeded format change]
> [jonathan: don't split security code in mbox.c]
> ---
> drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
> drivers/cxl/cxlmem.h | 15 +++++++++++++++
> 2 files changed, 48 insertions(+)
>

snip


> /* Device enabled poison commands */
> enum poison_cmd_enabled_bits {
> CXL_POISON_ENABLED_LIST,
> @@ -454,6 +463,7 @@ struct cxl_dev_state {
> * (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> * @mbox_mutex: Mutex to synchronize mailbox access.
> * @firmware_version: Firmware version for the memory device.
> + * @dcd_cmds: List of DCD commands implemented by memory device
> * @enabled_cmds: Hardware commands found enabled in CEL.
> * @exclusive_cmds: Commands that are kernel-internal only

It's not a 'List' it's a bitmap. How about mimicing 'enabled_cmds'
description:

* @dcd_cmds: DCD commands found enabled in CEL

> * @total_bytes: sum of all possible capacities
> @@ -481,6 +491,7 @@ struct cxl_memdev_state {
> size_t lsa_size;
> struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> char firmware_version[0x10];
> + DECLARE_BITMAP(dcd_cmds, CXL_DCD_ENABLED_MAX);
> DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
> DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
> u64 total_bytes;
> @@ -551,6 +562,10 @@ enum cxl_opcode {
> CXL_MBOX_OP_UNLOCK = 0x4503,
> CXL_MBOX_OP_FREEZE_SECURITY = 0x4504,
> CXL_MBOX_OP_PASSPHRASE_SECURE_ERASE = 0x4505,
> + CXL_MBOX_OP_GET_DC_CONFIG = 0x4800,
> + CXL_MBOX_OP_GET_DC_EXTENT_LIST = 0x4801,
> + CXL_MBOX_OP_ADD_DC_RESPONSE = 0x4802,
> + CXL_MBOX_OP_RELEASE_DC = 0x4803,
> CXL_MBOX_OP_MAX = 0x10000
> };
>
>
> --
> 2.44.0
>

2024-04-10 19:17:14

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

On Sun, Mar 24, 2024 at 04:18:05PM -0700, Ira Weiny wrote:
> From: Navneet Singh <[email protected]>
>
> Until now region modes and decoder modes were equivalent in that they
> were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
> regions (which will represent an array of device regions [better named
> partitions] the index of which could be different on different
> interleaved devices), the mode of an endpoint decoder and a region will
> no longer be equivalent.
>
> Define a new region mode enumeration and adjust the code for it.
>
> Suggested-by: Jonathan Cameron <[email protected]>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> <none>
> ---
> drivers/cxl/core/region.c | 77 +++++++++++++++++++++++++++++++++++------------
> drivers/cxl/cxl.h | 26 ++++++++++++++--
> 2 files changed, 81 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 4c7fd2d5cccb..1723d17f121e 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -40,7 +40,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> rc = down_read_interruptible(&cxl_region_rwsem);
> if (rc)
> return rc;
> - if (cxlr->mode != CXL_DECODER_PMEM)
> + if (cxlr->mode != CXL_REGION_PMEM)
> rc = sysfs_emit(buf, "\n");
> else
> rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> @@ -353,7 +353,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
> * Support tooling that expects to find a 'uuid' attribute for all
> * regions regardless of mode.
> */
> - if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> + if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_REGION_PMEM)
> return 0444;
> return a->mode;
> }
> @@ -516,7 +516,7 @@ static ssize_t mode_show(struct device *dev, struct device_attribute *attr,
> {
> struct cxl_region *cxlr = to_cxl_region(dev);
>
> - return sysfs_emit(buf, "%s\n", cxl_decoder_mode_name(cxlr->mode));
> + return sysfs_emit(buf, "%s\n", cxl_region_mode_name(cxlr->mode));
> }
> static DEVICE_ATTR_RO(mode);
>
> @@ -542,7 +542,7 @@ static int alloc_hpa(struct cxl_region *cxlr, resource_size_t size)
>
> /* ways, granularity and uuid (if PMEM) need to be set before HPA */
> if (!p->interleave_ways || !p->interleave_granularity ||
> - (cxlr->mode == CXL_DECODER_PMEM && uuid_is_null(&p->uuid)))
> + (cxlr->mode == CXL_REGION_PMEM && uuid_is_null(&p->uuid)))
> return -ENXIO;
>
> div64_u64_rem(size, (u64)SZ_256M * p->interleave_ways, &remainder);
> @@ -1683,6 +1683,17 @@ static int cxl_region_sort_targets(struct cxl_region *cxlr)
> return rc;
> }
>
> +static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> + enum cxl_decoder_mode dmode)
> +{

Perhaps is_region_mode_compatible() ?

Seems we have precedence for asking these questions that have
boolean responses. I picked 'region' because it is the region
we are trying to construct.


> + if (rmode == CXL_REGION_RAM && dmode == CXL_DECODER_RAM)
> + return true;
> + if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> + return true;
> +
> + return false;
> +}
> +
> static int cxl_region_attach(struct cxl_region *cxlr,
> struct cxl_endpoint_decoder *cxled, int pos)
> {
> @@ -1693,9 +1704,11 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> struct cxl_dport *dport;
> int rc = -ENXIO;
>
> - if (cxled->mode != cxlr->mode) {
> - dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> - dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> + if (!cxl_modes_compatible(cxlr->mode, cxled->mode)) {
> + dev_dbg(&cxlr->dev, "%s region mode: %s mismatch decoder: %s\n",
> + dev_name(&cxled->cxld.dev),
> + cxl_region_mode_name(cxlr->mode),
> + cxl_decoder_mode_name(cxled->mode));
> return -EINVAL;
> }

Does the above return bypass this next code segment (not in your diff):

if (cxled->mode == CXL_DECODER_DEAD) {
dev_dbg(&cxlr->dev, "%s dead\n", dev_name(&cxled->cxld.dev));
return -ENODEV;
}

It seems we are changing the return value on DEAD.

More below where a new check for DEAD is added ...

snip

> /* Establish an empty region covering the given HPA range */
> static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder *cxled)
> @@ -2808,12 +2840,17 @@ static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> struct cxl_port *port = cxlrd_to_port(cxlrd);
> struct range *hpa = &cxled->cxld.hpa_range;
> struct cxl_region_params *p;
> + enum cxl_region_mode mode;
> struct cxl_region *cxlr;
> struct resource *res;
> int rc;
>
> + if (cxled->mode == CXL_DECODER_DEAD)
> + return ERR_PTR(-EINVAL);

I see this addition, but it is in a different place and with
a different return value. Help me understand that this is no
change in behavior.


> +
> + mode = cxl_decoder_to_region_mode(cxled->mode);
> do {
> - cxlr = __create_region(cxlrd, cxled->mode,
> + cxlr = __create_region(cxlrd, mode,
> atomic_read(&cxlrd->region_id));
> } while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
>

snip

> /*
> * Track whether this decoder is reserved for region autodiscovery, or
> * free for userspace provisioning.
> @@ -511,7 +532,8 @@ struct cxl_region_params {
> * struct cxl_region - CXL region
> * @dev: This region's device
> * @id: This region's id. Id is globally unique across all regions
> - * @mode: Endpoint decoder allocation / access mode
> + * @mode: Region mode which defines which endpoint decoder mode the region is
> + * compatible with

Maybe...
@mode: Region mode used for decoder compatibility check

snip to end

--Alison

>

2024-04-10 23:08:02

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 07/26] cxl/port: Add dynamic capacity size support to endpoint decoders

On Sun, Mar 24, 2024 at 04:18:10PM -0700, Ira Weiny wrote:
> From: Navneet Singh <[email protected]>
>
> To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> map DC partitions (regions). In addition to assigning the size of the
> DC partition, the decoder must assign any skip value from the previous
> decoder. This must be done within a contiguous DPA space.
>
> Two complications arise with Dynamic Capacity regions which did not
> exist with Ram and PMEM partitions. First, gaps in the DPA space can

RAM

> exist between and around the DC Regions. Second, the Linux resource
> tree does not allow a resource to be marked across existing nodes within
> a tree.
>
> For clarity, below is an example of an 60GB device with 10GB of RAM,
> 10GB of PMEM and 10GB for each of 2 DC Regions. The desired CXL mapping
> is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.
>
> DPA RANGE
> (dpa_res)
> 0GB 10GB 20GB 30GB 40GB 50GB 60GB
> |----------|----------|----------|----------|----------|----------|
>
> RAM PMEM DC0 DC1
> (ram_res) (pmem_res) (dc_res[0]) (dc_res[1])
> |----------|----------| <gap> |----------| <gap> |----------|
>
> RAM PMEM DC1
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> 0GB 5GB 10GB 15GB 20GB 30GB 40GB 50GB 60GB
>
> The previous skip resource between RAM and PMEM was always a child of
> the RAM resource and fit nicely [see (S) below]. Because of this
> simplicity this skip resource reference was not stored in any CXL state.
> On release the skip range could be calculated based on the endpoint
> decoders stored values.
>
> Now when DC1 is being mapped 4 skip resources must be created as
> children. One for the PMEM resource (A), two of the parent DPA resource
> (B,D), and one more child of the DC0 resource (C).
>
> 0GB 10GB 20GB 30GB 40GB 50GB 60GB
> |----------|----------|----------|----------|----------|----------|
> | |
> |----------|----------| | |----------| | |----------|
> | | | | |
> (S) (A) (B) (C) (D)
> v v v v v
> |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> skip skip skip skip skip
>

Nice art!


> Expand the calculation of DPA freespace and enhance the logic to support
> mapping/unmapping DC DPA space. To track the potential of multiple skip
> resources an xarray is attached to the endpoint decoder. The existing
> algorithm between RAM and PMEM is consolidated within the new one to
> streamline the code even though the result is the storage of a single
> skip resource in the xarray.

This passed the unit test cxl-poison.sh that relies on you not totally
breaking the cxled->skip here. Not exactly a tested by, but something!


>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: Update cover letter]
> ---
> drivers/cxl/core/hdm.c | 192 +++++++++++++++++++++++++++++++++++++++++++-----
> drivers/cxl/core/port.c | 2 +
> drivers/cxl/cxl.h | 2 +
> 3 files changed, 179 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index e22b6f4f7145..da7d58184490 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -210,6 +210,25 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, CXL);
>
> +static void cxl_skip_release(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> + struct cxl_port *port = cxled_to_port(cxled);
> + struct device *dev = &port->dev;

Here and below it's probably needless to define dev.
Just &port->dev in your single dev_dbg()
This is something to check for across the patchset.


> + unsigned long index;
> + void *entry;
> +
> + xa_for_each(&cxled->skip_res, index, entry) {
> + struct resource *res = entry;
> +
> + dev_dbg(dev, "decoder%d.%d: releasing skipped space; %pr\n",
> + port->id, cxled->cxld.id, res);
> + __release_region(&cxlds->dpa_res, res->start,
> + resource_size(res));
> + xa_erase(&cxled->skip_res, index);
> + }
> +}
> +
> /*
> * Must be called in a context that synchronizes against this decoder's
> * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -220,15 +239,11 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> struct cxl_port *port = cxled_to_port(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct resource *res = cxled->dpa_res;
> - resource_size_t skip_start;
>
> lockdep_assert_held_write(&cxl_dpa_rwsem);
>
> - /* save @skip_start, before @res is released */
> - skip_start = res->start - cxled->skip;
> __release_region(&cxlds->dpa_res, res->start, resource_size(res));
> - if (cxled->skip)
> - __release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> + cxl_skip_release(cxled);
> cxled->skip = 0;
> cxled->dpa_res = NULL;
> put_device(&cxled->cxld.dev);
> @@ -263,6 +278,100 @@ static int dc_mode_to_region_index(enum cxl_decoder_mode mode)
> return mode - CXL_DECODER_DC0;
> }
>
> +static int cxl_request_skip(struct cxl_endpoint_decoder *cxled,
> + resource_size_t skip_base, resource_size_t skip_len)
> +{
> + struct cxl_dev_state *cxlds = cxled_to_memdev(cxled)->cxlds;
> + const char *name = dev_name(&cxled->cxld.dev);
> + struct cxl_port *port = cxled_to_port(cxled);
> + struct resource *dpa_res = &cxlds->dpa_res;
> + struct device *dev = &port->dev;

again

> + struct resource *res;
> + int rc;
> +
> + res = __request_region(dpa_res, skip_base, skip_len, name, 0);
> + if (!res)
> + return -EBUSY;
> +
> + rc = xa_insert(&cxled->skip_res, skip_base, res, GFP_KERNEL);
> + if (rc) {
> + __release_region(dpa_res, skip_base, skip_len);
> + return rc;
> + }
> +
> + dev_dbg(dev, "decoder%d.%d: skipped space; %pr\n",
> + port->id, cxled->cxld.id, res);
> + return 0;
> +}
> +
> +static int cxl_reserve_dpa_skip(struct cxl_endpoint_decoder *cxled,
> + resource_size_t base, resource_size_t skipped)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_port *port = cxled_to_port(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + resource_size_t skip_base = base - skipped;
> + struct device *dev = &port->dev;
> + resource_size_t skip_len = 0;
> + int rc, index;
> +
> + if (resource_size(&cxlds->ram_res) && skip_base <= cxlds->ram_res.end) {
> + skip_len = cxlds->ram_res.end - skip_base + 1;
> + rc = cxl_request_skip(cxled, skip_base, skip_len);
> + if (rc)
> + return rc;
> + skip_base += skip_len;
> + }
> +
> + if (skip_base == base) {
> + dev_dbg(dev, "skip done ram!\n");
> + return 0;
> + }
> +
> + if (resource_size(&cxlds->pmem_res) &&
> + skip_base <= cxlds->pmem_res.end) {
> + skip_len = cxlds->pmem_res.end - skip_base + 1;
> + rc = cxl_request_skip(cxled, skip_base, skip_len);
> + if (rc)
> + return rc;
> + skip_base += skip_len;
> + }
> +
> + index = dc_mode_to_region_index(cxled->mode);
> + for (int i = 0; i <= index; i++) {
> + struct resource *dcr = &cxlds->dc_res[i];
> +
> + if (skip_base < dcr->start) {
> + skip_len = dcr->start - skip_base;
> + rc = cxl_request_skip(cxled, skip_base, skip_len);
> + if (rc)
> + return rc;
> + skip_base += skip_len;
> + }
> +
> + if (skip_base == base) {
> + dev_dbg(dev, "skip done DC region %d!\n", i);
> + break;
> + }
> +
> + if (resource_size(dcr) && skip_base <= dcr->end) {
> + if (skip_base > base) {
> + dev_err(dev, "Skip error DC region %d; skip_base %pa; base %pa\n",
> + i, &skip_base, &base);
> + return -ENXIO;
> + }
> +
> + skip_len = dcr->end - skip_base + 1;
> + rc = cxl_request_skip(cxled, skip_base, skip_len);
> + if (rc)
> + return rc;
> + skip_base += skip_len;
> + }
> + }
> +
> + return 0;
> +}
> +
> static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> resource_size_t base, resource_size_t len,
> resource_size_t skipped)
> @@ -300,13 +409,12 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> }
>
> if (skipped) {
> - res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> - dev_name(&cxled->cxld.dev), 0);
> - if (!res) {
> - dev_dbg(dev,
> - "decoder%d.%d: failed to reserve skipped space\n",
> - port->id, cxled->cxld.id);
> - return -EBUSY;
> + int rc = cxl_reserve_dpa_skip(cxled, base, skipped);
> +
> + if (rc) {
> + dev_dbg(dev, "decoder%d.%d: failed to reserve skipped space; %pa - %pa\n",
> + port->id, cxled->cxld.id, &base, &skipped);
> + return rc;
> }
> }
> res = __request_region(&cxlds->dpa_res, base, len,
> @@ -314,14 +422,20 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> if (!res) {
> dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> port->id, cxled->cxld.id);
> - if (skipped)
> - __release_region(&cxlds->dpa_res, base - skipped,
> - skipped);
> + cxl_skip_release(cxled);
> return -EBUSY;
> }
> cxled->dpa_res = res;
> cxled->skip = skipped;
>
> + for (int mode = CXL_DECODER_DC0; mode <= CXL_DECODER_DC7; mode++) {
> + int index = dc_mode_to_region_index(mode);
> +
> + if (resource_contains(&cxlds->dc_res[index], res)) {
> + cxled->mode = mode;
> + goto success;
> + }
> + }
> if (resource_contains(&cxlds->pmem_res, res))
> cxled->mode = CXL_DECODER_PMEM;
> else if (resource_contains(&cxlds->ram_res, res))
> @@ -332,6 +446,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> cxled->mode = CXL_DECODER_MIXED;
> }
>
> +success:
> + dev_dbg(dev, "decoder%d.%d: %pr mode: %d\n", port->id, cxled->cxld.id,
> + cxled->dpa_res, cxled->mode);
> port->hdm_end++;
> get_device(&cxled->cxld.dev);
> return 0;
> @@ -463,14 +580,14 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
>
> int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> {
> + resource_size_t free_ram_start, free_pmem_start, free_dc_start;
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> - resource_size_t free_ram_start, free_pmem_start;
> struct cxl_port *port = cxled_to_port(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &cxled->cxld.dev;
> resource_size_t start, avail, skip;
> struct resource *p, *last;
> - int rc;
> + int rc, dc_index;
>
> down_write(&cxl_dpa_rwsem);
> if (cxled->cxld.region) {
> @@ -500,6 +617,21 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> else
> free_pmem_start = cxlds->pmem_res.start;
>
> + /*
> + * Limit each decoder to a single DC region to map memory with
> + * different DSMAS entry.
> + */
> + dc_index = dc_mode_to_region_index(cxled->mode);
> + if (dc_index >= 0) {
> + if (cxlds->dc_res[dc_index].child) {
> + dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
> + dc_index);
> + rc = -EINVAL;
> + goto out;
> + }
> + free_dc_start = cxlds->dc_res[dc_index].start;
> + }

From the "Limit each decoder" comment to here please explain.
I'm reading we cannot alloc dpa from this DC region because
is has a child? And a child is a region? Maybe I got it ;)


snip to end

--Alison

2024-04-10 23:17:36

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 21/26] dax/region: Prevent range mapping allocation on sparse regions

On Sun, Mar 24, 2024 at 04:18:24PM -0700, Ira Weiny wrote:

Perhaps lead w some words from prior patch to provide context:

"DAX regions mapping dynamic capacity partitions introduce a requirement
for the memory backing the region to come and go as required. This
results in a DAX region with sparse areas of memory backing."

Or should this fold into:
dax/region: Create extent resources on DAX region driver load

> Sparse regions are not fully populated with memory and this complicates
> range mapping of dax devices on those regions. There is no use case for
> range mapping on sparse regions.
>
> Avoid the complication by prevent range mapping of dax devices on sparse
> regions.

>
> Signed-off-by: Ira Weiny <[email protected]>
> ---
> drivers/dax/bus.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index bab19fc578d0..56dddaceeccb 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1452,6 +1452,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
> return 0;
> if (a == &dev_attr_mapping.attr && is_static(dax_region))
> return 0;
> + if (a == &dev_attr_mapping.attr && is_sparse(dax_region))
> + return 0;
> if ((a == &dev_attr_align.attr ||
> a == &dev_attr_size.attr) && is_static(dax_region))
> return 0444;
>
> --
> 2.44.0
>

2024-04-10 23:24:04

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts

On Sun, Mar 24, 2024 at 04:18:16PM -0700, Ira Weiny wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism. The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications.
>
> Firmware can't configure DCD events to be FW controlled but can retain
> control of memory events. Split irq configuration of memory events and
> DCD events to allow for FW control of memory events while DCD is host
> controlled.
>
> Configure DCD event log interrupts on devices supporting dynamic
> capacity. Disable DCD if interrupts are not supported.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---

snip

> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 12cd5d399230..ef482eae09e9 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c

snip

>
> +static int cxl_irqsetup(struct cxl_memdev_state *mds,
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> +{
> + struct cxl_dev_state *cxlds = &mds->cxlds;
> + int rc;
> +
> + if (native_cxl) {
> + rc = cxl_event_irqsetup(mds, policy);
> + if (rc)
> + return rc;
> + }
> +
> + if (cxl_dcd_supported(mds)) {
> + rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
> + if (rc) {
> + dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
move this..

> + cxl_disable_dcd(mds);

to after you've done the disabling...

dev_err(cxlds->dev, "DCD disabled: failed to get interrupt for event log\n");
> + return rc;

not sure I got the words right.

> + }
> + }
> +
> + return 0;
> +}
> +
> static bool cxl_event_int_is_fw(u8 setting)
> {
> u8 mode = FIELD_GET(CXLDEV_EVENT_INT_MODE_MASK, setting);
> @@ -757,17 +793,25 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge,
> struct cxl_memdev_state *mds, bool irq_avail)
> {
> struct cxl_event_interrupt_policy policy = { 0 };
> + bool native_cxl = host_bridge->native_cxl_error;
> int rc;
>
> /*
> * When BIOS maintains CXL error reporting control, it will process
> * event records. Only one agent can do so.
> + *
> + * If BIOS has control of events and DCD is not supported skip event
> + * configuration.
> */
> - if (!host_bridge->native_cxl_error)
> + if (!native_cxl && !cxl_dcd_supported(mds))
> return 0;
>
> if (!irq_avail) {
> dev_info(mds->cxlds.dev, "No interrupt support, disable event processing.\n");
> + if (cxl_dcd_supported(mds)) {
> + dev_info(mds->cxlds.dev, "DCD requires interrupts, disable DCD\n");

Similar here -
Maybe better to disable, and just say it's done because this sounds a bit like a request to the user.

> + cxl_disable_dcd(mds);

dev_info(mds->cxlds.dev, "DCD disabled: no interrupt support\n");

How come this one is dev_info() and prior case of disabling was a
dev_err()/


snip to end

-- Alison


2024-04-11 00:24:25

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices

On Sun, Mar 24, 2024 at 04:18:19PM -0700, Ira Weiny wrote:
> From: Navneet Singh <[email protected]>
>
> Once all extents of an interleave set are present a region must
> surface an extent to the region.

Why the vague words - realize and surface?

Maybe slip the word DCD in the commit msg:
cxl/extent: Create DCD region extent devices

And be more explicit about where we are in the setup process:

Once the region driver discovers all the extents of an interleave set
a region extent device must be created for every device extent found.

Maybe rough example, but my intent is to seqway from the fact that
the region driver has done it's part, now we are here, creating the
region extent devices. Should that be 'DAX' region extent devices?

>
> Without interleaving; endpoint decoder and region extents have a 1:1
> relationship. Future support for IW > 1 will maintain a N:1
> relationship between the device extents and region extents.
>
> Create a region extent device for every device extent found. Release of
> the extent device triggers a response to the underlying hardware extent.
>
> There is no strong use case to support the addition of extents which
> overlap previously accepted extent ranges. Reject such new extents
> until such time as a good use case emerges.
>
> Expose the necessary details of region extents by creating the following
> sysfs entries.
>
> /sys/bus/cxl/devices/dax_regionX/extentY
> /sys/bus/cxl/devices/dax_regionX/extentY/offset
> /sys/bus/cxl/devices/dax_regionX/extentY/length
> /sys/bus/cxl/devices/dax_regionX/extentY/label
>
> The use of the extent devices by the DAX layer is deferred to later
> patches.

You mean later in this set, right?


>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [iweiny: new patch]
> [iweiny: Rename 'dr_extent' to 'region_extent']
> ---
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/extent.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/mbox.c | 43 +++++++++++++++
> drivers/cxl/core/region.c | 76 +++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 37 +++++++++++++
> tools/testing/cxl/Kbuild | 1 +
> 6 files changed, 290 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..35c5c76bfcf1 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -14,5 +14,6 @@ cxl_core-y += pci.o
> cxl_core-y += hdm.o
> cxl_core-y += pmu.o
> cxl_core-y += cdat.o
> +cxl_core-y += extent.o
> cxl_core-$(CONFIG_TRACING) += trace.o
> cxl_core-$(CONFIG_CXL_REGION) += region.o
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..487c220f1c3c
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <cxl.h>
> +
> +static DEFINE_IDA(cxl_extent_ida);
> +
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + return sysfs_emit(buf, "%pa\n", &reg_ext->hpa_range.start);
> +}
> +static DEVICE_ATTR_RO(offset);
> +
> +static ssize_t length_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> + u64 length = range_len(&reg_ext->hpa_range);
> +
> + return sysfs_emit(buf, "%pa\n", &length);
> +}
> +static DEVICE_ATTR_RO(length);
> +
> +static ssize_t label_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + return sysfs_emit(buf, "%s\n", reg_ext->label);
> +}
> +static DEVICE_ATTR_RO(label);
> +
> +static struct attribute *region_extent_attrs[] = {
> + &dev_attr_offset.attr,
> + &dev_attr_length.attr,
> + &dev_attr_label.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group region_extent_attribute_group = {
> + .attrs = region_extent_attrs,
> +};
> +
> +static const struct attribute_group *region_extent_attribute_groups[] = {
> + &region_extent_attribute_group,
> + NULL,
> +};
> +
> +static void region_extent_release(struct device *dev)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + cxl_release_ed_extent(&reg_ext->ed_ext);
> + ida_free(&cxl_extent_ida, reg_ext->dev.id);
> + kfree(reg_ext);
> +}
> +
> +static const struct device_type region_extent_type = {
> + .name = "extent",
> + .release = region_extent_release,
> + .groups = region_extent_attribute_groups,
> +};
> +
> +bool is_region_extent(struct device *dev)
> +{
> + return dev->type == &region_extent_type;
> +}
> +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL);
> +
> +static void region_extent_unregister(void *ext)
> +{
> + struct region_extent *reg_ext = ext;
> +
> + dev_dbg(&reg_ext->dev, "DAX region rm extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> + device_unregister(&reg_ext->dev);
> +}
> +
> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct region_extent *reg_ext;
> + struct device *dev;
> + int rc, id;
> +
> + id = ida_alloc(&cxl_extent_ida, GFP_KERNEL);
> + if (id < 0)
> + return -ENOMEM;
> +
> + reg_ext = kzalloc(sizeof(*reg_ext), GFP_KERNEL);
> + if (!reg_ext)
> + return -ENOMEM;
> +
> + reg_ext->hpa_range = *hpa_range;
> + reg_ext->ed_ext.dpa_range = *dpa_range;
> + reg_ext->ed_ext.cxled = cxled;
> + snprintf(reg_ext->label, DAX_EXTENT_LABEL_LEN, "%s", label);
> +
> + dev = &reg_ext->dev;
> + device_initialize(dev);
> + dev->id = id;
> + device_set_pm_not_required(dev);
> + dev->parent = &cxlr_dax->dev;
> + dev->type = &region_extent_type;
> + rc = dev_set_name(dev, "extent%d", dev->id);
> + if (rc)
> + goto err;
> +
> + rc = device_add(dev);
> + if (rc)
> + goto err;
> +
> + dev_dbg(dev, "DAX region extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> + reg_ext);
> +
> +err:
> + dev_err(&cxlr_dax->dev, "Failed to initialize DAX extent dev HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + put_device(dev);
> + return rc;
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9e33a0976828..6b00e717e42b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> + struct range *extent, int opcode)
> +{
> + struct cxl_mbox_cmd mbox_cmd;
> + size_t size;
> +
> + struct cxl_mbox_dc_response *dc_res __free(kfree);
> + size = struct_size(dc_res, extent_list, 1);
> + dc_res = kzalloc(size, GFP_KERNEL);
> + if (!dc_res)
> + return -ENOMEM;
> +
> + dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
> + memset(dc_res->extent_list[0].reserved, 0, 8);
> + dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
> + dc_res->extent_list_size = cpu_to_le32(1);

Is the cpu_to_le32(1) necessary?
I notice similar for .extent_cnt, .start_extent_index in mbox.c


> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = opcode,
> + .size_in = size,
> + .payload_in = dc_res,
> + };
> +
> + return cxl_internal_send_cmd(mds, &mbox_cmd);
> +}
> +
> static struct cxl_memdev_state *
> cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> {
> @@ -1029,6 +1055,23 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> return container_of(cxlds, struct cxl_memdev_state, cxlds);
> }
>
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent)
> +{
> + struct cxl_endpoint_decoder *cxled = extent->cxled;
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + dev_dbg(dev, "Releasing DC extent DPA %#llx - %#llx\n",
> + extent->dpa_range.start, extent->dpa_range.end);
> +
> + rc = cxl_send_dc_cap_response(mds, &extent->dpa_range, CXL_MBOX_OP_RELEASE_DC);
> + if (rc)
> + dev_dbg(dev, "Failed to respond releasing extent DPA %#llx - %#llx; %d\n",
> + extent->dpa_range.start, extent->dpa_range.end, rc);

Don't repeat the start/end details on the second dev_dbg(), add rc value
only.

> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_release_ed_extent, CXL);
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3e563ab29afe..7635ff109578 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,11 +1450,81 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +static int extent_check_overlap(struct device *dev, void *arg)
> +{
> + struct range *new_range = arg;
> + struct region_extent *ext;
> +
> + if (!is_region_extent(dev))
> + return 0;
> +
> + ext = to_region_extent(dev);
> + return range_overlaps(&ext->hpa_range, new_range);
> +}
> +
> +static int extent_overlaps(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range)
> +{
> + struct device *dev __free(put_device) =
> + device_find_child(&cxlr_dax->dev, hpa_range, extent_check_overlap);
> +
> + if (dev)
> + return -EINVAL;
> + return 0;
> +}
> +
> /* Callers are expected to ensure cxled has been attached to a region */
> int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> struct cxl_dc_extent *dc_extent)
> {
> - return 0;
> + struct cxl_region *cxlr = cxled->cxld.region;
> + struct range ext_dpa_range, ext_hpa_range;
> + struct device *dev = &cxlr->dev;
> + resource_size_t dpa_offset, hpa;
> +
> + /*
> + * Interleave ways == 1 means this coresponds to a 1:1 mapping between
> + * device extents and DAX region extents. Future implementations
> + * should hold DC region extents here until the full dax region extent
> + * can be realized.
> + */
> + if (cxlr->params.interleave_ways != 1) {
> + dev_err(dev, "Interleaving DC not supported\n");
> + return -EINVAL;
> + }
> +
> + ext_dpa_range = (struct range) {
> + .start = le64_to_cpu(dc_extent->start_dpa),
> + .end = le64_to_cpu(dc_extent->start_dpa) +
> + le64_to_cpu(dc_extent->length) - 1,
> + };
> +
> + dev_dbg(dev, "Adding DC extent DPA %#llx - %#llx\n",
> + ext_dpa_range.start, ext_dpa_range.end);
> +
> + /*
> + * Without interleave...
> + * HPA offset == DPA offset
> + * ... but do the math anyway
> + */
> + dpa_offset = ext_dpa_range.start - cxled->dpa_res->start;
> + hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> + ext_hpa_range = (struct range) {
> + .start = hpa - cxlr->cxlr_dax->hpa_range.start,
> + .end = ext_hpa_range.start + range_len(&ext_dpa_range) - 1,
> + };
> +
> + if (extent_overlaps(cxlr->cxlr_dax, &ext_hpa_range))
> + return -EINVAL;
> +
> + dev_dbg(dev, "Realizing region extent at HPA %#llx - %#llx\n",
> + ext_hpa_range.start, ext_hpa_range.end);
> +
> + return dax_region_create_ext(cxlr->cxlr_dax, &ext_hpa_range,
> + (char *)dc_extent->tag,
> + &ext_dpa_range,
> + cxled);
> }
>
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> @@ -2684,6 +2754,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>
> dev = &cxlr_dax->dev;
> cxlr_dax->cxlr = cxlr;
> + cxlr->cxlr_dax = cxlr_dax;
> device_initialize(dev);
> lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> device_set_pm_not_required(dev);
> @@ -2799,7 +2870,10 @@ static int cxl_region_read_extents(struct cxl_region *cxlr)
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> + struct cxl_region *cxlr = cxlr_dax->cxlr;
>
> + cxlr->cxlr_dax = NULL;
> + cxlr_dax->cxlr = NULL;
> device_unregister(&cxlr_dax->dev);
> }
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d585f5fdd3ae..5379ad7f5852 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -564,6 +564,7 @@ struct cxl_region_params {
> * @type: Endpoint decoder target type
> * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
> * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
> * @flags: Region state flags
> * @params: active + config params for the region
> */
> @@ -574,6 +575,7 @@ struct cxl_region {
> enum cxl_decoder_type type;
> struct cxl_nvdimm_bridge *cxl_nvb;
> struct cxl_pmem_region *cxlr_pmem;
> + struct cxl_dax_region *cxlr_dax;
> unsigned long flags;
> struct cxl_region_params params;
> };
> @@ -617,6 +619,41 @@ struct cxl_dax_region {
> struct range hpa_range;
> };
>
> +/**
> + * struct cxl_ed_extent - Extent within an endpoint decoder
> + * @dpa_range: DPA range this extent covers within the decoder
> + * @cxled: reference to the endpoint decoder
> + */
> +struct cxl_ed_extent {
> + struct range dpa_range;
> + struct cxl_endpoint_decoder *cxled;
> +};
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent);
> +
> +/**
> + * struct region_extent - CXL DAX region extent
> + * @dev: device representing this extent
> + * @hpa_range: HPA range of this extent
> + * @label: label of the extent
> + * @ed_ext: Endpoint decoder extent which backs this extent
> + */
> +#define DAX_EXTENT_LABEL_LEN 64
> +struct region_extent {
> + struct device dev;
> + struct range hpa_range;
> + char label[DAX_EXTENT_LABEL_LEN];
> + struct cxl_ed_extent ed_ext;
> +};
> +
> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled);
> +
> +bool is_region_extent(struct device *dev);
> +#define to_region_extent(dev) container_of(dev, struct region_extent, dev)
> +
> /**
> * struct cxl_port - logical collection of upstream port devices and
> * downstream port devices to construct a CXL memory
> diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
> index 030b388800f0..dc0cc1d5e6a0 100644
> --- a/tools/testing/cxl/Kbuild
> +++ b/tools/testing/cxl/Kbuild
> @@ -60,6 +60,7 @@ cxl_core-y += $(CXL_CORE_SRC)/pci.o
> cxl_core-y += $(CXL_CORE_SRC)/hdm.o
> cxl_core-y += $(CXL_CORE_SRC)/pmu.o
> cxl_core-y += $(CXL_CORE_SRC)/cdat.o
> +cxl_core-y += $(CXL_CORE_SRC)/extent.o
> cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
> cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
> cxl_core-y += config_check.o
>
> --
> 2.44.0
>

2024-04-24 17:58:46

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 02/26] cxl/core: Separate region mode from decoder mode

Dave Jiang wrote:
>
>
> On 3/27/24 10:22 PM, Ira Weiny wrote:
> > Davidlohr Bueso wrote:
> >> On Sun, 24 Mar 2024, [email protected] wrote:
> >>
> >>> From: Navneet Singh <[email protected]>
> >>>
> >>> Until now region modes and decoder modes were equivalent in that they
> >>> were either PMEM or RAM. With the upcoming addition of Dynamic Capacity
> >>> regions (which will represent an array of device regions [better named
> >>> partitions] the index of which could be different on different
> >>> interleaved devices), the mode of an endpoint decoder and a region will
> >>> no longer be equivalent.
> >>>
> >>> Define a new region mode enumeration and adjust the code for it.
> >>
> >> Could this could also be picked up regardless of dcd?
> >
> > It could but there is no need for it without DCD.
> >
> > I will work on re-ordering the cleanups if Dave will agree to take them
> > early.
>
> There's no reason for the change unless it comes with DCD right? And probably no urgent need to taking it ahead then?

No I don't think so.
Ira

> >
> > Ira



2024-04-24 20:24:22

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices

Dave Jiang wrote:
>
>
> On 3/24/24 4:18 PM, [email protected] wrote:
> > From: Navneet Singh <[email protected]>
> >

[snip]

> > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> > new file mode 100644
> > index 000000000000..487c220f1c3c
> > --- /dev/null
> > +++ b/drivers/cxl/core/extent.c
> > @@ -0,0 +1,133 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
> > +
> > +#include <linux/device.h>
> > +#include <linux/slab.h>
> > +#include <cxl.h>
> > +
> > +static DEFINE_IDA(cxl_extent_ida);
>
> According to Documentation/core-api/idr.rst, IDR interface is deprecated and
> xarray usage is preferred.

IDA != IDR

ida_alloc() provides a unique, unused id for the device. I worked hard to
eliminate all extra references to the extent objects so as to ensure object
lifetimes. So I'm keeping this for now.

> > +
> > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > + char *buf)
>
> Parameter alignment a bit off here? and some of the other functions as well.

Thanks, fixed.

[snip]

> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 9e33a0976828..6b00e717e42b 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> > return rc;
> > }
> >
> > +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> > + struct range *extent, int opcode)
> > +{
> > + struct cxl_mbox_cmd mbox_cmd;
> > + size_t size;
> > +
> > + struct cxl_mbox_dc_response *dc_res __free(kfree);
> > + size = struct_size(dc_res, extent_list, 1);
> > + dc_res = kzalloc(size, GFP_KERNEL);
> > + if (!dc_res)
> > + return -ENOMEM;
> > +
> > + dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
> > + memset(dc_res->extent_list[0].reserved, 0, 8);
>
> Not needed. kzalloc already zeroed.

Thanks, Fan mentioned it too.

Ira

2024-04-30 03:24:55

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices

Jonathan Cameron wrote:
> On Sun, 24 Mar 2024 16:18:19 -0700
> [email protected] wrote:
>
> > From: Navneet Singh <[email protected]>
> >
> > Once all extents of an interleave set are present a region must
> > surface an extent to the region.
> >
> > Without interleaving; endpoint decoder and region extents have a 1:1
> > relationship. Future support for IW > 1 will maintain a N:1
> > relationship between the device extents and region extents.
> >
> > Create a region extent device for every device extent found. Release of
> > the extent device triggers a response to the underlying hardware extent.
> >
> > There is no strong use case to support the addition of extents which
> > overlap previously accepted extent ranges. Reject such new extents
> > until such time as a good use case emerges.
> >
> > Expose the necessary details of region extents by creating the following
> > sysfs entries.
> >
> > /sys/bus/cxl/devices/dax_regionX/extentY
> > /sys/bus/cxl/devices/dax_regionX/extentY/offset
> > /sys/bus/cxl/devices/dax_regionX/extentY/length
> > /sys/bus/cxl/devices/dax_regionX/extentY/label
>
> Docs?

That is a good idea.

> The label in particular worries me a little as I'm not sure what
> is in it.

I envisioned a pass through of the tag.

>
> If it's the tag one possible format is a uuid (not a coincidence
> that it is the same length) and interpreting that as characters isn't
> going to get us far. I wonder if we have to treat it as a binary attr
> given we have no idea what it is.

In thinking about this more (and running some experiments): none of these are
strictly necessary in this initial implementation. No code currently uses
them directly.

I questioned these in the past and I've done so again over the weekend.

I was about to rip them out entirely when I remembered Gregory Price's
comments on Discord. There he indicating a desire to very carefully place
dax devices. Without at least the offset and length above (and to a
lesser extent the label) this can't be done.

One still has to create and delete dax devices carefully
to place a dax device in a specific place. But the above give the user
the information to do so. Without it the user must coordinate with the FM
even more (which we could require initially).

On particular issue is the simplification I made within the kernel to
track extents. The extents are no longer ordered within an xarray.

This means a user can't accurately predict which extent will be used when
allocating a dax device. One has to experiment and look at the resulting
mappings of the dax device to see if it got allocated in the right place.

For example:


| DC region |
|-----------------------------------------------------|
|--------| |--------| |
| (ext0) | | (ext1) | |
| (1G) | | (1G) | |

If the above extents were surfaced in the following order:

ext1
ext0

Then a dax device of size 1G was created. The dax mapping would be:


| DC region |
|-----------------------------------------------------|
|--------| |--------| |
| (ext0) | | (ext1) | |
| (1G) | | (1G) | |
| |(daxX.1)| |

Allocating another dax device would result in:

|(daxX.2)} |(daxX.1)| |

I don't think this is exactly what the user is going to expect. This can
be resolved by by looking at the dax device mappings though.[0] So I'm going
to leave this for now. But I expect some additional porcelain is going to
be required to fully meet Gregory's requirements.

[0]
/sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
/sys/bus/dax/devices/daxX.Y/mapping[0..N]/end

Back to the label field: It is currently just the 'tag' of the individual
extent (because no interleaving). My vision for the interleave case would
be for the kernel to assemble device extents into a region extent only if
the tags match and export that.

Thinking on it more though we should leave label out for now. This is the
second time it has been questioned.

> Otherwise a query inline that may well be answered in later patches.
>
> >
> > The use of the extent devices by the DAX layer is deferred to later
> > patches.
> >
> > Signed-off-by: Navneet Singh <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
> >

[snip]

> > +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> > + struct range *hpa_range,
> > + const char *label,
> > + struct range *dpa_range,
> > + struct cxl_endpoint_decoder *cxled)
> > +{
> > + struct region_extent *reg_ext;
> > + struct device *dev;
> > + int rc, id;
> > +
> > + id = ida_alloc(&cxl_extent_ida, GFP_KERNEL);
> > + if (id < 0)
> > + return -ENOMEM;
>
> Whilst it doesn't matter hugely, it's nice if the release does things
> in opposite order of the creation. So perhaps move the ida_alloc
> after kzalloc or reg_ext?

Actually there is an ida resource leak here if the alloc fails. I'll fix
that too.

>
> > +
> > + reg_ext = kzalloc(sizeof(*reg_ext), GFP_KERNEL);
> > + if (!reg_ext)
> > + return -ENOMEM;
> > +
> > + reg_ext->hpa_range = *hpa_range;
> > + reg_ext->ed_ext.dpa_range = *dpa_range;
> > + reg_ext->ed_ext.cxled = cxled;
> > + snprintf(reg_ext->label, DAX_EXTENT_LABEL_LEN, "%s", label);
> > +
> > + dev = &reg_ext->dev;
> > + device_initialize(dev);
> > + dev->id = id;
> > + device_set_pm_not_required(dev);
> > + dev->parent = &cxlr_dax->dev;
> > + dev->type = &region_extent_type;
> > + rc = dev_set_name(dev, "extent%d", dev->id);
> > + if (rc)
> > + goto err;
> > +
> > + rc = device_add(dev);
> > + if (rc)
> > + goto err;
> > +
> > + dev_dbg(dev, "DAX region extent HPA %#llx - %#llx\n",
> > + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> > +
> > + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,
> > + reg_ext);
>
> Indent

Yep.

>
> > +
> > +err:
> > + dev_err(&cxlr_dax->dev, "Failed to initialize DAX extent dev HPA %#llx - %#llx\n",
> > + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> > +
> > + put_device(dev);
> > + return rc;
> > +}
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 9e33a0976828..6b00e717e42b 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> > return rc;
> > }
> >
> > +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> > + struct range *extent, int opcode)
> > +{
> > + struct cxl_mbox_cmd mbox_cmd;
> > + size_t size;
> > +
> > + struct cxl_mbox_dc_response *dc_res __free(kfree);
> > + size = struct_size(dc_res, extent_list, 1);
> > + dc_res = kzalloc(size, GFP_KERNEL);
> > + if (!dc_res)
> > + return -ENOMEM;
> > +
> > + dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
> > + memset(dc_res->extent_list[0].reserved, 0, 8);
> > + dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
> > + dc_res->extent_list_size = cpu_to_le32(1);
>
> I guess this comes up later, but such a response means that if we are offered
> multiple extents in an add with the more flag set then we always reject all
> but the first one.

I've thought about how to best support for the more flag without major
complications. So far the use of the more flag is IMO more trouble than
it is worth. I agree that the spec is clear WRT the grouping of a
response with the more flag set but it is very vague on __why__.

>
> > +
> > + mbox_cmd = (struct cxl_mbox_cmd) {
> > + .opcode = opcode,
> > + .size_in = size,
> > + .payload_in = dc_res,
> > + };
> > +
> > + return cxl_internal_send_cmd(mds, &mbox_cmd);
> > +}
> > +
> > static struct cxl_memdev_state *
> > cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> > {
> > @@ -1029,6 +1055,23 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> > return container_of(cxlds, struct cxl_memdev_state, cxlds);
> > }
> >
> > +void cxl_release_ed_extent(struct cxl_ed_extent *extent)
> > +{
> > + struct cxl_endpoint_decoder *cxled = extent->cxled;
> > + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> > + struct device *dev = mds->cxlds.dev;
> > + int rc;
> > +
> > + dev_dbg(dev, "Releasing DC extent DPA %#llx - %#llx\n",
> > + extent->dpa_range.start, extent->dpa_range.end);
> > +
> > + rc = cxl_send_dc_cap_response(mds, &extent->dpa_range, CXL_MBOX_OP_RELEASE_DC);
>
> Long line that doesn't really need to be.

Yep

>
> > + if (rc)
> > + dev_dbg(dev, "Failed to respond releasing extent DPA %#llx - %#llx; %d\n",
> > + extent->dpa_range.start, extent->dpa_range.end, rc);
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_release_ed_extent, CXL);
> > +
> > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> > enum cxl_event_log_type type)
> > {
> > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> > index 3e563ab29afe..7635ff109578 100644
> > --- a/drivers/cxl/core/region.c
> > +++ b/drivers/cxl/core/region.c
> > @@ -1450,11 +1450,81 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> > return 0;
> > }
>
> > static int cxl_region_attach_position(struct cxl_region *cxlr,
> > @@ -2684,6 +2754,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
> >
> > dev = &cxlr_dax->dev;
> > cxlr_dax->cxlr = cxlr;
> > + cxlr->cxlr_dax = cxlr_dax;
> > device_initialize(dev);
> > lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> > device_set_pm_not_required(dev);
> > @@ -2799,7 +2870,10 @@ static int cxl_region_read_extents(struct cxl_region *cxlr)
> > static void cxlr_dax_unregister(void *_cxlr_dax)
> > {
> > struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> > + struct cxl_region *cxlr = cxlr_dax->cxlr;
> >
> > + cxlr->cxlr_dax = NULL;
> > + cxlr_dax->cxlr = NULL;
> > device_unregister(&cxlr_dax->dev);
> > }
> >
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index d585f5fdd3ae..5379ad7f5852 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
>
>
> > +/**
> > + * struct region_extent - CXL DAX region extent
> > + * @dev: device representing this extent
> > + * @hpa_range: HPA range of this extent
> > + * @label: label of the extent
> > + * @ed_ext: Endpoint decoder extent which backs this extent
> > + */
> > +#define DAX_EXTENT_LABEL_LEN 64
>
> Something called DAX_* doesn't belong in this header...
> Either give a CXL_DAX_ prefix or move the definition if appropriate.
>

I've remove this as well as the label sysfs instead.

Ira

2024-05-01 23:49:47

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

Jonathan Cameron wrote:
>
> >
> > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
> Hi Ira, Navneet.
> >
> > Remaining work:
> >
> > 1) Integrate the QoS work from Dave Jiang
> > 2) Interleave support
>
>
> More flag. This one I think is potentially important and don't
> see any handling in here.

Nope I admit I missed the spec requirement.

>
> Whilst an FM could in theory be careful to avoid sending a
> sparse set of extents, if the device is managing the memory range
> (which is possible all it supports) and the FM issues an Initiate Dynamic
> Capacity Add with Free (again may be all device supports) then we
> can't stop the device issuing a bunch of sparse extents.
>
> Now it won't be broken as such without this, but every time we
> accept the first extent that will implicitly reject the rest.
> That will look very ugly to an FM which has to poke potentially many
> times to successfully allocate memory to a host.

This helps me to see see why the more bit is useful.

>
> I also don't think it will be that hard to support, but maybe I'm
> missing something?

Just a bunch of code and refactoring busy work. ;-) It's not rocket
science but does fundamentally change the arch again.

>
> My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> over a list of extents passed in then slightly more complex response
> generation.

Not exactly 'just a loop'. No matter how I work this out there is the
possibility that some extents get surfaced and then the kernel tries to
remove them because it should not have.

To be most safe the cxl core is going to have to make 2 round trips to the
cxl region layer for each extent. The first determines if the extent is
valid and creates the extent as much as possible. The second actually
surfaces the extents. However, if the surface fails then you might not
get the extents back. So now we are in an invalid state. :-/ WARN and
continue I guess?!??!

I think the safest way to handle this is add a new kernel notify event
called 'extent create' which stops short of surfacing the extent. [I'm
not 100% sure how this is going to affect interleave.]

I think the safest logic for add is something like:

cxl_handle_dcd_add_event()
add_extent(squirl_list, extent);

if (more bit) /* wait for more */
return;

/* Create extents to hedge the bets against failure */
for_each(squirl_list)
if (notify 'extent create' != ok)
send_response(fail);
return;

for_each(squirl_list)
if (notify 'surface' != ok)
/*
* If the more bit was set, some extents
* have been surfaced and now need to be
* removed...
*
* Try to remove them and hope...
*/
WARN_ON('surface extents failed');
for_each(squirl_list)
notify 'remove without response'
send_response(fail);
return;

send_response(squirl_list, accept);

The logic for remove is not changed AFAICS because the device must allow
for memory to be released at any time so the host is free to release each
of the extents individually despite the 'more' bit???

>
> I don't want this to block getting initial DCD support in but it
> will be a bit ugly if we quickly support the more flag and then end
> up with just one kernel that an FM has to be careful with...

I'm not sure which is worse. Given your use case above it seems like the
more bit may be more important for 'dumb' devices which want to add
extents in blocks before responding to the FM. Thus complicating the FM.

It seems 'smarter' devices which could figure this out (not requiring the
more bit) are the ones which will be developed later. So it seems the use
case time line is the opposite of what we need right now.

For that reason I'm inclined to try and get this in.

Ira

2024-05-02 21:13:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices

Ira Weiny wrote:
> Jonathan Cameron wrote:
> > On Sun, 24 Mar 2024 16:18:19 -0700
> > [email protected] wrote:
> >
> > > From: Navneet Singh <[email protected]>
> > >
> > > Once all extents of an interleave set are present a region must
> > > surface an extent to the region.
> > >
> > > Without interleaving; endpoint decoder and region extents have a 1:1
> > > relationship. Future support for IW > 1 will maintain a N:1
> > > relationship between the device extents and region extents.
> > >
> > > Create a region extent device for every device extent found. Release of
> > > the extent device triggers a response to the underlying hardware extent.
> > >
> > > There is no strong use case to support the addition of extents which
> > > overlap previously accepted extent ranges. Reject such new extents
> > > until such time as a good use case emerges.
> > >
> > > Expose the necessary details of region extents by creating the following
> > > sysfs entries.
> > >
> > > /sys/bus/cxl/devices/dax_regionX/extentY
> > > /sys/bus/cxl/devices/dax_regionX/extentY/offset
> > > /sys/bus/cxl/devices/dax_regionX/extentY/length
> > > /sys/bus/cxl/devices/dax_regionX/extentY/label
> >
> > Docs?
>
> That is a good idea.
>
> > The label in particular worries me a little as I'm not sure what
> > is in it.
>
> I envisioned a pass through of the tag.
>
> >
> > If it's the tag one possible format is a uuid (not a coincidence
> > that it is the same length) and interpreting that as characters isn't
> > going to get us far. I wonder if we have to treat it as a binary attr
> > given we have no idea what it is.
>
> In thinking about this more (and running some experiments): none of these are
> strictly necessary in this initial implementation. No code currently uses
> them directly.
>
> I questioned these in the past and I've done so again over the weekend.
>
> I was about to rip them out entirely when I remembered Gregory Price's
> comments on Discord. There he indicating a desire to very carefully place
> dax devices. Without at least the offset and length above (and to a
> lesser extent the label) this can't be done.

Careful placement of dax-devices physically requires an entirely new
allocation ABI. There is the mapping_store() interface that was added
for a specific kexec / VMM fast restore use case, but that never
envisioned the sparse region case. So I do think it is worthwhile to
punt on that question to a later add-on feature.

[..]
> I don't think this is exactly what the user is going to expect. This can
> be resolved by by looking at the dax device mappings though.[0] So I'm going
> to leave this for now. But I expect some additional porcelain is going to
> be required to fully meet Gregory's requirements.

Not sure what the exact requirement is, but if it's the typical, "I want
to allocate by tag", then I think there is another potential coarse
grained solution that probably covers most cases. Allow multiple
dax_regions per cxl_dcd_regions, where each dax_region manages an
exclusive set of tags.

The host negotiates a dax_region tag layout with the orchestrator and
can then trust that all of the extents that show up in a given dax_region
belong to a given tag or set of tags.

This is not something that needs to be considered in the initial
enabling, but is potentially a way to avoid bolting-on a new fine grained
allocation api after the fact.


> [0]
> /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
> /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
>
> Back to the label field: It is currently just the 'tag' of the individual
> extent (because no interleaving). My vision for the interleave case would
> be for the kernel to assemble device extents into a region extent only if
> the tags match and export that.
>
> Thinking on it more though we should leave label out for now. This is the
> second time it has been questioned.

I don't understand the issue. That is a critical piece of information
and it is at the cxl device level

/sys/bus/cxl/devices/dax_regionX/extentY/label

..now I would just call that "tag" and UUID format it (to Jonathan's
point), but I see no rationale to hide what is most likely the most
useful information about an extent.

2024-05-03 09:33:28

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

On Wed, 1 May 2024 16:49:24 -0700
Ira Weiny <[email protected]> wrote:

> Jonathan Cameron wrote:
> >
> > >
> > > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
> > Hi Ira, Navneet.
> > >
> > > Remaining work:
> > >
> > > 1) Integrate the QoS work from Dave Jiang
> > > 2) Interleave support
> >
> >
> > More flag. This one I think is potentially important and don't
> > see any handling in here.
>
> Nope I admit I missed the spec requirement.
>
> >
> > Whilst an FM could in theory be careful to avoid sending a
> > sparse set of extents, if the device is managing the memory range
> > (which is possible all it supports) and the FM issues an Initiate Dynamic
> > Capacity Add with Free (again may be all device supports) then we
> > can't stop the device issuing a bunch of sparse extents.
> >
> > Now it won't be broken as such without this, but every time we
> > accept the first extent that will implicitly reject the rest.
> > That will look very ugly to an FM which has to poke potentially many
> > times to successfully allocate memory to a host.
>
> This helps me to see see why the more bit is useful.
>
> >
> > I also don't think it will be that hard to support, but maybe I'm
> > missing something?
>
> Just a bunch of code and refactoring busy work. ;-) It's not rocket
> science but does fundamentally change the arch again.
>
> >
> > My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> > over a list of extents passed in then slightly more complex response
> > generation.
>
> Not exactly 'just a loop'. No matter how I work this out there is the
> possibility that some extents get surfaced and then the kernel tries to
> remove them because it should not have.

Lets consider why it might need to back out.
1) Device sends an invalid set of extents - so maybe one in a later message
overlaps with an already allocated extent. Device bug, handling can
be extremely inelegant - up to crashing the kernel. Worst that happens
due to race is probably a poison storm / machine check fun? Not our
responsibility to deal with something that broken (in my view!) Best effort
only.

2) Host can't handle the extent for some reason and didn't know that until
later - can just reject the ones it can't handle.

>
> To be most safe the cxl core is going to have to make 2 round trips to the
> cxl region layer for each extent. The first determines if the extent is
> valid and creates the extent as much as possible. The second actually
> surfaces the extents. However, if the surface fails then you might not
> get the extents back. So now we are in an invalid state. :-/ WARN and
> continue I guess?!??!

Yes. Orchestrator can decide how to handle - probably reboot server in as
gentle a fashion as possible.


>
> I think the safest way to handle this is add a new kernel notify event
> called 'extent create' which stops short of surfacing the extent. [I'm
> not 100% sure how this is going to affect interleave.]
>
> I think the safest logic for add is something like:
>
> cxl_handle_dcd_add_event()
> add_extent(squirl_list, extent);
>
> if (more bit) /* wait for more */
> return;
>
> /* Create extents to hedge the bets against failure */
> for_each(squirl_list)
> if (notify 'extent create' != ok)
> send_response(fail);
> return;
>
> for_each(squirl_list)
> if (notify 'surface' != ok)
> /*
> * If the more bit was set, some extents
> * have been surfaced and now need to be
> * removed...
> *
> * Try to remove them and hope...
> */

If we failed to surface them all another option is just tell the device
that. Responds with the extents that successfully surfaced and reject
all others (or all after the one that failed?) So for the lower layers
send the device a response that says "thanks but I only took these ones"
and for the upper layers pretend "I was only offered these ones"

> WARN_ON('surface extents failed');
> for_each(squirl_list)
> notify 'remove without response'
> send_response(fail);
> return;
>
> send_response(squirl_list, accept);
>
> The logic for remove is not changed AFAICS because the device must allow
> for memory to be released at any time so the host is free to release each
> of the extents individually despite the 'more' bit???

Yes, but only after it accepted them - which needs to be done in one go.
So you can't just send releases before that (the device will return an
error and keep them in the pending list I think...)

>
> >
> > I don't want this to block getting initial DCD support in but it
> > will be a bit ugly if we quickly support the more flag and then end
> > up with just one kernel that an FM has to be careful with...
>
> I'm not sure which is worse. Given your use case above it seems like the
> more bit may be more important for 'dumb' devices which want to add
> extents in blocks before responding to the FM. Thus complicating the FM.
>
> It seems 'smarter' devices which could figure this out (not requiring the
> more bit) are the ones which will be developed later. So it seems the use
> case time line is the opposite of what we need right now.

Once we hit shareable capacity (which the smarter devices will use) then
this become the dominant approach to non contiguous allocations because
you can't add extents with a given tag in multiple goes.

So I'd expect the more flag to be more common not less over time.
>
> For that reason I'm inclined to try and get this in.
>

Great - but I'd not worry too much about bad effects if you get invalid
lists from the device. If the only option is shout and panic, then fine
though I'd imagine we can do slightly better than that, so maybe warn
extensively and don't let the region be used.

Jonathan

> Ira
>


2024-05-03 17:10:17

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 07/26] cxl/port: Add dynamic capacity size support to endpoint decoders

Jonathan Cameron wrote:
> On Sun, 24 Mar 2024 16:18:10 -0700
> [email protected] wrote:
>
> > From: Navneet Singh <[email protected]>
> >
> > To support Dynamic Capacity Devices (DCD) endpoint decoders will need to
> > map DC partitions (regions). In addition to assigning the size of the
> > DC partition, the decoder must assign any skip value from the previous
> > decoder. This must be done within a contiguous DPA space.
> >
> > Two complications arise with Dynamic Capacity regions which did not
> > exist with Ram and PMEM partitions. First, gaps in the DPA space can
> > exist between and around the DC Regions. Second, the Linux resource
> > tree does not allow a resource to be marked across existing nodes within
> > a tree.
> >
> > For clarity, below is an example of an 60GB device with 10GB of RAM,
> > 10GB of PMEM and 10GB for each of 2 DC Regions. The desired CXL mapping
> > is 5GB of RAM, 5GB of PMEM, and all 10GB of DC1.
> >
> > DPA RANGE
> > (dpa_res)
> > 0GB 10GB 20GB 30GB 40GB 50GB 60GB
> > |----------|----------|----------|----------|----------|----------|
> >
> > RAM PMEM DC0 DC1
> > (ram_res) (pmem_res) (dc_res[0]) (dc_res[1])
> > |----------|----------| <gap> |----------| <gap> |----------|
> >
> > RAM PMEM DC1
> > |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXXXXXXX|
> > 0GB 5GB 10GB 15GB 20GB 30GB 40GB 50GB 60GB
>
>
> To add another corner to the example, maybe map only part of DC1?

Maybe. See below.


[snip]

> > @@ -500,6 +617,21 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> > else
> > free_pmem_start = cxlds->pmem_res.start;
> >
> > + /*
> > + * Limit each decoder to a single DC region to map memory with
> > + * different DSMAS entry.

This prevents more than 1 region per DC partition (region).

> > + */
> > + dc_index = dc_mode_to_region_index(cxled->mode);
> > + if (dc_index >= 0) {
> > + if (cxlds->dc_res[dc_index].child) {
> > + dev_err(dev, "Cannot allocate DPA from DC Region: %d\n",
> > + dc_index);
> > + rc = -EINVAL;
> > + goto out;
> > + }
> > + free_dc_start = cxlds->dc_res[dc_index].start;
> > + }
> > +
> > if (cxled->mode == CXL_DECODER_RAM) {
> > start = free_ram_start;
> > avail = cxlds->ram_res.end - start + 1;
> > @@ -521,12 +653,38 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> > else
> > skip_end = start - 1;
> > skip = skip_end - skip_start + 1;
> > + } else if (cxl_decoder_mode_is_dc(cxled->mode)) {
> > + resource_size_t skip_start, skip_end;
> > +
> > + start = free_dc_start;
> > + avail = cxlds->dc_res[dc_index].end - start + 1;
> > + if ((resource_size(&cxlds->pmem_res) == 0) || !cxlds->pmem_res.child)
> > + skip_start = free_ram_start;
> > + else
> > + skip_start = free_pmem_start;
> > + /*
> > + * If any dc region is already mapped, then that allocation
> > + * already handled the RAM and PMEM skip. Check for DC region
> > + * skip.
> > + */
> > + for (int i = dc_index - 1; i >= 0 ; i--) {
> > + if (cxlds->dc_res[i].child) {
> > + skip_start = cxlds->dc_res[i].child->end + 1;
> > + break;
> > + }
> > + }
> > +
> > + skip_end = start - 1;
> > + skip = skip_end - skip_start + 1;
>
> I notice in the pmem equivalent there is a case for part of the region already mapped.
> Can that not happen for a DC region as well?

See above check. Each DC region (partition) was to be associated with a
single DSMAS entry. I'm unclear now why that decision was made.

It does not seem hard to add this though. Do we really need that ability
considering dax devices are likely going to be the main boundry for users
of a DC region?

Ira

2024-05-03 17:23:17

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 07/26] cxl/port: Add dynamic capacity size support to endpoint decoders

Ira Weiny wrote:
> > > else
> > > free_pmem_start = cxlds->pmem_res.start;
> > >
> > > + /*
> > > + * Limit each decoder to a single DC region to map memory with
> > > + * different DSMAS entry.
>
> This prevents more than 1 region per DC partition (region).

Why? Multiple regions per partition is the current status quo for other
partition types.

> > I notice in the pmem equivalent there is a case for part of the region already mapped.
> > Can that not happen for a DC region as well?
>
> See above check. Each DC region (partition) was to be associated with a
> single DSMAS entry. I'm unclear now why that decision was made.

The limitation of one DSMAS per partition makes sense otherwise that
would indicate performance across different spans of the partition.

> It does not seem hard to add this though. Do we really need that ability
> considering dax devices are likely going to be the main boundry for users
> of a DC region?

It seems like extra work to make DCD a special case compared to the
other partition types, so the burden of proof is the other way. Why
tolerate DCD divergence from the status quo?

2024-05-03 19:10:43

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

Alison Schofield wrote:
> On Sun, Mar 24, 2024 at 04:18:08PM -0700, Ira Weiny wrote:
> > cxl_dpa_set_mode() checks the mode for validity two times, once outside
> > of the DPA RW semaphore and again within.
>
> Not true.

Sorry for not being clear. It does check the mode 2x but not for
validity. I'll clarify.

> It only checks mode once before the lock. It checks for
> capacity after the lock. If it didn't check mode before the lock,
> then unsupported modes would fall through.

But we can check the mode 1 time and either check the size or fail.

>
> > The function is not in a critical path.
>
> Implying what here? OK to check twice (even though it wasn't)
> or OK to expand scope of locking.

Implying that checking the mode outside the lock is not required.

>
> > Prior to Dynamic Capacity the extra check was not much
> > of an issue. The addition of DC modes increases the complexity of
> > the check.
> >
> > Simplify the mode check before adding the more complex DC modes.
> >
>
> The addition of the DC mode check doesn't seem complex.

It is if you have to check it 2 times.

>
> Pardon my picking at the words, but if you'd like to refactor the
> function, just say so. The final result is a bit more readable, but
> also adding the DC mode checks without refactoring would read fine
> also.

When I added the DC mode to this function without this refactoring it was
quite a bit more code and ugly IMO. So this cleanup helped. If I were
not adding the DC code there would be much less reason to change this
function.


[snip]

> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index 7d97790b893d..66b8419fd0c3 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -411,44 +411,35 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> > struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > struct device *dev = &cxled->cxld.dev;
> > - int rc;
> >
> > + guard(rwsem_write)(&cxl_dpa_rwsem);
> > + if (cxled->cxld.flags & CXL_DECODER_F_ENABLE)
> > + return -EBUSY;
> > +
> > + /*
> > + * Check that the mode is supported by the current partition
> > + * configuration
> > + */
> > switch (mode) {
> > case CXL_DECODER_RAM:
> > + if (!resource_size(&cxlds->ram_res)) {
> > + dev_dbg(dev, "no available ram capacity\n");
> > + return -ENXIO;
> > + }
> > + break;
> > case CXL_DECODER_PMEM:
> > + if (!resource_size(&cxlds->pmem_res)) {
> > + dev_dbg(dev, "no available pmem capacity\n");
> > + return -ENXIO;
> > + }
> > break;
> > default:
> > dev_dbg(dev, "unsupported mode: %d\n", mode);
> > return -EINVAL;
> > }
> >
>
> delete extra line

You don't like the space following the switch?

Ira

2024-05-03 20:33:23

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

On Fri, May 03, 2024 at 12:09:27PM -0700, Ira Weiny wrote:

snip

>
> > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > > index 7d97790b893d..66b8419fd0c3 100644
> > > --- a/drivers/cxl/core/hdm.c
> > > +++ b/drivers/cxl/core/hdm.c
> > > @@ -411,44 +411,35 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> > > struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > > struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > > struct device *dev = &cxled->cxld.dev;
> > > - int rc;
> > >
> > > + guard(rwsem_write)(&cxl_dpa_rwsem);
> > > + if (cxled->cxld.flags & CXL_DECODER_F_ENABLE)
> > > + return -EBUSY;
> > > +
> > > + /*
> > > + * Check that the mode is supported by the current partition
> > > + * configuration
> > > + */
> > > switch (mode) {
> > > case CXL_DECODER_RAM:
> > > + if (!resource_size(&cxlds->ram_res)) {
> > > + dev_dbg(dev, "no available ram capacity\n");
> > > + return -ENXIO;
> > > + }
> > > + break;
> > > case CXL_DECODER_PMEM:
> > > + if (!resource_size(&cxlds->pmem_res)) {
> > > + dev_dbg(dev, "no available pmem capacity\n");
> > > + return -ENXIO;
> > > + }
> > > break;
> > > default:
> > > dev_dbg(dev, "unsupported mode: %d\n", mode);
> > > return -EINVAL;
> > > }
> > >
> >
> > delete extra line
>
> You don't like the space following the switch?
>
> Ira

Sorry - looks fine. Ignore my comment.


2024-05-04 01:20:01

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

Ira Weiny wrote:
> Alison Schofield wrote:
> > On Sun, Mar 24, 2024 at 04:18:08PM -0700, Ira Weiny wrote:
> > > cxl_dpa_set_mode() checks the mode for validity two times, once outside
> > > of the DPA RW semaphore and again within.
> >
> > Not true.
>
> Sorry for not being clear. It does check the mode 2x but not for
> validity. I'll clarify.
>
> > It only checks mode once before the lock. It checks for
> > capacity after the lock. If it didn't check mode before the lock,
> > then unsupported modes would fall through.
>
> But we can check the mode 1 time and either check the size or fail.
>
> >
> > > The function is not in a critical path.
> >
> > Implying what here? OK to check twice (even though it wasn't)
> > or OK to expand scope of locking.
>
> Implying that checking the mode outside the lock is not required.

The @mode check outside the lock is there to taking the lock when not
necessary because the passed in mode is already bogus.

The lock is about making sure the write of cxled->mode relative to the
state of the dpa partitions is an atomic check-and-set.

So this makes the function unconditionally take the lock when it might
be bogus to do so. The value of reorganizing this is questionable.

2024-05-04 04:13:44

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

Ira Weiny wrote:
> Alison Schofield wrote:
> > On Sun, Mar 24, 2024 at 04:18:08PM -0700, Ira Weiny wrote:
> > > cxl_dpa_set_mode() checks the mode for validity two times, once outside
> > > of the DPA RW semaphore and again within.
> >
> > Not true.
>
> Sorry for not being clear. It does check the mode 2x but not for
> validity. I'll clarify.
>
> > It only checks mode once before the lock. It checks for
> > capacity after the lock. If it didn't check mode before the lock,
> > then unsupported modes would fall through.
>
> But we can check the mode 1 time and either check the size or fail.
>
> >
> > > The function is not in a critical path.
> >
> > Implying what here? OK to check twice (even though it wasn't)
> > or OK to expand scope of locking.
>
> Implying that checking the mode outside the lock is not required.
>
> >
> > > Prior to Dynamic Capacity the extra check was not much
> > > of an issue. The addition of DC modes increases the complexity of
> > > the check.
> > >
> > > Simplify the mode check before adding the more complex DC modes.
> > >
> >
> > The addition of the DC mode check doesn't seem complex.
>
> It is if you have to check it 2 times.
>
> >
> > Pardon my picking at the words, but if you'd like to refactor the
> > function, just say so. The final result is a bit more readable, but
> > also adding the DC mode checks without refactoring would read fine
> > also.
>
> When I added the DC mode to this function without this refactoring it was
> quite a bit more code and ugly IMO. So this cleanup helped. If I were
> not adding the DC code there would be much less reason to change this
> function.

Where did the "quite a bit more code" come from? A change that moves
unnecessary code under a lock and is larger than just incrementally
extending the status quo does not feel like a cleanup.

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 7d97790b893d..0dc886bc22c6 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -411,11 +411,12 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
struct device *dev = &cxled->cxld.dev;
- int rc;
+ int rc, dcd;

switch (mode) {
case CXL_DECODER_RAM:
case CXL_DECODER_PMEM:
+ case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
break;
default:
dev_dbg(dev, "unsupported mode: %d\n", mode);
@@ -442,6 +443,11 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
rc = -ENXIO;
goto out;
}
+ dcd = dc_mode_to_region_index(mode);
+ if (resource_size(&cxlds->dc_res[dcd]) == 0) {
+ dev_dbg(dev, "no available dynamic capacity\n");
+ goto out;
+ }

cxled->mode = mode;
rc = 0;

2024-05-06 03:47:32

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

Dan Williams wrote:
> Ira Weiny wrote:
> > Alison Schofield wrote:
> > > On Sun, Mar 24, 2024 at 04:18:08PM -0700, Ira Weiny wrote:
> > > > cxl_dpa_set_mode() checks the mode for validity two times, once outside
> > > > of the DPA RW semaphore and again within.
> > >
> > > Not true.
> >
> > Sorry for not being clear. It does check the mode 2x but not for
> > validity. I'll clarify.
> >
> > > It only checks mode once before the lock. It checks for
> > > capacity after the lock. If it didn't check mode before the lock,
> > > then unsupported modes would fall through.
> >
> > But we can check the mode 1 time and either check the size or fail.
> >
> > >
> > > > The function is not in a critical path.
> > >
> > > Implying what here? OK to check twice (even though it wasn't)
> > > or OK to expand scope of locking.
> >
> > Implying that checking the mode outside the lock is not required.
> >
> > >
> > > > Prior to Dynamic Capacity the extra check was not much
> > > > of an issue. The addition of DC modes increases the complexity of
> > > > the check.
> > > >
> > > > Simplify the mode check before adding the more complex DC modes.
> > > >
> > >
> > > The addition of the DC mode check doesn't seem complex.
> >
> > It is if you have to check it 2 times.
> >
> > >
> > > Pardon my picking at the words, but if you'd like to refactor the
> > > function, just say so. The final result is a bit more readable, but
> > > also adding the DC mode checks without refactoring would read fine
> > > also.
> >
> > When I added the DC mode to this function without this refactoring it was
> > quite a bit more code and ugly IMO. So this cleanup helped. If I were
> > not adding the DC code there would be much less reason to change this
> > function.
>
> Where did the "quite a bit more code" come from? A change that moves
> unnecessary code under a lock and is larger than just incrementally
> extending the status quo does not feel like a cleanup.

I'll drop the patch.

Ira

>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7d97790b893d..0dc886bc22c6 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -411,11 +411,12 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &cxled->cxld.dev;
> - int rc;
> + int rc, dcd;
>
> switch (mode) {
> case CXL_DECODER_RAM:
> case CXL_DECODER_PMEM:
> + case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> break;
> default:
> dev_dbg(dev, "unsupported mode: %d\n", mode);
> @@ -442,6 +443,11 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> rc = -ENXIO;
> goto out;
> }
> + dcd = dc_mode_to_region_index(mode);
> + if (resource_size(&cxlds->dc_res[dcd]) == 0) {
> + dev_dbg(dev, "no available dynamic capacity\n");
> + goto out;
> + }
>
> cxled->mode = mode;
> rc = 0;



2024-05-06 04:06:25

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode()

Dan Williams wrote:
> Ira Weiny wrote:
> > Alison Schofield wrote:
> > > On Sun, Mar 24, 2024 at 04:18:08PM -0700, Ira Weiny wrote:
> > > > cxl_dpa_set_mode() checks the mode for validity two times, once outside
> > > > of the DPA RW semaphore and again within.
> > >
> > > Not true.
> >
> > Sorry for not being clear. It does check the mode 2x but not for
> > validity. I'll clarify.
> >
> > > It only checks mode once before the lock. It checks for
> > > capacity after the lock. If it didn't check mode before the lock,
> > > then unsupported modes would fall through.
> >
> > But we can check the mode 1 time and either check the size or fail.
> >
> > >
> > > > The function is not in a critical path.
> > >
> > > Implying what here? OK to check twice (even though it wasn't)
> > > or OK to expand scope of locking.
> >
> > Implying that checking the mode outside the lock is not required.
>
> The @mode check outside the lock is there to taking the lock when not
> necessary because the passed in mode is already bogus.

Sorry I meant to say 'is required'.

>
> The lock is about making sure the write of cxled->mode relative to the
> state of the dpa partitions is an atomic check-and-set.
>
> So this makes the function unconditionally take the lock when it might
> be bogus to do so. The value of reorganizing this is questionable.

Why would it be bogus? I don't see that. Regardless I dropped the patch as it
is not worth spending more time on. There are bigger issues to resolve with
this series.

Ira

2024-05-06 04:08:05

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 07/26] cxl/port: Add dynamic capacity size support to endpoint decoders

Dan Williams wrote:
> Ira Weiny wrote:
> > > > else
> > > > free_pmem_start = cxlds->pmem_res.start;
> > > >
> > > > + /*
> > > > + * Limit each decoder to a single DC region to map memory with
> > > > + * different DSMAS entry.
> >
> > This prevents more than 1 region per DC partition (region).
>
> Why? Multiple regions per partition is the current status quo for other
> partition types.
>
> > > I notice in the pmem equivalent there is a case for part of the region already mapped.
> > > Can that not happen for a DC region as well?
> >
> > See above check. Each DC region (partition) was to be associated with a
> > single DSMAS entry. I'm unclear now why that decision was made.
>
> The limitation of one DSMAS per partition makes sense otherwise that
> would indicate performance across different spans of the partition.
>
> > It does not seem hard to add this though. Do we really need that ability
> > considering dax devices are likely going to be the main boundry for users
> > of a DC region?
>
> It seems like extra work to make DCD a special case compared to the
> other partition types, so the burden of proof is the other way. Why
> tolerate DCD divergence from the status quo?

I don't remember the details. But this was discussed before in a call. I'm ok
adding this support. I have a test for it now. Just need to tweek a couple of
things.

Ira

2024-05-06 04:24:42

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

Jonathan Cameron wrote:
> On Wed, 1 May 2024 16:49:24 -0700
> Ira Weiny <[email protected]> wrote:
>
> > Jonathan Cameron wrote:
> > >
> > > >
> > > > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
> > > Hi Ira, Navneet.
> > > >
> > > > Remaining work:
> > > >
> > > > 1) Integrate the QoS work from Dave Jiang
> > > > 2) Interleave support
> > >
> > >
> > > More flag. This one I think is potentially important and don't
> > > see any handling in here.
> >
> > Nope I admit I missed the spec requirement.
> >
> > >
> > > Whilst an FM could in theory be careful to avoid sending a
> > > sparse set of extents, if the device is managing the memory range
> > > (which is possible all it supports) and the FM issues an Initiate Dynamic
> > > Capacity Add with Free (again may be all device supports) then we
> > > can't stop the device issuing a bunch of sparse extents.
> > >
> > > Now it won't be broken as such without this, but every time we
> > > accept the first extent that will implicitly reject the rest.
> > > That will look very ugly to an FM which has to poke potentially many
> > > times to successfully allocate memory to a host.
> >
> > This helps me to see see why the more bit is useful.
> >
> > >
> > > I also don't think it will be that hard to support, but maybe I'm
> > > missing something?
> >
> > Just a bunch of code and refactoring busy work. ;-) It's not rocket
> > science but does fundamentally change the arch again.
> >
> > >
> > > My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> > > over a list of extents passed in then slightly more complex response
> > > generation.
> >
> > Not exactly 'just a loop'. No matter how I work this out there is the
> > possibility that some extents get surfaced and then the kernel tries to
> > remove them because it should not have.
>
> Lets consider why it might need to back out.
> 1) Device sends an invalid set of extents - so maybe one in a later message
> overlaps with an already allocated extent. Device bug, handling can
> be extremely inelegant - up to crashing the kernel. Worst that happens
> due to race is probably a poison storm / machine check fun? Not our
> responsibility to deal with something that broken (in my view!) Best effort
> only.
>
> 2) Host can't handle the extent for some reason and didn't know that until
> later - can just reject the ones it can't handle.

3) Something in the host fails like ENOMEM on a later extent surface which
requires the host to back out of all of them.

3 should be rare and I'm working toward it. But it is possible this will
happen.

If you have a 'prepare' notify it should avoid most of these because the
extents will be mostly formed. But there are some error paths on the actual
surface code path.

>
> >
> > To be most safe the cxl core is going to have to make 2 round trips to the
> > cxl region layer for each extent. The first determines if the extent is
> > valid and creates the extent as much as possible. The second actually
> > surfaces the extents. However, if the surface fails then you might not
> > get the extents back. So now we are in an invalid state. :-/ WARN and
> > continue I guess?!??!
>
> Yes. Orchestrator can decide how to handle - probably reboot server in as
> gentle a fashion as possible.
>

Ok

>
> >
> > I think the safest way to handle this is add a new kernel notify event
> > called 'extent create' which stops short of surfacing the extent. [I'm
> > not 100% sure how this is going to affect interleave.]
> >
> > I think the safest logic for add is something like:
> >
> > cxl_handle_dcd_add_event()
> > add_extent(squirl_list, extent);
> >
> > if (more bit) /* wait for more */
> > return;
> >
> > /* Create extents to hedge the bets against failure */
> > for_each(squirl_list)
> > if (notify 'extent create' != ok)
> > send_response(fail);
> > return;
> >
> > for_each(squirl_list)
> > if (notify 'surface' != ok)
> > /*
> > * If the more bit was set, some extents
> > * have been surfaced and now need to be
> > * removed...
> > *
> > * Try to remove them and hope...
> > */
>
> If we failed to surface them all another option is just tell the device
> that. Responds with the extents that successfully surfaced and reject
> all others (or all after the one that failed?) So for the lower layers
> send the device a response that says "thanks but I only took these ones"
> and for the upper layers pretend "I was only offered these ones"
>

But doesn't that basically break the more bit? I'm willing to do that as it is
easier for the host.

> > WARN_ON('surface extents failed');
> > for_each(squirl_list)
> > notify 'remove without response'
> > send_response(fail);
> > return;
> >
> > send_response(squirl_list, accept);
> >
> > The logic for remove is not changed AFAICS because the device must allow
> > for memory to be released at any time so the host is free to release each
> > of the extents individually despite the 'more' bit???
>
> Yes, but only after it accepted them - which needs to be done in one go.
> So you can't just send releases before that (the device will return an
> error and keep them in the pending list I think...)

:-( OK so this more bit is really more... no pun intended. Because this
breaks the entire model I have if I have to treat these as a huge atomic unit.

Let me think on that a bit more. Obviously it is just tagging an iterating the
extents to find those associated with a more bit on accept. But it will take
some time to code up.

>
> >
> > >
> > > I don't want this to block getting initial DCD support in but it
> > > will be a bit ugly if we quickly support the more flag and then end
> > > up with just one kernel that an FM has to be careful with...
> >
> > I'm not sure which is worse. Given your use case above it seems like the
> > more bit may be more important for 'dumb' devices which want to add
> > extents in blocks before responding to the FM. Thus complicating the FM.
> >
> > It seems 'smarter' devices which could figure this out (not requiring the
> > more bit) are the ones which will be developed later. So it seems the use
> > case time line is the opposite of what we need right now.
>
> Once we hit shareable capacity (which the smarter devices will use) then
> this become the dominant approach to non contiguous allocations because
> you can't add extents with a given tag in multiple goes.

Why not? Sharing is going to require some synchronization with the
orchestrator and can't the user app just report it did not get all it's memory
and wait for more? With the same tag?

>
> So I'd expect the more flag to be more common not less over time.
> >
> > For that reason I'm inclined to try and get this in.
> >
>
> Great - but I'd not worry too much about bad effects if you get invalid
> lists from the device. If the only option is shout and panic, then fine
> though I'd imagine we can do slightly better than that, so maybe warn
> extensively and don't let the region be used.

It is not just about invalid lists. It is that setting up the extent devices
may fail and waiting for the devices to be set up means that they are user
visible. So that is the chicken and the egg...

This is unlikely and perhaps the partials should just be surfaced and accept
whatever works. Then let it all tear down later if it does not all go.

But I was trying to honor the accept 'all or nothing' as that is what has been
stated as the requirement of the more bit.

But it seems that it does not __have__ to be atomic. Or at least the partials
can be cleaned up and all tried again.

Ira

>
> Jonathan
>
> > Ira
> >
>



2024-05-06 04:36:03

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices

Dan Williams wrote:
> Ira Weiny wrote:
> > Jonathan Cameron wrote:
> > > On Sun, 24 Mar 2024 16:18:19 -0700
> > > [email protected] wrote:
> > >
> > > > From: Navneet Singh <[email protected]>
> > > >
> > > > Once all extents of an interleave set are present a region must
> > > > surface an extent to the region.
> > > >
> > > > Without interleaving; endpoint decoder and region extents have a 1:1
> > > > relationship. Future support for IW > 1 will maintain a N:1
> > > > relationship between the device extents and region extents.
> > > >
> > > > Create a region extent device for every device extent found. Release of
> > > > the extent device triggers a response to the underlying hardware extent.
> > > >
> > > > There is no strong use case to support the addition of extents which
> > > > overlap previously accepted extent ranges. Reject such new extents
> > > > until such time as a good use case emerges.
> > > >
> > > > Expose the necessary details of region extents by creating the following
> > > > sysfs entries.
> > > >
> > > > /sys/bus/cxl/devices/dax_regionX/extentY
> > > > /sys/bus/cxl/devices/dax_regionX/extentY/offset
> > > > /sys/bus/cxl/devices/dax_regionX/extentY/length
> > > > /sys/bus/cxl/devices/dax_regionX/extentY/label
> > >
> > > Docs?
> >
> > That is a good idea.
> >
> > > The label in particular worries me a little as I'm not sure what
> > > is in it.
> >
> > I envisioned a pass through of the tag.
> >
> > >
> > > If it's the tag one possible format is a uuid (not a coincidence
> > > that it is the same length) and interpreting that as characters isn't
> > > going to get us far. I wonder if we have to treat it as a binary attr
> > > given we have no idea what it is.
> >
> > In thinking about this more (and running some experiments): none of these are
> > strictly necessary in this initial implementation. No code currently uses
> > them directly.
> >
> > I questioned these in the past and I've done so again over the weekend.
> >
> > I was about to rip them out entirely when I remembered Gregory Price's
> > comments on Discord. There he indicating a desire to very carefully place
> > dax devices. Without at least the offset and length above (and to a
> > lesser extent the label) this can't be done.
>
> Careful placement of dax-devices physically requires an entirely new
> allocation ABI. There is the mapping_store() interface that was added
> for a specific kexec / VMM fast restore use case, but that never
> envisioned the sparse region case. So I do think it is worthwhile to
> punt on that question to a later add-on feature.

Agreed.

>
> [..]
> > I don't think this is exactly what the user is going to expect. This can
> > be resolved by by looking at the dax device mappings though.[0] So I'm going
> > to leave this for now. But I expect some additional porcelain is going to
> > be required to fully meet Gregory's requirements.
>
> Not sure what the exact requirement is, but if it's the typical, "I want
> to allocate by tag",

I'm extrapolating that it will be. I want to allocate on the first extent.
Tags were removed from the series a while ago.

> then I think there is another potential coarse
> grained solution that probably covers most cases. Allow multiple
> dax_regions per cxl_dcd_regions, where each dax_region manages an
> exclusive set of tags.

I'll have to think on that because don't dax regions map to specific dpas on
creation?

>
> The host negotiates a dax_region tag layout with the orchestrator and
> can then trust that all of the extents that show up in a given dax_region
> belong to a given tag or set of tags.
>
> This is not something that needs to be considered in the initial
> enabling, but is potentially a way to avoid bolting-on a new fine grained
> allocation api after the fact.
>

The point is I'm not trying to bolt anything on. Just trying to explain what
can and can't be done. The purpose of these entries was to give the user the
ability to see what extents existed and by correlating the dax mappings could
see where their dax mappings landed. Careful allocation of dax devices could
result in the use of some extents and not others. But this was __not__ at all
intended to be done initially. Just use the space as it come available without
any tag use at all.

>
> > [0]
> > /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
> > /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
> >
> > Back to the label field: It is currently just the 'tag' of the individual
> > extent (because no interleaving). My vision for the interleave case would
> > be for the kernel to assemble device extents into a region extent only if
> > the tags match and export that.
> >
> > Thinking on it more though we should leave label out for now. This is the
> > second time it has been questioned.
>
> I don't understand the issue. That is a critical piece of information
> and it is at the cxl device level
>
> /sys/bus/cxl/devices/dax_regionX/extentY/label
>
> ...now I would just call that "tag" and UUID format it (to Jonathan's
> point), but I see no rationale to hide what is most likely the most
> useful information about an extent.

The rationale is that the user was not going to use it. So no use case == no
reason to have it... yet. I had code a while back to allocate dax devices on
specific tags. But that was deemed to different from the current dax
allocation mechanism so it was scrapped... for now.

I can add it back... But I'm just getting a bit testy about who wants what and
how this is all going to get used.

Ira

2024-05-06 16:59:47

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts

ira.weiny@ wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic Capacity Devices (DCD) support extent change notifications
> through the event log mechanism. The interrupt mailbox commands were
> extended in CXL 3.1 to support these notifications.
>
> Firmware can't configure DCD events to be FW controlled but can retain
> control of memory events. Split irq configuration of memory events and
> DCD events to allow for FW control of memory events while DCD is host
> controlled.
>
> Configure DCD event log interrupts on devices supporting dynamic
> capacity. Disable DCD if interrupts are not supported.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [iweiny: rebase to upstream irq code]
> [iweiny: disable DCD if irqs not supported]
> ---
> drivers/cxl/core/mbox.c | 9 ++++++-
> drivers/cxl/cxl.h | 4 ++-
> drivers/cxl/cxlmem.h | 4 +++
> drivers/cxl/pci.c | 71 ++++++++++++++++++++++++++++++++++++++++---------
> 4 files changed, 74 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 14e8a7528a8b..58b31fa47b93 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1323,10 +1323,17 @@ static int cxl_get_dc_config(struct cxl_memdev_state *mds, u8 start_region,
> return rc;
> }
>
> -static bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> +bool cxl_dcd_supported(struct cxl_memdev_state *mds)
> {
> return test_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> }
> +EXPORT_SYMBOL_NS_GPL(cxl_dcd_supported, CXL);
> +
> +void cxl_disable_dcd(struct cxl_memdev_state *mds)
> +{
> + clear_bit(CXL_DCD_ENABLED_GET_CONFIG, mds->dcd_cmds);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_disable_dcd, CXL);

Just use the open-coded bit ops, or local / static helpers because these
helpers do not consume any other infra from core/mbox.c.

> /**
> * cxl_dev_dynamic_capacity_identify() - Reads the dynamic capacity
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 15d418b3bc9b..d585f5fdd3ae 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -164,11 +164,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
> #define CXLDEV_EVENT_STATUS_WARN BIT(1)
> #define CXLDEV_EVENT_STATUS_FAIL BIT(2)
> #define CXLDEV_EVENT_STATUS_FATAL BIT(3)
> +#define CXLDEV_EVENT_STATUS_DCD BIT(4)
>
> #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \
> CXLDEV_EVENT_STATUS_WARN | \
> CXLDEV_EVENT_STATUS_FAIL | \
> - CXLDEV_EVENT_STATUS_FATAL)
> + CXLDEV_EVENT_STATUS_FATAL| \
> + CXLDEV_EVENT_STATUS_DCD)
>
> /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */
> #define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0)
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 4624cf612c1e..01bee6eedff3 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -225,7 +225,9 @@ struct cxl_event_interrupt_policy {
> u8 warn_settings;
> u8 failure_settings;
> u8 fatal_settings;
> + u8 dcd_settings;
> } __packed;
> +#define CXL_EVENT_INT_POLICY_BASE_SIZE 4 /* info, warn, failure, fatal */
>
> /**
> * struct cxl_event_state - Event log driver state
> @@ -890,6 +892,8 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> const uuid_t *uuid, union cxl_event *evt);
> +bool cxl_dcd_supported(struct cxl_memdev_state *mds);
> +void cxl_disable_dcd(struct cxl_memdev_state *mds);
> int cxl_set_timestamp(struct cxl_memdev_state *mds);
> int cxl_poison_state_init(struct cxl_memdev_state *mds);
> int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 12cd5d399230..ef482eae09e9 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -669,22 +669,33 @@ static int cxl_event_get_int_policy(struct cxl_memdev_state *mds,
> }
>
> static int cxl_event_config_msgnums(struct cxl_memdev_state *mds,
> - struct cxl_event_interrupt_policy *policy)
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> {
> struct cxl_mbox_cmd mbox_cmd;
> + size_t size_in;
> int rc;
>
> - *policy = (struct cxl_event_interrupt_policy) {
> - .info_settings = CXL_INT_MSI_MSIX,
> - .warn_settings = CXL_INT_MSI_MSIX,
> - .failure_settings = CXL_INT_MSI_MSIX,
> - .fatal_settings = CXL_INT_MSI_MSIX,
> - };
> + if (native_cxl) {
> + *policy = (struct cxl_event_interrupt_policy) {
> + .info_settings = CXL_INT_MSI_MSIX,
> + .warn_settings = CXL_INT_MSI_MSIX,
> + .failure_settings = CXL_INT_MSI_MSIX,
> + .fatal_settings = CXL_INT_MSI_MSIX,
> + .dcd_settings = 0,

No need to initialize dcd_settings.

> + };
> + }
> + size_in = CXL_EVENT_INT_POLICY_BASE_SIZE;

Let's skip adding this new #define and make it explicit, i.e. wihtout
needing to crack open the spec, that the dcd settings are incremental to
the original payload size:

size_in = offsetof(typeof(*policy), dcd_settings);

> +
> + if (cxl_dcd_supported(mds)) {
> + policy->dcd_settings = CXL_INT_MSI_MSIX;
> + size_in += sizeof(policy->dcd_settings);

..and then this can just be:

size_in = sizeof(*policy);

> + }
>
> mbox_cmd = (struct cxl_mbox_cmd) {
> .opcode = CXL_MBOX_OP_SET_EVT_INT_POLICY,
> .payload_in = policy,
> - .size_in = sizeof(*policy),
> + .size_in = size_in,
> };
>
> rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> @@ -731,6 +742,31 @@ static int cxl_event_irqsetup(struct cxl_memdev_state *mds,
> return 0;
> }
>
> +static int cxl_irqsetup(struct cxl_memdev_state *mds,
> + struct cxl_event_interrupt_policy *policy,
> + bool native_cxl)
> +{
> + struct cxl_dev_state *cxlds = &mds->cxlds;
> + int rc;
> +
> + if (native_cxl) {
> + rc = cxl_event_irqsetup(mds, policy);
> + if (rc)
> + return rc;
> + }
> +
> + if (cxl_dcd_supported(mds)) {
> + rc = cxl_event_req_irq(cxlds, policy->dcd_settings);
> + if (rc) {
> + dev_err(cxlds->dev, "Failed to get interrupt for DCD event log\n");
> + cxl_disable_dcd(mds);
> + return rc;
> + }
> + }

I think this could be simplified if cxl_event_req_irq() simply skipped
the non CXL_INT_MSI_MSIX modes. I.e. cxl_event_req_irq() is being too
strict after the policy settings have already run the gauntlet.

2024-05-06 18:34:51

by Dan Williams

[permalink] [raw]
Subject: RE: [PATCH 14/26] cxl/region: Read existing extents on region creation

ira.weiny@ wrote:
> From: Navneet Singh <[email protected]>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case creation of a new
> region on top of the DC partition (region) is expected to expose those
> extents for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized read the device extent list. For ease of review, this patch
> stops after reading the extent list and leaves realization of the region
> extents to a future patch.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1:
> [iweiny: remove extent list xarray]
> [iweiny: Update spec references to 3.1]
> [iweiny: use struct range in extents]
> [iweiny: remove all reference tracking and let regions track extents
> through the extent devices.]
> [djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
> ---
> drivers/cxl/core/core.h | 9 +++
> drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 29 +++++++
> drivers/cxl/cxlmem.h | 49 ++++++++++++
> 4 files changed, 279 insertions(+)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 91abeffbe985..119b12362977 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,6 +4,8 @@
> #ifndef __CXL_CORE_H__
> #define __CXL_CORE_H__
>
> +#include <cxlmem.h>
> +
> extern const struct device_type cxl_nvdimm_bridge_type;
> extern const struct device_type cxl_nvdimm_type;
> extern const struct device_type cxl_pmu_type;
> @@ -28,6 +30,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
> int cxl_region_init(void);
> void cxl_region_exit(void);
> int cxl_get_poison_by_endpoint(struct cxl_port *port);
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent);

There are already functions called "cxled_", so lets not invent the
"cxl_ed_" prefix.

[..]
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 58b31fa47b93..9e33a0976828 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct device *dev = mds->cxlds.dev;
> + uint64_t start, len;
> +
> + start = le64_to_cpu(dc_extent->start_dpa);
> + len = le64_to_cpu(dc_extent->length);
> +
> + /* Extents must not cross region boundary's */
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + if (dcr->base <= start &&
> + (start + len) <= (dcr->base + dcr->decode_len)) {
> + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> + start, start + len - 1, i, start - dcr->base);
> + return 0;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> + start, start + len - 1);

If the goal is give the admin an answer to the question "hey what
happened to the capacity I was expecting?", then this should include the
tag. Also, this is a warning, not an error, right? I.e. the driver
continues with the validated extents.

> + return -EINVAL;

This value is not returned up the stack, however, I expect EINVAL on
user input errors. For misaligned device-internal addressing, ENXIO is
more appropriate.

> +}
> +
> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *extent)

How about cxled_contains_extent()?

There's no other "extents" besides "dc_extents" in the driver, and once
a symbol name goes over 2 underscores it starts to be too many tokens.

> +{
> + uint64_t start = le64_to_cpu(extent->start_dpa);
> + uint64_t length = le64_to_cpu(extent->length);
> + struct range ext_range = (struct range){
> + .start = start,
> + .end = start + length - 1,
> + };
> + struct range ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> +
> + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
> + cxled->dpa_res, start, length);

ED is not a standalone abbreviation anywhere else, it's either "cxled" or
"endpoint decoder". I am open to renames, but not mixed names.

For this one the decoder name is already in the printout, so no real
need to redundantly mention "ED".

Lastly, I think continued use of 'struct range' is begging for a new
enlightened format specifier. I am thinking "%par" since these things
are usually some kind of physical address, and I do not see an easy way
to extend the existing "%pr/%pR" to accommodate ranges.

> +
> + return range_contains(&ed_range, &ext_range);
> +}
> +
> void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> @@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}

Looks good, makes me wonder if a cxled_to_devstate() would be a net positive
reduction in code. I think most of the current cxled_to_memdev(), just
do cxlmd->cxlds with the result.

> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> + unsigned int *extent_gen_num)

I know the spec has this behavior where asking for zero extents returns
the total pending, but that does not really justify having this extra
step before retrieving extents.

> +{
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + struct cxl_mbox_get_dc_extent_out dc_extents;

The more I look at these patches the more I think a s/dc_extent/extent/
change is warranted to cut down on the visual token parsing reading this
code.

> + struct cxl_mbox_cmd mbox_cmd;
> + unsigned int count;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(0),
> + .start_extent_index = cpu_to_le32(0),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = sizeof(dc_extents),
> + .payload_out = &dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + count = le32_to_cpu(dc_extents.total_extent_cnt);
> + *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);

Setting aside that this function likely serves no incremental purpose,
why is the number of extents stored in a variable called "gen_num"?

> +
> + return count;
> +}
> +
> +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
> + unsigned int start_gen_num,
> + unsigned int exp_cnt)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + unsigned int start_index, total_read;
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_mbox_cmd mbox_cmd;
> +
> + struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
> + kvmalloc(mds->payload_size, GFP_KERNEL);
> + if (!dc_extents)
> + return -ENOMEM;
> +
> + total_read = 0;
> + start_index = 0;
> + do {
> + unsigned int nr_ext, total_extent_cnt, gen_num;
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(exp_cnt - start_index),

Shouldn't this be something like:

.extent_cnt = cpu_to_le32(start_index ? remaining : 1),

..where @remaining is initialized at the end of the first iteration?

> + .start_extent_index = cpu_to_le32(start_index),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = mds->payload_size,
> + .payload_out = dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);

It occurs to me that usage of "nr_" outnumbers "_cnt" in the driver, lets
stick to the predominate style and just use "nr_" for symbol names that
represent counts and just call this nr_returned, or similar.

> + total_read += nr_ext;
> + total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> + gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +
> + dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> + total_extent_cnt, gen_num);
> +
> + if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> + dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
> + gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> + return -EIO;

Why fail? When the generation number has changed I would only hope that
means that the number of extents in the list has gone up, not that
previously retrieved extents have been invalidated.

So a generation number change event likely just means to retry the
retrieval starting from the end of the last generation.

> + }
> +
> + for (int i = 0; i < nr_ext ; i++) {
> + dev_dbg(dev, "Processing extent %d/%d\n",
> + start_index + i, exp_cnt);
> + rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
> + if (rc)
> + continue;
> + if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
> + continue;
> + rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
> + if (rc)
> + return rc;

I would rather this patch just claim to only validate all present
extents rather than pretend to add it. I.e. defer
cxl_ed_add_one_extent() to be defined and called later. When it comes
back a name with less tokens like cxled_add_extent() would be nice.
"one" is already assumed by non-plural "extent".

> + }
> +
> + start_index += nr_ext;
> + } while (exp_cnt > total_read);
> +
> + return 0;
> +}
> +
> +/**
> + * cxl_read_dc_extents() - Read any existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add any existing extents found which belong to this decoder.
> + *
> + * Return: 0 if command was executed successfully, -ERRNO on error.
> + */
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + unsigned int extent_gen_num;
> + int rc;
> +
> + if (!cxl_dcd_supported(mds)) {

Why is "dcd_supported" being checked again so deep in the stack? How
does an upper layer get this far into the driver without something
already noticing that dcd support is not present?

[..]
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 0d7b09a49dcf..3e563ab29afe 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,6 +1450,13 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> +
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder *cxled,
> @@ -2773,6 +2780,22 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
> return rc;
> }
>
> +static int cxl_region_read_extents(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int i;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + int rc;
> +
> + rc = cxl_read_dc_extents(p->targets[i]);

Per comment above, the targets should have already been checked for dcd
support before being added to the region.


> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> @@ -2807,6 +2830,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> dev_name(dev));
>
> + if (cxlr->mode == CXL_REGION_DC) {
> + rc = cxl_region_read_extents(cxlr);

devm_cxl_add_dax_region() happens way after the region parameters have
been validated. I would have expected that initial extent list
validation happens earlier during region attach. This reorganization
also more naturally fits the interleave case where there will need be
cross device-validation before cxl_region_probe() runs.

[..]

2024-05-07 01:31:01

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 16/26] cxl/extent: Realize extent devices

ira.weiny@ wrote:
> From: Navneet Singh <[email protected]>
>
> Once all extents of an interleave set are present a region must
> surface an extent to the region.
>
> Without interleaving; endpoint decoder and region extents have a 1:1
> relationship. Future support for IW > 1 will maintain a N:1
> relationship between the device extents and region extents.
>
> Create a region extent device for every device extent found. Release of
> the extent device triggers a response to the underlying hardware extent.
>
> There is no strong use case to support the addition of extents which
> overlap previously accepted extent ranges. Reject such new extents
> until such time as a good use case emerges.
>
> Expose the necessary details of region extents by creating the following
> sysfs entries.
>
> /sys/bus/cxl/devices/dax_regionX/extentY
> /sys/bus/cxl/devices/dax_regionX/extentY/offset
> /sys/bus/cxl/devices/dax_regionX/extentY/length
> /sys/bus/cxl/devices/dax_regionX/extentY/label
>
> The use of the extent devices by the DAX layer is deferred to later
> patches.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [iweiny: new patch]
> [iweiny: Rename 'dr_extent' to 'region_extent']
> ---
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/extent.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/mbox.c | 43 +++++++++++++++
> drivers/cxl/core/region.c | 76 +++++++++++++++++++++++++-
> drivers/cxl/cxl.h | 37 +++++++++++++
> tools/testing/cxl/Kbuild | 1 +
> 6 files changed, 290 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..35c5c76bfcf1 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -14,5 +14,6 @@ cxl_core-y += pci.o
> cxl_core-y += hdm.o
> cxl_core-y += pmu.o
> cxl_core-y += cdat.o
> +cxl_core-y += extent.o
> cxl_core-$(CONFIG_TRACING) += trace.o
> cxl_core-$(CONFIG_CXL_REGION) += region.o
> diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> new file mode 100644
> index 000000000000..487c220f1c3c
> --- /dev/null
> +++ b/drivers/cxl/core/extent.c
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <cxl.h>
> +
> +static DEFINE_IDA(cxl_extent_ida);

I would have expected this to be region scoped, that would also allow
them to all be listed as extents on the "cxl" bus because they are CXL
objects.

So at a minimum they would be named:

dev_set_name(dev, "extent%d.%d", cxlr->id, extent_id)

..but given the idea to have multiple dax_region as the management
mechanism for coarse routing of tags by regions this might want to be a
triplet of

cxlr->id, dcd_dax_region_id, extent_id

..at a minimum lets give some freedom to figure out the tag routing
mechanism especially with the threat of multiple dax_region's per
cxl_region.

> +
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> + char *buf)

For something called offset I would expect the output to be relative to
the base HPA of the parent cxl_region. Implementation below looks like
an absolute, not an offset.

Now it might be an offset but "hpa_range" refers to an absolute range
everywhere else it appears in the driver.

> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + return sysfs_emit(buf, "%pa\n", &reg_ext->hpa_range.start);

Note "%pa" is for "phys_addr_t" or "resource_size_t" because those
alternate between "unsigned long" and "unsigned long long". Likely this
is subtly safe because there is currently no place where a phys_addr_t
is larger than a u64, but I would feel better if this assigned to
phys_addr_t or just did %#llx since I think u64 is always an "unsigned
long long"

> +}
> +static DEVICE_ATTR_RO(offset);
> +
> +static ssize_t length_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> + u64 length = range_len(&reg_ext->hpa_range);
> +
> + return sysfs_emit(buf, "%pa\n", &length);

Same %pa vs phys_addr_t vs u64 comment.

> +}
> +static DEVICE_ATTR_RO(length);
> +
> +static ssize_t label_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);
> +
> + return sysfs_emit(buf, "%s\n", reg_ext->label);

I like Jonathan's suggestion of just uuid formatting this and calling it
"tag" since these things are CXL device objects they can have CXL
attributes.

If the tag is empty then just hide this attribute.

> +}
> +static DEVICE_ATTR_RO(label);
> +
> +static struct attribute *region_extent_attrs[] = {
> + &dev_attr_offset.attr,
> + &dev_attr_length.attr,
> + &dev_attr_label.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group region_extent_attribute_group = {
> + .attrs = region_extent_attrs,
> +};
> +
> +static const struct attribute_group *region_extent_attribute_groups[] = {
> + &region_extent_attribute_group,
> + NULL,
> +};

Just use __ATTRIBUTE_GROUPS() helper for this. Note I recommended that
over ATTRIBUTE_GROUPS() since the latter does not allow an is_visible()
callback to be specified.

> +static void region_extent_release(struct device *dev)
> +{
> + struct region_extent *reg_ext = to_region_extent(dev);

Does the reg_ext abbreviation really buy that much versus just typing
out region_extent?

> +
> + cxl_release_ed_extent(&reg_ext->ed_ext);

No, Linux object release time is too late to touch the hardware state.
Unregister time is more appropriate. However, this gets back to whole
question about "why should the kernel automatically send a release for
extents that it found present at the beginning of time?"

For example the driver does not reset endpoint decoders on driver
shutdown, why should it free extents just because the driver got
unbound?

Now the question becomes when should it free extents? I am thinking the
policy to start should be identical to endpoint decoders. I.e. explicit
decommitting the region causes all decode state including extents to be
freed and FM release dynamic capacity events to idle capacity are the
only methods that extents get released.

> + ida_free(&cxl_extent_ida, reg_ext->dev.id);
> + kfree(reg_ext);
> +}
> +
> +static const struct device_type region_extent_type = {
> + .name = "extent",
> + .release = region_extent_release,
> + .groups = region_extent_attribute_groups,
> +};
> +
> +bool is_region_extent(struct device *dev)
> +{
> + return dev->type == &region_extent_type;
> +}
> +EXPORT_SYMBOL_NS_GPL(is_region_extent, CXL);

What outside the core needs this export?

> +static void region_extent_unregister(void *ext)
> +{
> + struct region_extent *reg_ext = ext;
> +
> + dev_dbg(&reg_ext->dev, "DAX region rm extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);

Another use for a %par range print helper.

> + device_unregister(&reg_ext->dev);
> +}
> +
> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled)
> +{
> + struct region_extent *reg_ext;
> + struct device *dev;
> + int rc, id;
> +
> + id = ida_alloc(&cxl_extent_ida, GFP_KERNEL);
> + if (id < 0)
> + return -ENOMEM;
> +
> + reg_ext = kzalloc(sizeof(*reg_ext), GFP_KERNEL);
> + if (!reg_ext)
> + return -ENOMEM;

@id leak?

> +
> + reg_ext->hpa_range = *hpa_range;
> + reg_ext->ed_ext.dpa_range = *dpa_range;
> + reg_ext->ed_ext.cxled = cxled;
> + snprintf(reg_ext->label, DAX_EXTENT_LABEL_LEN, "%s", label);

another instance of the "tag as uuid" feedback.

> +
> + dev = &reg_ext->dev;
> + device_initialize(dev);
> + dev->id = id;
> + device_set_pm_not_required(dev);
> + dev->parent = &cxlr_dax->dev;
> + dev->type = &region_extent_type;

Lets also place these objects on the cxl_bus_type alongside endpoint
decoders etc...

> + rc = dev_set_name(dev, "extent%d", dev->id);

..but that does require a naming convention that will not collide in
/sys/bus/cxl/devices

> + if (rc)
> + goto err;
> +
> + rc = device_add(dev);
> + if (rc)
> + goto err;
> +
> + dev_dbg(dev, "DAX region extent HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + return devm_add_action_or_reset(&cxlr_dax->dev, region_extent_unregister,

It is awkward to use &cxlr_dax->dev as the devm host here. How do you
know that that @cxlr_dax is or is not attached to its driver at this
point?

Likely if you want this to be the devm host this device enumeration
should be deferred to cxl_dax_region_probe() in drivers/dax/cxl.c.

> + reg_ext);
> +
> +err:
> + dev_err(&cxlr_dax->dev, "Failed to initialize DAX extent dev HPA %#llx - %#llx\n",
> + reg_ext->hpa_range.start, reg_ext->hpa_range.end);
> +
> + put_device(dev);
> + return rc;
> +}
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9e33a0976828..6b00e717e42b 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> + struct range *extent, int opcode)
> +{
> + struct cxl_mbox_cmd mbox_cmd;
> + size_t size;
> +
> + struct cxl_mbox_dc_response *dc_res __free(kfree);
> + size = struct_size(dc_res, extent_list, 1);
> + dc_res = kzalloc(size, GFP_KERNEL);
> + if (!dc_res)
> + return -ENOMEM;
> +
> + dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
> + memset(dc_res->extent_list[0].reserved, 0, 8);
> + dc_res->extent_list[0].length = cpu_to_le64(range_len(extent));
> + dc_res->extent_list_size = cpu_to_le32(1);
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = opcode,
> + .size_in = size,
> + .payload_in = dc_res,
> + };
> +
> + return cxl_internal_send_cmd(mds, &mbox_cmd);
> +}
> +
> static struct cxl_memdev_state *
> cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> {
> @@ -1029,6 +1055,23 @@ cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> return container_of(cxlds, struct cxl_memdev_state, cxlds);
> }
>
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent)

I am failing to grok this naming see comments on 'struct cxl_ed_extent'

> +{
> + struct cxl_endpoint_decoder *cxled = extent->cxled;
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + int rc;
> +
> + dev_dbg(dev, "Releasing DC extent DPA %#llx - %#llx\n",
> + extent->dpa_range.start, extent->dpa_range.end);
> +
> + rc = cxl_send_dc_cap_response(mds, &extent->dpa_range, CXL_MBOX_OP_RELEASE_DC);
> + if (rc)
> + dev_dbg(dev, "Failed to respond releasing extent DPA %#llx - %#llx; %d\n",
> + extent->dpa_range.start, extent->dpa_range.end, rc);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_release_ed_extent, CXL);
> +
> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 3e563ab29afe..7635ff109578 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,11 +1450,81 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +static int extent_check_overlap(struct device *dev, void *arg)
> +{
> + struct range *new_range = arg;
> + struct region_extent *ext;
> +
> + if (!is_region_extent(dev))
> + return 0;
> +
> + ext = to_region_extent(dev);
> + return range_overlaps(&ext->hpa_range, new_range);
> +}
> +
> +static int extent_overlaps(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range)
> +{
> + struct device *dev __free(put_device) =
> + device_find_child(&cxlr_dax->dev, hpa_range, extent_check_overlap);
> +
> + if (dev)
> + return -EINVAL;
> + return 0;
> +}
> +
> /* Callers are expected to ensure cxled has been attached to a region */
> int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> struct cxl_dc_extent *dc_extent)
> {
> - return 0;

..and here is the danger of predefining stub functions in one patch and
filling them in later. The validation of extents needed to be moved
earlier in the flow, closer to cxl_region_attach() time, and review of
dax_region_create_ext() identified it needed to be later in the flow.

Empty stubs like this run the risk of not having enough context to
justify when and where they are called.

> + struct cxl_region *cxlr = cxled->cxld.region;
> + struct range ext_dpa_range, ext_hpa_range;
> + struct device *dev = &cxlr->dev;
> + resource_size_t dpa_offset, hpa;
> +
> + /*
> + * Interleave ways == 1 means this coresponds to a 1:1 mapping between
> + * device extents and DAX region extents. Future implementations
> + * should hold DC region extents here until the full dax region extent
> + * can be realized.
> + */
> + if (cxlr->params.interleave_ways != 1) {
> + dev_err(dev, "Interleaving DC not supported\n");
> + return -EINVAL;
> + }
> +
> + ext_dpa_range = (struct range) {
> + .start = le64_to_cpu(dc_extent->start_dpa),
> + .end = le64_to_cpu(dc_extent->start_dpa) +
> + le64_to_cpu(dc_extent->length) - 1,
> + };
> +
> + dev_dbg(dev, "Adding DC extent DPA %#llx - %#llx\n",
> + ext_dpa_range.start, ext_dpa_range.end);
> +
> + /*
> + * Without interleave...
> + * HPA offset == DPA offset
> + * ... but do the math anyway

The full math would walk the extents of all targets in the
region_extent.

> + */
> + dpa_offset = ext_dpa_range.start - cxled->dpa_res->start;
> + hpa = cxled->cxld.hpa_range.start + dpa_offset;
> +
> + ext_hpa_range = (struct range) {
> + .start = hpa - cxlr->cxlr_dax->hpa_range.start,
> + .end = ext_hpa_range.start + range_len(&ext_dpa_range) - 1,
> + };
> +
> + if (extent_overlaps(cxlr->cxlr_dax, &ext_hpa_range))
> + return -EINVAL;
> +
> + dev_dbg(dev, "Realizing region extent at HPA %#llx - %#llx\n",
> + ext_hpa_range.start, ext_hpa_range.end);
> +
> + return dax_region_create_ext(cxlr->cxlr_dax, &ext_hpa_range,
> + (char *)dc_extent->tag,
> + &ext_dpa_range,
> + cxled);
> }
>
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> @@ -2684,6 +2754,7 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr)
>
> dev = &cxlr_dax->dev;
> cxlr_dax->cxlr = cxlr;
> + cxlr->cxlr_dax = cxlr_dax;
> device_initialize(dev);
> lockdep_set_class(&dev->mutex, &cxl_dax_region_key);
> device_set_pm_not_required(dev);
> @@ -2799,7 +2870,10 @@ static int cxl_region_read_extents(struct cxl_region *cxlr)
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> + struct cxl_region *cxlr = cxlr_dax->cxlr;
>
> + cxlr->cxlr_dax = NULL;
> + cxlr_dax->cxlr = NULL;
> device_unregister(&cxlr_dax->dev);
> }
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d585f5fdd3ae..5379ad7f5852 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -564,6 +564,7 @@ struct cxl_region_params {
> * @type: Endpoint decoder target type
> * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown
> * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge
> + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge
> * @flags: Region state flags
> * @params: active + config params for the region
> */
> @@ -574,6 +575,7 @@ struct cxl_region {
> enum cxl_decoder_type type;
> struct cxl_nvdimm_bridge *cxl_nvb;
> struct cxl_pmem_region *cxlr_pmem;
> + struct cxl_dax_region *cxlr_dax;
> unsigned long flags;
> struct cxl_region_params params;
> };
> @@ -617,6 +619,41 @@ struct cxl_dax_region {
> struct range hpa_range;
> };
>
> +/**
> + * struct cxl_ed_extent - Extent within an endpoint decoder
> + * @dpa_range: DPA range this extent covers within the decoder
> + * @cxled: reference to the endpoint decoder
> + */
> +struct cxl_ed_extent {

Why is _ed_ in the name. It feels like 'struct cxl_extent' is the lowest
level "extent" type to worry about, and a 'struct cxl_region_extent is
an object that represents an interleave-set of extents in HPA space.

> + struct range dpa_range;
> + struct cxl_endpoint_decoder *cxled;
> +};
> +void cxl_release_ed_extent(struct cxl_ed_extent *extent);
> +
> +/**
> + * struct region_extent - CXL DAX region extent
> + * @dev: device representing this extent
> + * @hpa_range: HPA range of this extent
> + * @label: label of the extent
> + * @ed_ext: Endpoint decoder extent which backs this extent
> + */
> +#define DAX_EXTENT_LABEL_LEN 64
> +struct region_extent {
> + struct device dev;
> + struct range hpa_range;
> + char label[DAX_EXTENT_LABEL_LEN];
> + struct cxl_ed_extent ed_ext;

This should always be an array, even if interleaves > 1 are not
supported in this initial enabling it should be an x1 interleave array
from day 1.

> +};
> +
> +int dax_region_create_ext(struct cxl_dax_region *cxlr_dax,
> + struct range *hpa_range,
> + const char *label,
> + struct range *dpa_range,
> + struct cxl_endpoint_decoder *cxled);
> +
> +bool is_region_extent(struct device *dev);
> +#define to_region_extent(dev) container_of(dev, struct region_extent, dev)

All the other to_<object>() helpers do runtime object type-safety which
is why all the is_<object>() helpers exist.

Maybe a future patch needs these to be used outside the core, but seems
a premature export at this point.

2024-05-07 02:31:52

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 17/26] dax/region: Create extent resources on DAX region driver load

ira.weiny@ wrote:
> From: Navneet Singh <[email protected]>
>
> DAX regions mapping dynamic capacity partitions introduce a requirement
> for the memory backing the region to come and go as required. This
> results in a DAX region with sparse areas of memory backing. To track
> the sparseness of the region, DAX extent objects need to track
> sub-resource information as a new layer between the DAX region resource
> and DAX device range resources.
>
> Recall that DCD extents may be accepted when a region is first created.
> Extend this support on region driver load. Scan existing extents and
> create DAX extent resources as a first step to DAX extent realization.
>
> The lifetime of a DAX extent is tricky to manage because the extent life
> may end in one of two ways. First, the device may request the extent be
> released. Second, the region may release the extent when it is
> destroyed without hardware involvement. Support extent release without
> hardware involvement first. Subsequent patches will provide for
> hardware to request extent removal.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [iweiny: remove xarrays]
> [iweiny: remove as much of extra reference stuff as possible]
> [iweiny: Move extent resource handling to core DAX code]
> ---
> drivers/dax/bus.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/dax/cxl.c | 43 ++++++++++++++++++++++++++++++++++--
> drivers/dax/dax-private.h | 12 +++++++++++
> 3 files changed, 108 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 903566aff5eb..4d5ed7ab6537 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -186,6 +186,61 @@ static bool is_sparse(struct dax_region *dax_region)
> return (dax_region->res.flags & IORESOURCE_DAX_SPARSE_CAP) != 0;
> }
>
> +static int dax_region_add_resource(struct dax_region *dax_region,
> + struct dax_extent *dax_ext,
> + resource_size_t start,
> + resource_size_t length)
> +{
> + struct resource *ext_res;
> +
> + dev_dbg(dax_region->dev, "DAX region resource %pr\n", &dax_region->res);
> + ext_res = __request_region(&dax_region->res, start, length, "extent", 0);
> + if (!ext_res) {
> + dev_err(dax_region->dev, "Failed to add region s:%pa l:%pa\n",
> + &start, &length);
> + return -ENOSPC;
> + }
> +
> + dax_ext->region = dax_region;
> + dax_ext->res = ext_res;
> + dev_dbg(dax_region->dev, "Extent add resource %pr\n", ext_res);

dax_ext is never used, it feels like this these helpers are in the wrong
patch. Like consumer side of dax_ext infrastructure lands *before* the
producer side.

Because the justification for this producer-side patch is after the case
for 'struct dax_extent' has been made.

> +int dax_region_add_extent(struct dax_region *dax_region, struct device *ext_dev,
> + resource_size_t start, resource_size_t length)
> +{
> + int rc;
> +
> + struct dax_extent *dax_ext __free(kfree) = kzalloc(sizeof(*dax_ext),
> + GFP_KERNEL);
> + if (!dax_ext)
> + return -ENOMEM;
> +
> + guard(rwsem_write)(&dax_region_rwsem);
> + rc = dax_region_add_resource(dax_region, dax_ext, start, length);
> + if (rc)
> + return rc;
> +
> + return devm_add_action_or_reset(ext_dev, dax_region_release_extent,
> + no_free_ptr(dax_ext));

This looks like an awkward rewrite of __devm_request_region(), but
likely that is because dax_ext is vestigial in this patch.

> +static void cxl_dax_region_add_extents(struct cxl_dax_region *cxlr_dax,
> + struct dax_region *dax_region)
> +{
> + dev_dbg(&cxlr_dax->dev, "Adding extents\n");
> + device_for_each_child(&cxlr_dax->dev, dax_region, cxl_dax_region_add_extent);

Per the comment on the last patch to move extent device creation to
cxl_dax_region_probe() that can get rid of looping over those devices
another time.

2024-05-07 05:04:55

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 18/26] cxl/mem: Handle DCD add & release capacity events.

ira.weiny@ wrote:
> From: Navneet Singh <[email protected]>
>
> A dynamic capacity devices (DCD) send events to signal the host about
> changes in the availability of Dynamic Capacity (DC) memory. These
> events contain extents, the addition or removal of which may occur at
> any time.
>
> Adding memory is straight forward. If no region exists the extent is
> rejected. If a region does exist, a region extent is formed and
> surfaced.
>
> Removing memory requires checking if the memory is currently in use.
> Memory use tracking is added in a subsequent patch so here the memory is
> never in use and the removal occurs immediately.
>
> Most often extents will be offered to and accepted by the host in well
> defined chunks. However, part of an extent may be requested for
> release. Simplify extent tracking by signaling removal of any extent
> which overlaps the requested release range.
>
> Force removal is intended as a mechanism between the FM and the device
> and intended only when the host is unresponsive or otherwise broken.
> Purposely ignore force removal events.
>
> Process DCD extents.
>
> Recall that all devices of an interleave set must offer a corresponding
> extent for the region extent to be realized. This patch limits
> interleave to 1. Thus the 1:1 mapping between device extent and DAX
> region extent allows immediate surfacing.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for v1
> [iweiny: remove all xarrays]
> [iweiny: entirely new architecture]
> ---
> drivers/cxl/core/extent.c | 4 ++
> drivers/cxl/core/mbox.c | 142 +++++++++++++++++++++++++++++++++++++++++++---
> drivers/cxl/core/region.c | 139 ++++++++++++++++++++++++++++++++++++++++-----
> drivers/cxl/cxl.h | 34 +++++++++++
> drivers/cxl/cxlmem.h | 21 +++----
> drivers/cxl/mem.c | 45 +++++++++++++++
> drivers/dax/cxl.c | 22 +++++++
> include/linux/cxl-event.h | 31 ++++++++++
> 8 files changed, 405 insertions(+), 33 deletions(-)
>
[..]
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 7635ff109578..a07d95136f0d 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
[..]
> @@ -1502,18 +1552,7 @@ int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> dev_dbg(dev, "Adding DC extent DPA %#llx - %#llx\n",
> ext_dpa_range.start, ext_dpa_range.end);
>
> - /*
> - * Without interleave...
> - * HPA offset == DPA offset
> - * ... but do the math anyway
> - */
> - dpa_offset = ext_dpa_range.start - cxled->dpa_res->start;
> - hpa = cxled->cxld.hpa_range.start + dpa_offset;
> -
> - ext_hpa_range = (struct range) {
> - .start = hpa - cxlr->cxlr_dax->hpa_range.start,
> - .end = ext_hpa_range.start + range_len(&ext_dpa_range) - 1,
> - };

Please don't refactor code that just got added in the same series. Upon
seeing that this wants a common helper in this patch, go back to the
original patch and put it in a helper from the beginning.

[..]
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5379ad7f5852..156d7c9a8de5 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
[..]
> @@ -891,10 +900,18 @@ bool is_cxl_region(struct device *dev);
>
> extern struct bus_type cxl_bus_type;

I skipped ahead here in the review since notification organization feels
wrong.

> +/* Driver Notifier Data */
> +struct cxl_drv_nd {

I never would have guessed that cxl_drv_nd meant cxl driver notifier
data, it might be able to be jettisoned.

> + enum dc_event event;
> + struct cxl_dc_extent *dc_extent;
> + struct region_extent *reg_ext;
> +};
> +
> struct cxl_driver {
> const char *name;
> int (*probe)(struct device *dev);
> void (*remove)(struct device *dev);
> + int (*notify)(struct device *dev, struct cxl_drv_nd *nd);

First, this feels an overly DCD specific mechanism to inflict on the core
generic 'struct cxl_driver'. Most 'struct cxl_driver' instances do not
need any 'notify' callback and 'struct cxl_drv_nd' makes this even less
relevant to the core 'struct cxl_driver' definition.

Second, it leads to 2 anonymous ->notify() callbacks wht too deep of a
stack. It feels as if the resulting code is being actively evasive.

Given that the event handling code already knows how to lookup a 'struct
cxl_region', as Alison demonstrated in her DPA->HPA series, it should be
straightforward to lookup a 'struct cxl_dax_region' without a notifying
the cxl_mem driver.

So my expectation is just enough DCD event parsing to determine when the
payload applies to given cxl_dax_region. Then define a:

struct cxl_dax_region_driver {
struct cxl_driver driver;
void (*notify)(struct cxl_dax_region *cxlr_dax, ...);
};

..to send the payload over for further processing. If a cxl_dax_region
device instance cannot be found, just drop the event record.

2024-05-08 14:59:12

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)

On Sun, 5 May 2024 21:24:14 -0700
Ira Weiny <[email protected]> wrote:

> Jonathan Cameron wrote:
> > On Wed, 1 May 2024 16:49:24 -0700
> > Ira Weiny <[email protected]> wrote:
> >
> > > Jonathan Cameron wrote:
> > > >
> > > > >
> > > > > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]
> > > > Hi Ira, Navneet.
> > > > >
> > > > > Remaining work:
> > > > >
> > > > > 1) Integrate the QoS work from Dave Jiang
> > > > > 2) Interleave support
> > > >
> > > >
> > > > More flag. This one I think is potentially important and don't
> > > > see any handling in here.
> > >
> > > Nope I admit I missed the spec requirement.
> > >
> > > >
> > > > Whilst an FM could in theory be careful to avoid sending a
> > > > sparse set of extents, if the device is managing the memory range
> > > > (which is possible all it supports) and the FM issues an Initiate Dynamic
> > > > Capacity Add with Free (again may be all device supports) then we
> > > > can't stop the device issuing a bunch of sparse extents.
> > > >
> > > > Now it won't be broken as such without this, but every time we
> > > > accept the first extent that will implicitly reject the rest.
> > > > That will look very ugly to an FM which has to poke potentially many
> > > > times to successfully allocate memory to a host.
> > >
> > > This helps me to see see why the more bit is useful.
> > >
> > > >
> > > > I also don't think it will be that hard to support, but maybe I'm
> > > > missing something?
> > >
> > > Just a bunch of code and refactoring busy work. ;-) It's not rocket
> > > science but does fundamentally change the arch again.
> > >
> > > >
> > > > My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> > > > over a list of extents passed in then slightly more complex response
> > > > generation.
> > >
> > > Not exactly 'just a loop'. No matter how I work this out there is the
> > > possibility that some extents get surfaced and then the kernel tries to
> > > remove them because it should not have.
> >
> > Lets consider why it might need to back out.
> > 1) Device sends an invalid set of extents - so maybe one in a later message
> > overlaps with an already allocated extent. Device bug, handling can
> > be extremely inelegant - up to crashing the kernel. Worst that happens
> > due to race is probably a poison storm / machine check fun? Not our
> > responsibility to deal with something that broken (in my view!) Best effort
> > only.
> >
> > 2) Host can't handle the extent for some reason and didn't know that until
> > later - can just reject the ones it can't handle.
>
> 3) Something in the host fails like ENOMEM on a later extent surface which
> requires the host to back out of all of them.
>
> 3 should be rare and I'm working toward it. But it is possible this will
> happen.
>
> If you have a 'prepare' notify it should avoid most of these because the
> extents will be mostly formed. But there are some error paths on the actual
> surface code path.

True. If these are really small allocations then elegant handling feels like
a nice to have rather than a requirement.

>
> >
> > >
> > > To be most safe the cxl core is going to have to make 2 round trips to the
> > > cxl region layer for each extent. The first determines if the extent is
> > > valid and creates the extent as much as possible. The second actually
> > > surfaces the extents. However, if the surface fails then you might not
> > > get the extents back. So now we are in an invalid state. :-/ WARN and
> > > continue I guess?!??!
> >
> > Yes. Orchestrator can decide how to handle - probably reboot server in as
> > gentle a fashion as possible.
> >
>
> Ok
>
> >
> > >
> > > I think the safest way to handle this is add a new kernel notify event
> > > called 'extent create' which stops short of surfacing the extent. [I'm
> > > not 100% sure how this is going to affect interleave.]
> > >
> > > I think the safest logic for add is something like:
> > >
> > > cxl_handle_dcd_add_event()
> > > add_extent(squirl_list, extent);
> > >
> > > if (more bit) /* wait for more */
> > > return;
> > >
> > > /* Create extents to hedge the bets against failure */
> > > for_each(squirl_list)
> > > if (notify 'extent create' != ok)
> > > send_response(fail);
> > > return;
> > >
> > > for_each(squirl_list)
> > > if (notify 'surface' != ok)
> > > /*
> > > * If the more bit was set, some extents
> > > * have been surfaced and now need to be
> > > * removed...
> > > *
> > > * Try to remove them and hope...
> > > */
> >
> > If we failed to surface them all another option is just tell the device
> > that. Responds with the extents that successfully surfaced and reject
> > all others (or all after the one that failed?) So for the lower layers
> > send the device a response that says "thanks but I only took these ones"
> > and for the upper layers pretend "I was only offered these ones"
> >
>
> But doesn't that basically break the more bit? I'm willing to do that as it is
> easier for the host.

Don't think so. We can always accept part of the offered extents in same
way we can accept part of a single offered extent if we like.
The more flag just means we only get to do that communication of what
we accepted once. So we have to reply with what we want and don't set
more flag in last message - thus indicating we don't want the rest.
(making sure we also tidy up the log for the ones we rejected)

>
> > > WARN_ON('surface extents failed');
> > > for_each(squirl_list)
> > > notify 'remove without response'
> > > send_response(fail);
> > > return;
> > >
> > > send_response(squirl_list, accept);
> > >
> > > The logic for remove is not changed AFAICS because the device must allow
> > > for memory to be released at any time so the host is free to release each
> > > of the extents individually despite the 'more' bit???
> >
> > Yes, but only after it accepted them - which needs to be done in one go.
> > So you can't just send releases before that (the device will return an
> > error and keep them in the pending list I think...)
>
> :-( OK so this more bit is really more... no pun intended. Because this
> breaks the entire model I have if I have to treat these as a huge atomic unit.
>
> Let me think on that a bit more. Obviously it is just tagging an iterating the
> extents to find those associated with a more bit on accept. But it will take
> some time to code up.

The ability to give up at any point (though you need to read and clear the extents
that are left) should get around a lot of the complexity but sure it's
not a trivial thing to support.

I'd flip a 'something went wrong flag' on the the first failure, carry on the
walk not surfacing anything else, but clearing the logs etc, then finally reply
with what succeeded before that 'went wrong' flag was set.

>
> >
> > >
> > > >
> > > > I don't want this to block getting initial DCD support in but it
> > > > will be a bit ugly if we quickly support the more flag and then end
> > > > up with just one kernel that an FM has to be careful with...
> > >
> > > I'm not sure which is worse. Given your use case above it seems like the
> > > more bit may be more important for 'dumb' devices which want to add
> > > extents in blocks before responding to the FM. Thus complicating the FM.
> > >
> > > It seems 'smarter' devices which could figure this out (not requiring the
> > > more bit) are the ones which will be developed later. So it seems the use
> > > case time line is the opposite of what we need right now.
> >
> > Once we hit shareable capacity (which the smarter devices will use) then
> > this become the dominant approach to non contiguous allocations because
> > you can't add extents with a given tag in multiple goes.
>
> Why not? Sharing is going to require some synchronization with the
> orchestrator and can't the user app just report it did not get all it's memory
> and wait for more? With the same tag?

Hmm. I was sure the spec said sharing did not allow addition of capacity after
first creation, but now can't find it. If you did do it though, fun
occurs when you then pass it on to the second device because you have
to do that via tag alone.

I believe this is about simplification on the device side because
offers of extents to other hosts are done by tag. If you allow extra ones
to turn up there are race conditions to potentially deal with.

7.6.7.6.5 Initiate Dynamic Capacity add.

"Enable shared Access" Enable access to extents previously added to another
host in a DC region that reports the "sharable" flag, as designated by the
specific tag value.

Note it is up to the device to offer the same capacity to all hosts for
which this is issued. There is no extent list or length provided.


>
> >
> > So I'd expect the more flag to be more common not less over time.
> > >
> > > For that reason I'm inclined to try and get this in.
> > >
> >
> > Great - but I'd not worry too much about bad effects if you get invalid
> > lists from the device. If the only option is shout and panic, then fine
> > though I'd imagine we can do slightly better than that, so maybe warn
> > extensively and don't let the region be used.
>
> It is not just about invalid lists. It is that setting up the extent devices
> may fail and waiting for the devices to be set up means that they are user
> visible. So that is the chicken and the egg...
>
> This is unlikely and perhaps the partials should just be surfaced and accept
> whatever works. Then let it all tear down later if it does not all go.
>
> But I was trying to honor the accept 'all or nothing' as that is what has been
> stated as the requirement of the more bit.

That's not quite true - for shared it is all or nothing (after first host anyway) but
for other capacity it is 'accept X and reject Y in one go'. You don't need to
take it all but you only get one go to say what you did accept.

>
> But it seems that it does not __have__ to be atomic. Or at least the partials
> can be cleaned up and all tried again.

With care you can accept up to a point, then give those back if you like - or
carry on and use them.

Jonathan

>
> Ira
>
> >
> > Jonathan
> >
> > > Ira
> > >
> >
>
>
>


2024-05-14 02:42:22

by Li Zhijian

[permalink] [raw]
Subject: Re: [PATCH 04/26] cxl/region: Add dynamic capacity decoder and region modes


The following change is preferred to show the correct mode name.


diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 35ee565e27c9..729006ca4997 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -684,7 +684,7 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)

if (size > avail) {
dev_dbg(dev, "%pa exceeds available %s capacity: %pa\n", &size,
- cxled->mode == CXL_DECODER_RAM ? "ram" : "pmem",
+ cxl_decoder_mode_name(cxled->mode),
&avail);
rc = -ENOSPC;
goto out;



On 25/03/2024 07:18, [email protected] wrote:
> From: Navneet Singh <[email protected]>
>
> Region mode must reflect a general dynamic capacity type which is
> associated with a specific Dynamic Capacity (DC) partitions in each
> device decoder within the region. DC partitions are also know as DC
> regions per CXL 3.1.
>
> Decoder mode reflects a specific DC partition.
>
> Define the new modes to use in subsequent patches and the helper
> functions required to make the association between these new modes.
>
> Signed-off-by: Navneet Singh <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
> ---
> Changes for v1
> [iweiny: split out from: Add dynamic capacity cxl region support.]
> ---
> drivers/cxl/core/region.c | 4 ++++
> drivers/cxl/cxl.h | 23 +++++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 1723d17f121e..ec3b8c6948e9 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1690,6 +1690,8 @@ static bool cxl_modes_compatible(enum cxl_region_mode rmode,
> return true;
> if (rmode == CXL_REGION_PMEM && dmode == CXL_DECODER_PMEM)
> return true;
> + if (rmode == CXL_REGION_DC && cxl_decoder_mode_is_dc(dmode))
> + return true;
>
> return false;
> }
> @@ -2824,6 +2826,8 @@ cxl_decoder_to_region_mode(enum cxl_decoder_mode mode)
> return CXL_REGION_RAM;
> case CXL_DECODER_PMEM:
> return CXL_REGION_PMEM;
> + case CXL_DECODER_DC0 ... CXL_DECODER_DC7:
> + return CXL_REGION_DC;
> case CXL_DECODER_MIXED:
> default:
> return CXL_REGION_MIXED;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 9a0cce1e6fca..3b8935089c0c 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -365,6 +365,14 @@ enum cxl_decoder_mode {
> CXL_DECODER_NONE,
> CXL_DECODER_RAM,
> CXL_DECODER_PMEM,
> + CXL_DECODER_DC0,
> + CXL_DECODER_DC1,
> + CXL_DECODER_DC2,
> + CXL_DECODER_DC3,
> + CXL_DECODER_DC4,
> + CXL_DECODER_DC5,
> + CXL_DECODER_DC6,
> + CXL_DECODER_DC7,
> CXL_DECODER_MIXED,
> CXL_DECODER_DEAD,
> };
> @@ -375,6 +383,14 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> [CXL_DECODER_NONE] = "none",
> [CXL_DECODER_RAM] = "ram",
> [CXL_DECODER_PMEM] = "pmem",
> + [CXL_DECODER_DC0] = "dc0",
> + [CXL_DECODER_DC1] = "dc1",
> + [CXL_DECODER_DC2] = "dc2",
> + [CXL_DECODER_DC3] = "dc3",
> + [CXL_DECODER_DC4] = "dc4",
> + [CXL_DECODER_DC5] = "dc5",
> + [CXL_DECODER_DC6] = "dc6",
> + [CXL_DECODER_DC7] = "dc7",
> [CXL_DECODER_MIXED] = "mixed",
> };
>
> @@ -383,10 +399,16 @@ static inline const char *cxl_decoder_mode_name(enum cxl_decoder_mode mode)
> return "mixed";
> }
>
> +static inline bool cxl_decoder_mode_is_dc(enum cxl_decoder_mode mode)
> +{
> + return (mode >= CXL_DECODER_DC0 && mode <= CXL_DECODER_DC7);
> +}
> +
> enum cxl_region_mode {
> CXL_REGION_NONE,
> CXL_REGION_RAM,
> CXL_REGION_PMEM,
> + CXL_REGION_DC,
> CXL_REGION_MIXED,
> };
>
> @@ -396,6 +418,7 @@ static inline const char *cxl_region_mode_name(enum cxl_region_mode mode)
> [CXL_REGION_NONE] = "none",
> [CXL_REGION_RAM] = "ram",
> [CXL_REGION_PMEM] = "pmem",
> + [CXL_REGION_DC] = "dc",
> [CXL_REGION_MIXED] = "mixed",
> };
>
>