From: Ira Weiny <[email protected]>
Changes from V11:[8]
The major change in this version is to remove the workqueue from the
internal implementation of the state machine. A single ordered
workqueue within each mailbox processes tasks submitted. This
workqueue takes care of all locking and guarantees that tasks are
completed in the order submitted. Any synchronization which is
required between tasks will need to be handled by the user of the
mailbox. However, the user can depend on work items being completed in
the order they are submitted. So a single thread submitter is
guaranteed to get all work items completed in order. This also aids in
the support of a single mailbox supporting multiple protocols. Each
protocol could have a separate thread submitting tasks for that
protocol. The mailbox object will ensure that each protocol task is
complete before another task starts. But multiple user threads can be
submitting tasks for different protocols all at the same time without
regard to other protocols being used.
XArrays are used throughout the series.
Other minor changes are noted in the individual patches.
CXL drivers need various data which are provided through generic DOE mailboxes
as defined in the PCIe 6.0 spec.[1]
One such data is the Coherent Device Atribute Table (CDAT). CDAT data provides
coherent information about the various devices in the system. It was developed
because systems no longer have a priori knowledge of all coherent devices
within a system. CDAT describes the coherent characteristics of the
components on the CXL bus separate from system configurations. The OS can
then, for example, use this information to form correct interleave sets.
To begin reading the CDAT the OS must have support to access the DOE mailboxes
provided by the CXL devices.
Because DOE is not specific to DOE but is provided within the PCI spec, the
series adds PCI DOE capability library functions. These functions allow for
the iteration of the DOE capabilities on a device as well as creating
pci_doe_mb structures which can control the operation of the DOE state machine.
For now the iteration of and storage of the DOE mailboxes is done on memdev
objects within the CXL stack. When this is needed in more generic code this
can be lifted later.
This work was tested using qemu.
[0] https://lore.kernel.org/linux-cxl/[email protected]/
[1] https://pcisig.com/specifications
[2] https://lore.kernel.org/qemu-devel/[email protected]/
[3] https://lore.kernel.org/linux-cxl/[email protected]/
[4] https://lore.kernel.org/linux-cxl/[email protected]/
[5] https://lore.kernel.org/linux-cxl/[email protected]/
[6] https://lore.kernel.org/linux-cxl/[email protected]/
[7] https://lore.kernel.org/linux-cxl/[email protected]/
[8] https://lore.kernel.org/linux-cxl/[email protected]/
Previous changes
================
Changes from V10:[7]
Address Ben Widawsky's comments
Protect against potentially malicious devices.
Fix ownership issue of cdat_mb
Changes from V9:[6]
Address feedback from
Lukas Wunner, Davidlohr Bueso, Jonathan Cameron,
Alison Schofield, and Ben Widawsky
Details in each individual patch.
Changes from V8:[5]
For this version I've punted a bit to get it out and drop the auxiliary
bus functionality. I like where Jonathan is going with the port driver
idea. I think eventually the irq/mailbox creation will need to be more
generic in a PCI port driver. I've modeled this version on such an
architecture but used the CXL port for the time being.
From Dan
Drop the auxiliary bus/device
From Jonathan
Cleanups
From Bjorn
Clean up commit messages
move pci-doe.c to doe.c
Clean up PCI spec references
Ensure all messages use pci_*()
Add offset to error messages to distinguish mailboxes
use hex for DOE offsets
Print 4 nibbles for Vendor ID and 2 for type.
s/irq/IRQ in comments
Fix long lines
Fix typos
Changes from V7:[4]
Avoid code bloat by making pci-doe.c conditional on CONFIG_PCI_DOE
which is auto selected by the CXL_PCI config option.
Minor code clean ups
Fix bug in pci_doe_supports_prot()
Rebase to cxl-pending
Changes from V6:[3]
The big change is the removal of the auxiliary bus code from the PCI
layer. The auxiliary bus usage is now in the CXL layer. The PCI layer
provides helpers for subsystems to utilize DOE mailboxes by creating a
pci_doe_mb object which controls a state machine for that mailbox
capability. The CXL layer wraps this object in an auxiliary device and
driver which can then be used to determine if the kernel is controlling
the capability or it is available to be used by user space. Reads from
user space via lspci are allowed. Writes are allowed but flagged via a
tainting the kernel.
Feedback from Bjorn, Jonathan, and Dan
Details in each patch
Changes from V5:[0]
Rework the patch set to split PCI vs CXL changes
Also make each change a bit more stand alone for easier review
Add cxl_cdat structure
Put CDAT related data structures in cdat.h
Clarify some device lifetimes with comments
Incorporate feedback from Jonathan, Bjorn and Dan
The bigest change is placing the DOE scanning code into the
pci_doe driver (part of the PCI codre).
Validate the CDAT when it is read rather than before DSMAS
parsing
Do not report DSMAS failure as an error, report a warning and
keep going.
Retry reading the table 1 time.
Update commit messages and this cover letter
Ira Weiny (7):
PCI: Replace magic constant for PCI Sig Vendor ID
cxl/pci: Create PCI DOE mailbox's for memory devices
driver-core: Introduce BIN_ATTR_ADMIN_{RO,RW}
cxl/port: Read CDAT table
cxl/port: Introduce cxl_cdat_valid()
cxl/port: Retry reading CDAT on failure
cxl/port: Parse out DSMAS data from CDAT table
Jonathan Cameron (2):
PCI: Add vendor ID for the PCI SIG
PCI: Create PCIe library functions in support of DOE mailboxes.
.clang-format | 1 +
drivers/cxl/Kconfig | 1 +
drivers/cxl/cdat.h | 125 ++++++
drivers/cxl/core/pci.c | 293 +++++++++++++++
drivers/cxl/cxl.h | 7 +
drivers/cxl/cxlmem.h | 7 +
drivers/cxl/cxlpci.h | 2 +
drivers/cxl/mem.c | 1 +
drivers/cxl/pci.c | 37 ++
drivers/cxl/port.c | 51 +++
drivers/pci/Kconfig | 3 +
drivers/pci/Makefile | 1 +
drivers/pci/doe.c | 689 ++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 2 +-
include/linux/pci-doe.h | 81 ++++
include/linux/pci_ids.h | 1 +
include/linux/sysfs.h | 16 +
include/uapi/linux/pci_regs.h | 29 +-
18 files changed, 1345 insertions(+), 2 deletions(-)
create mode 100644 drivers/cxl/cdat.h
create mode 100644 drivers/pci/doe.c
create mode 100644 include/linux/pci-doe.h
base-commit: 34e37b4c432cd0f1842b352fde4b8878b4166888
--
2.35.3
From: Ira Weiny <[email protected]>
DOE mailbox objects will be needed for various mailbox communications
with each memory device.
Iterate each DOE mailbox capability and create PCI DOE mailbox objects
as found.
It is not anticipated that this is the final resting place for the
iteration of the DOE devices. The support of switch ports will drive
this code into the PCIe side. In this imagined architecture the CXL
port driver would then query into the PCI device for the DOE mailbox
array.
For now creating the mailboxes in the CXL port is good enough for the
endpoints. Later PCIe ports will need to support this to support switch
ports more generically.
Cc: Dan Williams <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Lukas Wunner <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Changes from V11:
Drop review from: Ben Widawsky <[email protected]>
Remove irq code for now
Adjust for pci_doe_get_int_msg_num()
Adjust for pcim_doe_create_mb()
(No longer need to handle the destroy.)
Use xarray for DOE mailbox array
Changes from V9:
Bug fix: ensure DOE mailboxes are iterated before memdev add
Ben Widawsky
Set use_irq to false and just return on error.
Don't return a value from devm_cxl_pci_create_doe()
Skip allocating doe_mb array if there are no mailboxes
Skip requesting irqs if none found.
Ben/Jonathan Cameron
s/num_irqs/max_irqs
Changes from V8:
Move PCI_DOE selection to CXL_BUS to support future patches
which move queries into the port code.
Remove Auxiliary device arch
Squash the functionality of the auxiliary driver into this
patch.
Split out the irq handling a bit.
Changes from V7:
Minor code clean ups
Rebased on cxl-pending
Changes from V6:
Move all the auxiliary device stuff to the CXL layer
Changes from V5:
Split the CXL specific stuff off from the PCI DOE create
auxiliary device code.
---
drivers/cxl/Kconfig | 1 +
drivers/cxl/cxlmem.h | 3 +++
drivers/cxl/pci.c | 37 +++++++++++++++++++++++++++++++++++++
3 files changed, 41 insertions(+)
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index f64e3984689f..7adaaf80b302 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -2,6 +2,7 @@
menuconfig CXL_BUS
tristate "CXL (Compute Express Link) Devices Support"
depends on PCI
+ select PCI_DOE
help
CXL is a bus that is electrically compatible with PCI Express, but
layers three protocols on that signalling (CXL.io, CXL.cache, and
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 60d10ee1e7fc..360f282ef80c 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -191,6 +191,7 @@ struct cxl_endpoint_dvsec_info {
* @component_reg_phys: register base of component registers
* @info: Cached DVSEC information about the device.
* @serial: PCIe Device Serial Number
+ * @doe_mbs: PCI DOE mailbox array
* @mbox_send: @dev specific transport for transmitting mailbox commands
*
* See section 8.2.9.5.2 Capacity Configuration and Label Storage for
@@ -224,6 +225,8 @@ struct cxl_dev_state {
resource_size_t component_reg_phys;
u64 serial;
+ struct xarray doe_mbs;
+
int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
};
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 5a0ae46d4989..5821e6c1253b 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -8,6 +8,7 @@
#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/pci.h>
+#include <linux/pci-doe.h>
#include <linux/io.h>
#include "cxlmem.h"
#include "cxlpci.h"
@@ -386,6 +387,37 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
return rc;
}
+static void cxl_pci_destroy_doe(void *mbs)
+{
+ struct xarray *xa = mbs;
+
+ xa_destroy(xa);
+}
+
+static void devm_cxl_pci_create_doe(struct cxl_dev_state *cxlds)
+{
+ struct device *dev = cxlds->dev;
+ struct pci_dev *pdev = to_pci_dev(dev);
+ u16 off = 0;
+
+ pci_doe_for_each_off(pdev, off) {
+ struct pci_doe_mb *doe_mb;
+
+ doe_mb = pcim_doe_create_mb(pdev, off, -1);
+ if (IS_ERR(doe_mb)) {
+ pci_err(pdev,
+ "Failed to create MB object for MB @ %x\n",
+ off);
+ doe_mb = NULL;
+ }
+
+ if (xa_insert(&cxlds->doe_mbs, off, doe_mb, GFP_KERNEL))
+ break;
+
+ pci_dbg(pdev, "Created DOE mailbox @%x\n", off);
+ }
+}
+
static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
struct cxl_register_map map;
@@ -408,6 +440,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
if (IS_ERR(cxlds))
return PTR_ERR(cxlds);
+ xa_init(&cxlds->doe_mbs);
+ devm_add_action(&pdev->dev, cxl_pci_destroy_doe, &cxlds->doe_mbs);
+
cxlds->serial = pci_get_dsn(pdev);
cxlds->cxl_dvsec = pci_find_dvsec_capability(
pdev, PCI_DVSEC_VENDOR_ID_CXL, CXL_DVSEC_PCIE_DEVICE);
@@ -434,6 +469,8 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);
+ devm_cxl_pci_create_doe(cxlds);
+
rc = cxl_pci_setup_mailbox(cxlds);
if (rc)
return rc;
--
2.35.3
From: Ira Weiny <[email protected]>
The OS will need CDAT data from CXL devices to properly set up
interleave sets. Currently this is supported through a DOE mailbox
which supports CDAT.
Search the DOE mailboxes available, query CDAT data, and cache the data
for later parsing.
Provide a sysfs binary attribute to allow dumping of the CDAT.
Binary dumping is modeled on /sys/firmware/ACPI/tables/
The ability to dump this table will be very useful for emulation of real
devices once they become available as QEMU CXL type 3 device emulation will
be able to load this file in.
This does not support table updates at runtime. It will always provide
whatever was there when first cached. Handling of table updates can be
implemented later.
Finally create a complete list of CDAT defines within cdat.h for code
wishing to decode the CDAT table.
Signed-off-by: Jonathan Cameron <[email protected]>
Co-developed-by: Jonathan Cameron <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Changes from V11:
Adjust for the use of DOE mailbox xarray
Dan Williams:
Remove unnecessary get/put device
Use new BIN_ATTR_ADMIN_RO macro
Flag that CDAT was supported
If there is a read error then the CDAT sysfs
will return a 0 length entry
Changes from V10:
Ben Widawsky
Failure to find CDAT should be a debug message not error
Remove reference to cdat_mb from the port object
Dropped [PATCH V10 5/9] cxl/port: Find a DOE mailbox which supports
CDAT
Iterate the mailboxes for the CDAT one each time.
Define CXL_DOE_TABLE_ACCESS_LAST_ENTRY and add comment about
it's use.
Changes from V9:
Add debugging output
Jonathan Cameron
Move read_cdat to port probe by using dev_groups for the
sysfs attributes. This avoids issues with using devm
before the driver is loaded while making sure the CDAT
binary is available.
Changes from V8:
Fix length print format
Incorporate feedback from Jonathan
Move all this to cxl_port which can help support switches when
the time comes.
Changes from V6:
Fix issue with devm use
Move cached cdat data to cxl_dev_state
Use new pci_doe_submit_task()
Ensure the aux driver is locked while processing tasks
Rebased on cxl-pending
Changes from V5:
Add proper guards around cdat.h
Split out finding the CDAT DOE mailbox
Use cxl_cdat to group CDAT data together
Adjust to use auxiliary_find_device() to find the DOE device
which supplies the CDAT protocol.
Rebased to latest
Remove dev_dbg(length)
Remove unneeded DOE Table access defines
Move CXL_DOE_PROTOCOL_TABLE_ACCESS define into this patch where
it is used
Changes from V4:
Split this into it's own patch
Rearchitect this such that the memdev driver calls into the DOE
driver via the cxl_mem state object. This allows CDAT data to
come from any type of cxl_mem object not just PCI DOE.
Rebase on new struct cxl_dev_state
---
drivers/cxl/cdat.h | 100 +++++++++++++++++++++++++
drivers/cxl/core/pci.c | 166 +++++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 5 ++
drivers/cxl/cxlpci.h | 1 +
drivers/cxl/port.c | 51 +++++++++++++
5 files changed, 323 insertions(+)
create mode 100644 drivers/cxl/cdat.h
diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
new file mode 100644
index 000000000000..c6a48ab326bf
--- /dev/null
+++ b/drivers/cxl/cdat.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __CXL_CDAT_H__
+#define __CXL_CDAT_H__
+
+/*
+ * Coherent Device Attribute table (CDAT)
+ *
+ * Specification available from UEFI.org
+ *
+ * Whilst CDAT is defined as a single table, the access via DOE maiboxes is
+ * done one entry at a time, where the first entry is the header.
+ */
+
+#define CXL_DOE_TABLE_ACCESS_REQ_CODE 0x000000ff
+#define CXL_DOE_TABLE_ACCESS_REQ_CODE_READ 0
+#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE 0x0000ff00
+#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA 0
+#define CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE 0xffff0000
+#define CXL_DOE_TABLE_ACCESS_LAST_ENTRY 0xffff
+
+/*
+ * CDAT entries are little endian and are read from PCI config space which
+ * is also little endian.
+ * As such, on a big endian system these will have been reversed.
+ * This prevents us from making easy use of packed structures.
+ * Style form pci_regs.h
+ */
+
+#define CDAT_HEADER_LENGTH_DW 4
+#define CDAT_HEADER_LENGTH_BYTES (CDAT_HEADER_LENGTH_DW * sizeof(u32))
+#define CDAT_HEADER_DW0_LENGTH 0xffffffff
+#define CDAT_HEADER_DW1_REVISION 0x000000ff
+#define CDAT_HEADER_DW1_CHECKSUM 0x0000ff00
+/* CDAT_HEADER_DW2_RESERVED */
+#define CDAT_HEADER_DW3_SEQUENCE 0xffffffff
+
+/* All structures have a common first DW */
+#define CDAT_STRUCTURE_DW0_TYPE 0x000000ff
+#define CDAT_STRUCTURE_DW0_TYPE_DSMAS 0
+#define CDAT_STRUCTURE_DW0_TYPE_DSLBIS 1
+#define CDAT_STRUCTURE_DW0_TYPE_DSMSCIS 2
+#define CDAT_STRUCTURE_DW0_TYPE_DSIS 3
+#define CDAT_STRUCTURE_DW0_TYPE_DSEMTS 4
+#define CDAT_STRUCTURE_DW0_TYPE_SSLBIS 5
+
+#define CDAT_STRUCTURE_DW0_LENGTH 0xffff0000
+
+/* Device Scoped Memory Affinity Structure */
+#define CDAT_DSMAS_DW1_DSMAD_HANDLE 0x000000ff
+#define CDAT_DSMAS_DW1_FLAGS 0x0000ff00
+#define CDAT_DSMAS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMAS_DPA_LEN(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
+#define CDAT_DSMAS_NON_VOLATILE(flags) ((flags & 0x04) >> 2)
+
+/* Device Scoped Latency and Bandwidth Information Structure */
+#define CDAT_DSLBIS_DW1_HANDLE 0x000000ff
+#define CDAT_DSLBIS_DW1_FLAGS 0x0000ff00
+#define CDAT_DSLBIS_DW1_DATA_TYPE 0x00ff0000
+#define CDAT_DSLBIS_BASE_UNIT(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSLBIS_DW4_ENTRY_0 0x0000ffff
+#define CDAT_DSLBIS_DW4_ENTRY_1 0xffff0000
+#define CDAT_DSLBIS_DW5_ENTRY_2 0x0000ffff
+
+/* Device Scoped Memory Side Cache Information Structure */
+#define CDAT_DSMSCIS_DW1_HANDLE 0x000000ff
+#define CDAT_DSMSCIS_MEMORY_SIDE_CACHE_SIZE(entry) \
+ ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMSCIS_DW4_MEMORY_SIDE_CACHE_ATTRS 0xffffffff
+
+/* Device Scoped Initiator Structure */
+#define CDAT_DSIS_DW1_FLAGS 0x000000ff
+#define CDAT_DSIS_DW1_HANDLE 0x0000ff00
+
+/* Device Scoped EFI Memory Type Structure */
+#define CDAT_DSEMTS_DW1_HANDLE 0x000000ff
+#define CDAT_DSEMTS_DW1_EFI_MEMORY_TYPE_ATTR 0x0000ff00
+#define CDAT_DSEMTS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSEMTS_DPA_LENGTH(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
+
+/* Switch Scoped Latency and Bandwidth Information Structure */
+#define CDAT_SSLBIS_DW1_DATA_TYPE 0x000000ff
+#define CDAT_SSLBIS_BASE_UNIT(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_SSLBIS_ENTRY_PORT_X(entry, i) ((entry)[4 + (i) * 2] & 0x0000ffff)
+#define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
+#define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
+
+#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
+
+/**
+ * struct cxl_cdat - CXL CDAT data
+ *
+ * @table: cache of CDAT table
+ * @length: length of cached CDAT table
+ */
+struct cxl_cdat {
+ void *table;
+ size_t length;
+};
+
+#endif /* !__CXL_CDAT_H__ */
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c4c99ff7b55e..4bd479ec0253 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -4,10 +4,12 @@
#include <linux/device.h>
#include <linux/delay.h>
#include <linux/pci.h>
+#include <linux/pci-doe.h>
#include <cxlpci.h>
#include <cxlmem.h>
#include <cxl.h>
#include "core.h"
+#include "cdat.h"
/**
* DOC: cxl core pci
@@ -458,3 +460,167 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm)
return 0;
}
EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, CXL);
+
+static struct pci_doe_mb *find_cdat_mb(struct device *uport)
+{
+ struct cxl_memdev *cxlmd;
+ struct cxl_dev_state *cxlds;
+ unsigned long index;
+ void *entry;
+
+ if (!is_cxl_memdev(uport))
+ return NULL;
+
+ cxlmd = to_cxl_memdev(uport);
+ cxlds = cxlmd->cxlds;
+
+ xa_for_each(&cxlds->doe_mbs, index, entry) {
+ struct pci_doe_mb *cur = entry;
+
+ if (pci_doe_supports_prot(cur, PCI_DVSEC_VENDOR_ID_CXL,
+ CXL_DOE_PROTOCOL_TABLE_ACCESS))
+ return cur;
+ }
+
+ return NULL;
+}
+
+#define CDAT_DOE_REQ(entry_handle) \
+ (FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE, \
+ CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) | \
+ FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE, \
+ CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) | \
+ FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
+
+static void cxl_doe_task_complete(struct pci_doe_task *task)
+{
+ complete(task->private);
+}
+
+static int cxl_cdat_get_length(struct device *dev,
+ struct pci_doe_mb *cdat_mb,
+ size_t *length)
+{
+ u32 cdat_request_pl = CDAT_DOE_REQ(0);
+ u32 cdat_response_pl[32];
+ DECLARE_COMPLETION_ONSTACK(c);
+ struct pci_doe_task task = {
+ .prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+ .prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+ .request_pl = &cdat_request_pl,
+ .request_pl_sz = sizeof(cdat_request_pl),
+ .response_pl = cdat_response_pl,
+ .response_pl_sz = sizeof(cdat_response_pl),
+ .complete = cxl_doe_task_complete,
+ .private = &c,
+ };
+ int rc = 0;
+
+ rc = pci_doe_submit_task(cdat_mb, &task);
+ if (rc < 0) {
+ dev_err(dev, "DOE submit failed: %d", rc);
+ return rc;
+ }
+ wait_for_completion(&c);
+
+ if (task.rv < 1)
+ return -EIO;
+
+ *length = cdat_response_pl[1];
+ dev_dbg(dev, "CDAT length %zu\n", *length);
+
+ return rc;
+}
+
+static int cxl_cdat_read_table(struct device *dev,
+ struct pci_doe_mb *cdat_mb,
+ struct cxl_cdat *cdat)
+{
+ size_t length = cdat->length;
+ u32 *data = cdat->table;
+ int entry_handle = 0;
+ int rc = 0;
+
+ do {
+ u32 cdat_request_pl = CDAT_DOE_REQ(entry_handle);
+ u32 cdat_response_pl[32];
+ DECLARE_COMPLETION_ONSTACK(c);
+ struct pci_doe_task task = {
+ .prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+ .prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+ .request_pl = &cdat_request_pl,
+ .request_pl_sz = sizeof(cdat_request_pl),
+ .response_pl = cdat_response_pl,
+ .response_pl_sz = sizeof(cdat_response_pl),
+ .complete = cxl_doe_task_complete,
+ .private = &c,
+ };
+ size_t entry_dw;
+ u32 *entry;
+
+ rc = pci_doe_submit_task(cdat_mb, &task);
+ if (rc < 0) {
+ dev_err(dev, "DOE submit failed: %d", rc);
+ return rc;
+ }
+ wait_for_completion(&c);
+
+ /* Get the CXL table access header entry handle */
+ entry_handle = FIELD_GET(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE,
+ cdat_response_pl[0]);
+ entry = cdat_response_pl + 1;
+ entry_dw = task.rv / sizeof(u32);
+ /* Skip Header */
+ entry_dw -= 1;
+ entry_dw = min(length / 4, entry_dw);
+ memcpy(data, entry, entry_dw * sizeof(u32));
+ length -= entry_dw * sizeof(u32);
+ data += entry_dw;
+
+ } while (entry_handle != CXL_DOE_TABLE_ACCESS_LAST_ENTRY);
+
+ return rc;
+}
+
+/**
+ * read_cdat_data - Read the CDAT data on this port
+ * @port: Port to read data from
+ *
+ * This call will sleep waiting for responses from the DOE mailbox.
+ */
+void read_cdat_data(struct cxl_port *port)
+{
+ static struct pci_doe_mb *cdat_mb;
+ struct device *dev = &port->dev;
+ struct device *uport = port->uport;
+ size_t cdat_length;
+ int ret;
+
+ cdat_mb = find_cdat_mb(uport);
+ if (!cdat_mb) {
+ dev_dbg(dev, "No CDAT mailbox\n");
+ return;
+ }
+
+ port->cdat_sup = true;
+
+ if (cxl_cdat_get_length(dev, cdat_mb, &cdat_length)) {
+ dev_dbg(dev, "No CDAT length\n");
+ return;
+ }
+
+ port->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
+ if (!port->cdat.table)
+ return;
+
+ port->cdat.length = cdat_length;
+ ret = cxl_cdat_read_table(dev, cdat_mb, &port->cdat);
+ if (ret) {
+ /* Don't leave table data allocated on error */
+ devm_kfree(dev, port->cdat.table);
+ port->cdat.table = NULL;
+ port->cdat.length = 0;
+ dev_err(dev, "CDAT data read error\n");
+ }
+}
+EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 140dc3278cde..9a08379000a0 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -8,6 +8,7 @@
#include <linux/bitfield.h>
#include <linux/bitops.h>
#include <linux/io.h>
+#include "cdat.h"
/**
* DOC: cxl objects
@@ -267,6 +268,8 @@ struct cxl_nvdimm {
* @component_reg_phys: component register capability base address (optional)
* @dead: last ep has been removed, force port re-creation
* @depth: How deep this port is relative to the root. depth 0 is the root.
+ * @cdat: Cached CDAT data
+ * @cdat_sup: Is CDAT supported
*/
struct cxl_port {
struct device dev;
@@ -278,6 +281,8 @@ struct cxl_port {
resource_size_t component_reg_phys;
bool dead;
unsigned int depth;
+ struct cxl_cdat cdat;
+ bool cdat_sup;
};
/**
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index fce1c11729c2..eec597dbe763 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -74,4 +74,5 @@ static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
int devm_cxl_port_enumerate_dports(struct cxl_port *port);
struct cxl_dev_state;
int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
+void read_cdat_data(struct cxl_port *port);
#endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index 3cf308f114c4..1ec34a159657 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -49,6 +49,9 @@ static int cxl_port_probe(struct device *dev)
if (IS_ERR(cxlhdm))
return PTR_ERR(cxlhdm);
+ /* Cache the data early to ensure is_visible() works */
+ read_cdat_data(port);
+
if (is_cxl_endpoint(port)) {
struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
struct cxl_dev_state *cxlds = cxlmd->cxlds;
@@ -78,10 +81,58 @@ static int cxl_port_probe(struct device *dev)
return 0;
}
+static ssize_t cdat_read(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t offset, size_t count)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct cxl_port *port = to_cxl_port(dev);
+
+ if (!port->cdat.table)
+ return 0;
+
+ return memory_read_from_buffer(buf, count, &offset,
+ port->cdat.table,
+ port->cdat.length);
+}
+
+static BIN_ATTR_ADMIN_RO(cdat, 0);
+
+static umode_t cxl_port_bin_attr_is_visible(struct kobject *kobj,
+ struct bin_attribute *attr, int i)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct cxl_port *port = to_cxl_port(dev);
+
+ if ((attr == &bin_attr_cdat) && port->cdat_sup)
+ return attr->attr.mode;
+
+ return 0;
+}
+
+static struct bin_attribute *cxl_cdat_bin_attributes[] = {
+ &bin_attr_cdat,
+ NULL,
+};
+
+static struct attribute_group cxl_cdat_attribute_group = {
+ .name = "CDAT",
+ .bin_attrs = cxl_cdat_bin_attributes,
+ .is_bin_visible = cxl_port_bin_attr_is_visible,
+};
+
+static const struct attribute_group *cxl_port_attribute_groups[] = {
+ &cxl_cdat_attribute_group,
+ NULL,
+};
+
static struct cxl_driver cxl_port_driver = {
.name = "cxl_port",
.probe = cxl_port_probe,
.id = CXL_DEVICE_PORT,
+ .drv = {
+ .dev_groups = cxl_port_attribute_groups,
+ },
};
module_cxl_driver(cxl_port_driver);
--
2.35.3
From: Ira Weiny <[email protected]>
CXL Ports with memory devices attached need the information from the
Device Scoped Memory Affinity Structure (DSMAS). This information is
contained within the CDAT table buffer which is cached in the port
device.
If CDAT data is available, parse and cache DSMAS data from the table.
Store this data in unmarshaled struct dsmas data structures for ease of
use later. Ensure DSMAS headers are not malicious or ill formed so as
to cause buffer overflow errors.
Signed-off-by: Ira Weiny <[email protected]>
---
Changes from V10
From Ben Widawsky
Check data lengths to protect against malicious devices
Changes from V8
Adjust to the cdat data being in cxl_port
Changes from V7
Rebased on cxl-pending
Changes from V6
Move to port.c
It is not an error if no DSMAS data is found
Changes from V5
Fix up sparse warnings
Split out cdat_hdr_valid()
Update cdat_hdr_valid()
Remove revision and cs field parsing
There is no point in these
Add seq check and debug print.
From Jonathan
Add spaces around '+' and '/'
use devm_krealloc() for dmas_ary
---
drivers/cxl/cdat.h | 23 ++++++++++++++
drivers/cxl/core/pci.c | 72 ++++++++++++++++++++++++++++++++++++++++++
drivers/cxl/cxl.h | 2 ++
drivers/cxl/cxlmem.h | 4 +++
drivers/cxl/cxlpci.h | 1 +
drivers/cxl/mem.c | 1 +
6 files changed, 103 insertions(+)
diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
index 39eb561081f2..ca1f55762416 100644
--- a/drivers/cxl/cdat.h
+++ b/drivers/cxl/cdat.h
@@ -51,6 +51,7 @@
#define CDAT_DSMAS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
#define CDAT_DSMAS_DPA_LEN(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
#define CDAT_DSMAS_NON_VOLATILE(flags) ((flags & 0x04) >> 2)
+#define CDAT_DSMAS_ENTRY_SIZE (6 * sizeof(u32))
/* Device Scoped Latency and Bandwidth Information Structure */
#define CDAT_DSLBIS_DW1_HANDLE 0x000000ff
@@ -60,22 +61,26 @@
#define CDAT_DSLBIS_DW4_ENTRY_0 0x0000ffff
#define CDAT_DSLBIS_DW4_ENTRY_1 0xffff0000
#define CDAT_DSLBIS_DW5_ENTRY_2 0x0000ffff
+#define CDAT_DSLBIS_ENTRY_SIZE (6 * sizeof(u32))
/* Device Scoped Memory Side Cache Information Structure */
#define CDAT_DSMSCIS_DW1_HANDLE 0x000000ff
#define CDAT_DSMSCIS_MEMORY_SIDE_CACHE_SIZE(entry) \
((u64)((entry)[3]) << 32 | (entry)[2])
#define CDAT_DSMSCIS_DW4_MEMORY_SIDE_CACHE_ATTRS 0xffffffff
+#define CDAT_DSMSCIS_ENTRY_SIZE (5 * sizeof(u32))
/* Device Scoped Initiator Structure */
#define CDAT_DSIS_DW1_FLAGS 0x000000ff
#define CDAT_DSIS_DW1_HANDLE 0x0000ff00
+#define CDAT_DSIS_ENTRY_SIZE (2 * sizeof(u32))
/* Device Scoped EFI Memory Type Structure */
#define CDAT_DSEMTS_DW1_HANDLE 0x000000ff
#define CDAT_DSEMTS_DW1_EFI_MEMORY_TYPE_ATTR 0x0000ff00
#define CDAT_DSEMTS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
#define CDAT_DSEMTS_DPA_LENGTH(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
+#define CDAT_DSEMTS_ENTRY_SIZE (6 * sizeof(u32))
/* Switch Scoped Latency and Bandwidth Information Structure */
#define CDAT_SSLBIS_DW1_DATA_TYPE 0x000000ff
@@ -83,9 +88,27 @@
#define CDAT_SSLBIS_ENTRY_PORT_X(entry, i) ((entry)[4 + (i) * 2] & 0x0000ffff)
#define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
#define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
+#define CDAT_SSLBIS_HEADER_SIZE (6 * sizeof(u32))
#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
+/**
+ * struct cxl_dsmas - host unmarshaled version of DSMAS data
+ *
+ * As defined in the Coherent Device Attribute Table (CDAT) specification this
+ * represents a single DSMAS entry in that table.
+ *
+ * @dpa_base: The lowest Device Physical Address associated with this DSMAD
+ * @length: Length in bytes of this DSMAD
+ * @non_volatile: If set, the memory region represents Non-Volatile memory
+ */
+struct cxl_dsmas {
+ u64 dpa_base;
+ u64 length;
+ /* Flags */
+ u8 non_volatile:1;
+};
+
/**
* struct cxl_cdat - CXL CDAT data
*
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index d7c2a415cc5f..6d58fb1e46b0 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -679,3 +679,75 @@ void read_cdat_data(struct cxl_port *port)
retries);
}
EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
+
+void parse_dsmas(struct cxl_memdev *cxlmd, struct cxl_port *port)
+{
+ struct device *dev = &port->dev;
+ struct cxl_dsmas *dsmas_ary = NULL;
+ u32 *data = port->cdat.table;
+ int bytes_left = port->cdat.length;
+ int nr_dsmas = 0;
+
+ if (!data) {
+ dev_info(dev, "No CDAT data available for DSMAS\n");
+ return;
+ }
+
+ /* Skip header */
+ data += CDAT_HEADER_LENGTH_DW;
+ bytes_left -= CDAT_HEADER_LENGTH_BYTES;
+
+ while (bytes_left > 0) {
+ u32 *cur_rec = data;
+ u8 type = FIELD_GET(CDAT_STRUCTURE_DW0_TYPE, cur_rec[0]);
+ u16 length = FIELD_GET(CDAT_STRUCTURE_DW0_LENGTH, cur_rec[0]);
+
+ if (type == CDAT_STRUCTURE_DW0_TYPE_DSMAS) {
+ struct cxl_dsmas *new_ary;
+ u8 flags;
+
+ /* Protect against malicious devices */
+ if (bytes_left < CDAT_DSMAS_ENTRY_SIZE ||
+ length != CDAT_DSMAS_ENTRY_SIZE) {
+ dev_err(dev, "Invalid DSMAS data detected\n");
+ return;
+ }
+
+ new_ary = devm_krealloc(dev, dsmas_ary,
+ sizeof(*dsmas_ary) * (nr_dsmas + 1),
+ GFP_KERNEL);
+ if (!new_ary) {
+ dev_err(dev,
+ "Failed to allocate memory for DSMAS data (nr_dsmas %d)\n",
+ nr_dsmas);
+ return;
+ }
+ dsmas_ary = new_ary;
+
+ flags = FIELD_GET(CDAT_DSMAS_DW1_FLAGS, cur_rec[1]);
+
+ dsmas_ary[nr_dsmas].dpa_base = CDAT_DSMAS_DPA_OFFSET(cur_rec);
+ dsmas_ary[nr_dsmas].length = CDAT_DSMAS_DPA_LEN(cur_rec);
+ dsmas_ary[nr_dsmas].non_volatile = CDAT_DSMAS_NON_VOLATILE(flags);
+
+ dev_dbg(dev, "DSMAS %d: %llx:%llx %s\n",
+ nr_dsmas,
+ dsmas_ary[nr_dsmas].dpa_base,
+ dsmas_ary[nr_dsmas].dpa_base +
+ dsmas_ary[nr_dsmas].length,
+ (dsmas_ary[nr_dsmas].non_volatile ?
+ "Persistent" : "Volatile")
+ );
+
+ nr_dsmas++;
+ }
+
+ data += (length / sizeof(u32));
+ bytes_left -= length;
+ }
+
+ dev_dbg(dev, "Found %d DSMAS entries\n", nr_dsmas);
+ cxlmd->dsmas_ary = dsmas_ary;
+ cxlmd->nr_dsmas = nr_dsmas;
+}
+EXPORT_SYMBOL_NS_GPL(parse_dsmas, CXL);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 9a08379000a0..5332b4d52d55 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -10,6 +10,8 @@
#include <linux/io.h>
#include "cdat.h"
+#include "cdat.h"
+
/**
* DOC: cxl objects
*
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 360f282ef80c..54231c26470c 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -36,6 +36,8 @@
* @cxlds: The device state backing this device
* @detach_work: active memdev lost a port in its ancestry
* @id: id number of this memdev instance.
+ * @dsmas_ary: Array of DSMAS entries as parsed from the CDAT table
+ * @nr_dsmas: Number of entries in dsmas_ary
*/
struct cxl_memdev {
struct device dev;
@@ -43,6 +45,8 @@ struct cxl_memdev {
struct cxl_dev_state *cxlds;
struct work_struct detach_work;
int id;
+ struct cxl_dsmas *dsmas_ary;
+ int nr_dsmas;
};
static inline struct cxl_memdev *to_cxl_memdev(struct device *dev)
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index eec597dbe763..3e68804d8935 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -75,4 +75,5 @@ int devm_cxl_port_enumerate_dports(struct cxl_port *port);
struct cxl_dev_state;
int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
void read_cdat_data(struct cxl_port *port);
+void parse_dsmas(struct cxl_memdev *cxlmd, struct cxl_port *port);
#endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index c310f1fd3db0..a8768df4ae38 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -35,6 +35,7 @@ static int create_endpoint(struct cxl_memdev *cxlmd,
if (IS_ERR(endpoint))
return PTR_ERR(endpoint);
+ parse_dsmas(cxlmd, endpoint);
dev_dbg(&cxlmd->dev, "add: %s\n", dev_name(&endpoint->dev));
if (!endpoint->dev.driver) {
--
2.35.3
From: Ira Weiny <[email protected]>
The CDAT read may fail for a number of reasons but mainly it is possible
to get different parts of a valid state. The checksum in the CDAT table
protects against this.
Now that the cdat data is validated, issue a retry if the CDAT read
fails. For now 5 retries are implemented.
Reviewed-by: Ben Widawsky <[email protected]>
Reviewed-by: Alison Schofield <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Changes from V10
Pick up review tag and fix commit message
Changes from V9
Alison Schofield/Davidlohr Bueso
Print debug on each iteration and error only after failure
Changes from V8
Move code to cxl/core/pci.c
Changes from V6
Move to pci.c
Fix retries count
Change to 5 retries
Changes from V5:
New patch -- easy to push off or drop.
---
drivers/cxl/core/pci.c | 41 ++++++++++++++++++++++++++++++-----------
1 file changed, 30 insertions(+), 11 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 6d775cc3dca1..d7c2a415cc5f 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -618,36 +618,30 @@ static int cxl_cdat_read_table(struct device *dev,
return rc;
}
-/**
- * read_cdat_data - Read the CDAT data on this port
- * @port: Port to read data from
- *
- * This call will sleep waiting for responses from the DOE mailbox.
- */
-void read_cdat_data(struct cxl_port *port)
+static int __read_cdat_data(struct cxl_port *port)
{
static struct pci_doe_mb *cdat_mb;
struct device *dev = &port->dev;
struct device *uport = port->uport;
size_t cdat_length;
- int ret;
+ int ret = 0;
cdat_mb = find_cdat_mb(uport);
if (!cdat_mb) {
dev_dbg(dev, "No CDAT mailbox\n");
- return;
+ return -EIO;
}
port->cdat_sup = true;
if (cxl_cdat_get_length(dev, cdat_mb, &cdat_length)) {
dev_dbg(dev, "No CDAT length\n");
- return;
+ return -EIO;
}
port->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
if (!port->cdat.table)
- return;
+ return -ENOMEM;
port->cdat.length = cdat_length;
ret = cxl_cdat_read_table(dev, cdat_mb, &port->cdat);
@@ -658,5 +652,30 @@ void read_cdat_data(struct cxl_port *port)
port->cdat.length = 0;
dev_err(dev, "CDAT data read error\n");
}
+
+ return ret;
+}
+
+/**
+ * read_cdat_data - Read the CDAT data on this port
+ * @port: Port to read data from
+ *
+ * This call will sleep waiting for responses from the DOE mailbox.
+ */
+void read_cdat_data(struct cxl_port *port)
+{
+ int retries = 5;
+ int rc;
+
+ while (retries--) {
+ rc = __read_cdat_data(port);
+ if (!rc)
+ return;
+ dev_dbg(&port->dev,
+ "CDAT data read error rc=%d (retries %d)\n",
+ rc, retries);
+ }
+ dev_err(&port->dev, "CDAT data read failed after %d retries\n",
+ retries);
}
EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
--
2.35.3
From: Jonathan Cameron <[email protected]>
This ID is used in DOE headers to identify protocols that are defined
within the PCI Express Base Specification, PCIe r6.0, sec 6.30.1.1 table
6-32.
Acked-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
---
include/linux/pci_ids.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 0178823ce8c2..8af3b86206b1 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -151,6 +151,7 @@
#define PCI_CLASS_OTHERS 0xff
/* Vendors and devices. Sort key: vendor first, device next. */
+#define PCI_VENDOR_ID_PCI_SIG 0x0001
#define PCI_VENDOR_ID_LOONGSON 0x0014
--
2.35.3
From: Ira Weiny <[email protected]>
The CDAT data is protected by a checksum and should be the proper
length.
Introduce cxl_cdat_valid() to validate the data. While at it check and
store the sequence number.
Signed-off-by: Ira Weiny <[email protected]>
---
Changes from V8
Move code to cxl/core/pci.c
Changes from V6
Change name to cxl_cdat_valid() as this validates all the CDAT
data not just the header
Add error and debug prints
Changes from V5
New patch, split out
Update cdat_hdr_valid()
Remove revision and cs field parsing
There is no point in these
Add seq check and debug print.
---
drivers/cxl/cdat.h | 2 ++
drivers/cxl/core/pci.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
index c6a48ab326bf..39eb561081f2 100644
--- a/drivers/cxl/cdat.h
+++ b/drivers/cxl/cdat.h
@@ -91,10 +91,12 @@
*
* @table: cache of CDAT table
* @length: length of cached CDAT table
+ * @seq: Last read Sequence number of the CDAT table
*/
struct cxl_cdat {
void *table;
size_t length;
+ u32 seq;
};
#endif /* !__CXL_CDAT_H__ */
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 4bd479ec0253..6d775cc3dca1 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -532,6 +532,40 @@ static int cxl_cdat_get_length(struct device *dev,
return rc;
}
+static bool cxl_cdat_valid(struct device *dev, struct cxl_cdat *cdat)
+{
+ u32 *table = cdat->table;
+ u8 *data8 = cdat->table;
+ u32 length, seq;
+ u8 check;
+ int i;
+
+ length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
+ if ((length < CDAT_HEADER_LENGTH_BYTES) || (length > cdat->length)) {
+ dev_err(dev, "CDAT Invalid length %u (%zu-%zu)\n", length,
+ CDAT_HEADER_LENGTH_BYTES, cdat->length);
+ return false;
+ }
+
+ for (check = 0, i = 0; i < length; i++)
+ check += data8[i];
+
+ dev_dbg(dev, "CDAT length %u CS %u\n", length, check);
+ if (check != 0) {
+ dev_err(dev, "CDAT Invalid checksum %u\n", check);
+ return false;
+ }
+
+ seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
+ /* Store the sequence for now. */
+ if (cdat->seq != seq) {
+ dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
+ cdat->seq = seq;
+ }
+
+ return true;
+}
+
static int cxl_cdat_read_table(struct device *dev,
struct pci_doe_mb *cdat_mb,
struct cxl_cdat *cdat)
@@ -579,6 +613,8 @@ static int cxl_cdat_read_table(struct device *dev,
} while (entry_handle != CXL_DOE_TABLE_ACCESS_LAST_ENTRY);
+ if (!rc && !cxl_cdat_valid(dev, cdat))
+ return -EIO;
return rc;
}
--
2.35.3
From: Ira Weiny <[email protected]>
Many binary attributes need to limit access to CAP_SYS_ADMIN only; ie
many binary attributes specify is_visible with 0400 or 0600.
Make setting the permissions of such attributes more explicit by
defining BIN_ATTR_ADMIN_{RO,RW}.
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Changes from V11:
New Patch
---
include/linux/sysfs.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index e3f1e8ac1f85..fd3fe5c8c17f 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -235,6 +235,22 @@ struct bin_attribute bin_attr_##_name = __BIN_ATTR_WO(_name, _size)
#define BIN_ATTR_RW(_name, _size) \
struct bin_attribute bin_attr_##_name = __BIN_ATTR_RW(_name, _size)
+
+#define __BIN_ATTR_ADMIN_RO(_name, _size) { \
+ .attr = { .name = __stringify(_name), .mode = 0400 }, \
+ .read = _name##_read, \
+ .size = _size, \
+}
+
+#define __BIN_ATTR_ADMIN_RW(_name, _size) \
+ __BIN_ATTR(_name, 0600, _name##_read, _name##_write, _size)
+
+#define BIN_ATTR_ADMIN_RO(_name, _size) \
+struct bin_attribute bin_attr_##_name = __BIN_ATTR_ADMIN_RO(_name, _size)
+
+#define BIN_ATTR_ADMIN_RW(_name, _size) \
+struct bin_attribute bin_attr_##_name = __BIN_ATTR_ADMIN_RW(_name, _size)
+
struct sysfs_ops {
ssize_t (*show)(struct kobject *, struct attribute *, char *);
ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
--
2.35.3
From: Jonathan Cameron <[email protected]>
Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based
mailbox with standard protocol discovery. Each mailbox is accessed
through a DOE Extended Capability.
Each DOE mailbox must support the DOE discovery protocol in addition to
any number of additional protocols.
Define core PCIe functionality to manage a single PCIe DOE mailbox at a
defined config space offset. Functionality includes iterating,
creating, query of supported protocol, and task submission. Destruction
of the mailboxes is device managed.
If interrupts are desired, the interrupt number can be queried and
passed to the create function. Passing a negative value disables
interrupts for that mailbox. It is the caller's responsibility to ensure
enough interrupt vectors are allocated.
Cc: "Li, Ming" <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Changes from V11
Bjorn: s/PCI/PCIe
use dev_fmt
move cap_offset in struct pci_doe_mb
use break and return from a central place
fix interrupt prints
s/PCI_DOE_CAP_IRQ/PCI_DOE_CAP_INT_MSG_NUM
s/irq_msg_num/int_msg_num
when the value is not an irq but rather the
interrupt message number for the DOE
s/irq/IRQ in comments
Clarify request and response payload size units
In addition clarify the rv units Check for
invalid response payload size (must be at least
1 DW)
Dan: s/EOPNOTSUPP/ENXIO/
Add pci_doe_for_each_off to .clang-format
use xarray for supported protocols
s/pci_doe_create_mb/pcim_doe_create_mb/
Remove pci_doe_destroy_mb
Dan: Convert the statemachine to process tasks as work items
Define pci_doe_write_ctrl()
Introduce pci_doe_irq_enabled()
issue a stand alone abort
Don't go through the state machine for the abort. Just
poll/irq until the response comes back.
Remove Wait Abort state
A wait abort can just be triggered from outside and stop
the state machine from whatever loop it may be in.
Let the state machine issue the abort itself and wait
for it to return or not.
Remove Wait abort on error
Issue the abort directly before returning. Abort
failure will flag the MB dead.
Remove workqueue processing from state machine
clean up function locations in the file
Move abort flag/document it
React to an abort while aborting and bail. This will
mark the mailbox dead.
Convert task to a work item
Create a workqueue in the mailbox. Remove cur_task and
locking. Set DEAD when taking mailbox down.
print error on marking mailbox dead
Introduce signal_task_abort
flatten out the state machine
Changes from V9
Lukas Wunner
Update comments
Move private doe structures and defines from pci-doe.h to doe.c
check Data Obj Ready prior to last ack
Davidlohr
make task_lock a spinlock
Lukas/Jonathan
Remove special case of error in irq handler
Fix potential race with the scheduling of a task when one is ending.
The current task can't be retired until the state
machine is idle. Otherwise a new task work item may run
and the state machine would be out of sync.
Changes from V8
Remove Bjorn's ack
Expose a function to find the irq number for a mailbox based on
offset. This is the code Jonathan proposed for finding the irq
number here:
https://lore.kernel.org/linux-cxl/[email protected]/
This removes funky bool parameter to create.
Move pci_set_master() within the pci_doe_enable_irq()
Per Bjorn
Clean up commit messages
move pci-doe.c to doe.c
Clean up PCI spec references
Ensure all messages use pci_*()
Add offset to error messages to distinguish mailboxes
use hex for DOE offsets
Print 4 nibbles for Vendor ID and 2 for type.
s/irq/IRQ in comments
Fix long lines
Fix typos
Changes from V7
Add a Kconfig for this functionality
Fix bug in pci_doe_supports_prot()
Rebased on cxl-pending
Changes from V6
Clean up signed off by lines
Make this functionality all PCI library functions
Clean up header files
s/pci_doe_irq/pci_doe_irq_handler
Use pci_{request,free}_irq
Remove irq_name (maintained by pci_request_irq)
Fix checks to use an irq
Consistently use u16 for cap_offset
Cleanup kdocs and comments
Create a helper retire_cur_task() to handle locking of the
current task pointer.
Remove devm_ calls from PCI layer.
The devm_ calls do not allow for the pci_doe_mb objects
to be tied to an auxiliary device. Leave it to the
caller to use devm_ if desired.
From Dan Williams
s/cb/end_task/; Pass pci_doe_task to end_task
Clarify exchange/task/request/response.
Merge pci_doe_task and pci_doe_exchange into
pci_doe_task which represents a single
request/response task for the state machine to
process.
Simplify submitting work to the mailbox
Replace pci_doe_exchange_sync() with
pci_doe_submit_task() Consumers of the mailbox
are now responsible for setting up callbacks
within a task object and submitting them to the
mailbox to be processed.
Remove WARN_ON when task != NULL and be sure to abort that task.
Convert abort/dead to atomic flags
s/state_lock/task_lock to better define what the lock is
protecting
Remove all the auxiliary bus code from the PCI layer
The PCI layer provides helpers to use the DOE
Mailboxes. Each subsystem can then use the
helpers as they see fit. The CXL layer in this
series uses aux devices to manage the new
pci_doe_mb objects.
From Bjorn
Clarify the fact that DOE mailboxes are capabilities of
the device.
Code clean ups
Cleanup Makefile
Update references to PCI SIG spec v6.0
Move this attribution here:
This code is based on Jonathan's V4 series here:
https://lore.kernel.org/linux-cxl/[email protected]/
Changes from V5
From Bjorn
s/pci_WARN/pci_warn
Add timeout period to print
Trim to 80 chars
Use Tabs for DOE define spacing
Use %#x for clarity
From Jonathan
Addresses concerns about the order of unwinding stuff
s/doe/doe_dev in pci_doe_exhcnage_sync
Correct kernel Doc comment
Move pci_doe_task_complete() down in the file.
Rework pci_doe_irq()
process STATUS_ERROR first
Return IRQ_NONE if the irq is not processed
Use PCI_DOE_STATUS_INT_STATUS explicitly to
clear the irq
Clean up goto label s/err_free_irqs/err_free_irq
use devm_kzalloc for doe struct
clean up error paths in pci_doe_probe
s/pci_doe_drv/pci_doe
remove include mutex.h
remove device name and define, move it in the next patch which uses it
use devm_kasprintf() for irq_name
use devm_request_irq()
remove pci_doe_unregister()
[get/put]_device() were unneeded and with the use of
devm_* this function can be removed completely.
refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
make this function just a registration of the irq and
move pci_doe_abort() into pci_doe_probe()
use devm_* to allocate the protocol array
Changes from Jonathan's V4
Move the DOE MB code into the DOE auxiliary driver
Remove Task List in favor of a wait queue
Changes from Ben
remove CXL references
propagate rc from pci functions on error
---
.clang-format | 1 +
drivers/pci/Kconfig | 3 +
drivers/pci/Makefile | 1 +
drivers/pci/doe.c | 689 ++++++++++++++++++++++++++++++++++
include/linux/pci-doe.h | 81 ++++
include/uapi/linux/pci_regs.h | 29 +-
6 files changed, 803 insertions(+), 1 deletion(-)
create mode 100644 drivers/pci/doe.c
create mode 100644 include/linux/pci-doe.h
diff --git a/.clang-format b/.clang-format
index fa959436bcfd..7bebb066f2a2 100644
--- a/.clang-format
+++ b/.clang-format
@@ -420,6 +420,7 @@ ForEachMacros:
- 'of_property_for_each_string'
- 'of_property_for_each_u32'
- 'pci_bus_for_each_resource'
+ - 'pci_doe_for_each_off'
- 'pcl_for_each_chunk'
- 'pcl_for_each_segment'
- 'pcm_for_each_format'
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 133c73207782..b2f2e588a817 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -121,6 +121,9 @@ config XEN_PCIDEV_FRONTEND
config PCI_ATS
bool
+config PCI_DOE
+ bool
+
config PCI_ECAM
bool
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 0da6b1ebc694..2680e4c92f0a 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM) += ecam.o
obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
obj-$(CONFIG_VGA_ARB) += vgaarb.o
+obj-$(CONFIG_PCI_DOE) += doe.o
# Endpoint library must be initialized before its users
obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
new file mode 100644
index 000000000000..4a7a1e988124
--- /dev/null
+++ b/drivers/pci/doe.c
@@ -0,0 +1,689 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Object Exchange
+ * PCIe r6.0, sec 6.30 DOE
+ *
+ * Copyright (C) 2021 Huawei
+ * Jonathan Cameron <[email protected]>
+ *
+ * Copyright (C) 2022 Intel Corporation
+ * Ira Weiny <[email protected]>
+ */
+
+#define dev_fmt(fmt) "DOE: " fmt
+
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci-doe.h>
+#include <linux/workqueue.h>
+
+#define PCI_DOE_PROTOCOL_DISCOVERY 0
+
+#define PCI_DOE_BUSY_MAX_RETRIES 16
+#define PCI_DOE_POLL_INTERVAL (HZ / 128)
+
+/* Timeout of 1 second from 6.30.2 Operation, PCI Spec r6.0 */
+#define PCI_DOE_TIMEOUT HZ
+
+#define PCI_DOE_FLAG_ABORT 0
+#define PCI_DOE_FLAG_IRQ 1
+#define PCI_DOE_FLAG_DEAD 2
+
+/**
+ * struct pci_doe_mb - State for a single DOE mailbox
+ *
+ * This state is used to manage a single DOE mailbox capability. All fields
+ * should be considered opaque to the consumers and the structure passed into
+ * the helpers below after being created by devm_pci_doe_create()
+ *
+ * @pdev: PCI device this mailbox belongs to
+ * @cap_offset: Capability offset
+ * @int_msg_num: DOE Interrupt Message Number; negative if irqs are not used
+ * @prots: Array of protocols supported (encoded as long values)
+ * @wq: Wait queue for work items awaiting irq/abort
+ * @work_queue: Queue of pci_doe_work items
+ * @flags: Bit array of PCI_DOE_FLAG_* flags
+ *
+ * Note: @prots can't be allocated with struct size because the number of
+ * protocols is not known until after this structure is in use. However, the
+ * single discovery protocol is always required to query for the number of
+ * protocols.
+ */
+struct pci_doe_mb {
+ struct pci_dev *pdev;
+ u16 cap_offset;
+ int int_msg_num;
+ struct xarray prots;
+
+ wait_queue_head_t wq;
+ struct workqueue_struct *work_queue;
+ unsigned long flags;
+};
+
+static irqreturn_t pci_doe_irq_handler(int irq, void *data)
+{
+ struct pci_doe_mb *doe_mb = data;
+ struct pci_dev *pdev = doe_mb->pdev;
+ int offset = doe_mb->cap_offset;
+ u32 val;
+
+ pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+ if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
+ pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
+ PCI_DOE_STATUS_INT_STATUS);
+ set_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags);
+ wake_up(&doe_mb->wq);
+ return IRQ_HANDLED;
+ }
+
+ return IRQ_NONE;
+}
+
+static bool pci_doe_irq_enabled(struct pci_doe_mb *doe_mb)
+{
+ return doe_mb->int_msg_num >= 0;
+}
+
+static void pci_doe_abort(struct pci_doe_mb *doe_mb)
+{
+ set_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags);
+ wake_up(&doe_mb->wq);
+}
+
+/*
+ * Returned from the wait functions to indicate that an abort has been issued
+ */
+#define DOE_WAIT_ABORT 1
+
+static int pci_doe_arm_wait(struct pci_doe_mb *doe_mb)
+{
+ if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags))
+ return DOE_WAIT_ABORT;
+ clear_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags);
+ return 0;
+}
+
+static int pci_doe_wait_poll(struct pci_doe_mb *doe_mb, unsigned long timeout)
+{
+ if (wait_event_timeout(doe_mb->wq,
+ test_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags),
+ timeout))
+ return DOE_WAIT_ABORT;
+ return 0;
+}
+
+static int pci_doe_wait_irq_or_poll(struct pci_doe_mb *doe_mb)
+{
+ if (pci_doe_irq_enabled(doe_mb)) {
+ wait_event_timeout(doe_mb->wq,
+ test_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags) ||
+ test_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags),
+ PCI_DOE_TIMEOUT);
+ if (test_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags))
+ return DOE_WAIT_ABORT;
+ return 0;
+ }
+
+ return pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL);
+}
+
+static void pci_doe_write_ctrl(struct pci_doe_mb *doe_mb, u32 val)
+{
+ struct pci_dev *pdev = doe_mb->pdev;
+ int offset = doe_mb->cap_offset;
+
+ if (pci_doe_irq_enabled(doe_mb))
+ val |= PCI_DOE_CTRL_INT_EN;
+ pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+}
+
+static int pci_doe_issue_abort(struct pci_doe_mb *doe_mb)
+{
+ struct pci_dev *pdev = doe_mb->pdev;
+ int offset = doe_mb->cap_offset;
+ unsigned long timeout_jiffies;
+
+ pci_dbg(pdev, "[%x] Issuing Abort\n", offset);
+
+ /*
+ * Abort detected while aborting; something is really broken or the
+ * mailbox is being destroyed.
+ */
+ if (pci_doe_arm_wait(doe_mb))
+ return -EIO;
+
+ timeout_jiffies = jiffies + PCI_DOE_TIMEOUT;
+ pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_ABORT);
+
+ do {
+ u32 val;
+
+ /*
+ * Abort detected while aborting; something is really broken or
+ * the mailbox is being destroyed.
+ */
+ if (pci_doe_wait_irq_or_poll(doe_mb))
+ return -EIO;
+ pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+ /* Abort success! */
+ if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
+ !FIELD_GET(PCI_DOE_STATUS_BUSY, val))
+ return 0;
+
+ } while (!time_after(jiffies, timeout_jiffies));
+
+ /* Abort has timed out and the MB is dead */
+ pci_err(pdev, "[%x] ABORT timed out\n", offset);
+ return -EIO;
+}
+
+static int pci_doe_send_req(struct pci_doe_mb *doe_mb,
+ struct pci_doe_task *task)
+{
+ struct pci_dev *pdev = doe_mb->pdev;
+ int offset = doe_mb->cap_offset;
+ u32 val;
+ int i;
+
+ /*
+ * Check the DOE busy bit is not set. If it is set, this could indicate
+ * someone other than Linux (e.g. firmware) is using the mailbox. Note
+ * it is expected that firmware and OS will negotiate access rights via
+ * an, as yet to be defined method.
+ */
+ pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+ if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
+ return -EBUSY;
+
+ if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+ return -EIO;
+
+ /* Write DOE Header */
+ val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, task->prot.vid) |
+ FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, task->prot.type);
+ pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
+ /* Length is 2 DW of header + length of payload in DW */
+ pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+ FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
+ 2 + task->request_pl_sz /
+ sizeof(u32)));
+ for (i = 0; i < task->request_pl_sz / sizeof(u32); i++)
+ pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+ task->request_pl[i]);
+
+ pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_GO);
+
+ /* Request is sent - now wait for poll or IRQ */
+ return 0;
+}
+
+static bool pci_doe_data_obj_ready(struct pci_doe_mb *doe_mb)
+{
+ struct pci_dev *pdev = doe_mb->pdev;
+ int offset = doe_mb->cap_offset;
+ u32 val;
+
+ pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+ if (FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val))
+ return true;
+ return false;
+}
+
+static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
+{
+ struct pci_dev *pdev = doe_mb->pdev;
+ int offset = doe_mb->cap_offset;
+ size_t length, payload_length;
+ u32 val;
+ int i;
+
+ /* Read the first dword to get the protocol */
+ pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+ if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) ||
+ (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) {
+ pci_err(pdev,
+ "[%x] expected [VID, Protocol] = [%04x, %02x], got [%04x, %02x]\n",
+ doe_mb->cap_offset,
+ task->prot.vid, task->prot.type,
+ FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
+ FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
+ return -EIO;
+ }
+
+ pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+ /* Read the second dword to get the length */
+ pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+ pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+
+ length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
+ if (length > SZ_1M || length < 2)
+ return -EIO;
+
+ /* First 2 dwords have already been read */
+ length -= 2;
+ payload_length = min(length, task->response_pl_sz / sizeof(u32));
+ /* Read the rest of the response payload */
+ for (i = 0; i < payload_length; i++) {
+ pci_read_config_dword(pdev, offset + PCI_DOE_READ,
+ &task->response_pl[i]);
+ /* Prior to the last ack, ensure Data Object Ready */
+ if (i == (payload_length-1) && !pci_doe_data_obj_ready(doe_mb))
+ return -EIO;
+ pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+ }
+
+ /* Flush excess length */
+ for (; i < length; i++) {
+ pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+ pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+ }
+
+ /* Final error check to pick up on any since Data Object Ready */
+ pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+ if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+ return -EIO;
+
+ return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
+}
+
+static void signal_task_complete(struct pci_doe_task *task, int rv)
+{
+ task->rv = rv;
+ task->complete(task);
+}
+
+static void signal_task_abort(struct pci_doe_task *task, int rv)
+{
+ struct pci_doe_mb *doe_mb = task->doe_mb;
+ struct pci_dev *pdev = doe_mb->pdev;
+
+ if (pci_doe_issue_abort(doe_mb)) {
+ /* On failure set the MB dead - no more submissions */
+ pci_err(pdev, "[%x] Abort failed marking mailbox dead\n",
+ doe_mb->cap_offset);
+ set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
+ }
+ signal_task_complete(task, rv);
+}
+
+static void doe_statemachine_work(struct work_struct *work)
+{
+ struct pci_doe_task *task = container_of(work, struct pci_doe_task,
+ work);
+ struct pci_doe_mb *doe_mb = task->doe_mb;
+ struct pci_dev *pdev = doe_mb->pdev;
+ int offset = doe_mb->cap_offset;
+ unsigned int busy_retries = 0;
+ unsigned long timeout_jiffies;
+ u32 val;
+ int rc;
+
+ if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
+ signal_task_complete(task, -EIO);
+ return;
+ }
+
+ /* Send request */
+retry_req:
+ if (pci_doe_arm_wait(doe_mb)) {
+ signal_task_abort(task, -EIO);
+ return;
+ }
+
+ rc = pci_doe_send_req(doe_mb, task);
+
+ /*
+ * The specification does not provide any guidance on how long
+ * some other entity could keep the DOE busy, so try for 1
+ * second then fail. Busy handling is best effort only, because
+ * there is no way of avoiding racing against another user of
+ * the DOE.
+ */
+ if (rc == -EBUSY) {
+ busy_retries++;
+ if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
+ pci_warn(pdev,
+ "[%x] busy for too long (> 1 sec)\n",
+ offset);
+ signal_task_complete(task, rc);
+ return;
+ }
+ if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
+ signal_task_abort(task, rc);
+ return;
+ }
+ goto retry_req;
+ } else if (rc) {
+ signal_task_abort(task, rc);
+ return;
+ }
+
+ timeout_jiffies = jiffies + HZ;
+ if (pci_doe_wait_irq_or_poll(doe_mb)) {
+ signal_task_abort(task, -EIO);
+ return;
+ }
+
+ /* Poll for response */
+retry_resp:
+ if (pci_doe_arm_wait(doe_mb)) {
+ signal_task_abort(task, -EIO);
+ return;
+ }
+
+ pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+ if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+ signal_task_abort(task, -EIO);
+ return;
+ }
+
+ if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
+ if (time_after(jiffies, timeout_jiffies)) {
+ signal_task_abort(task, -ETIMEDOUT);
+ return;
+ }
+ if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) {
+ signal_task_abort(task, -EIO);
+ return;
+ }
+ goto retry_resp;
+ }
+
+ rc = pci_doe_recv_resp(doe_mb, task);
+ if (rc < 0) {
+ signal_task_abort(task, rc);
+ return;
+ }
+
+ signal_task_complete(task, rc);
+}
+
+static void pci_doe_task_complete(struct pci_doe_task *task)
+{
+ complete(task->private);
+}
+
+static int pci_doe_discovery(struct pci_doe_mb *doe_mb, u8 *index, u16 *vid,
+ u8 *protocol)
+{
+ u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
+ *index);
+ u32 response_pl;
+ DECLARE_COMPLETION_ONSTACK(c);
+ struct pci_doe_task task = {
+ .prot.vid = PCI_VENDOR_ID_PCI_SIG,
+ .prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
+ .request_pl = &request_pl,
+ .request_pl_sz = sizeof(request_pl),
+ .response_pl = &response_pl,
+ .response_pl_sz = sizeof(response_pl),
+ .complete = pci_doe_task_complete,
+ .private = &c,
+ };
+ int ret;
+
+ ret = pci_doe_submit_task(doe_mb, &task);
+ if (ret < 0)
+ return ret;
+
+ wait_for_completion(&c);
+
+ if (task.rv != sizeof(response_pl))
+ return -EIO;
+
+ *vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
+ *protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
+ response_pl);
+ *index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
+ response_pl);
+
+ return 0;
+}
+
+static void *pci_doe_xa_entry(u16 vid, u8 prot)
+{
+ return (void *)(((unsigned long)vid << 16) | prot);
+}
+
+static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
+{
+ u8 index = 0;
+ u8 xa_idx = 0;
+
+ do {
+ int rc;
+ u16 vid;
+ u8 prot;
+
+ rc = pci_doe_discovery(doe_mb, &index, &vid, &prot);
+ if (rc)
+ return rc;
+
+ pci_dbg(doe_mb->pdev,
+ "[%x] Found protocol %d vid: %x prot: %x\n",
+ doe_mb->cap_offset, xa_idx, vid, prot);
+
+ rc = xa_insert(&doe_mb->prots, xa_idx++,
+ pci_doe_xa_entry(vid, prot), GFP_KERNEL);
+ if (rc)
+ return -ENOMEM;
+ } while (index);
+
+ return 0;
+}
+
+static int pci_doe_enable_irq(struct pci_doe_mb *doe_mb,
+ unsigned int int_msg_num)
+{
+ struct pci_dev *pdev = doe_mb->pdev;
+ int offset = doe_mb->cap_offset;
+ int rc;
+
+ /*
+ * Enabling bus mastering is required for MSI/MSIx. It is safe to call
+ * this multiple times and thus is called here to ensure that mastering
+ * is enabled even if the driver has done so.
+ */
+ pci_set_master(pdev);
+ rc = pci_request_irq(pdev, int_msg_num, pci_doe_irq_handler, NULL,
+ doe_mb, "DOE[%x %s]", offset, pci_name(pdev));
+ if (rc)
+ return rc;
+
+ doe_mb->int_msg_num = int_msg_num;
+ pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_INT_EN);
+
+ return 0;
+}
+
+static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
+{
+ if (doe_mb->work_queue)
+ destroy_workqueue(doe_mb->work_queue);
+ if (pci_doe_irq_enabled(doe_mb))
+ pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb);
+ xa_destroy(&doe_mb->prots);
+ kfree(doe_mb);
+}
+
+/**
+ * pci_doe_get_int_msg_num() - Return the interrupt message number for the
+ * mailbox at offset
+ *
+ * @pdev: The PCI device
+ * @offset: Offset of the DOE mailbox
+ *
+ * Returns: IRQ number on success
+ * -errno if IRQs are not supported on this mailbox
+ */
+int pci_doe_get_int_msg_num(struct pci_dev *pdev, int offset)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
+ if (!FIELD_GET(PCI_DOE_CAP_INT_SUP, val))
+ return -ENXIO;
+
+ return FIELD_GET(PCI_DOE_CAP_INT_MSG_NUM, val);
+}
+EXPORT_SYMBOL_GPL(pci_doe_get_int_msg_num);
+
+static void pci_doe_destroy_mb(void *mb)
+{
+ struct pci_doe_mb *doe_mb = mb;
+
+ /* Mark going down */
+ set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
+
+ /* Abort any in progress work items */
+ pci_doe_abort(doe_mb);
+
+ /* Flush remaining work items */
+ flush_workqueue(doe_mb->work_queue);
+
+ pci_doe_free_mb(doe_mb);
+}
+
+/**
+ * pcim_doe_create_mb() - Create a DOE mailbox object
+ *
+ * @pdev: PCI device to create the DOE mailbox for
+ * @cap_offset: Offset of the DOE mailbox
+ * @int_msg_num: Interrupt message number to use; a negative value means don't
+ * use interrupts
+ *
+ * Create a single mailbox object to manage the mailbox protocol at the
+ * cap_offset specified.
+ *
+ * Caller should allocate PCI IRQ vectors before passing a possitive value for
+ * int_msg_num.
+ *
+ * RETURNS: created mailbox object on success
+ * ERR_PTR(-errno) on failure
+ */
+struct pci_doe_mb *pcim_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
+ int int_msg_num)
+{
+ struct pci_doe_mb *doe_mb;
+ int rc;
+
+ doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
+ if (!doe_mb)
+ return ERR_PTR(-ENOMEM);
+
+ doe_mb->pdev = pdev;
+ doe_mb->int_msg_num = -1;
+ doe_mb->cap_offset = cap_offset;
+
+ xa_init(&doe_mb->prots);
+ init_waitqueue_head(&doe_mb->wq);
+
+ if (int_msg_num >= 0) {
+ rc = pci_doe_enable_irq(doe_mb, int_msg_num);
+ if (rc)
+ pci_err(pdev,
+ "[%x] enable requested IRQ (%d) failed : %d\n",
+ doe_mb->cap_offset, int_msg_num, rc);
+ }
+
+ doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0,
+ doe_mb->cap_offset);
+ if (!doe_mb->work_queue) {
+ pci_err(pdev, "[%x] failed to allocate work queue\n",
+ doe_mb->cap_offset);
+ pci_doe_free_mb(doe_mb);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* Reset the mailbox by issuing an abort */
+ rc = pci_doe_issue_abort(doe_mb);
+ if (rc) {
+ pci_err(pdev, "[%x] failed to reset : %d\n",
+ doe_mb->cap_offset, rc);
+ pci_doe_free_mb(doe_mb);
+ return ERR_PTR(rc);
+ }
+
+ if (devm_add_action_or_reset(&pdev->dev, pci_doe_destroy_mb, doe_mb))
+ return ERR_PTR(-EIO);
+
+ rc = pci_doe_cache_protocols(doe_mb);
+ if (rc) {
+ pci_err(pdev, "[%x] failed to cache protocols : %d\n",
+ doe_mb->cap_offset, rc);
+ return ERR_PTR(rc);
+ }
+
+ return doe_mb;
+}
+EXPORT_SYMBOL_GPL(pcim_doe_create_mb);
+
+/**
+ * pci_doe_supports_prot() - Return if the DOE instance supports the given
+ * protocol
+ * @doe_mb: DOE mailbox capability to query
+ * @vid: Protocol Vendor ID
+ * @type: Protocol type
+ *
+ * RETURNS: True if the DOE mailbox supports the protocol specified
+ */
+bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type)
+{
+ unsigned long index;
+ void *entry;
+
+ /* The discovery protocol must always be supported */
+ if (vid == PCI_VENDOR_ID_PCI_SIG && type == PCI_DOE_PROTOCOL_DISCOVERY)
+ return true;
+
+ xa_for_each(&doe_mb->prots, index, entry)
+ if (entry == pci_doe_xa_entry(vid, type))
+ return true;
+
+ return false;
+}
+EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
+
+/**
+ * pci_doe_submit_task() - Submit a task to be processed by the state machine
+ *
+ * @doe_mb: DOE mailbox capability to submit to
+ * @task: task to be queued
+ *
+ * Submit a DOE task (request/response) to the DOE mailbox to be processed.
+ * Returns upon queueing the task object. If the queue is full this function
+ * will sleep until there is room in the queue.
+ *
+ * task->complete will be called when the state machine is done processing this
+ * task.
+ *
+ * Excess data will be discarded.
+ *
+ * RETURNS: 0 when task has been successful queued, -ERRNO on error
+ */
+int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
+{
+ if (!pci_doe_supports_prot(doe_mb, task->prot.vid, task->prot.type))
+ return -EINVAL;
+
+ /*
+ * DOE requests must be a whole number of DW
+ * and the response needs to be big enough for at least 1 DW
+ */
+ if (task->request_pl_sz % sizeof(u32) ||
+ task->response_pl_sz < sizeof(u32))
+ return -EINVAL;
+
+ if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
+ return -EIO;
+
+ task->doe_mb = doe_mb;
+ INIT_WORK(&task->work, doe_statemachine_work);
+ queue_work(doe_mb->work_queue, &task->work);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(pci_doe_submit_task);
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
new file mode 100644
index 000000000000..805b58ff4016
--- /dev/null
+++ b/include/linux/pci-doe.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data Object Exchange
+ * PCIe r6.0, sec 6.30 DOE
+ *
+ * Copyright (C) 2021 Huawei
+ * Jonathan Cameron <[email protected]>
+ *
+ * Copyright (C) 2022 Intel Corporation
+ * Ira Weiny <[email protected]>
+ */
+
+#ifndef LINUX_PCI_DOE_H
+#define LINUX_PCI_DOE_H
+
+#include <linux/completion.h>
+
+struct pci_doe_protocol {
+ u16 vid;
+ u8 type;
+};
+
+struct pci_doe_mb;
+
+/**
+ * struct pci_doe_task - represents a single query/response
+ *
+ * @prot: DOE Protocol
+ * @request_pl: The request payload
+ * @request_pl_sz: Size of the request payload (bytes)
+ * @response_pl: The response payload
+ * @response_pl_sz: Size of the response payload (bytes)
+ * @rv: Return value. Length of received response or error (bytes)
+ * @complete: Called when task is complete
+ * @private: Private data for the consumer
+ * @work: Used internally by the mailbox
+ * @doe_mb: Used internally by the mailbox
+ *
+ * The payload sizes and rv are specified in bytes with the following
+ * restrictions concerning the protocol.
+ *
+ * 1) The request_pl_sz must be a multiple of double words (4 bytes)
+ * 2) The response_pl_sz must be >= a single double word (4 bytes)
+ * 3) rv is returned as bytes but it will be a multiple of double words
+ *
+ * NOTE there is no need for the caller to initialize work or doe_mb.
+ */
+struct pci_doe_task {
+ struct pci_doe_protocol prot;
+ u32 *request_pl;
+ size_t request_pl_sz;
+ u32 *response_pl;
+ size_t response_pl_sz;
+ int rv;
+ void (*complete)(struct pci_doe_task *task);
+ void *private;
+
+ /* No need for the user to initialize these fields */
+ struct work_struct work;
+ struct pci_doe_mb *doe_mb;
+};
+
+/**
+ * pci_doe_for_each_off - Iterate each DOE capability
+ * @pdev: struct pci_dev to iterate
+ * @off: u16 of config space offset of each mailbox capability found
+ */
+#define pci_doe_for_each_off(pdev, off) \
+ for (off = pci_find_next_ext_capability(pdev, off, \
+ PCI_EXT_CAP_ID_DOE); \
+ off > 0; \
+ off = pci_find_next_ext_capability(pdev, off, \
+ PCI_EXT_CAP_ID_DOE))
+
+int pci_doe_get_int_msg_num(struct pci_dev *pdev, int offset);
+struct pci_doe_mb *pcim_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
+ int irq);
+bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
+int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task);
+
+#endif
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index bee1a9ed6e66..9d50678f3f62 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -736,7 +736,8 @@
#define PCI_EXT_CAP_ID_DVSEC 0x23 /* Designated Vendor-Specific */
#define PCI_EXT_CAP_ID_DLF 0x25 /* Data Link Feature */
#define PCI_EXT_CAP_ID_PL_16GT 0x26 /* Physical Layer 16.0 GT/s */
-#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_PL_16GT
+#define PCI_EXT_CAP_ID_DOE 0x2E /* Data Object Exchange */
+#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_DOE
#define PCI_EXT_CAP_DSN_SIZEOF 12
#define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
@@ -1102,4 +1103,30 @@
#define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK 0x000000F0
#define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT 4
+/* Data Object Exchange */
+#define PCI_DOE_CAP 0x04 /* DOE Capabilities Register */
+#define PCI_DOE_CAP_INT_SUP 0x00000001 /* Interrupt Support */
+#define PCI_DOE_CAP_INT_MSG_NUM 0x00000ffe /* Interrupt Message Number */
+#define PCI_DOE_CTRL 0x08 /* DOE Control Register */
+#define PCI_DOE_CTRL_ABORT 0x00000001 /* DOE Abort */
+#define PCI_DOE_CTRL_INT_EN 0x00000002 /* DOE Interrupt Enable */
+#define PCI_DOE_CTRL_GO 0x80000000 /* DOE Go */
+#define PCI_DOE_STATUS 0x0c /* DOE Status Register */
+#define PCI_DOE_STATUS_BUSY 0x00000001 /* DOE Busy */
+#define PCI_DOE_STATUS_INT_STATUS 0x00000002 /* DOE Interrupt Status */
+#define PCI_DOE_STATUS_ERROR 0x00000004 /* DOE Error */
+#define PCI_DOE_STATUS_DATA_OBJECT_READY 0x80000000 /* Data Object Ready */
+#define PCI_DOE_WRITE 0x10 /* DOE Write Data Mailbox Register */
+#define PCI_DOE_READ 0x14 /* DOE Read Data Mailbox Register */
+
+/* DOE Data Object - note not actually registers */
+#define PCI_DOE_DATA_OBJECT_HEADER_1_VID 0x0000ffff
+#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE 0x00ff0000
+#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH 0x0003ffff
+
+#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX 0x000000ff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID 0x0000ffff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL 0x00ff0000
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX 0xff000000
+
#endif /* LINUX_PCI_REGS_H */
--
2.35.3
On Mon, Jun 27, 2022 at 09:15:23PM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> Many binary attributes need to limit access to CAP_SYS_ADMIN only; ie
> many binary attributes specify is_visible with 0400 or 0600.
>
> Make setting the permissions of such attributes more explicit by
> defining BIN_ATTR_ADMIN_{RO,RW}.
>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
On Mon, 27 Jun 2022 21:15:21 -0700
[email protected] wrote:
> From: Jonathan Cameron <[email protected]>
>
> Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based
> mailbox with standard protocol discovery. Each mailbox is accessed
> through a DOE Extended Capability.
>
> Each DOE mailbox must support the DOE discovery protocol in addition to
> any number of additional protocols.
>
> Define core PCIe functionality to manage a single PCIe DOE mailbox at a
> defined config space offset. Functionality includes iterating,
> creating, query of supported protocol, and task submission. Destruction
> of the mailboxes is device managed.
>
> If interrupts are desired, the interrupt number can be queried and
> passed to the create function. Passing a negative value disables
> interrupts for that mailbox. It is the caller's responsibility to ensure
> enough interrupt vectors are allocated.
>
> Cc: "Li, Ming" <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Signed-off-by: Jonathan Cameron <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
Hi Ira,
Thanks for keeping at this!
I think this has reintroduced some of the races around that annoying
interrupt source form BUSY transitioning to low that has
no explicit 'cause' flag. I think we'd hammered all those out in the
previous version but maybe there were still some there...
I 'think' it will work as is, but depending on the timing a given DOE
implementation has, the interrupt may be completely pointless as it
will be signaling the wrong event.
Jonathan
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 0da6b1ebc694..2680e4c92f0a 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM) += ecam.o
> obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
> obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
> obj-$(CONFIG_VGA_ARB) += vgaarb.o
> +obj-$(CONFIG_PCI_DOE) += doe.o
>
> # Endpoint library must be initialized before its users
> obj-$(CONFIG_PCI_ENDPOINT) += endpoint/
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> new file mode 100644
> index 000000000000..4a7a1e988124
> --- /dev/null
> +++ b/drivers/pci/doe.c
> +/**
> + * struct pci_doe_mb - State for a single DOE mailbox
> + *
> + * This state is used to manage a single DOE mailbox capability. All fields
> + * should be considered opaque to the consumers and the structure passed into
> + * the helpers below after being created by devm_pci_doe_create()
> + *
> + * @pdev: PCI device this mailbox belongs to
> + * @cap_offset: Capability offset
> + * @int_msg_num: DOE Interrupt Message Number; negative if irqs are not used
> + * @prots: Array of protocols supported (encoded as long values)
> + * @wq: Wait queue for work items awaiting irq/abort
> + * @work_queue: Queue of pci_doe_work items
> + * @flags: Bit array of PCI_DOE_FLAG_* flags
> + *
> + * Note: @prots can't be allocated with struct size because the number of
> + * protocols is not known until after this structure is in use. However, the
> + * single discovery protocol is always required to query for the number of
> + * protocols.
> + */
> +struct pci_doe_mb {
> + struct pci_dev *pdev;
> + u16 cap_offset;
> + int int_msg_num;
> + struct xarray prots;
> +
> + wait_queue_head_t wq;
> + struct workqueue_struct *work_queue;
> + unsigned long flags;
> +};
> +
> +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> +{
> + struct pci_doe_mb *doe_mb = data;
> + struct pci_dev *pdev = doe_mb->pdev;
> + int offset = doe_mb->cap_offset;
> + u32 val;
> +
> + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> + PCI_DOE_STATUS_INT_STATUS);
> + set_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags);
> + wake_up(&doe_mb->wq);
> + return IRQ_HANDLED;
> + }
> +
> + return IRQ_NONE;
> +}
> +
> +/*
> + * Returned from the wait functions to indicate that an abort has been issued
> + */
> +#define DOE_WAIT_ABORT 1
> +
> +static int pci_doe_arm_wait(struct pci_doe_mb *doe_mb)
Feels like there should be a naming to convey the return value as
a boolean rather than pushing through a flag value.
> +{
> + if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags))
> + return DOE_WAIT_ABORT;
> + clear_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags);
> + return 0;
> +}
> +
> +static void pci_doe_write_ctrl(struct pci_doe_mb *doe_mb, u32 val)
> +{
> + struct pci_dev *pdev = doe_mb->pdev;
> + int offset = doe_mb->cap_offset;
> +
> + if (pci_doe_irq_enabled(doe_mb))
> + val |= PCI_DOE_CTRL_INT_EN;
> + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +}
> +
> +static int pci_doe_issue_abort(struct pci_doe_mb *doe_mb)
Can we rename this as it does more than simply issue the abort,
it waits for it to finish
> +{
> + struct pci_dev *pdev = doe_mb->pdev;
> + int offset = doe_mb->cap_offset;
> + unsigned long timeout_jiffies;
> +
> + pci_dbg(pdev, "[%x] Issuing Abort\n", offset);
> +
> + /*
> + * Abort detected while aborting; something is really broken or the
> + * mailbox is being destroyed.
> + */
> + if (pci_doe_arm_wait(doe_mb))
> + return -EIO;
> +
> + timeout_jiffies = jiffies + PCI_DOE_TIMEOUT;
> + pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_ABORT);
> +
> + do {
> + u32 val;
> +
> + /*
> + * Abort detected while aborting; something is really broken or
> + * the mailbox is being destroyed.
> + */
> + if (pci_doe_wait_irq_or_poll(doe_mb))
> + return -EIO;
> + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> + /* Abort success! */
> + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> + !FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> + return 0;
> +
> + } while (!time_after(jiffies, timeout_jiffies));
> +
> + /* Abort has timed out and the MB is dead */
> + pci_err(pdev, "[%x] ABORT timed out\n", offset);
Does this print mention it's a DOE somewhere?
> + return -EIO;
> +}
> +
...
> +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> +{
> + struct pci_dev *pdev = doe_mb->pdev;
> + int offset = doe_mb->cap_offset;
> + size_t length, payload_length;
> + u32 val;
> + int i;
> +
> + /* Read the first dword to get the protocol */
> + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> + if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) ||
> + (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) {
> + pci_err(pdev,
> + "[%x] expected [VID, Protocol] = [%04x, %02x], got [%04x, %02x]\n",
> + doe_mb->cap_offset,
> + task->prot.vid, task->prot.type,
> + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> + return -EIO;
> + }
> +
> + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> + /* Read the second dword to get the length */
> + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +
> + length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> + if (length > SZ_1M || length < 2)
> + return -EIO;
> +
> + /* First 2 dwords have already been read */
> + length -= 2;
> + payload_length = min(length, task->response_pl_sz / sizeof(u32));
> + /* Read the rest of the response payload */
> + for (i = 0; i < payload_length; i++) {
> + pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> + &task->response_pl[i]);
> + /* Prior to the last ack, ensure Data Object Ready */
> + if (i == (payload_length-1) && !pci_doe_data_obj_ready(doe_mb))
spaces around -
> + return -EIO;
> + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> + }
> +
> + /* Flush excess length */
> + for (; i < length; i++) {
> + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> + }
> +
> + /* Final error check to pick up on any since Data Object Ready */
> + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> + return -EIO;
> +
> + return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
> +}
> +
> +
> +static void doe_statemachine_work(struct work_struct *work)
> +{
> + struct pci_doe_task *task = container_of(work, struct pci_doe_task,
> + work);
> + struct pci_doe_mb *doe_mb = task->doe_mb;
> + struct pci_dev *pdev = doe_mb->pdev;
> + int offset = doe_mb->cap_offset;
> + unsigned int busy_retries = 0;
> + unsigned long timeout_jiffies;
> + u32 val;
> + int rc;
> +
> + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> + signal_task_complete(task, -EIO);
> + return;
> + }
> +
> + /* Send request */
> +retry_req:
> + if (pci_doe_arm_wait(doe_mb)) {
> + signal_task_abort(task, -EIO);
> + return;
> + }
Is there a race here? If Busy drops at this point we queue up
a message, but IRQ bit is already set. Hence when we hit
wait_event_timeout() the flag is already set and IIRC we'll
drop straight through.
It'll probably be fine because it will end up polling below
but doesn't look ideal.
Upshot is that you sort of have to handle "spurious interrupts"
cleanly and rewait on the interrupt if you get one whilst also handling
race conditions around RW1C of the interrupt status flag.
> +
> + rc = pci_doe_send_req(doe_mb, task);
> +
> + /*
> + * The specification does not provide any guidance on how long
> + * some other entity could keep the DOE busy, so try for 1
> + * second then fail. Busy handling is best effort only, because
> + * there is no way of avoiding racing against another user of
> + * the DOE.
> + */
> + if (rc == -EBUSY) {
> + busy_retries++;
> + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> + pci_warn(pdev,
> + "[%x] busy for too long (> 1 sec)\n",
> + offset);
> + signal_task_complete(task, rc);
> + return;
> + }
> + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
> + signal_task_abort(task, rc);
> + return;
> + }
> + goto retry_req;
> + } else if (rc) {
> + signal_task_abort(task, rc);
> + return;
> + }
> +
> + timeout_jiffies = jiffies + HZ;
> + if (pci_doe_wait_irq_or_poll(doe_mb)) {
So this may well be passed as a result of a BUSY transition to 0 very soon
after the doe_send_req but well before the data is ready....
> + signal_task_abort(task, -EIO);
> + return;
> + }
> +
> + /* Poll for response */
> +retry_resp:
> + if (pci_doe_arm_wait(doe_mb)) {
I think we can get here between Busy drop and Object Ready which means
this can get another IRQ_FLAG setting just after it. Does it matter?
Don't think so, as we don't use that bit again in this run through
and it will be cleared at beginning of next one, but if so why is
this call here? I think it's only useful for detecting an abort, if
so do that explicitly.
> + signal_task_abort(task, -EIO);
> + return;
> + }
> +
> + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> + signal_task_abort(task, -EIO);
> + return;
> + }
> +
> + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> + if (time_after(jiffies, timeout_jiffies)) {
> + signal_task_abort(task, -ETIMEDOUT);
> + return;
> + }
> + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) {
Following on from above....
As a result of the interrupt having fired on the BUSY off transition,
I think we will almost always end up spinning here until Object Ready
is set. Fine, but seems a shame we don't let an interrupt do this
for us in most cases. (note in QEMU response is instantaneous so
when the interrupt for Busy drop is set, object ready also true so
by the time we get here data ready will already be sent).
> + signal_task_abort(task, -EIO);
> + return;
> + }
> + goto retry_resp;
> + }
> +
> + rc = pci_doe_recv_resp(doe_mb, task);
> + if (rc < 0) {
> + signal_task_abort(task, rc);
> + return;
> + }
> +
> + signal_task_complete(task, rc);
> +}
> +
> +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> +{
> + if (doe_mb->work_queue)
I'm not a great fan of free functions that check a bunch of conditions
because they may be called before things are set up. To my
mind that generally means we should be calling individual cleanup
in the appropriate error handlers.
Either that or just use devm handling for each item. Sure
it's a few more lines of code, but I find it a lot easier to go
Oh look that thing we just set up is cleaned up by this.
> + destroy_workqueue(doe_mb->work_queue);
> + if (pci_doe_irq_enabled(doe_mb))
> + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb);
> + xa_destroy(&doe_mb->prots);
> + kfree(doe_mb);
> +}
> +
...
> +
> +static void pci_doe_destroy_mb(void *mb)
> +{
> + struct pci_doe_mb *doe_mb = mb;
> +
> + /* Mark going down */
> + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> +
> + /* Abort any in progress work items */
> + pci_doe_abort(doe_mb);
Abort is getting used for two things in here. Perhaps
rename this one to
pci_doe_abort_tasks() or something like that?
> +
> + /* Flush remaining work items */
> + flush_workqueue(doe_mb->work_queue);
> +
> + pci_doe_free_mb(doe_mb);
> +}
> +
> +/**
> + * pcim_doe_create_mb() - Create a DOE mailbox object
> + *
> + * @pdev: PCI device to create the DOE mailbox for
> + * @cap_offset: Offset of the DOE mailbox
> + * @int_msg_num: Interrupt message number to use; a negative value means don't
> + * use interrupts
> + *
> + * Create a single mailbox object to manage the mailbox protocol at the
> + * cap_offset specified.
> + *
> + * Caller should allocate PCI IRQ vectors before passing a possitive value for
positive
> + * int_msg_num.
> + *
> + * RETURNS: created mailbox object on success
> + * ERR_PTR(-errno) on failure
> + */
> +struct pci_doe_mb *pcim_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> + int int_msg_num)
> +{
> + struct pci_doe_mb *doe_mb;
> + int rc;
> +
> + doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
> + if (!doe_mb)
> + return ERR_PTR(-ENOMEM);
> +
> + doe_mb->pdev = pdev;
> + doe_mb->int_msg_num = -1;
> + doe_mb->cap_offset = cap_offset;
> +
> + xa_init(&doe_mb->prots);
> + init_waitqueue_head(&doe_mb->wq);
> +
> + if (int_msg_num >= 0) {
> + rc = pci_doe_enable_irq(doe_mb, int_msg_num);
> + if (rc)
> + pci_err(pdev,
> + "[%x] enable requested IRQ (%d) failed : %d\n",
> + doe_mb->cap_offset, int_msg_num, rc);
If we are printing an error, I'd argue we should not continue.
Or at very least we should a comment here to say why we should do so...
> + }
> +
> + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0,
> + doe_mb->cap_offset);
> + if (!doe_mb->work_queue) {
> + pci_err(pdev, "[%x] failed to allocate work queue\n",
> + doe_mb->cap_offset);
> + pci_doe_free_mb(doe_mb);
As above, I'd rather this explicitly freed what has been set up
and only that rather than calling a free function that does a bunch of
stuff conditionally.
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + /* Reset the mailbox by issuing an abort */
> + rc = pci_doe_issue_abort(doe_mb);
> + if (rc) {
> + pci_err(pdev, "[%x] failed to reset : %d\n",
> + doe_mb->cap_offset, rc);
> + pci_doe_free_mb(doe_mb);
> + return ERR_PTR(rc);
> + }
> +
> + if (devm_add_action_or_reset(&pdev->dev, pci_doe_destroy_mb, doe_mb))
> + return ERR_PTR(-EIO);
> +
> + rc = pci_doe_cache_protocols(doe_mb);
> + if (rc) {
> + pci_err(pdev, "[%x] failed to cache protocols : %d\n",
> + doe_mb->cap_offset, rc);
> + return ERR_PTR(rc);
> + }
> +
> + return doe_mb;
> +}
> +EXPORT_SYMBOL_GPL(pcim_doe_create_mb);
> +
On Mon, 27 Jun 2022 21:15:22 -0700
[email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> DOE mailbox objects will be needed for various mailbox communications
> with each memory device.
>
> Iterate each DOE mailbox capability and create PCI DOE mailbox objects
> as found.
>
> It is not anticipated that this is the final resting place for the
> iteration of the DOE devices. The support of switch ports will drive
> this code into the PCIe side. In this imagined architecture the CXL
> port driver would then query into the PCI device for the DOE mailbox
> array.
>
> For now creating the mailboxes in the CXL port is good enough for the
> endpoints. Later PCIe ports will need to support this to support switch
> ports more generically.
>
> Cc: Dan Williams <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Lukas Wunner <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
My main comment on this is that we should not paper over any errors in
DOE setup. Those indicate a bug or hardware fault, so like anything similar
we should at very least document why it makes sense to continue. In most
cases I'd argue it doesn't as something is very wrong.
>
> ---
> Changes from V11:
> Drop review from: Ben Widawsky <[email protected]>
> Remove irq code for now
> Adjust for pci_doe_get_int_msg_num()
> Adjust for pcim_doe_create_mb()
> (No longer need to handle the destroy.)
> Use xarray for DOE mailbox array
>
> Changes from V9:
> Bug fix: ensure DOE mailboxes are iterated before memdev add
> Ben Widawsky
> Set use_irq to false and just return on error.
> Don't return a value from devm_cxl_pci_create_doe()
> Skip allocating doe_mb array if there are no mailboxes
> Skip requesting irqs if none found.
> Ben/Jonathan Cameron
> s/num_irqs/max_irqs
>
> Changes from V8:
> Move PCI_DOE selection to CXL_BUS to support future patches
> which move queries into the port code.
> Remove Auxiliary device arch
> Squash the functionality of the auxiliary driver into this
> patch.
> Split out the irq handling a bit.
>
> Changes from V7:
> Minor code clean ups
> Rebased on cxl-pending
>
> Changes from V6:
> Move all the auxiliary device stuff to the CXL layer
>
> Changes from V5:
> Split the CXL specific stuff off from the PCI DOE create
> auxiliary device code.
> ---
> drivers/cxl/Kconfig | 1 +
> drivers/cxl/cxlmem.h | 3 +++
> drivers/cxl/pci.c | 37 +++++++++++++++++++++++++++++++++++++
> 3 files changed, 41 insertions(+)
>
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index f64e3984689f..7adaaf80b302 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -2,6 +2,7 @@
> menuconfig CXL_BUS
> tristate "CXL (Compute Express Link) Devices Support"
> depends on PCI
> + select PCI_DOE
> help
> CXL is a bus that is electrically compatible with PCI Express, but
> layers three protocols on that signalling (CXL.io, CXL.cache, and
> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> index 60d10ee1e7fc..360f282ef80c 100644
> --- a/drivers/cxl/cxlmem.h
> +++ b/drivers/cxl/cxlmem.h
> @@ -191,6 +191,7 @@ struct cxl_endpoint_dvsec_info {
> * @component_reg_phys: register base of component registers
> * @info: Cached DVSEC information about the device.
> * @serial: PCIe Device Serial Number
> + * @doe_mbs: PCI DOE mailbox array
> * @mbox_send: @dev specific transport for transmitting mailbox commands
> *
> * See section 8.2.9.5.2 Capacity Configuration and Label Storage for
> @@ -224,6 +225,8 @@ struct cxl_dev_state {
> resource_size_t component_reg_phys;
> u64 serial;
>
> + struct xarray doe_mbs;
> +
> int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
> };
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 5a0ae46d4989..5821e6c1253b 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -8,6 +8,7 @@
> #include <linux/mutex.h>
> #include <linux/list.h>
> #include <linux/pci.h>
> +#include <linux/pci-doe.h>
> #include <linux/io.h>
> #include "cxlmem.h"
> #include "cxlpci.h"
> @@ -386,6 +387,37 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
> return rc;
> }
>
> +static void cxl_pci_destroy_doe(void *mbs)
> +{
> + struct xarray *xa = mbs;
Local variable doesn't add anything...
> +
> + xa_destroy(xa);
> +}
> +
> +static void devm_cxl_pci_create_doe(struct cxl_dev_state *cxlds)
> +{
> + struct device *dev = cxlds->dev;
> + struct pci_dev *pdev = to_pci_dev(dev);
> + u16 off = 0;
> +
> + pci_doe_for_each_off(pdev, off) {
> + struct pci_doe_mb *doe_mb;
> +
> + doe_mb = pcim_doe_create_mb(pdev, off, -1);
> + if (IS_ERR(doe_mb)) {
> + pci_err(pdev,
> + "Failed to create MB object for MB @ %x\n",
> + off);
Definitely at least need a comment for why papering over this failure is
fine. My gut feeling is we shouldn't ignore it.
> + doe_mb = NULL;
> + }
> +
> + if (xa_insert(&cxlds->doe_mbs, off, doe_mb, GFP_KERNEL))
> + break;
If we hit that break something has gone horribly wrong and we shouldn't
paper over it either. We might have a partial list of DOEs and callers
after this will have no way of knowing it isn't the full list.
> +
> + pci_dbg(pdev, "Created DOE mailbox @%x\n", off);
> + }
> +}
> +
> static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> {
> struct cxl_register_map map;
> @@ -408,6 +440,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (IS_ERR(cxlds))
> return PTR_ERR(cxlds);
>
> + xa_init(&cxlds->doe_mbs);
> + devm_add_action(&pdev->dev, cxl_pci_destroy_doe, &cxlds->doe_mbs);
_or_reset()? If the devm registration itself fails we want to bail out cleanly.
It's vanishingly unlikely to happen, but we should still handle that case.
> +
> cxlds->serial = pci_get_dsn(pdev);
> cxlds->cxl_dvsec = pci_find_dvsec_capability(
> pdev, PCI_DVSEC_VENDOR_ID_CXL, CXL_DVSEC_PCIE_DEVICE);
> @@ -434,6 +469,8 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>
> cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);
>
> + devm_cxl_pci_create_doe(cxlds);
> +
> rc = cxl_pci_setup_mailbox(cxlds);
> if (rc)
> return rc;
On Mon, 27 Jun 2022 21:15:27 -0700
[email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> CXL Ports with memory devices attached need the information from the
> Device Scoped Memory Affinity Structure (DSMAS). This information is
> contained within the CDAT table buffer which is cached in the port
> device.
>
> If CDAT data is available, parse and cache DSMAS data from the table.
> Store this data in unmarshaled struct dsmas data structures for ease of
> use later. Ensure DSMAS headers are not malicious or ill formed so as
> to cause buffer overflow errors.
>
> Signed-off-by: Ira Weiny <[email protected]>
>
Looks good to me. Was kind of expecting yet another xarray :)
Reviewed-by: Jonathan Cameron <[email protected]>
On Mon, 27 Jun 2022 21:15:25 -0700
[email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> The CDAT data is protected by a checksum and should be the proper
> length.
>
> Introduce cxl_cdat_valid() to validate the data. While at it check and
> store the sequence number.
>
> Signed-off-by: Ira Weiny <[email protected]>
>
Minor ordering comment. With that tidied up
Reviewed-by: Jonathan Cameron <[email protected]>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 4bd479ec0253..6d775cc3dca1 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -532,6 +532,40 @@ static int cxl_cdat_get_length(struct device *dev,
> return rc;
> }
>
> +static bool cxl_cdat_valid(struct device *dev, struct cxl_cdat *cdat)
> +{
> + u32 *table = cdat->table;
> + u8 *data8 = cdat->table;
> + u32 length, seq;
> + u8 check;
> + int i;
> +
> + length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
> + if ((length < CDAT_HEADER_LENGTH_BYTES) || (length > cdat->length)) {
> + dev_err(dev, "CDAT Invalid length %u (%zu-%zu)\n", length,
> + CDAT_HEADER_LENGTH_BYTES, cdat->length);
> + return false;
> + }
> +
> + for (check = 0, i = 0; i < length; i++)
> + check += data8[i];
> +
> + dev_dbg(dev, "CDAT length %u CS %u\n", length, check);
> + if (check != 0) {
> + dev_err(dev, "CDAT Invalid checksum %u\n", check);
> + return false;
> + }
> +
> + seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
> + /* Store the sequence for now. */
> + if (cdat->seq != seq) {
> + dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
> + cdat->seq = seq;
> + }
> +
> + return true;
> +}
> +
> static int cxl_cdat_read_table(struct device *dev,
> struct pci_doe_mb *cdat_mb,
> struct cxl_cdat *cdat)
> @@ -579,6 +613,8 @@ static int cxl_cdat_read_table(struct device *dev,
>
> } while (entry_handle != CXL_DOE_TABLE_ACCESS_LAST_ENTRY);
>
> + if (!rc && !cxl_cdat_valid(dev, cdat))
> + return -EIO;
I'd prefer those handled separately as flow is more readable if error
handling always out of line.
if (rc)
return rc;
if (!cxl_cdata_valid)
return -EIO;
return 0;
> return rc;
> }
>
On Mon, 27 Jun 2022 21:15:26 -0700
[email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> The CDAT read may fail for a number of reasons but mainly it is possible
> to get different parts of a valid state. The checksum in the CDAT table
> protects against this.
>
> Now that the cdat data is validated, issue a retry if the CDAT read
> fails. For now 5 retries are implemented.
>
> Reviewed-by: Ben Widawsky <[email protected]>
> Reviewed-by: Alison Schofield <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from V10
> Pick up review tag and fix commit message
>
> Changes from V9
> Alison Schofield/Davidlohr Bueso
> Print debug on each iteration and error only after failure
>
> Changes from V8
> Move code to cxl/core/pci.c
>
> Changes from V6
> Move to pci.c
> Fix retries count
> Change to 5 retries
>
> Changes from V5:
> New patch -- easy to push off or drop.
> ---
> drivers/cxl/core/pci.c | 41 ++++++++++++++++++++++++++++++-----------
> 1 file changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 6d775cc3dca1..d7c2a415cc5f 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -618,36 +618,30 @@ static int cxl_cdat_read_table(struct device *dev,
> return rc;
> }
>
> -/**
> - * read_cdat_data - Read the CDAT data on this port
> - * @port: Port to read data from
> - *
> - * This call will sleep waiting for responses from the DOE mailbox.
> - */
> -void read_cdat_data(struct cxl_port *port)
> +static int __read_cdat_data(struct cxl_port *port)
> {
> static struct pci_doe_mb *cdat_mb;
> struct device *dev = &port->dev;
> struct device *uport = port->uport;
> size_t cdat_length;
> - int ret;
> + int ret = 0;
Fairly sure there isn't a path in which ret isn't set...
Mixing ret and rc is a bit inconsistent, maybe scrub patch set for
one or the other. (My fault originally I think :)
>
> cdat_mb = find_cdat_mb(uport);
> if (!cdat_mb) {
> dev_dbg(dev, "No CDAT mailbox\n");
> - return;
> + return -EIO;
> }
>
> port->cdat_sup = true;
>
> if (cxl_cdat_get_length(dev, cdat_mb, &cdat_length)) {
> dev_dbg(dev, "No CDAT length\n");
> - return;
> + return -EIO;
> }
>
> port->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
> if (!port->cdat.table)
> - return;
> + return -ENOMEM;
>
> port->cdat.length = cdat_length;
> ret = cxl_cdat_read_table(dev, cdat_mb, &port->cdat);
> @@ -658,5 +652,30 @@ void read_cdat_data(struct cxl_port *port)
> port->cdat.length = 0;
> dev_err(dev, "CDAT data read error\n");
> }
> +
> + return ret;
> +}
> +
> +/**
> + * read_cdat_data - Read the CDAT data on this port
> + * @port: Port to read data from
> + *
> + * This call will sleep waiting for responses from the DOE mailbox.
> + */
> +void read_cdat_data(struct cxl_port *port)
> +{
> + int retries = 5;
> + int rc;
> +
> + while (retries--) {
> + rc = __read_cdat_data(port);
> + if (!rc)
> + return;
> + dev_dbg(&port->dev,
> + "CDAT data read error rc=%d (retries %d)\n",
> + rc, retries);
> + }
> + dev_err(&port->dev, "CDAT data read failed after %d retries\n",
> + retries);
> }
> EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
On Mon, 27 Jun 2022 21:15:21 -0700
[email protected] wrote:
> From: Jonathan Cameron <[email protected]>
>
> Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based
> mailbox with standard protocol discovery. Each mailbox is accessed
> through a DOE Extended Capability.
>
> Each DOE mailbox must support the DOE discovery protocol in addition to
> any number of additional protocols.
>
> Define core PCIe functionality to manage a single PCIe DOE mailbox at a
> defined config space offset. Functionality includes iterating,
> creating, query of supported protocol, and task submission. Destruction
> of the mailboxes is device managed.
>
> If interrupts are desired, the interrupt number can be queried and
> passed to the create function. Passing a negative value disables
> interrupts for that mailbox. It is the caller's responsibility to ensure
> enough interrupt vectors are allocated.
>
> Cc: "Li, Ming" <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Signed-off-by: Jonathan Cameron <[email protected]>
> Co-developed-by: Ira Weiny <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
+static void *pci_doe_xa_entry(u16 vid, u8 prot)
+{
+ return (void *)(((unsigned long)vid << 16) | prot);
+}
...
> +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
> +{
> + u8 index = 0;
> + u8 xa_idx = 0;
> +
> + do {
> + int rc;
> + u16 vid;
> + u8 prot;
> +
> + rc = pci_doe_discovery(doe_mb, &index, &vid, &prot);
> + if (rc)
> + return rc;
> +
> + pci_dbg(doe_mb->pdev,
> + "[%x] Found protocol %d vid: %x prot: %x\n",
> + doe_mb->cap_offset, xa_idx, vid, prot);
> +
> + rc = xa_insert(&doe_mb->prots, xa_idx++,
> + pci_doe_xa_entry(vid, prot), GFP_KERNEL);
I'm not that familiar with xarray, but the docs suggest that you have
to use xa_mk_value() to store an integer directly into it.
> + if (rc)
> + return -ENOMEM;
> + } while (index);
> +
> + return 0;
> +}
> +
On Mon, 27 Jun 2022 21:15:24 -0700
[email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> The OS will need CDAT data from CXL devices to properly set up
> interleave sets. Currently this is supported through a DOE mailbox
> which supports CDAT.
>
> Search the DOE mailboxes available, query CDAT data, and cache the data
> for later parsing.
>
> Provide a sysfs binary attribute to allow dumping of the CDAT.
>
> Binary dumping is modeled on /sys/firmware/ACPI/tables/
>
> The ability to dump this table will be very useful for emulation of real
> devices once they become available as QEMU CXL type 3 device emulation will
> be able to load this file in.
>
> This does not support table updates at runtime. It will always provide
> whatever was there when first cached. Handling of table updates can be
> implemented later.
>
> Finally create a complete list of CDAT defines within cdat.h for code
> wishing to decode the CDAT table.
>
> Signed-off-by: Jonathan Cameron <[email protected]>
> Co-developed-by: Jonathan Cameron <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
One query inline, though I'm fine with it either way, just want to
understand your logic in keeping completion when Dan suggested using
flush_work() to achieve the same thing.
>
> ---
> Changes from V11:
> Adjust for the use of DOE mailbox xarray
> Dan Williams:
> Remove unnecessary get/put device
> Use new BIN_ATTR_ADMIN_RO macro
> Flag that CDAT was supported
> If there is a read error then the CDAT sysfs
> will return a 0 length entry
>
...
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index c4c99ff7b55e..4bd479ec0253 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -4,10 +4,12 @@
> +static int cxl_cdat_get_length(struct device *dev,
> + struct pci_doe_mb *cdat_mb,
> + size_t *length)
> +{
> + u32 cdat_request_pl = CDAT_DOE_REQ(0);
> + u32 cdat_response_pl[32];
> + DECLARE_COMPLETION_ONSTACK(c);
> + struct pci_doe_task task = {
> + .prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> + .prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> + .request_pl = &cdat_request_pl,
> + .request_pl_sz = sizeof(cdat_request_pl),
> + .response_pl = cdat_response_pl,
> + .response_pl_sz = sizeof(cdat_response_pl),
> + .complete = cxl_doe_task_complete,
> + .private = &c,
> + };
> + int rc = 0;
> +
> + rc = pci_doe_submit_task(cdat_mb, &task);
> + if (rc < 0) {
> + dev_err(dev, "DOE submit failed: %d", rc);
> + return rc;
> + }
> + wait_for_completion(&c);
Dan mentioned in his review that we could just use
flush_work() and drop the completion logic and callback.
Why didn't you go that way? Would want to wrap it up
in pci_doe_wait_task() or something like that.
> +
> + if (task.rv < 1)
> + return -EIO;
> +
> + *length = cdat_response_pl[1];
> + dev_dbg(dev, "CDAT length %zu\n", *length);
> +
> + return rc;
> +}
> +
On Tue, Jun 28, 2022 at 03:16:26PM +0100, Jonathan Cameron wrote:
> On Mon, 27 Jun 2022 21:15:21 -0700
> [email protected] wrote:
> > From: Jonathan Cameron <[email protected]>
> > + /* Abort has timed out and the MB is dead */
> > + pci_err(pdev, "[%x] ABORT timed out\n", offset);
>
> Does this print mention it's a DOE somewhere?
It does because of this earlier mention:
> > +#define dev_fmt(fmt) "DOE: " fmt
On Mon, Jun 27, 2022 at 09:15:23PM -0700, Ira wrote:
> From: Ira Weiny <[email protected]>
>
> Many binary attributes need to limit access to CAP_SYS_ADMIN only; ie
> many binary attributes specify is_visible with 0400 or 0600.
>
> Make setting the permissions of such attributes more explicit by
> defining BIN_ATTR_ADMIN_{RO,RW}.
>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
This should have had a suggested by tag on it. I'm hoping that Lore will pick
it up with this email.
Suggested-by: Dan Williams <[email protected]>
My apologies to Dan.
Ira
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from V11:
> New Patch
> ---
> include/linux/sysfs.h | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index e3f1e8ac1f85..fd3fe5c8c17f 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -235,6 +235,22 @@ struct bin_attribute bin_attr_##_name = __BIN_ATTR_WO(_name, _size)
> #define BIN_ATTR_RW(_name, _size) \
> struct bin_attribute bin_attr_##_name = __BIN_ATTR_RW(_name, _size)
>
> +
> +#define __BIN_ATTR_ADMIN_RO(_name, _size) { \
> + .attr = { .name = __stringify(_name), .mode = 0400 }, \
> + .read = _name##_read, \
> + .size = _size, \
> +}
> +
> +#define __BIN_ATTR_ADMIN_RW(_name, _size) \
> + __BIN_ATTR(_name, 0600, _name##_read, _name##_write, _size)
> +
> +#define BIN_ATTR_ADMIN_RO(_name, _size) \
> +struct bin_attribute bin_attr_##_name = __BIN_ATTR_ADMIN_RO(_name, _size)
> +
> +#define BIN_ATTR_ADMIN_RW(_name, _size) \
> +struct bin_attribute bin_attr_##_name = __BIN_ATTR_ADMIN_RW(_name, _size)
> +
> struct sysfs_ops {
> ssize_t (*show)(struct kobject *, struct attribute *, char *);
> ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
> --
> 2.35.3
>
On Tue, Jun 28, 2022 at 03:38:48PM +0100, Jonathan Cameron wrote:
> On Mon, 27 Jun 2022 21:15:21 -0700
> [email protected] wrote:
>
> > From: Jonathan Cameron <[email protected]>
> >
> > Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based
> > mailbox with standard protocol discovery. Each mailbox is accessed
> > through a DOE Extended Capability.
> >
> > Each DOE mailbox must support the DOE discovery protocol in addition to
> > any number of additional protocols.
> >
> > Define core PCIe functionality to manage a single PCIe DOE mailbox at a
> > defined config space offset. Functionality includes iterating,
> > creating, query of supported protocol, and task submission. Destruction
> > of the mailboxes is device managed.
> >
> > If interrupts are desired, the interrupt number can be queried and
> > passed to the create function. Passing a negative value disables
> > interrupts for that mailbox. It is the caller's responsibility to ensure
> > enough interrupt vectors are allocated.
> >
> > Cc: "Li, Ming" <[email protected]>
> > Cc: Bjorn Helgaas <[email protected]>
> > Signed-off-by: Jonathan Cameron <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
>
> +static void *pci_doe_xa_entry(u16 vid, u8 prot)
> +{
> + return (void *)(((unsigned long)vid << 16) | prot);
> +}
> ...
>
> > +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
> > +{
> > + u8 index = 0;
> > + u8 xa_idx = 0;
> > +
> > + do {
> > + int rc;
> > + u16 vid;
> > + u8 prot;
> > +
> > + rc = pci_doe_discovery(doe_mb, &index, &vid, &prot);
> > + if (rc)
> > + return rc;
> > +
> > + pci_dbg(doe_mb->pdev,
> > + "[%x] Found protocol %d vid: %x prot: %x\n",
> > + doe_mb->cap_offset, xa_idx, vid, prot);
> > +
> > + rc = xa_insert(&doe_mb->prots, xa_idx++,
> > + pci_doe_xa_entry(vid, prot), GFP_KERNEL);
>
> I'm not that familiar with xarray, but the docs suggest that you have
> to use xa_mk_value() to store an integer directly into it.
Indeed I missed that. Thanks.
Ira
>
> > + if (rc)
> > + return -ENOMEM;
> > + } while (index);
> > +
> > + return 0;
> > +}
> > +
On Tue, Jun 28, 2022 at 03:16:26PM +0100, Jonathan Cameron wrote:
> On Mon, 27 Jun 2022 21:15:21 -0700
> [email protected] wrote:
>
> > From: Jonathan Cameron <[email protected]>
> >
> > Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based
> > mailbox with standard protocol discovery. Each mailbox is accessed
> > through a DOE Extended Capability.
> >
> > Each DOE mailbox must support the DOE discovery protocol in addition to
> > any number of additional protocols.
> >
> > Define core PCIe functionality to manage a single PCIe DOE mailbox at a
> > defined config space offset. Functionality includes iterating,
> > creating, query of supported protocol, and task submission. Destruction
> > of the mailboxes is device managed.
> >
> > If interrupts are desired, the interrupt number can be queried and
> > passed to the create function. Passing a negative value disables
> > interrupts for that mailbox. It is the caller's responsibility to ensure
> > enough interrupt vectors are allocated.
> >
> > Cc: "Li, Ming" <[email protected]>
> > Cc: Bjorn Helgaas <[email protected]>
> > Signed-off-by: Jonathan Cameron <[email protected]>
> > Co-developed-by: Ira Weiny <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
> >
> Hi Ira,
>
> Thanks for keeping at this!
>
> I think this has reintroduced some of the races around that annoying
> interrupt source form BUSY transitioning to low that has
> no explicit 'cause' flag. I think we'd hammered all those out in the
> previous version but maybe there were still some there...
Well I really tried hard not to introduce races which would be a problem. But
I would not be surprised.
>
> I 'think' it will work as is, but depending on the timing a given DOE
> implementation has, the interrupt may be completely pointless as it
> will be signaling the wrong event.
:-/
There is a chance that an IRQ comes in just after we timeout waiting for it. I
think that has always been the case and the IRQ will effectively be missed I
_think_.
>
> Jonathan
>
[snip]
>
> > +/*
> > + * Returned from the wait functions to indicate that an abort has been issued
> > + */
> > +#define DOE_WAIT_ABORT 1
> > +
> > +static int pci_doe_arm_wait(struct pci_doe_mb *doe_mb)
>
> Feels like there should be a naming to convey the return value as
> a boolean rather than pushing through a flag value.
Something like?
static bool pci_doe_arm_abort_seen(struct pci_doe_mb *doe_mb)
{
if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags))
return true;
clear_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags);
return false;
}
...
if (pci_doe_arm_abort_seen(mb))
... Process abort ...
...
>
> > +{
> > + if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags))
> > + return DOE_WAIT_ABORT;
> > + clear_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags);
> > + return 0;
> > +}
> > +
>
> > +static void pci_doe_write_ctrl(struct pci_doe_mb *doe_mb, u32 val)
> > +{
> > + struct pci_dev *pdev = doe_mb->pdev;
> > + int offset = doe_mb->cap_offset;
> > +
> > + if (pci_doe_irq_enabled(doe_mb))
> > + val |= PCI_DOE_CTRL_INT_EN;
> > + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +}
> > +
> > +static int pci_doe_issue_abort(struct pci_doe_mb *doe_mb)
> Can we rename this as it does more than simply issue the abort,
> it waits for it to finish
Sure.
How about just pci_doe_abort()? I'm probably going to open code that call now.
>
> > +{
> > + struct pci_dev *pdev = doe_mb->pdev;
> > + int offset = doe_mb->cap_offset;
> > + unsigned long timeout_jiffies;
> > +
> > + pci_dbg(pdev, "[%x] Issuing Abort\n", offset);
> > +
> > + /*
> > + * Abort detected while aborting; something is really broken or the
> > + * mailbox is being destroyed.
> > + */
> > + if (pci_doe_arm_wait(doe_mb))
> > + return -EIO;
> > +
> > + timeout_jiffies = jiffies + PCI_DOE_TIMEOUT;
> > + pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_ABORT);
> > +
> > + do {
> > + u32 val;
> > +
> > + /*
> > + * Abort detected while aborting; something is really broken or
> > + * the mailbox is being destroyed.
> > + */
> > + if (pci_doe_wait_irq_or_poll(doe_mb))
> > + return -EIO;
> > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > + /* Abort success! */
> > + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > + !FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> > + return 0;
> > +
> > + } while (!time_after(jiffies, timeout_jiffies));
> > +
> > + /* Abort has timed out and the MB is dead */
> > + pci_err(pdev, "[%x] ABORT timed out\n", offset);
>
> Does this print mention it's a DOE somewhere?
Yep, per Bjorn's suggestion I removed all the 'DOE ' strings and use the
dev_fmt specifier.
[snip]
> > +
> > + /* First 2 dwords have already been read */
> > + length -= 2;
> > + payload_length = min(length, task->response_pl_sz / sizeof(u32));
> > + /* Read the rest of the response payload */
> > + for (i = 0; i < payload_length; i++) {
> > + pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > + &task->response_pl[i]);
> > + /* Prior to the last ack, ensure Data Object Ready */
> > + if (i == (payload_length-1) && !pci_doe_data_obj_ready(doe_mb))
>
> spaces around -
Done.
>
> > + return -EIO;
> > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > + }
> > +
> > + /* Flush excess length */
> > + for (; i < length; i++) {
> > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > + }
> > +
> > + /* Final error check to pick up on any since Data Object Ready */
> > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > + return -EIO;
> > +
> > + return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
> > +}
> > +
>
> > +
> > +static void doe_statemachine_work(struct work_struct *work)
> > +{
> > + struct pci_doe_task *task = container_of(work, struct pci_doe_task,
> > + work);
> > + struct pci_doe_mb *doe_mb = task->doe_mb;
> > + struct pci_dev *pdev = doe_mb->pdev;
> > + int offset = doe_mb->cap_offset;
> > + unsigned int busy_retries = 0;
> > + unsigned long timeout_jiffies;
> > + u32 val;
> > + int rc;
> > +
> > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> > + signal_task_complete(task, -EIO);
> > + return;
> > + }
> > +
> > + /* Send request */
> > +retry_req:
> > + if (pci_doe_arm_wait(doe_mb)) {
> > + signal_task_abort(task, -EIO);
> > + return;
> > + }
>
> Is there a race here? If Busy drops at this point we queue up
> a message, but IRQ bit is already set. Hence when we hit
> wait_event_timeout() the flag is already set and IIRC we'll
> drop straight through.
>
I did not realize that the device would interrupt when Busy dropped? I was
thinking that V11 did not respond to IRQ but indeed it did via setting the work
item to run immediately...
However, I only see this in the spec:
1) System firmware/software checks that the DOE Busy bit is Clear to ensure
that the DOE instance is ready to receive a DOE request.
>
> It'll probably be fine because it will end up polling below
> but doesn't look ideal.
I agree it would not be ideal but I think if we are waiting for Busy to be
cleared then the pci_doe_arm_wait() should be benign.
>
> Upshot is that you sort of have to handle "spurious interrupts"
> cleanly and rewait on the interrupt if you get one whilst also handling
> race conditions around RW1C of the interrupt status flag.
Sorry I'm not sure what 'RW1C' means here?
Anyway, spurious interrupts was something I was concerned about but I don't
think there is anything we can do about an interrupt coming in when we are
expecting one but the device did not really send one. AFAIK that is virtually
impossible anyway.
If we actually 'miss' one because we timed out before the device sent it then I
think we are going to ignore the PCI_DOE_FLAG_IRQ flag on the next go around.
Actually timeout is handled by the abort call and that IRQ will, depending on
timing, cause a full PCI_DOE_TIMEOUT to expire. :-( That is indeed not
ideal. However, by that time the error and busy flags should be clear and we
can safely continue. Otherwise we are going to take the mailbox down.
It may seem better to arm wait on each iteration through the abort loop. But
this is not logically correct because the abort operation should trigger an
IRQ. So there is always a race if we missed an IRQ because we timed out early.
>
>
> > +
> > + rc = pci_doe_send_req(doe_mb, task);
> > +
> > + /*
> > + * The specification does not provide any guidance on how long
> > + * some other entity could keep the DOE busy, so try for 1
> > + * second then fail. Busy handling is best effort only, because
> > + * there is no way of avoiding racing against another user of
> > + * the DOE.
> > + */
> > + if (rc == -EBUSY) {
> > + busy_retries++;
> > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > + pci_warn(pdev,
> > + "[%x] busy for too long (> 1 sec)\n",
> > + offset);
> > + signal_task_complete(task, rc);
> > + return;
> > + }
> > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
> > + signal_task_abort(task, rc);
> > + return;
> > + }
> > + goto retry_req;
> > + } else if (rc) {
> > + signal_task_abort(task, rc);
> > + return;
> > + }
> > +
> > + timeout_jiffies = jiffies + HZ;
> > + if (pci_doe_wait_irq_or_poll(doe_mb)) {
>
> So this may well be passed as a result of a BUSY transition to 0 very soon
> after the doe_send_req but well before the data is ready....
I think the simple fix is to make the BUSY wait on an IRQ. Something like:
21:13:53 > git di
diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
index 12f9f8045eb7..afd326320798 100644
--- a/drivers/pci/doe.c
+++ b/drivers/pci/doe.c
@@ -352,7 +352,7 @@ static void doe_statemachine_work(struct work_struct *work)
signal_task_complete(task, rc);
return;
}
- if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
+ if (pci_doe_wait_irq_or_poll(...)) {
signal_task_abort(task, rc);
return;
}
>
>
> > + signal_task_abort(task, -EIO);
> > + return;
> > + }
> > +
> > + /* Poll for response */
> > +retry_resp:
> > + if (pci_doe_arm_wait(doe_mb)) {
> I think we can get here between Busy drop and Object Ready which means
> this can get another IRQ_FLAG setting just after it. Does it matter?
> Don't think so, as we don't use that bit again in this run through
> and it will be cleared at beginning of next one,
Yea basically I agree.
> but if so why is
> this call here?
Seemed like the right thing to do at the time... ;-) j/k
> I think it's only useful for detecting an abort, if
> so do that explicitly.
Actually it is the right thing to do... However, the wait poll below also
needs to be an IRQ or poll. I'm not sure how I missed that logic.
>
> > + signal_task_abort(task, -EIO);
> > + return;
> > + }
> > +
> > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > + signal_task_abort(task, -EIO);
> > + return;
> > + }
> > +
> > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > + if (time_after(jiffies, timeout_jiffies)) {
> > + signal_task_abort(task, -ETIMEDOUT);
> > + return;
> > + }
> > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) {
>
> Following on from above....
> As a result of the interrupt having fired on the BUSY off transition,
> I think we will almost always end up spinning here until Object Ready
> is set. Fine, but seems a shame we don't let an interrupt do this
> for us in most cases. (note in QEMU response is instantaneous so
> when the interrupt for Busy drop is set, object ready also true so
> by the time we get here data ready will already be sent).
This needs to be pci_doe_wait_irq_or_poll() as well and the arm wait above
ensures we continue to look for that interrupt.
I'm starting to see how I got confused. The timeout values all vary and
mod_delayed_work() made it very easy for you to interrupt any of those.
I tried to bundle the poll vs interrupt modes in pci_doe_wait_irq_or_poll() and
got confused. :-(
>
> > + signal_task_abort(task, -EIO);
> > + return;
> > + }
> > + goto retry_resp;
> > + }
> > +
> > + rc = pci_doe_recv_resp(doe_mb, task);
> > + if (rc < 0) {
> > + signal_task_abort(task, rc);
> > + return;
> > + }
> > +
> > + signal_task_complete(task, rc);
> > +}
> > +
>
>
> > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> > +{
> > + if (doe_mb->work_queue)
>
> I'm not a great fan of free functions that check a bunch of conditions
> because they may be called before things are set up.
I'll see what I can do. I do kind of like this but I think it gets muddled and
I'm not dead set on either way.
> To my
> mind that generally means we should be calling individual cleanup
> in the appropriate error handlers.
>
> Either that or just use devm handling for each item. Sure
> it's a few more lines of code, but I find it a lot easier to go
>
> Oh look that thing we just set up is cleaned up by this.
>
>
> > + destroy_workqueue(doe_mb->work_queue);
> > + if (pci_doe_irq_enabled(doe_mb))
> > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb);
> > + xa_destroy(&doe_mb->prots);
> > + kfree(doe_mb);
> > +}
> > +
>
> ...
>
> > +
> > +static void pci_doe_destroy_mb(void *mb)
> > +{
> > + struct pci_doe_mb *doe_mb = mb;
> > +
> > + /* Mark going down */
> > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> > +
> > + /* Abort any in progress work items */
> > + pci_doe_abort(doe_mb);
>
> Abort is getting used for two things in here. Perhaps
> rename this one to
> pci_doe_abort_tasks() or something like that?
What do you mean two things? Oh I think I see. You mean abort the work item
vs abort sent to the hardware?
This no longer aborts all the tasks just the one which may be in progress.
Because the work queue is ordered only one task may be in progress. I'll clean
up the comment too.
This sets the abort flag and wakes it up if it is sleeping. If not then the
abort flag will be detected in the next arm.
FWIW I think I may just remove this call and open code it here.
>
> > +
> > + /* Flush remaining work items */
> > + flush_workqueue(doe_mb->work_queue);
> > +
> > + pci_doe_free_mb(doe_mb);
> > +}
> > +
> > +/**
> > + * pcim_doe_create_mb() - Create a DOE mailbox object
> > + *
> > + * @pdev: PCI device to create the DOE mailbox for
> > + * @cap_offset: Offset of the DOE mailbox
> > + * @int_msg_num: Interrupt message number to use; a negative value means don't
> > + * use interrupts
> > + *
> > + * Create a single mailbox object to manage the mailbox protocol at the
> > + * cap_offset specified.
> > + *
> > + * Caller should allocate PCI IRQ vectors before passing a possitive value for
>
> positive
Thanks fixed.
>
> > + * int_msg_num.
> > + *
> > + * RETURNS: created mailbox object on success
> > + * ERR_PTR(-errno) on failure
> > + */
> > +struct pci_doe_mb *pcim_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > + int int_msg_num)
> > +{
> > + struct pci_doe_mb *doe_mb;
> > + int rc;
> > +
> > + doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
> > + if (!doe_mb)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + doe_mb->pdev = pdev;
> > + doe_mb->int_msg_num = -1;
> > + doe_mb->cap_offset = cap_offset;
> > +
> > + xa_init(&doe_mb->prots);
> > + init_waitqueue_head(&doe_mb->wq);
> > +
> > + if (int_msg_num >= 0) {
> > + rc = pci_doe_enable_irq(doe_mb, int_msg_num);
> > + if (rc)
> > + pci_err(pdev,
> > + "[%x] enable requested IRQ (%d) failed : %d\n",
> > + doe_mb->cap_offset, int_msg_num, rc);
>
> If we are printing an error, I'd argue we should not continue.
> Or at very least we should a comment here to say why we should do so...
>
Not continue seems reasonable.
>
> > + }
> > +
> > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0,
> > + doe_mb->cap_offset);
> > + if (!doe_mb->work_queue) {
> > + pci_err(pdev, "[%x] failed to allocate work queue\n",
> > + doe_mb->cap_offset);
> > + pci_doe_free_mb(doe_mb);
>
> As above, I'd rather this explicitly freed what has been set up
> and only that rather than calling a free function that does a bunch of
> stuff conditionally.
I think I can make that work. This is the only conditional in free however,
because the other conditional is the IRQ support which may not be set up.
Thanks again for the in depth review!
Ira
>
>
> > + return ERR_PTR(-ENOMEM);
> > + }
> > +
> > + /* Reset the mailbox by issuing an abort */
> > + rc = pci_doe_issue_abort(doe_mb);
> > + if (rc) {
> > + pci_err(pdev, "[%x] failed to reset : %d\n",
> > + doe_mb->cap_offset, rc);
> > + pci_doe_free_mb(doe_mb);
> > + return ERR_PTR(rc);
> > + }
> > +
> > + if (devm_add_action_or_reset(&pdev->dev, pci_doe_destroy_mb, doe_mb))
> > + return ERR_PTR(-EIO);
> > +
> > + rc = pci_doe_cache_protocols(doe_mb);
> > + if (rc) {
> > + pci_err(pdev, "[%x] failed to cache protocols : %d\n",
> > + doe_mb->cap_offset, rc);
> > + return ERR_PTR(rc);
> > + }
> > +
> > + return doe_mb;
> > +}
> > +EXPORT_SYMBOL_GPL(pcim_doe_create_mb);
> > +
>
>
>
[+cc Krzysztof]
On Mon, Jun 27, 2022 at 09:15:23PM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> Many binary attributes need to limit access to CAP_SYS_ADMIN only; ie
> many binary attributes specify is_visible with 0400 or 0600.
>
> Make setting the permissions of such attributes more explicit by
> defining BIN_ATTR_ADMIN_{RO,RW}.
>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
FWIW, this looks a lot like this previous patch:
https://lore.kernel.org/all/[email protected]/
> ---
> Changes from V11:
> New Patch
> ---
> include/linux/sysfs.h | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index e3f1e8ac1f85..fd3fe5c8c17f 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -235,6 +235,22 @@ struct bin_attribute bin_attr_##_name = __BIN_ATTR_WO(_name, _size)
> #define BIN_ATTR_RW(_name, _size) \
> struct bin_attribute bin_attr_##_name = __BIN_ATTR_RW(_name, _size)
>
> +
> +#define __BIN_ATTR_ADMIN_RO(_name, _size) { \
> + .attr = { .name = __stringify(_name), .mode = 0400 }, \
> + .read = _name##_read, \
> + .size = _size, \
> +}
> +
> +#define __BIN_ATTR_ADMIN_RW(_name, _size) \
> + __BIN_ATTR(_name, 0600, _name##_read, _name##_write, _size)
> +
> +#define BIN_ATTR_ADMIN_RO(_name, _size) \
> +struct bin_attribute bin_attr_##_name = __BIN_ATTR_ADMIN_RO(_name, _size)
> +
> +#define BIN_ATTR_ADMIN_RW(_name, _size) \
> +struct bin_attribute bin_attr_##_name = __BIN_ATTR_ADMIN_RW(_name, _size)
> +
> struct sysfs_ops {
> ssize_t (*show)(struct kobject *, struct attribute *, char *);
> ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
> --
> 2.35.3
>
On Tue, 28 Jun 2022 11:20:32 -0700
Ira Weiny <[email protected]> wrote:
> On Tue, Jun 28, 2022 at 03:16:26PM +0100, Jonathan Cameron wrote:
> > On Mon, 27 Jun 2022 21:15:21 -0700
> > [email protected] wrote:
> >
> > > From: Jonathan Cameron <[email protected]>
> > >
> > > Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based
> > > mailbox with standard protocol discovery. Each mailbox is accessed
> > > through a DOE Extended Capability.
> > >
> > > Each DOE mailbox must support the DOE discovery protocol in addition to
> > > any number of additional protocols.
> > >
> > > Define core PCIe functionality to manage a single PCIe DOE mailbox at a
> > > defined config space offset. Functionality includes iterating,
> > > creating, query of supported protocol, and task submission. Destruction
> > > of the mailboxes is device managed.
> > >
> > > If interrupts are desired, the interrupt number can be queried and
> > > passed to the create function. Passing a negative value disables
> > > interrupts for that mailbox. It is the caller's responsibility to ensure
> > > enough interrupt vectors are allocated.
> > >
> > > Cc: "Li, Ming" <[email protected]>
> > > Cc: Bjorn Helgaas <[email protected]>
> > > Signed-off-by: Jonathan Cameron <[email protected]>
> > > Co-developed-by: Ira Weiny <[email protected]>
> > > Signed-off-by: Ira Weiny <[email protected]>
> > >
> > Hi Ira,
> >
> > Thanks for keeping at this!
> >
> > I think this has reintroduced some of the races around that annoying
> > interrupt source form BUSY transitioning to low that has
> > no explicit 'cause' flag. I think we'd hammered all those out in the
> > previous version but maybe there were still some there...
>
> Well I really tried hard not to introduce races which would be a problem. But
> I would not be surprised.
>
> >
> > I 'think' it will work as is, but depending on the timing a given DOE
> > implementation has, the interrupt may be completely pointless as it
> > will be signaling the wrong event.
>
> :-/
>
> There is a chance that an IRQ comes in just after we timeout waiting for it. I
> think that has always been the case and the IRQ will effectively be missed I
> _think_.
The timeout case I'm not that worried about as it means the device
is out of spec, so whilst it might happen and we don't want to break in that
case it should be uncommon enough that a perf disadvantage doesn't matter.
...
> > > +
> > > +static void doe_statemachine_work(struct work_struct *work)
> > > +{
> > > + struct pci_doe_task *task = container_of(work, struct pci_doe_task,
> > > + work);
> > > + struct pci_doe_mb *doe_mb = task->doe_mb;
> > > + struct pci_dev *pdev = doe_mb->pdev;
> > > + int offset = doe_mb->cap_offset;
> > > + unsigned int busy_retries = 0;
> > > + unsigned long timeout_jiffies;
> > > + u32 val;
> > > + int rc;
> > > +
> > > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> > > + signal_task_complete(task, -EIO);
> > > + return;
> > > + }
> > > +
> > > + /* Send request */
> > > +retry_req:
> > > + if (pci_doe_arm_wait(doe_mb)) {
> > > + signal_task_abort(task, -EIO);
> > > + return;
> > > + }
> >
> > Is there a race here? If Busy drops at this point we queue up
> > a message, but IRQ bit is already set. Hence when we hit
> > wait_event_timeout() the flag is already set and IIRC we'll
> > drop straight through.
> >
>
> I did not realize that the device would interrupt when Busy dropped? I was
> thinking that V11 did not respond to IRQ but indeed it did via setting the work
> item to run immediately...
>
> However, I only see this in the spec:
>
> 1) System firmware/software checks that the DOE Busy bit is Clear to ensure
> that the DOE instance is ready to receive a DOE request.
I missed this particular one originally and someone else pointed it out in
review (can't remember who though). The busy drop is mentioned in the
bit definition. It's in the capability definition.
"DOE Interrupt Status - This bit must be Set when an interrupt is generated
to indicate that the Data Object Ready bit or the DOE Error bit has been Set. or
that the DOE Busy bit has been Cleared."
(the formatting is in the release spec.. hohum)
Anyhow, upshot is that the status can be set as a result of Busy Bit clearing.
6.30.3 Interrupt Geneeration: then says that interrupt is generate on a
transition of the logical AND of
1. Vector unmasked
2. DOE interrupt Enable bit is 1
3. Value of the DOE interrupt Status bit is 1.
So if interrupt status bit is set to 1 due to a Busy drop and we then
clear it before Data Object Ready, we'll get 2 interrupts.
There is another vague bit of language that sort of allows other
uses of this interrupt for protocol specific stuff. Hopefully
no one falls for that, but we should safely handle that case (perf
drop as a result is fine though!) I can't remember where the exact
language is, but I've had a few 'polite discussions' to persuade
people using it that way would be a very bad idea...
>
> >
> > It'll probably be fine because it will end up polling below
> > but doesn't look ideal.
>
> I agree it would not be ideal but I think if we are waiting for Busy to be
> cleared then the pci_doe_arm_wait() should be benign.
I think in some of these paths we are waiting for Data Object Ready to be
set, the busy drop is effectively acting as a spurious interrupt if we
clear the status before the data object ready event which could be much later
because of Busy can clear really quickly.
>
> >
> > Upshot is that you sort of have to handle "spurious interrupts"
> > cleanly and rewait on the interrupt if you get one whilst also handling
> > race conditions around RW1C of the interrupt status flag.
>
> Sorry I'm not sure what 'RW1C' means here?
Read / Write 1 to clear. In this case I meant reading it and then clearing it
without looking at the other status bits.
>
> Anyway, spurious interrupts was something I was concerned about but I don't
> think there is anything we can do about an interrupt coming in when we are
> expecting one but the device did not really send one. AFAIK that is virtually
> impossible anyway.
In this case seeing 2 interrupts is highly likely.
We see the Busy drop one and the interrupt handler clears the Interrupt Status
Bit, then data object becomes ready and we go around again.
>
> If we actually 'miss' one because we timed out before the device sent it then I
> think we are going to ignore the PCI_DOE_FLAG_IRQ flag on the next go around.
>
> Actually timeout is handled by the abort call and that IRQ will, depending on
> timing, cause a full PCI_DOE_TIMEOUT to expire. :-( That is indeed not
> ideal. However, by that time the error and busy flags should be clear and we
> can safely continue. Otherwise we are going to take the mailbox down.
>
> It may seem better to arm wait on each iteration through the abort loop. But
> this is not logically correct because the abort operation should trigger an
> IRQ. So there is always a race if we missed an IRQ because we timed out early.
I probably stuck that comment in the wrong place. The initial call to clear
the flag before this should be fine (short of the 'spurious' case of people
using the interrupt for protocol specific usage).
>
> >
> >
> > > +
> > > + rc = pci_doe_send_req(doe_mb, task);
> > > +
> > > + /*
> > > + * The specification does not provide any guidance on how long
> > > + * some other entity could keep the DOE busy, so try for 1
> > > + * second then fail. Busy handling is best effort only, because
> > > + * there is no way of avoiding racing against another user of
> > > + * the DOE.
> > > + */
> > > + if (rc == -EBUSY) {
> > > + busy_retries++;
> > > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > > + pci_warn(pdev,
> > > + "[%x] busy for too long (> 1 sec)\n",
> > > + offset);
> > > + signal_task_complete(task, rc);
> > > + return;
> > > + }
> > > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
> > > + signal_task_abort(task, rc);
> > > + return;
> > > + }
> > > + goto retry_req;
> > > + } else if (rc) {
> > > + signal_task_abort(task, rc);
> > > + return;
> > > + }
> > > +
> > > + timeout_jiffies = jiffies + HZ;
> > > + if (pci_doe_wait_irq_or_poll(doe_mb)) {
> >
> > So this may well be passed as a result of a BUSY transition to 0 very soon
> > after the doe_send_req but well before the data is ready....
>
> I think the simple fix is to make the BUSY wait on an IRQ. Something like:
>
> 21:13:53 > git di
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> index 12f9f8045eb7..afd326320798 100644
> --- a/drivers/pci/doe.c
> +++ b/drivers/pci/doe.c
> @@ -352,7 +352,7 @@ static void doe_statemachine_work(struct work_struct *work)
> signal_task_complete(task, rc);
> return;
> }
> - if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
> + if (pci_doe_wait_irq_or_poll(...)) {
> signal_task_abort(task, rc);
> return;
This case (which I think is the -EBUSY from pci_doe_send_req() handling)
isn't important because it's trying to paper over a weird condition. We
don't normally expect to get here.
I was concerned with the line just above my comment which may not act as
a gate at all because it's tripped by the the Busy Drop, which may be
well before the data object ready that we are actually waiting for.
> }
>
> >
> >
> > > + signal_task_abort(task, -EIO);
> > > + return;
> > > + }
> > > +
> > > + /* Poll for response */
> > > +retry_resp:
> > > + if (pci_doe_arm_wait(doe_mb)) {
> > I think we can get here between Busy drop and Object Ready which means
> > this can get another IRQ_FLAG setting just after it. Does it matter?
> > Don't think so, as we don't use that bit again in this run through
> > and it will be cleared at beginning of next one,
>
> Yea basically I agree.
>
> > but if so why is
> > this call here?
>
> Seemed like the right thing to do at the time... ;-) j/k
>
> > I think it's only useful for detecting an abort, if
> > so do that explicitly.
>
> Actually it is the right thing to do... However, the wait poll below also
> needs to be an IRQ or poll. I'm not sure how I missed that logic.
Sounds write though without whole code laid out to follow through I'm
not 100% sure yet!
>
> >
> > > + signal_task_abort(task, -EIO);
> > > + return;
> > > + }
> > > +
> > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > + signal_task_abort(task, -EIO);
> > > + return;
> > > + }
> > > +
> > > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > > + if (time_after(jiffies, timeout_jiffies)) {
> > > + signal_task_abort(task, -ETIMEDOUT);
> > > + return;
> > > + }
> > > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) {
> >
> > Following on from above....
> > As a result of the interrupt having fired on the BUSY off transition,
> > I think we will almost always end up spinning here until Object Ready
> > is set. Fine, but seems a shame we don't let an interrupt do this
> > for us in most cases. (note in QEMU response is instantaneous so
> > when the interrupt for Busy drop is set, object ready also true so
> > by the time we get here data ready will already be sent).
>
> This needs to be pci_doe_wait_irq_or_poll() as well and the arm wait above
> ensures we continue to look for that interrupt.
>
> I'm starting to see how I got confused. The timeout values all vary and
> mod_delayed_work() made it very easy for you to interrupt any of those.
Yeah. That was a nice suggestion Dan made long ago but it doesn't play well
with the single workqueue.
>
> I tried to bundle the poll vs interrupt modes in pci_doe_wait_irq_or_poll() and
> got confused. :-(
>
> >
> > > + signal_task_abort(task, -EIO);
> > > + return;
> > > + }
> > > + goto retry_resp;
> > > + }
> > > +
> > > + rc = pci_doe_recv_resp(doe_mb, task);
> > > + if (rc < 0) {
> > > + signal_task_abort(task, rc);
> > > + return;
> > > + }
> > > +
> > > + signal_task_complete(task, rc);
> > > +}
> > > +
> >
> >
> > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> > > +{
> > > + if (doe_mb->work_queue)
> >
> > I'm not a great fan of free functions that check a bunch of conditions
> > because they may be called before things are set up.
>
> I'll see what I can do. I do kind of like this but I think it gets muddled and
> I'm not dead set on either way.
>
> > To my
> > mind that generally means we should be calling individual cleanup
> > in the appropriate error handlers.
> >
> > Either that or just use devm handling for each item. Sure
> > it's a few more lines of code, but I find it a lot easier to go
> >
> > Oh look that thing we just set up is cleaned up by this.
> >
> >
> > > + destroy_workqueue(doe_mb->work_queue);
> > > + if (pci_doe_irq_enabled(doe_mb))
> > > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb);
> > > + xa_destroy(&doe_mb->prots);
> > > + kfree(doe_mb);
> > > +}
> > > +
> >
> > ...
> >
> > > +
> > > +static void pci_doe_destroy_mb(void *mb)
> > > +{
> > > + struct pci_doe_mb *doe_mb = mb;
> > > +
> > > + /* Mark going down */
> > > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> > > +
> > > + /* Abort any in progress work items */
> > > + pci_doe_abort(doe_mb);
> >
> > Abort is getting used for two things in here. Perhaps
> > rename this one to
> > pci_doe_abort_tasks() or something like that?
>
> What do you mean two things? Oh I think I see. You mean abort the work item
> vs abort sent to the hardware?
yup.
>
> This no longer aborts all the tasks just the one which may be in progress.
> Because the work queue is ordered only one task may be in progress. I'll clean
> up the comment too.
Hmm. It puts a requirement on the caller to not queue multiple requests that
might require ordering. One advantage of flushing the lot was ordering was
unaffected (though any caller that queued multiple items would have to then
requeue multiple items so would have to maintain their own retry buffer).
>
> This sets the abort flag and wakes it up if it is sleeping. If not then the
> abort flag will be detected in the next arm.
>
> FWIW I think I may just remove this call and open code it here.
Sounds good, avoid naming confusion by getting rid of the name :)
> > > +
> > > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0,
> > > + doe_mb->cap_offset);
> > > + if (!doe_mb->work_queue) {
> > > + pci_err(pdev, "[%x] failed to allocate work queue\n",
> > > + doe_mb->cap_offset);
> > > + pci_doe_free_mb(doe_mb);
> >
> > As above, I'd rather this explicitly freed what has been set up
> > and only that rather than calling a free function that does a bunch of
> > stuff conditionally.
>
> I think I can make that work. This is the only conditional in free however,
> because the other conditional is the IRQ support which may not be set up.
If you split to multiple devm_ calls you can not setup a tear down for the
irq if we don't have one. Or, don't use pci_request_irq() but call
devm_request_threaded_irq() directly and let that clean up for you.
>
> Thanks again for the in depth review!
No problem. I know how nasty this seemingly simple little bit of code
is, so you have my sympathies :)
Jonathan
On Tue, Jun 28, 2022 at 11:20:32AM -0700, Ira wrote:
> On Tue, Jun 28, 2022 at 03:16:26PM +0100, Jonathan Cameron wrote:
> > On Mon, 27 Jun 2022 21:15:21 -0700
> > [email protected] wrote:
> >
[snip]
> >
> > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> > > +{
> > > + if (doe_mb->work_queue)
> >
> > I'm not a great fan of free functions that check a bunch of conditions
> > because they may be called before things are set up.
>
> I'll see what I can do. I do kind of like this but I think it gets muddled and
> I'm not dead set on either way.
I've completely reworked pci_doe_create_mb() to use devm throughout.
It is much cleaner. Thanks Jonathan!
Ira
On Tue, Jun 28, 2022 at 03:47:27PM +0100, Jonathan Cameron wrote:
> On Mon, 27 Jun 2022 21:15:24 -0700
> [email protected] wrote:
>
> > From: Ira Weiny <[email protected]>
> >
> > The OS will need CDAT data from CXL devices to properly set up
> > interleave sets. Currently this is supported through a DOE mailbox
> > which supports CDAT.
> >
> > Search the DOE mailboxes available, query CDAT data, and cache the data
> > for later parsing.
> >
> > Provide a sysfs binary attribute to allow dumping of the CDAT.
> >
> > Binary dumping is modeled on /sys/firmware/ACPI/tables/
> >
> > The ability to dump this table will be very useful for emulation of real
> > devices once they become available as QEMU CXL type 3 device emulation will
> > be able to load this file in.
> >
> > This does not support table updates at runtime. It will always provide
> > whatever was there when first cached. Handling of table updates can be
> > implemented later.
> >
> > Finally create a complete list of CDAT defines within cdat.h for code
> > wishing to decode the CDAT table.
> >
> > Signed-off-by: Jonathan Cameron <[email protected]>
> > Co-developed-by: Jonathan Cameron <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
>
> One query inline, though I'm fine with it either way, just want to
> understand your logic in keeping completion when Dan suggested using
> flush_work() to achieve the same thing.
>
> >
> > ---
> > Changes from V11:
> > Adjust for the use of DOE mailbox xarray
> > Dan Williams:
> > Remove unnecessary get/put device
> > Use new BIN_ATTR_ADMIN_RO macro
> > Flag that CDAT was supported
> > If there is a read error then the CDAT sysfs
> > will return a 0 length entry
> >
> ...
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > index c4c99ff7b55e..4bd479ec0253 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -4,10 +4,12 @@
>
> > +static int cxl_cdat_get_length(struct device *dev,
> > + struct pci_doe_mb *cdat_mb,
> > + size_t *length)
> > +{
> > + u32 cdat_request_pl = CDAT_DOE_REQ(0);
> > + u32 cdat_response_pl[32];
> > + DECLARE_COMPLETION_ONSTACK(c);
> > + struct pci_doe_task task = {
> > + .prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> > + .prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> > + .request_pl = &cdat_request_pl,
> > + .request_pl_sz = sizeof(cdat_request_pl),
> > + .response_pl = cdat_response_pl,
> > + .response_pl_sz = sizeof(cdat_response_pl),
> > + .complete = cxl_doe_task_complete,
> > + .private = &c,
> > + };
> > + int rc = 0;
> > +
> > + rc = pci_doe_submit_task(cdat_mb, &task);
> > + if (rc < 0) {
> > + dev_err(dev, "DOE submit failed: %d", rc);
> > + return rc;
> > + }
> > + wait_for_completion(&c);
>
> Dan mentioned in his review that we could just use
> flush_work() and drop the completion logic and callback.
> Why didn't you go that way? Would want to wrap it up
> in pci_doe_wait_task() or something like that.
In early reviews of the Aux Bus work Dan also specified the above design
pattern.
"I would expect that the caller of this routine [pci_doe_exchange_sync]
would want to specify the task and end_task() callback and use that as
the completion signal. It may also want "no wait" behavior where it is
prepared for the DOE result to come back sometime later. With that
change the exchange fields can move into the task directly."
-- https://lore.kernel.org/linux-cxl/CAPcyv4hYAgyf-WcArGvbWHAJgc5+p=OO_6ah_dXJhNM5cXcVTw@mail.gmail.com/
I've renamed pci_doe_exchange_sync() pci_doe_submit_task() because of other
feedback. Here the callback is set to cxl_doe_task_complete() and we have to
wait because this particular query needs the length prior to the next task
being issued.
I use flush_workqueue() within the shutdown handling (or if someone destroys
the mailbox or abort fails) to first block new work from being submitted
(PCI_DOE_FLAG_DEAD), cancel the running work if needed (PCI_DOE_FLAG_CANCEL
[was ABORT]), and then flush the queue causing all the pending work to error
when seeing PCI_DOE_FLAG_DEAD.
Ira
>
> > +
> > + if (task.rv < 1)
> > + return -EIO;
> > +
> > + *length = cdat_response_pl[1];
> > + dev_dbg(dev, "CDAT length %zu\n", *length);
> > +
> > + return rc;
> > +}
> > +
On Tue, Jun 28, 2022 at 03:57:20PM +0100, Jonathan Cameron wrote:
> On Mon, 27 Jun 2022 21:15:26 -0700
> [email protected] wrote:
>
[snip]
> >
> > -/**
> > - * read_cdat_data - Read the CDAT data on this port
> > - * @port: Port to read data from
> > - *
> > - * This call will sleep waiting for responses from the DOE mailbox.
> > - */
> > -void read_cdat_data(struct cxl_port *port)
> > +static int __read_cdat_data(struct cxl_port *port)
> > {
> > static struct pci_doe_mb *cdat_mb;
> > struct device *dev = &port->dev;
> > struct device *uport = port->uport;
> > size_t cdat_length;
> > - int ret;
> > + int ret = 0;
> Fairly sure there isn't a path in which ret isn't set...
>
Yep.
>
> Mixing ret and rc is a bit inconsistent, maybe scrub patch set for
> one or the other. (My fault originally I think :)
Ok PCI uses both ret and rc. :-( But CXL seems to be consistent with rc. So
I've used rc with the new series which I think satisfies both subsystems.
Thanks again for the detail review of the series. Hopefully there will be a
new version out tomorrow.
Ira
>
>
> >
> > cdat_mb = find_cdat_mb(uport);
> > if (!cdat_mb) {
> > dev_dbg(dev, "No CDAT mailbox\n");
> > - return;
> > + return -EIO;
> > }
> >
> > port->cdat_sup = true;
> >
> > if (cxl_cdat_get_length(dev, cdat_mb, &cdat_length)) {
> > dev_dbg(dev, "No CDAT length\n");
> > - return;
> > + return -EIO;
> > }
> >
> > port->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
> > if (!port->cdat.table)
> > - return;
> > + return -ENOMEM;
> >
> > port->cdat.length = cdat_length;
> > ret = cxl_cdat_read_table(dev, cdat_mb, &port->cdat);
> > @@ -658,5 +652,30 @@ void read_cdat_data(struct cxl_port *port)
> > port->cdat.length = 0;
> > dev_err(dev, "CDAT data read error\n");
> > }
> > +
> > + return ret;
> > +}
> > +
> > +/**
> > + * read_cdat_data - Read the CDAT data on this port
> > + * @port: Port to read data from
> > + *
> > + * This call will sleep waiting for responses from the DOE mailbox.
> > + */
> > +void read_cdat_data(struct cxl_port *port)
> > +{
> > + int retries = 5;
> > + int rc;
> > +
> > + while (retries--) {
> > + rc = __read_cdat_data(port);
> > + if (!rc)
> > + return;
> > + dev_dbg(&port->dev,
> > + "CDAT data read error rc=%d (retries %d)\n",
> > + rc, retries);
> > + }
> > + dev_err(&port->dev, "CDAT data read failed after %d retries\n",
> > + retries);
> > }
> > EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>
On Wed, Jun 29, 2022 at 03:09:47PM +0100, Jonathan Cameron wrote:
> On Tue, 28 Jun 2022 11:20:32 -0700
> Ira Weiny <[email protected]> wrote:
>
[snip]
> > > >
> > > Hi Ira,
> > >
> > > Thanks for keeping at this!
> > >
> > > I think this has reintroduced some of the races around that annoying
> > > interrupt source form BUSY transitioning to low that has
> > > no explicit 'cause' flag. I think we'd hammered all those out in the
> > > previous version but maybe there were still some there...
> >
> > Well I really tried hard not to introduce races which would be a problem. But
> > I would not be surprised.
> >
> > >
> > > I 'think' it will work as is, but depending on the timing a given DOE
> > > implementation has, the interrupt may be completely pointless as it
> > > will be signaling the wrong event.
> >
> > :-/
> >
> > There is a chance that an IRQ comes in just after we timeout waiting for it. I
> > think that has always been the case and the IRQ will effectively be missed I
> > _think_.
>
> The timeout case I'm not that worried about as it means the device
> is out of spec, so whilst it might happen and we don't want to break in that
> case it should be uncommon enough that a perf disadvantage doesn't matter.
Ok I think we are agreed here.
I've removed the irq stuff for now per the call yesterday. But I'm still
interested in how to solve the problem so see below.
>
>
> ...
>
> > > > +
> > > > +static void doe_statemachine_work(struct work_struct *work)
> > > > +{
> > > > + struct pci_doe_task *task = container_of(work, struct pci_doe_task,
> > > > + work);
> > > > + struct pci_doe_mb *doe_mb = task->doe_mb;
> > > > + struct pci_dev *pdev = doe_mb->pdev;
> > > > + int offset = doe_mb->cap_offset;
> > > > + unsigned int busy_retries = 0;
> > > > + unsigned long timeout_jiffies;
> > > > + u32 val;
> > > > + int rc;
> > > > +
> > > > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> > > > + signal_task_complete(task, -EIO);
> > > > + return;
> > > > + }
> > > > +
> > > > + /* Send request */
> > > > +retry_req:
> > > > + if (pci_doe_arm_wait(doe_mb)) {
> > > > + signal_task_abort(task, -EIO);
> > > > + return;
> > > > + }
> > >
> > > Is there a race here? If Busy drops at this point we queue up
> > > a message, but IRQ bit is already set. Hence when we hit
> > > wait_event_timeout() the flag is already set and IIRC we'll
> > > drop straight through.
> > >
> >
> > I did not realize that the device would interrupt when Busy dropped? I was
> > thinking that V11 did not respond to IRQ but indeed it did via setting the work
> > item to run immediately...
> >
> > However, I only see this in the spec:
> >
> > 1) System firmware/software checks that the DOE Busy bit is Clear to ensure
> > that the DOE instance is ready to receive a DOE request.
>
> I missed this particular one originally and someone else pointed it out in
> review (can't remember who though). The busy drop is mentioned in the
> bit definition. It's in the capability definition.
>
> "DOE Interrupt Status - This bit must be Set when an interrupt is generated
> to indicate that the Data Object Ready bit or the DOE Error bit has been Set. or
> that the DOE Busy bit has been Cleared."
> (the formatting is in the release spec.. hohum)
>
> Anyhow, upshot is that the status can be set as a result of Busy Bit clearing.
> 6.30.3 Interrupt Geneeration: then says that interrupt is generate on a
> transition of the logical AND of
>
> 1. Vector unmasked
> 2. DOE interrupt Enable bit is 1
> 3. Value of the DOE interrupt Status bit is 1.
>
> So if interrupt status bit is set to 1 due to a Busy drop and we then
> clear it before Data Object Ready, we'll get 2 interrupts.
I don't understand this. I think the issue is we _don't_ clear it between
pci_doe_arm_wait() and pci_doe_send_req().
Then we interpret the Busy drop interrupt incorrectly as the Data Object Ready
interrupt and start polling for a response immediately rather than waiting for
the Data Object Ready IRQ.
I _think_ this will be ok because the response will not be read until Data
Object Ready is actually set. So we check DOR wait again see the DOR IRQ there
and goto retry_resp to check Data Object Ready again.
Effectively I'm not sure I agree with you about _when_ we get the interrupts
but I do agree that we get an extra one which I'm not checking for _why_. More
important I think that getting more IRQs is better than missing an interrupt
and incorrectly thinking we timed out.
>
> There is another vague bit of language that sort of allows other
> uses of this interrupt for protocol specific stuff. Hopefully
> no one falls for that, but we should safely handle that case (perf
> drop as a result is fine though!) I can't remember where the exact
> language is, but I've had a few 'polite discussions' to persuade
> people using it that way would be a very bad idea...
>
>
>
> >
> > >
> > > It'll probably be fine because it will end up polling below
> > > but doesn't look ideal.
> >
> > I agree it would not be ideal but I think if we are waiting for Busy to be
> > cleared then the pci_doe_arm_wait() should be benign.
>
> I think in some of these paths we are waiting for Data Object Ready to be
> set, the busy drop is effectively acting as a spurious interrupt if we
> clear the status before the data object ready event which could be much later
> because of Busy can clear really quickly.
Ok yea I think this is what I am seeing.
>
> >
> > >
> > > Upshot is that you sort of have to handle "spurious interrupts"
> > > cleanly and rewait on the interrupt if you get one whilst also handling
> > > race conditions around RW1C of the interrupt status flag.
> >
> > Sorry I'm not sure what 'RW1C' means here?
>
> Read / Write 1 to clear. In this case I meant reading it and then clearing it
> without looking at the other status bits.
Ah. Perhaps the handler should be more involved in this setting different
flags and having the *_wait() functions be more specific about what exactly we
are waiting for. I'll have to think about that.
>
> >
> > Anyway, spurious interrupts was something I was concerned about but I don't
> > think there is anything we can do about an interrupt coming in when we are
> > expecting one but the device did not really send one. AFAIK that is virtually
> > impossible anyway.
>
> In this case seeing 2 interrupts is highly likely.
> We see the Busy drop one and the interrupt handler clears the Interrupt Status
> Bit, then data object becomes ready and we go around again.
But we are only going to see this if some other entity is using the mailbox
right? And I don't think that is going to be common, is it?
Is this the sequence you are speaking of? If so I think this is how it would
flow given the fix I suggested below.
Device Other Entity Linux CPU
Sets Busy
pci_doe_arm_wait() <- clears FLAG_IRQ
Clears Busy
pci_doe_irq_handler() <set FLAG_IRQ>
pci_doe_send_req() <- Sees !BUSY sends query
pci_doe_wait_irq() <- No waiting because of 'spurious' Busy Drop!!!
pci_doe_arm_wait() <- clears FLAG_IRQ
<DOR not set>
pci_doe_wait_irq() <- NOW waits!!!
Set DOR
pci_doe_irq_handler() <set FLAG_IRQ>
<goto retry_resp>
<DOR set>
pci_doe_recv_resp()
What am I missing?
Ira
>
> >
> > If we actually 'miss' one because we timed out before the device sent it then I
> > think we are going to ignore the PCI_DOE_FLAG_IRQ flag on the next go around.
> >
> > Actually timeout is handled by the abort call and that IRQ will, depending on
> > timing, cause a full PCI_DOE_TIMEOUT to expire. :-( That is indeed not
> > ideal. However, by that time the error and busy flags should be clear and we
> > can safely continue. Otherwise we are going to take the mailbox down.
> >
> > It may seem better to arm wait on each iteration through the abort loop. But
> > this is not logically correct because the abort operation should trigger an
> > IRQ. So there is always a race if we missed an IRQ because we timed out early.
>
> I probably stuck that comment in the wrong place. The initial call to clear
> the flag before this should be fine (short of the 'spurious' case of people
> using the interrupt for protocol specific usage).
>
> >
> > >
> > >
> > > > +
> > > > + rc = pci_doe_send_req(doe_mb, task);
> > > > +
> > > > + /*
> > > > + * The specification does not provide any guidance on how long
> > > > + * some other entity could keep the DOE busy, so try for 1
> > > > + * second then fail. Busy handling is best effort only, because
> > > > + * there is no way of avoiding racing against another user of
> > > > + * the DOE.
> > > > + */
> > > > + if (rc == -EBUSY) {
> > > > + busy_retries++;
> > > > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > > > + pci_warn(pdev,
> > > > + "[%x] busy for too long (> 1 sec)\n",
> > > > + offset);
> > > > + signal_task_complete(task, rc);
> > > > + return;
> > > > + }
> > > > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
> > > > + signal_task_abort(task, rc);
> > > > + return;
> > > > + }
> > > > + goto retry_req;
> > > > + } else if (rc) {
> > > > + signal_task_abort(task, rc);
> > > > + return;
> > > > + }
> > > > +
> > > > + timeout_jiffies = jiffies + HZ;
> > > > + if (pci_doe_wait_irq_or_poll(doe_mb)) {
> > >
> > > So this may well be passed as a result of a BUSY transition to 0 very soon
> > > after the doe_send_req but well before the data is ready....
> >
> > I think the simple fix is to make the BUSY wait on an IRQ. Something like:
>
>
> >
> > 21:13:53 > git di
> > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> > index 12f9f8045eb7..afd326320798 100644
> > --- a/drivers/pci/doe.c
> > +++ b/drivers/pci/doe.c
> > @@ -352,7 +352,7 @@ static void doe_statemachine_work(struct work_struct *work)
> > signal_task_complete(task, rc);
> > return;
> > }
> > - if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
> > + if (pci_doe_wait_irq_or_poll(...)) {
> > signal_task_abort(task, rc);
> > return;
>
> This case (which I think is the -EBUSY from pci_doe_send_req() handling)
> isn't important because it's trying to paper over a weird condition. We
> don't normally expect to get here.
>
> I was concerned with the line just above my comment which may not act as
> a gate at all because it's tripped by the the Busy Drop, which may be
> well before the data object ready that we are actually waiting for.
>
>
> > }
> >
> > >
> > >
> > > > + signal_task_abort(task, -EIO);
> > > > + return;
> > > > + }
> > > > +
> > > > + /* Poll for response */
> > > > +retry_resp:
> > > > + if (pci_doe_arm_wait(doe_mb)) {
> > > I think we can get here between Busy drop and Object Ready which means
> > > this can get another IRQ_FLAG setting just after it. Does it matter?
> > > Don't think so, as we don't use that bit again in this run through
> > > and it will be cleared at beginning of next one,
> >
> > Yea basically I agree.
> >
> > > but if so why is
> > > this call here?
> >
> > Seemed like the right thing to do at the time... ;-) j/k
> >
> > > I think it's only useful for detecting an abort, if
> > > so do that explicitly.
> >
> > Actually it is the right thing to do... However, the wait poll below also
> > needs to be an IRQ or poll. I'm not sure how I missed that logic.
>
> Sounds write though without whole code laid out to follow through I'm
> not 100% sure yet!
>
> >
> > >
> > > > + signal_task_abort(task, -EIO);
> > > > + return;
> > > > + }
> > > > +
> > > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > + signal_task_abort(task, -EIO);
> > > > + return;
> > > > + }
> > > > +
> > > > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > > > + if (time_after(jiffies, timeout_jiffies)) {
> > > > + signal_task_abort(task, -ETIMEDOUT);
> > > > + return;
> > > > + }
> > > > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) {
> > >
> > > Following on from above....
> > > As a result of the interrupt having fired on the BUSY off transition,
> > > I think we will almost always end up spinning here until Object Ready
> > > is set. Fine, but seems a shame we don't let an interrupt do this
> > > for us in most cases. (note in QEMU response is instantaneous so
> > > when the interrupt for Busy drop is set, object ready also true so
> > > by the time we get here data ready will already be sent).
> >
> > This needs to be pci_doe_wait_irq_or_poll() as well and the arm wait above
> > ensures we continue to look for that interrupt.
> >
> > I'm starting to see how I got confused. The timeout values all vary and
> > mod_delayed_work() made it very easy for you to interrupt any of those.
>
> Yeah. That was a nice suggestion Dan made long ago but it doesn't play well
> with the single workqueue.
>
> >
> > I tried to bundle the poll vs interrupt modes in pci_doe_wait_irq_or_poll() and
> > got confused. :-(
> >
> > >
> > > > + signal_task_abort(task, -EIO);
> > > > + return;
> > > > + }
> > > > + goto retry_resp;
> > > > + }
> > > > +
> > > > + rc = pci_doe_recv_resp(doe_mb, task);
> > > > + if (rc < 0) {
> > > > + signal_task_abort(task, rc);
> > > > + return;
> > > > + }
> > > > +
> > > > + signal_task_complete(task, rc);
> > > > +}
> > > > +
> > >
> > >
> > > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> > > > +{
> > > > + if (doe_mb->work_queue)
> > >
> > > I'm not a great fan of free functions that check a bunch of conditions
> > > because they may be called before things are set up.
> >
> > I'll see what I can do. I do kind of like this but I think it gets muddled and
> > I'm not dead set on either way.
> >
> > > To my
> > > mind that generally means we should be calling individual cleanup
> > > in the appropriate error handlers.
> > >
> > > Either that or just use devm handling for each item. Sure
> > > it's a few more lines of code, but I find it a lot easier to go
> > >
> > > Oh look that thing we just set up is cleaned up by this.
> > >
> > >
> > > > + destroy_workqueue(doe_mb->work_queue);
> > > > + if (pci_doe_irq_enabled(doe_mb))
> > > > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb);
> > > > + xa_destroy(&doe_mb->prots);
> > > > + kfree(doe_mb);
> > > > +}
> > > > +
> > >
> > > ...
> > >
> > > > +
> > > > +static void pci_doe_destroy_mb(void *mb)
> > > > +{
> > > > + struct pci_doe_mb *doe_mb = mb;
> > > > +
> > > > + /* Mark going down */
> > > > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> > > > +
> > > > + /* Abort any in progress work items */
> > > > + pci_doe_abort(doe_mb);
> > >
> > > Abort is getting used for two things in here. Perhaps
> > > rename this one to
> > > pci_doe_abort_tasks() or something like that?
> >
> > What do you mean two things? Oh I think I see. You mean abort the work item
> > vs abort sent to the hardware?
>
> yup.
>
> >
> > This no longer aborts all the tasks just the one which may be in progress.
> > Because the work queue is ordered only one task may be in progress. I'll clean
> > up the comment too.
>
> Hmm. It puts a requirement on the caller to not queue multiple requests that
> might require ordering. One advantage of flushing the lot was ordering was
> unaffected (though any caller that queued multiple items would have to then
> requeue multiple items so would have to maintain their own retry buffer).
>
> >
> > This sets the abort flag and wakes it up if it is sleeping. If not then the
> > abort flag will be detected in the next arm.
> >
> > FWIW I think I may just remove this call and open code it here.
>
> Sounds good, avoid naming confusion by getting rid of the name :)
>
>
>
> > > > +
> > > > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0,
> > > > + doe_mb->cap_offset);
> > > > + if (!doe_mb->work_queue) {
> > > > + pci_err(pdev, "[%x] failed to allocate work queue\n",
> > > > + doe_mb->cap_offset);
> > > > + pci_doe_free_mb(doe_mb);
> > >
> > > As above, I'd rather this explicitly freed what has been set up
> > > and only that rather than calling a free function that does a bunch of
> > > stuff conditionally.
> >
> > I think I can make that work. This is the only conditional in free however,
> > because the other conditional is the IRQ support which may not be set up.
>
> If you split to multiple devm_ calls you can not setup a tear down for the
> irq if we don't have one. Or, don't use pci_request_irq() but call
> devm_request_threaded_irq() directly and let that clean up for you.
>
>
> >
> > Thanks again for the in depth review!
>
> No problem. I know how nasty this seemingly simple little bit of code
> is, so you have my sympathies :)
>
>
> Jonathan
On Tue, Jun 28, 2022 at 03:49:42PM +0100, Jonathan Cameron wrote:
> On Mon, 27 Jun 2022 21:15:25 -0700
> [email protected] wrote:
>
> > From: Ira Weiny <[email protected]>
> >
> > The CDAT data is protected by a checksum and should be the proper
> > length.
> >
> > Introduce cxl_cdat_valid() to validate the data. While at it check and
> > store the sequence number.
> >
> > Signed-off-by: Ira Weiny <[email protected]>
> >
> Minor ordering comment. With that tidied up
> Reviewed-by: Jonathan Cameron <[email protected]>
>
>
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > index 4bd479ec0253..6d775cc3dca1 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -532,6 +532,40 @@ static int cxl_cdat_get_length(struct device *dev,
> > return rc;
> > }
> >
> > +static bool cxl_cdat_valid(struct device *dev, struct cxl_cdat *cdat)
> > +{
> > + u32 *table = cdat->table;
> > + u8 *data8 = cdat->table;
> > + u32 length, seq;
> > + u8 check;
> > + int i;
> > +
> > + length = FIELD_GET(CDAT_HEADER_DW0_LENGTH, table[0]);
> > + if ((length < CDAT_HEADER_LENGTH_BYTES) || (length > cdat->length)) {
> > + dev_err(dev, "CDAT Invalid length %u (%zu-%zu)\n", length,
> > + CDAT_HEADER_LENGTH_BYTES, cdat->length);
> > + return false;
> > + }
> > +
> > + for (check = 0, i = 0; i < length; i++)
> > + check += data8[i];
> > +
> > + dev_dbg(dev, "CDAT length %u CS %u\n", length, check);
> > + if (check != 0) {
> > + dev_err(dev, "CDAT Invalid checksum %u\n", check);
> > + return false;
> > + }
> > +
> > + seq = FIELD_GET(CDAT_HEADER_DW3_SEQUENCE, table[3]);
> > + /* Store the sequence for now. */
> > + if (cdat->seq != seq) {
> > + dev_info(dev, "CDAT seq change %x -> %x\n", cdat->seq, seq);
> > + cdat->seq = seq;
> > + }
> > +
> > + return true;
> > +}
> > +
> > static int cxl_cdat_read_table(struct device *dev,
> > struct pci_doe_mb *cdat_mb,
> > struct cxl_cdat *cdat)
> > @@ -579,6 +613,8 @@ static int cxl_cdat_read_table(struct device *dev,
> >
> > } while (entry_handle != CXL_DOE_TABLE_ACCESS_LAST_ENTRY);
> >
> > + if (!rc && !cxl_cdat_valid(dev, cdat))
> > + return -EIO;
>
> I'd prefer those handled separately as flow is more readable if error
> handling always out of line.
>
> if (rc)
> return rc;
Actually rc is never not 0 here. :-/
>
> if (!cxl_cdata_valid)
> return -EIO;
>
> return 0;
So this is all that is needed. But the previous patch needs some clean up
to make sense in the series.
Ira
On Tue, Jun 28, 2022 at 03:33:17PM +0100, Jonathan Cameron wrote:
> On Mon, 27 Jun 2022 21:15:22 -0700
> [email protected] wrote:
>
> > From: Ira Weiny <[email protected]>
> >
> > DOE mailbox objects will be needed for various mailbox communications
> > with each memory device.
> >
> > Iterate each DOE mailbox capability and create PCI DOE mailbox objects
> > as found.
> >
> > It is not anticipated that this is the final resting place for the
> > iteration of the DOE devices. The support of switch ports will drive
> > this code into the PCIe side. In this imagined architecture the CXL
> > port driver would then query into the PCI device for the DOE mailbox
> > array.
> >
> > For now creating the mailboxes in the CXL port is good enough for the
> > endpoints. Later PCIe ports will need to support this to support switch
> > ports more generically.
> >
> > Cc: Dan Williams <[email protected]>
> > Cc: Davidlohr Bueso <[email protected]>
> > Cc: Lukas Wunner <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
>
> My main comment on this is that we should not paper over any errors in
> DOE setup. Those indicate a bug or hardware fault, so like anything similar
> we should at very least document why it makes sense to continue. In most
> cases I'd argue it doesn't as something is very wrong.
>
> >
> > ---
> > Changes from V11:
> > Drop review from: Ben Widawsky <[email protected]>
> > Remove irq code for now
> > Adjust for pci_doe_get_int_msg_num()
> > Adjust for pcim_doe_create_mb()
> > (No longer need to handle the destroy.)
> > Use xarray for DOE mailbox array
> >
> > Changes from V9:
> > Bug fix: ensure DOE mailboxes are iterated before memdev add
> > Ben Widawsky
> > Set use_irq to false and just return on error.
> > Don't return a value from devm_cxl_pci_create_doe()
> > Skip allocating doe_mb array if there are no mailboxes
> > Skip requesting irqs if none found.
> > Ben/Jonathan Cameron
> > s/num_irqs/max_irqs
> >
> > Changes from V8:
> > Move PCI_DOE selection to CXL_BUS to support future patches
> > which move queries into the port code.
> > Remove Auxiliary device arch
> > Squash the functionality of the auxiliary driver into this
> > patch.
> > Split out the irq handling a bit.
> >
> > Changes from V7:
> > Minor code clean ups
> > Rebased on cxl-pending
> >
> > Changes from V6:
> > Move all the auxiliary device stuff to the CXL layer
> >
> > Changes from V5:
> > Split the CXL specific stuff off from the PCI DOE create
> > auxiliary device code.
> > ---
> > drivers/cxl/Kconfig | 1 +
> > drivers/cxl/cxlmem.h | 3 +++
> > drivers/cxl/pci.c | 37 +++++++++++++++++++++++++++++++++++++
> > 3 files changed, 41 insertions(+)
> >
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index f64e3984689f..7adaaf80b302 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -2,6 +2,7 @@
> > menuconfig CXL_BUS
> > tristate "CXL (Compute Express Link) Devices Support"
> > depends on PCI
> > + select PCI_DOE
> > help
> > CXL is a bus that is electrically compatible with PCI Express, but
> > layers three protocols on that signalling (CXL.io, CXL.cache, and
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 60d10ee1e7fc..360f282ef80c 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -191,6 +191,7 @@ struct cxl_endpoint_dvsec_info {
> > * @component_reg_phys: register base of component registers
> > * @info: Cached DVSEC information about the device.
> > * @serial: PCIe Device Serial Number
> > + * @doe_mbs: PCI DOE mailbox array
> > * @mbox_send: @dev specific transport for transmitting mailbox commands
> > *
> > * See section 8.2.9.5.2 Capacity Configuration and Label Storage for
> > @@ -224,6 +225,8 @@ struct cxl_dev_state {
> > resource_size_t component_reg_phys;
> > u64 serial;
> >
> > + struct xarray doe_mbs;
> > +
> > int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
> > };
> >
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 5a0ae46d4989..5821e6c1253b 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -8,6 +8,7 @@
> > #include <linux/mutex.h>
> > #include <linux/list.h>
> > #include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> > #include <linux/io.h>
> > #include "cxlmem.h"
> > #include "cxlpci.h"
> > @@ -386,6 +387,37 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
> > return rc;
> > }
> >
> > +static void cxl_pci_destroy_doe(void *mbs)
> > +{
> > + struct xarray *xa = mbs;
>
> Local variable doesn't add anything...
>
> > +
> > + xa_destroy(xa);
> > +}
> > +
> > +static void devm_cxl_pci_create_doe(struct cxl_dev_state *cxlds)
> > +{
> > + struct device *dev = cxlds->dev;
> > + struct pci_dev *pdev = to_pci_dev(dev);
> > + u16 off = 0;
> > +
> > + pci_doe_for_each_off(pdev, off) {
> > + struct pci_doe_mb *doe_mb;
> > +
> > + doe_mb = pcim_doe_create_mb(pdev, off, -1);
> > + if (IS_ERR(doe_mb)) {
> > + pci_err(pdev,
> > + "Failed to create MB object for MB @ %x\n",
> > + off);
>
> Definitely at least need a comment for why papering over this failure is
> fine. My gut feeling is we shouldn't ignore it.
I'm so confused at this point I don't really know any more. :-/
>
> > + doe_mb = NULL;
This at least needs to be a continue.
> > + }
> > +
> > + if (xa_insert(&cxlds->doe_mbs, off, doe_mb, GFP_KERNEL))
> > + break;
>
> If we hit that break something has gone horribly wrong and we shouldn't
> paper over it either. We might have a partial list of DOEs and callers
> after this will have no way of knowing it isn't the full list.
I _thought_ that we did not care if some mailboxes failed or not.
If CDAT is not supported on any of the mailboxes found then CDAT will not show
up on sysfs (as per Dan's last comment). If it was supported on a mailbox but
no data is found the sysfs will show up but be 0 length.
At this layer I thought we agreed to skip over these errors. If a protocol is
needed at a higher layer and it is not found on any of the mailboxes the errors
should show up in that layer. In this series CDAT is not 100% necessary as
devices can work without it. So the errors were mostly paper'ed over in favor
of just printing error messages and muddle on.
The xa_insert() deserves a pci_err() though.
And I'll add a comment indicating finding mailboxes is 'best effort' and higher
layers may error out the device depending on necessity. How about this?
/*
* Mailbox creation is best effort. Higher layers must determine if
* the lack of a mailbox for their protocol is a device failure or not.
*/
>
> > +
> > + pci_dbg(pdev, "Created DOE mailbox @%x\n", off);
> > + }
> > +}
> > +
> > static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > {
> > struct cxl_register_map map;
> > @@ -408,6 +440,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > if (IS_ERR(cxlds))
> > return PTR_ERR(cxlds);
> >
> > + xa_init(&cxlds->doe_mbs);
> > + devm_add_action(&pdev->dev, cxl_pci_destroy_doe, &cxlds->doe_mbs);
>
> _or_reset()? If the devm registration itself fails we want to bail out cleanly.
> It's vanishingly unlikely to happen, but we should still handle that case.
>
Actually no; xa_destroy does not need to be called at this point if we fail.
However, I do need to check the return of devm_add_action and fails if it is
not set. Thanks again!
Ira
> > +
> > cxlds->serial = pci_get_dsn(pdev);
> > cxlds->cxl_dvsec = pci_find_dvsec_capability(
> > pdev, PCI_DVSEC_VENDOR_ID_CXL, CXL_DVSEC_PCIE_DEVICE);
> > @@ -434,6 +469,8 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >
> > cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);
> >
> > + devm_cxl_pci_create_doe(cxlds);
> > +
> > rc = cxl_pci_setup_mailbox(cxlds);
> > if (rc)
> > return rc;
>
On Wed, 29 Jun 2022 22:32:57 -0700
Ira Weiny <[email protected]> wrote:
> On Tue, Jun 28, 2022 at 03:33:17PM +0100, Jonathan Cameron wrote:
> > On Mon, 27 Jun 2022 21:15:22 -0700
> > [email protected] wrote:
> >
> > > From: Ira Weiny <[email protected]>
> > >
> > > DOE mailbox objects will be needed for various mailbox communications
> > > with each memory device.
> > >
> > > Iterate each DOE mailbox capability and create PCI DOE mailbox objects
> > > as found.
> > >
> > > It is not anticipated that this is the final resting place for the
> > > iteration of the DOE devices. The support of switch ports will drive
> > > this code into the PCIe side. In this imagined architecture the CXL
> > > port driver would then query into the PCI device for the DOE mailbox
> > > array.
> > >
> > > For now creating the mailboxes in the CXL port is good enough for the
> > > endpoints. Later PCIe ports will need to support this to support switch
> > > ports more generically.
> > >
> > > Cc: Dan Williams <[email protected]>
> > > Cc: Davidlohr Bueso <[email protected]>
> > > Cc: Lukas Wunner <[email protected]>
> > > Signed-off-by: Ira Weiny <[email protected]>
> >
> > My main comment on this is that we should not paper over any errors in
> > DOE setup. Those indicate a bug or hardware fault, so like anything similar
> > we should at very least document why it makes sense to continue. In most
> > cases I'd argue it doesn't as something is very wrong.
> >
> > >
> > > ---
> > > Changes from V11:
> > > Drop review from: Ben Widawsky <[email protected]>
> > > Remove irq code for now
> > > Adjust for pci_doe_get_int_msg_num()
> > > Adjust for pcim_doe_create_mb()
> > > (No longer need to handle the destroy.)
> > > Use xarray for DOE mailbox array
> > >
> > > Changes from V9:
> > > Bug fix: ensure DOE mailboxes are iterated before memdev add
> > > Ben Widawsky
> > > Set use_irq to false and just return on error.
> > > Don't return a value from devm_cxl_pci_create_doe()
> > > Skip allocating doe_mb array if there are no mailboxes
> > > Skip requesting irqs if none found.
> > > Ben/Jonathan Cameron
> > > s/num_irqs/max_irqs
> > >
> > > Changes from V8:
> > > Move PCI_DOE selection to CXL_BUS to support future patches
> > > which move queries into the port code.
> > > Remove Auxiliary device arch
> > > Squash the functionality of the auxiliary driver into this
> > > patch.
> > > Split out the irq handling a bit.
> > >
> > > Changes from V7:
> > > Minor code clean ups
> > > Rebased on cxl-pending
> > >
> > > Changes from V6:
> > > Move all the auxiliary device stuff to the CXL layer
> > >
> > > Changes from V5:
> > > Split the CXL specific stuff off from the PCI DOE create
> > > auxiliary device code.
> > > ---
> > > drivers/cxl/Kconfig | 1 +
> > > drivers/cxl/cxlmem.h | 3 +++
> > > drivers/cxl/pci.c | 37 +++++++++++++++++++++++++++++++++++++
> > > 3 files changed, 41 insertions(+)
> > >
> > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > index f64e3984689f..7adaaf80b302 100644
> > > --- a/drivers/cxl/Kconfig
> > > +++ b/drivers/cxl/Kconfig
> > > @@ -2,6 +2,7 @@
> > > menuconfig CXL_BUS
> > > tristate "CXL (Compute Express Link) Devices Support"
> > > depends on PCI
> > > + select PCI_DOE
> > > help
> > > CXL is a bus that is electrically compatible with PCI Express, but
> > > layers three protocols on that signalling (CXL.io, CXL.cache, and
> > > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > > index 60d10ee1e7fc..360f282ef80c 100644
> > > --- a/drivers/cxl/cxlmem.h
> > > +++ b/drivers/cxl/cxlmem.h
> > > @@ -191,6 +191,7 @@ struct cxl_endpoint_dvsec_info {
> > > * @component_reg_phys: register base of component registers
> > > * @info: Cached DVSEC information about the device.
> > > * @serial: PCIe Device Serial Number
> > > + * @doe_mbs: PCI DOE mailbox array
> > > * @mbox_send: @dev specific transport for transmitting mailbox commands
> > > *
> > > * See section 8.2.9.5.2 Capacity Configuration and Label Storage for
> > > @@ -224,6 +225,8 @@ struct cxl_dev_state {
> > > resource_size_t component_reg_phys;
> > > u64 serial;
> > >
> > > + struct xarray doe_mbs;
> > > +
> > > int (*mbox_send)(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *cmd);
> > > };
> > >
> > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > > index 5a0ae46d4989..5821e6c1253b 100644
> > > --- a/drivers/cxl/pci.c
> > > +++ b/drivers/cxl/pci.c
> > > @@ -8,6 +8,7 @@
> > > #include <linux/mutex.h>
> > > #include <linux/list.h>
> > > #include <linux/pci.h>
> > > +#include <linux/pci-doe.h>
> > > #include <linux/io.h>
> > > #include "cxlmem.h"
> > > #include "cxlpci.h"
> > > @@ -386,6 +387,37 @@ static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
> > > return rc;
> > > }
> > >
> > > +static void cxl_pci_destroy_doe(void *mbs)
> > > +{
> > > + struct xarray *xa = mbs;
> >
> > Local variable doesn't add anything...
> >
> > > +
> > > + xa_destroy(xa);
> > > +}
> > > +
> > > +static void devm_cxl_pci_create_doe(struct cxl_dev_state *cxlds)
> > > +{
> > > + struct device *dev = cxlds->dev;
> > > + struct pci_dev *pdev = to_pci_dev(dev);
> > > + u16 off = 0;
> > > +
> > > + pci_doe_for_each_off(pdev, off) {
> > > + struct pci_doe_mb *doe_mb;
> > > +
> > > + doe_mb = pcim_doe_create_mb(pdev, off, -1);
> > > + if (IS_ERR(doe_mb)) {
> > > + pci_err(pdev,
> > > + "Failed to create MB object for MB @ %x\n",
> > > + off);
> >
> > Definitely at least need a comment for why papering over this failure is
> > fine. My gut feeling is we shouldn't ignore it.
>
> I'm so confused at this point I don't really know any more. :-/
>
> >
> > > + doe_mb = NULL;
>
> This at least needs to be a continue.
>
> > > + }
> > > +
> > > + if (xa_insert(&cxlds->doe_mbs, off, doe_mb, GFP_KERNEL))
> > > + break;
> >
> > If we hit that break something has gone horribly wrong and we shouldn't
> > paper over it either. We might have a partial list of DOEs and callers
> > after this will have no way of knowing it isn't the full list.
>
> I _thought_ that we did not care if some mailboxes failed or not.
I have a different view to Dan on this. In my view if your hardware is
not working in any way at all scream like mad don't carry on... Dan
is keen to try to muddle onwards.
>
> If CDAT is not supported on any of the mailboxes found then CDAT will not show
> up on sysfs (as per Dan's last comment). If it was supported on a mailbox but
> no data is found the sysfs will show up but be 0 length.
>
> At this layer I thought we agreed to skip over these errors. If a protocol is
> needed at a higher layer and it is not found on any of the mailboxes the errors
> should show up in that layer. In this series CDAT is not 100% necessary as
> devices can work without it. So the errors were mostly paper'ed over in favor
> of just printing error messages and muddle on.
>
> The xa_insert() deserves a pci_err() though.
That's probably the minimum we should do. The xa_insert() failing is something
horrible going wrong in our software / host afterall.
>
> And I'll add a comment indicating finding mailboxes is 'best effort' and higher
> layers may error out the device depending on necessity. How about this?
>
> /*
> * Mailbox creation is best effort. Higher layers must determine if
> * the lack of a mailbox for their protocol is a device failure or not.
> */
Good.
>
> >
> > > +
> > > + pci_dbg(pdev, "Created DOE mailbox @%x\n", off);
> > > + }
> > > +}
> > > +
> > > static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > > {
> > > struct cxl_register_map map;
> > > @@ -408,6 +440,9 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > > if (IS_ERR(cxlds))
> > > return PTR_ERR(cxlds);
> > >
> > > + xa_init(&cxlds->doe_mbs);
> > > + devm_add_action(&pdev->dev, cxl_pci_destroy_doe, &cxlds->doe_mbs);
> >
> > _or_reset()? If the devm registration itself fails we want to bail out cleanly.
> > It's vanishingly unlikely to happen, but we should still handle that case.
> >
>
> Actually no; xa_destroy does not need to be called at this point if we fail.
Ah ok, I'd missed that.
> However, I do need to check the return of devm_add_action and fails if it is
> not set. Thanks again!
>
> Ira
>
> > > +
> > > cxlds->serial = pci_get_dsn(pdev);
> > > cxlds->cxl_dvsec = pci_find_dvsec_capability(
> > > pdev, PCI_DVSEC_VENDOR_ID_CXL, CXL_DVSEC_PCIE_DEVICE);
> > > @@ -434,6 +469,8 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > >
> > > cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);
> > >
> > > + devm_cxl_pci_create_doe(cxlds);
> > > +
> > > rc = cxl_pci_setup_mailbox(cxlds);
> > > if (rc)
> > > return rc;
> >
On Wed, 29 Jun 2022 21:34:18 -0700
Ira Weiny <[email protected]> wrote:
> On Wed, Jun 29, 2022 at 03:09:47PM +0100, Jonathan Cameron wrote:
> > On Tue, 28 Jun 2022 11:20:32 -0700
> > Ira Weiny <[email protected]> wrote:
> >
>
> [snip]
>
> > > > >
> > > > Hi Ira,
> > > >
> > > > Thanks for keeping at this!
> > > >
> > > > I think this has reintroduced some of the races around that annoying
> > > > interrupt source form BUSY transitioning to low that has
> > > > no explicit 'cause' flag. I think we'd hammered all those out in the
> > > > previous version but maybe there were still some there...
> > >
> > > Well I really tried hard not to introduce races which would be a problem. But
> > > I would not be surprised.
> > >
> > > >
> > > > I 'think' it will work as is, but depending on the timing a given DOE
> > > > implementation has, the interrupt may be completely pointless as it
> > > > will be signaling the wrong event.
> > >
> > > :-/
> > >
> > > There is a chance that an IRQ comes in just after we timeout waiting for it. I
> > > think that has always been the case and the IRQ will effectively be missed I
> > > _think_.
> >
> > The timeout case I'm not that worried about as it means the device
> > is out of spec, so whilst it might happen and we don't want to break in that
> > case it should be uncommon enough that a perf disadvantage doesn't matter.
>
> Ok I think we are agreed here.
>
> I've removed the irq stuff for now per the call yesterday. But I'm still
> interested in how to solve the problem so see below.
>
> >
> >
> > ...
> >
> > > > > +
> > > > > +static void doe_statemachine_work(struct work_struct *work)
> > > > > +{
> > > > > + struct pci_doe_task *task = container_of(work, struct pci_doe_task,
> > > > > + work);
> > > > > + struct pci_doe_mb *doe_mb = task->doe_mb;
> > > > > + struct pci_dev *pdev = doe_mb->pdev;
> > > > > + int offset = doe_mb->cap_offset;
> > > > > + unsigned int busy_retries = 0;
> > > > > + unsigned long timeout_jiffies;
> > > > > + u32 val;
> > > > > + int rc;
> > > > > +
> > > > > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> > > > > + signal_task_complete(task, -EIO);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + /* Send request */
> > > > > +retry_req:
> > > > > + if (pci_doe_arm_wait(doe_mb)) {
> > > > > + signal_task_abort(task, -EIO);
> > > > > + return;
> > > > > + }
> > > >
> > > > Is there a race here? If Busy drops at this point we queue up
> > > > a message, but IRQ bit is already set. Hence when we hit
> > > > wait_event_timeout() the flag is already set and IIRC we'll
> > > > drop straight through.
> > > >
> > >
> > > I did not realize that the device would interrupt when Busy dropped? I was
> > > thinking that V11 did not respond to IRQ but indeed it did via setting the work
> > > item to run immediately...
> > >
> > > However, I only see this in the spec:
> > >
> > > 1) System firmware/software checks that the DOE Busy bit is Clear to ensure
> > > that the DOE instance is ready to receive a DOE request.
> >
> > I missed this particular one originally and someone else pointed it out in
> > review (can't remember who though). The busy drop is mentioned in the
> > bit definition. It's in the capability definition.
> >
> > "DOE Interrupt Status - This bit must be Set when an interrupt is generated
> > to indicate that the Data Object Ready bit or the DOE Error bit has been Set. or
> > that the DOE Busy bit has been Cleared."
> > (the formatting is in the release spec.. hohum)
> >
> > Anyhow, upshot is that the status can be set as a result of Busy Bit clearing.
> > 6.30.3 Interrupt Geneeration: then says that interrupt is generate on a
> > transition of the logical AND of
> >
> > 1. Vector unmasked
> > 2. DOE interrupt Enable bit is 1
> > 3. Value of the DOE interrupt Status bit is 1.
> >
> > So if interrupt status bit is set to 1 due to a Busy drop and we then
> > clear it before Data Object Ready, we'll get 2 interrupts.
>
> I don't understand this. I think the issue is we _don't_ clear it between
> pci_doe_arm_wait() and pci_doe_send_req().
>
> Then we interpret the Busy drop interrupt incorrectly as the Data Object Ready
> interrupt and start polling for a response immediately rather than waiting for
> the Data Object Ready IRQ.
>
> I _think_ this will be ok because the response will not be read until Data
> Object Ready is actually set. So we check DOR wait again see the DOR IRQ there
> and goto retry_resp to check Data Object Ready again.
>
> Effectively I'm not sure I agree with you about _when_ we get the interrupts
> but I do agree that we get an extra one which I'm not checking for _why_. More
> important I think that getting more IRQs is better than missing an interrupt
> and incorrectly thinking we timed out.
>
I think the confusion here is down to where I put the comment that there
was an issue. There are two paths... See below for the waveform
for how BUSY is allowed to change on a normal send_req. May be easier to
read that before this bit...
I think you are referring to the retry_req path - I didn't focus on that
because that shouldn't ever happen in reality. It's papering over a system
doing something stupid. We never expect to enter the state machine handling
with BUSY set but try to carry on if it is.
In that path we could have had an interrupt occur as BUSY stops being true.
If that happens between pci_doe_arm_wait() and send_req() then we have
PCI_DOE_FLAG_IRQ set which means we drop straight through the
pci_doe_wait_irq_or_poll() just before the /* Poll for response */
and as you as you say end up in a non irq escaped polling which is expensive
for no reason. This one I think we solve by separating the busy check
from sending data. So we wait for !BUSY, rearm the interrupt and move on
to actually send data (which can result in BUSY -> !BUSY transition of
it's own).
That normal flow BUSY -> !BUSY transition just after send_req (every time
we send anything the device may briefly set BUSY) results
in an interrupt that means we again drop straight through the wait
before /* Poll for response */ possibly well before Data Object ready
is true. So there we need another dance to ensure the interrupt flag
is cleared but we don't end up waiting on an interrupt that we missed.
That's hard to do with a single flag, you may be better of clearing the
interrupt status directly in the statemachine. As long as you clear it
before reading data_object_ready the race should be closed. If data
object ready isn't set, wait for another interrupt / timeout.
I tried drawing this in asci art but it didn't work out. Best bet
is probably to fix as many races as you can find with clear comments
then we'll see if anyone can find others.
> >
> > There is another vague bit of language that sort of allows other
> > uses of this interrupt for protocol specific stuff. Hopefully
> > no one falls for that, but we should safely handle that case (perf
> > drop as a result is fine though!) I can't remember where the exact
> > language is, but I've had a few 'polite discussions' to persuade
> > people using it that way would be a very bad idea...
> >
> >
> >
> > >
> > > >
> > > > It'll probably be fine because it will end up polling below
> > > > but doesn't look ideal.
> > >
> > > I agree it would not be ideal but I think if we are waiting for Busy to be
> > > cleared then the pci_doe_arm_wait() should be benign.
> >
> > I think in some of these paths we are waiting for Data Object Ready to be
> > set, the busy drop is effectively acting as a spurious interrupt if we
> > clear the status before the data object ready event which could be much later
> > because of Busy can clear really quickly.
>
> Ok yea I think this is what I am seeing.
>
> >
> > >
> > > >
> > > > Upshot is that you sort of have to handle "spurious interrupts"
> > > > cleanly and rewait on the interrupt if you get one whilst also handling
> > > > race conditions around RW1C of the interrupt status flag.
> > >
> > > Sorry I'm not sure what 'RW1C' means here?
> >
> > Read / Write 1 to clear. In this case I meant reading it and then clearing it
> > without looking at the other status bits.
>
> Ah. Perhaps the handler should be more involved in this setting different
> flags and having the *_wait() functions be more specific about what exactly we
> are waiting for. I'll have to think about that.
Either that or potentially don't clear the interrupt status in the handler,
but do it in the state machine. The silliness with BUSY means you are going
to get useless interrupts unforunately.
>
> >
> > >
> > > Anyway, spurious interrupts was something I was concerned about but I don't
> > > think there is anything we can do about an interrupt coming in when we are
> > > expecting one but the device did not really send one. AFAIK that is virtually
> > > impossible anyway.
> >
> > In this case seeing 2 interrupts is highly likely.
> > We see the Busy drop one and the interrupt handler clears the Interrupt Status
> > Bit, then data object becomes ready and we go around again.
>
> But we are only going to see this if some other entity is using the mailbox
> right? And I don't think that is going to be common, is it?
BUSY on entry to doe_statemachine_work() is indeed only relevant if
some other entity is trampling on us. It's best effort only.
BUSY during normal flow is the one I care about.
In most cases it will go like (assuming we clear the int status in the handler as now)
Send Object
BUSY ________|-----___________________
PROC ________|------------------______
OBJ RDY ___________________________-------
Int Status______________-____________-_____
where I've added PROC to mean the device is processing the data.
Once it clears the input buffer on the device and hence the device can accept
another protocol request BUSY will drop. If device has some pipelining
or runs multiple protocols in different threads, you can think of that busy
period just being the time it needs to copy out the request to some protocol
thread specific storage.
You won't see this in QEMU without extra hacks because we shorten the
flow so that whole thing is instantaneous.
If those two interrupts per transfer occur well spread out they can result in
your INT flag being set too often and some of the waits dropping through early.
It will 'work' I think though because you ultimately spin on Data object
ready which won't be set until after the second interrupt.
>
> Is this the sequence you are speaking of? If so I think this is how it would
> flow given the fix I suggested below.
>
> Device Other Entity Linux CPU
> Sets Busy
> pci_doe_arm_wait() <- clears FLAG_IRQ
> Clears Busy
> pci_doe_irq_handler() <set FLAG_IRQ>
> pci_doe_send_req() <- Sees !BUSY sends query
> pci_doe_wait_irq() <- No waiting because of 'spurious' Busy Drop!!!
>
> pci_doe_arm_wait() <- clears FLAG_IRQ
> <DOR not set>
> pci_doe_wait_irq() <- NOW waits!!!
> Set DOR
> pci_doe_irq_handler() <set FLAG_IRQ>
> <goto retry_resp>
> <DOR set>
> pci_doe_recv_resp()
>
> What am I missing?
It's not some other entity causing BUSY to be set, it's us :)
One simple route might be to have your interrupt handler not set the flag
unless we have Data Object Ready or Abort, but you need to take
care not to race in the handler, probably by reading that only after clearing
the interrupt status (thus allowing a new interrupt).
Jonathan
>
> Ira
>
> >
> > >
> > > If we actually 'miss' one because we timed out before the device sent it then I
> > > think we are going to ignore the PCI_DOE_FLAG_IRQ flag on the next go around.
> > >
> > > Actually timeout is handled by the abort call and that IRQ will, depending on
> > > timing, cause a full PCI_DOE_TIMEOUT to expire. :-( That is indeed not
> > > ideal. However, by that time the error and busy flags should be clear and we
> > > can safely continue. Otherwise we are going to take the mailbox down.
> > >
> > > It may seem better to arm wait on each iteration through the abort loop. But
> > > this is not logically correct because the abort operation should trigger an
> > > IRQ. So there is always a race if we missed an IRQ because we timed out early.
> >
> > I probably stuck that comment in the wrong place. The initial call to clear
> > the flag before this should be fine (short of the 'spurious' case of people
> > using the interrupt for protocol specific usage).
> >
> > >
> > > >
> > > >
> > > > > +
> > > > > + rc = pci_doe_send_req(doe_mb, task);
> > > > > +
> > > > > + /*
> > > > > + * The specification does not provide any guidance on how long
> > > > > + * some other entity could keep the DOE busy, so try for 1
> > > > > + * second then fail. Busy handling is best effort only, because
> > > > > + * there is no way of avoiding racing against another user of
> > > > > + * the DOE.
> > > > > + */
> > > > > + if (rc == -EBUSY) {
> > > > > + busy_retries++;
> > > > > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > > > > + pci_warn(pdev,
> > > > > + "[%x] busy for too long (> 1 sec)\n",
> > > > > + offset);
> > > > > + signal_task_complete(task, rc);
> > > > > + return;
> > > > > + }
> > > > > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
> > > > > + signal_task_abort(task, rc);
> > > > > + return;
> > > > > + }
> > > > > + goto retry_req;
> > > > > + } else if (rc) {
> > > > > + signal_task_abort(task, rc);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + timeout_jiffies = jiffies + HZ;
> > > > > + if (pci_doe_wait_irq_or_poll(doe_mb)) {
> > > >
> > > > So this may well be passed as a result of a BUSY transition to 0 very soon
> > > > after the doe_send_req but well before the data is ready....
> > >
> > > I think the simple fix is to make the BUSY wait on an IRQ. Something like:
> >
> >
> > >
> > > 21:13:53 > git di
> > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> > > index 12f9f8045eb7..afd326320798 100644
> > > --- a/drivers/pci/doe.c
> > > +++ b/drivers/pci/doe.c
> > > @@ -352,7 +352,7 @@ static void doe_statemachine_work(struct work_struct *work)
> > > signal_task_complete(task, rc);
> > > return;
> > > }
> > > - if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) {
> > > + if (pci_doe_wait_irq_or_poll(...)) {
> > > signal_task_abort(task, rc);
> > > return;
> >
> > This case (which I think is the -EBUSY from pci_doe_send_req() handling)
> > isn't important because it's trying to paper over a weird condition. We
> > don't normally expect to get here.
> >
> > I was concerned with the line just above my comment which may not act as
> > a gate at all because it's tripped by the the Busy Drop, which may be
> > well before the data object ready that we are actually waiting for.
> >
> >
> > > }
> > >
> > > >
> > > >
> > > > > + signal_task_abort(task, -EIO);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + /* Poll for response */
> > > > > +retry_resp:
> > > > > + if (pci_doe_arm_wait(doe_mb)) {
> > > > I think we can get here between Busy drop and Object Ready which means
> > > > this can get another IRQ_FLAG setting just after it. Does it matter?
> > > > Don't think so, as we don't use that bit again in this run through
> > > > and it will be cleared at beginning of next one,
> > >
> > > Yea basically I agree.
> > >
> > > > but if so why is
> > > > this call here?
> > >
> > > Seemed like the right thing to do at the time... ;-) j/k
> > >
> > > > I think it's only useful for detecting an abort, if
> > > > so do that explicitly.
> > >
> > > Actually it is the right thing to do... However, the wait poll below also
> > > needs to be an IRQ or poll. I'm not sure how I missed that logic.
> >
> > Sounds write though without whole code laid out to follow through I'm
> > not 100% sure yet!
> >
> > >
> > > >
> > > > > + signal_task_abort(task, -EIO);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > + signal_task_abort(task, -EIO);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > > > > + if (time_after(jiffies, timeout_jiffies)) {
> > > > > + signal_task_abort(task, -ETIMEDOUT);
> > > > > + return;
> > > > > + }
> > > > > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) {
> > > >
> > > > Following on from above....
> > > > As a result of the interrupt having fired on the BUSY off transition,
> > > > I think we will almost always end up spinning here until Object Ready
> > > > is set. Fine, but seems a shame we don't let an interrupt do this
> > > > for us in most cases. (note in QEMU response is instantaneous so
> > > > when the interrupt for Busy drop is set, object ready also true so
> > > > by the time we get here data ready will already be sent).
> > >
> > > This needs to be pci_doe_wait_irq_or_poll() as well and the arm wait above
> > > ensures we continue to look for that interrupt.
> > >
> > > I'm starting to see how I got confused. The timeout values all vary and
> > > mod_delayed_work() made it very easy for you to interrupt any of those.
> >
> > Yeah. That was a nice suggestion Dan made long ago but it doesn't play well
> > with the single workqueue.
> >
> > >
> > > I tried to bundle the poll vs interrupt modes in pci_doe_wait_irq_or_poll() and
> > > got confused. :-(
> > >
> > > >
> > > > > + signal_task_abort(task, -EIO);
> > > > > + return;
> > > > > + }
> > > > > + goto retry_resp;
> > > > > + }
> > > > > +
> > > > > + rc = pci_doe_recv_resp(doe_mb, task);
> > > > > + if (rc < 0) {
> > > > > + signal_task_abort(task, rc);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + signal_task_complete(task, rc);
> > > > > +}
> > > > > +
> > > >
> > > >
> > > > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> > > > > +{
> > > > > + if (doe_mb->work_queue)
> > > >
> > > > I'm not a great fan of free functions that check a bunch of conditions
> > > > because they may be called before things are set up.
> > >
> > > I'll see what I can do. I do kind of like this but I think it gets muddled and
> > > I'm not dead set on either way.
> > >
> > > > To my
> > > > mind that generally means we should be calling individual cleanup
> > > > in the appropriate error handlers.
> > > >
> > > > Either that or just use devm handling for each item. Sure
> > > > it's a few more lines of code, but I find it a lot easier to go
> > > >
> > > > Oh look that thing we just set up is cleaned up by this.
> > > >
> > > >
> > > > > + destroy_workqueue(doe_mb->work_queue);
> > > > > + if (pci_doe_irq_enabled(doe_mb))
> > > > > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb);
> > > > > + xa_destroy(&doe_mb->prots);
> > > > > + kfree(doe_mb);
> > > > > +}
> > > > > +
> > > >
> > > > ...
> > > >
> > > > > +
> > > > > +static void pci_doe_destroy_mb(void *mb)
> > > > > +{
> > > > > + struct pci_doe_mb *doe_mb = mb;
> > > > > +
> > > > > + /* Mark going down */
> > > > > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> > > > > +
> > > > > + /* Abort any in progress work items */
> > > > > + pci_doe_abort(doe_mb);
> > > >
> > > > Abort is getting used for two things in here. Perhaps
> > > > rename this one to
> > > > pci_doe_abort_tasks() or something like that?
> > >
> > > What do you mean two things? Oh I think I see. You mean abort the work item
> > > vs abort sent to the hardware?
> >
> > yup.
> >
> > >
> > > This no longer aborts all the tasks just the one which may be in progress.
> > > Because the work queue is ordered only one task may be in progress. I'll clean
> > > up the comment too.
> >
> > Hmm. It puts a requirement on the caller to not queue multiple requests that
> > might require ordering. One advantage of flushing the lot was ordering was
> > unaffected (though any caller that queued multiple items would have to then
> > requeue multiple items so would have to maintain their own retry buffer).
> >
> > >
> > > This sets the abort flag and wakes it up if it is sleeping. If not then the
> > > abort flag will be detected in the next arm.
> > >
> > > FWIW I think I may just remove this call and open code it here.
> >
> > Sounds good, avoid naming confusion by getting rid of the name :)
> >
> >
> >
> > > > > +
> > > > > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0,
> > > > > + doe_mb->cap_offset);
> > > > > + if (!doe_mb->work_queue) {
> > > > > + pci_err(pdev, "[%x] failed to allocate work queue\n",
> > > > > + doe_mb->cap_offset);
> > > > > + pci_doe_free_mb(doe_mb);
> > > >
> > > > As above, I'd rather this explicitly freed what has been set up
> > > > and only that rather than calling a free function that does a bunch of
> > > > stuff conditionally.
> > >
> > > I think I can make that work. This is the only conditional in free however,
> > > because the other conditional is the IRQ support which may not be set up.
> >
> > If you split to multiple devm_ calls you can not setup a tear down for the
> > irq if we don't have one. Or, don't use pci_request_irq() but call
> > devm_request_threaded_irq() directly and let that clean up for you.
> >
> >
> > >
> > > Thanks again for the in depth review!
> >
> > No problem. I know how nasty this seemingly simple little bit of code
> > is, so you have my sympathies :)
> >
> >
> > Jonathan
On Wed, 29 Jun 2022 20:35:34 -0700
Ira Weiny <[email protected]> wrote:
> On Tue, Jun 28, 2022 at 03:47:27PM +0100, Jonathan Cameron wrote:
> > On Mon, 27 Jun 2022 21:15:24 -0700
> > [email protected] wrote:
> >
> > > From: Ira Weiny <[email protected]>
> > >
> > > The OS will need CDAT data from CXL devices to properly set up
> > > interleave sets. Currently this is supported through a DOE mailbox
> > > which supports CDAT.
> > >
> > > Search the DOE mailboxes available, query CDAT data, and cache the data
> > > for later parsing.
> > >
> > > Provide a sysfs binary attribute to allow dumping of the CDAT.
> > >
> > > Binary dumping is modeled on /sys/firmware/ACPI/tables/
> > >
> > > The ability to dump this table will be very useful for emulation of real
> > > devices once they become available as QEMU CXL type 3 device emulation will
> > > be able to load this file in.
> > >
> > > This does not support table updates at runtime. It will always provide
> > > whatever was there when first cached. Handling of table updates can be
> > > implemented later.
> > >
> > > Finally create a complete list of CDAT defines within cdat.h for code
> > > wishing to decode the CDAT table.
> > >
> > > Signed-off-by: Jonathan Cameron <[email protected]>
> > > Co-developed-by: Jonathan Cameron <[email protected]>
> > > Signed-off-by: Ira Weiny <[email protected]>
> >
> > One query inline, though I'm fine with it either way, just want to
> > understand your logic in keeping completion when Dan suggested using
> > flush_work() to achieve the same thing.
> >
> > >
> > > ---
> > > Changes from V11:
> > > Adjust for the use of DOE mailbox xarray
> > > Dan Williams:
> > > Remove unnecessary get/put device
> > > Use new BIN_ATTR_ADMIN_RO macro
> > > Flag that CDAT was supported
> > > If there is a read error then the CDAT sysfs
> > > will return a 0 length entry
> > >
> > ...
> > > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > > index c4c99ff7b55e..4bd479ec0253 100644
> > > --- a/drivers/cxl/core/pci.c
> > > +++ b/drivers/cxl/core/pci.c
> > > @@ -4,10 +4,12 @@
> >
> > > +static int cxl_cdat_get_length(struct device *dev,
> > > + struct pci_doe_mb *cdat_mb,
> > > + size_t *length)
> > > +{
> > > + u32 cdat_request_pl = CDAT_DOE_REQ(0);
> > > + u32 cdat_response_pl[32];
> > > + DECLARE_COMPLETION_ONSTACK(c);
> > > + struct pci_doe_task task = {
> > > + .prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> > > + .prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> > > + .request_pl = &cdat_request_pl,
> > > + .request_pl_sz = sizeof(cdat_request_pl),
> > > + .response_pl = cdat_response_pl,
> > > + .response_pl_sz = sizeof(cdat_response_pl),
> > > + .complete = cxl_doe_task_complete,
> > > + .private = &c,
> > > + };
> > > + int rc = 0;
> > > +
> > > + rc = pci_doe_submit_task(cdat_mb, &task);
> > > + if (rc < 0) {
> > > + dev_err(dev, "DOE submit failed: %d", rc);
> > > + return rc;
> > > + }
> > > + wait_for_completion(&c);
> >
> > Dan mentioned in his review that we could just use
> > flush_work() and drop the completion logic and callback.
> > Why didn't you go that way? Would want to wrap it up
> > in pci_doe_wait_task() or something like that.
>
> In early reviews of the Aux Bus work Dan also specified the above design
> pattern.
>
> "I would expect that the caller of this routine [pci_doe_exchange_sync]
> would want to specify the task and end_task() callback and use that as
> the completion signal. It may also want "no wait" behavior where it is
> prepared for the DOE result to come back sometime later. With that
> change the exchange fields can move into the task directly."
>
> -- https://lore.kernel.org/linux-cxl/CAPcyv4hYAgyf-WcArGvbWHAJgc5+p=OO_6ah_dXJhNM5cXcVTw@mail.gmail.com/
>
> I've renamed pci_doe_exchange_sync() pci_doe_submit_task() because of other
> feedback. Here the callback is set to cxl_doe_task_complete() and we have to
> wait because this particular query needs the length prior to the next task
> being issued.
>
> I use flush_workqueue() within the shutdown handling (or if someone destroys
> the mailbox or abort fails) to first block new work from being submitted
> (PCI_DOE_FLAG_DEAD), cancel the running work if needed (PCI_DOE_FLAG_CANCEL
> [was ABORT]), and then flush the queue causing all the pending work to error
> when seeing PCI_DOE_FLAG_DEAD.
I'm lost, but I'm fine with completions anyway so no problem :)
I wasn't advocating not waiting, but potentially a different way of doing
the wait. If Dan cares about what I think he was proposing with flush_work()
I'll leave it to him to explain why.
Jonathan
>
> Ira
>
> >
> > > +
> > > + if (task.rv < 1)
> > > + return -EIO;
> > > +
> > > + *length = cdat_response_pl[1];
> > > + dev_dbg(dev, "CDAT length %zu\n", *length);
> > > +
> > > + return rc;
> > > +}
> > > +
>
On Thu, 30 Jun 2022, Jonathan Cameron wrote:
>On Wed, 29 Jun 2022 22:32:57 -0700 Ira Weiny <[email protected]> wrote:
>> I _thought_ that we did not care if some mailboxes failed or not.
>
>I have a different view to Dan on this. In my view if your hardware is
>not working in any way at all scream like mad don't carry on... Dan
>is keen to try to muddle onwards.
I am also of the idea of not carrying on upon any indication of failure.
>>
>> If CDAT is not supported on any of the mailboxes found then CDAT will not show
>> up on sysfs (as per Dan's last comment). If it was supported on a mailbox but
>> no data is found the sysfs will show up but be 0 length.
>>
>> At this layer I thought we agreed to skip over these errors. If a protocol is
>> needed at a higher layer and it is not found on any of the mailboxes the errors
>> should show up in that layer. In this series CDAT is not 100% necessary as
>> devices can work without it. So the errors were mostly paper'ed over in favor
>> of just printing error messages and muddle on.
>>
>> The xa_insert() deserves a pci_err() though.
>
>That's probably the minimum we should do. The xa_insert() failing is something
>horrible going wrong in our software / host afterall.
Yes. And in addition, devm_cxl_pci_create_doe() should return any error status, and
cxl_pci_probe() can choose to omit any errors, but it's still better to have it.
Thanks,
Davidlohr
On Thu, Jun 30, 2022 at 09:14:06AM -0700, Davidlohr Bueso wrote:
> On Thu, 30 Jun 2022, Jonathan Cameron wrote:
>
> > On Wed, 29 Jun 2022 22:32:57 -0700 Ira Weiny <[email protected]> wrote:
>
> > > I _thought_ that we did not care if some mailboxes failed or not.
> >
> > I have a different view to Dan on this. In my view if your hardware is
> > not working in any way at all scream like mad don't carry on... Dan
> > is keen to try to muddle onwards.
>
> I am also of the idea of not carrying on upon any indication of failure.
>
> > >
> > > If CDAT is not supported on any of the mailboxes found then CDAT will not show
> > > up on sysfs (as per Dan's last comment). If it was supported on a mailbox but
> > > no data is found the sysfs will show up but be 0 length.
> > >
> > > At this layer I thought we agreed to skip over these errors. If a protocol is
> > > needed at a higher layer and it is not found on any of the mailboxes the errors
> > > should show up in that layer. In this series CDAT is not 100% necessary as
> > > devices can work without it. So the errors were mostly paper'ed over in favor
> > > of just printing error messages and muddle on.
> > >
> > > The xa_insert() deserves a pci_err() though.
> >
> > That's probably the minimum we should do. The xa_insert() failing is something
> > horrible going wrong in our software / host afterall.
>
> Yes. And in addition, devm_cxl_pci_create_doe() should return any error status, and
> cxl_pci_probe() can choose to omit any errors, but it's still better to have it.
I don't think this adds anything. If devm_cxl_pci_create_doe() already prints
error messages what is cxl_pci_probe() going to do with the error if it does
not fail the probe?
I want devm_cxl_pci_create_doe() to continue trying to create the mailbox
objects. So for now I'm going to print the errors in devm_cxl_pci_create_doe()
and keep going without returning any error.
Ira
>
> Thanks,
> Davidlohr
On Thu, Jun 30, 2022 at 04:25:40PM +0100, Jonathan Cameron wrote:
> On Wed, 29 Jun 2022 21:34:18 -0700
> Ira Weiny <[email protected]> wrote:
>
[snip]
I've dropped the IRQ support and was polishing things up. Without the IRQ I
don't think any 'arming' makes sense.
However, in working through the sequence again I think I found another problem.
I _think_... :-/
> >
> > But we are only going to see this if some other entity is using the mailbox
> > right? And I don't think that is going to be common, is it?
>
> BUSY on entry to doe_statemachine_work() is indeed only relevant if
> some other entity is trampling on us. It's best effort only.
>
> BUSY during normal flow is the one I care about.
> In most cases it will go like (assuming we clear the int status in the handler as now)
>
> Send Object
> BUSY ________|-----___________________
> PROC ________|------------------______
> OBJ RDY ___________________________-------
> Int Status______________-____________-_____
So I did not realize that BUSY could clear like this. I thought the point of
BUSY was to indicate someone else had an exchange in flight.
What happens if another entity jumps in during the PROC time? How does one
know that OBJ RDY is _our_ object ready and not someone else's?
For example 'entity' issues a send, we see busy clear and also start a
send. But the device is still processing the send from 'entity':
Send Object(entity) Send Object (Linux)
BUSY ___|----_______________|---______________________________
PROC ___|-----------------------------___|-----------------___
OBJ RDY _________________________________-------______________---
Int Status________-__________________-_____-____________________-__
^^^
This...
... is _not_ Linux's object!?!?!?!
Can that happen?
If so this is entirely broken. Even Polling OBJ RDY will break. And worse yet
we will not even see BUSY being set in any 'abnormal' way.
>
> where I've added PROC to mean the device is processing the data.
> Once it clears the input buffer on the device and hence the device can accept
> another protocol request BUSY will drop. If device has some pipelining
> or runs multiple protocols in different threads, you can think of that busy
> period just being the time it needs to copy out the request to some protocol
> thread specific storage.
BUSY was not at all doing what I thought it did. I'm now concerned that it is
completely broken WRT to other entities even without IRQs. Frankly I'm
confused why pci_doe_send_req() even checks for busy because it is unlikely
that we will ever see it set. For sure we won't from our side because the
workqueue is going to process one task at a time.
If Linux wanted to have multiple objects in flight I think we would need a much
more complex state machine than we had. Maybe your original state machine
handled this. If so, I apologize for missing this subtle point.
At this point I'm debating removing the check for BUSY as well because I don't
see the point. (Other than maybe flagging some error to say that 'entity' may
be messing things up for us and bailing.)
Thoughts?
Ira
>
> You won't see this in QEMU without extra hacks because we shorten the
> flow so that whole thing is instantaneous.
>
> If those two interrupts per transfer occur well spread out they can result in
> your INT flag being set too often and some of the waits dropping through early.
>
> It will 'work' I think though because you ultimately spin on Data object
> ready which won't be set until after the second interrupt.
>
On Fri, 1 Jul 2022 15:22:38 -0700
Ira Weiny <[email protected]> wrote:
> On Thu, Jun 30, 2022 at 04:25:40PM +0100, Jonathan Cameron wrote:
> > On Wed, 29 Jun 2022 21:34:18 -0700
> > Ira Weiny <[email protected]> wrote:
> >
>
> [snip]
>
> I've dropped the IRQ support and was polishing things up. Without the IRQ I
> don't think any 'arming' makes sense.
>
> However, in working through the sequence again I think I found another problem.
> I _think_... :-/
>
> > >
> > > But we are only going to see this if some other entity is using the mailbox
> > > right? And I don't think that is going to be common, is it?
> >
> > BUSY on entry to doe_statemachine_work() is indeed only relevant if
> > some other entity is trampling on us. It's best effort only.
> >
> > BUSY during normal flow is the one I care about.
> > In most cases it will go like (assuming we clear the int status in the handler as now)
> >
> > Send Object
> > BUSY ________|-----___________________
> > PROC ________|------------------______
> > OBJ RDY ___________________________-------
> > Int Status______________-____________-_____
>
> So I did not realize that BUSY could clear like this. I thought the point of
> BUSY was to indicate someone else had an exchange in flight.
Unfortunately the spec doesn't provide any way of indicating 'who' is using
the DOE. All busy says is that right now the mailbox is not capable of receiving
a new request. Way back in one of the early posting we considered just dropping
the 'best effort' wait that is there, but I think we concluded it was harmless
and might make things a tiny bit more stable if there was something stale
from before OS load.
>
> What happens if another entity jumps in during the PROC time? How does one
> know that OBJ RDY is _our_ object ready and not someone else's?
Absolutely. The reality is that DOE isn't suitable for multi actor use.
We need to put in some mediation. One thing being neglected on my todo
list is that we need a _DSM in ACPI or similar to negotiate access plus
potentially some firmware interfaces to allow the OS to make firmware
mediated calls. Those firmware interfaces may be at the protocol level
or even further up the stack.
Not sure if we got to it, but this problem was in the slides for
last years Plumbers uconf talk on DOE.
>
> For example 'entity' issues a send, we see busy clear and also start a
> send. But the device is still processing the send from 'entity':
>
> Send Object(entity) Send Object (Linux)
> BUSY ___|----_______________|---______________________________
> PROC ___|-----------------------------___|-----------------___
> OBJ RDY _________________________________-------______________---
> Int Status________-__________________-_____-____________________-__
>
> ^^^
> This...
>
> ... is _not_ Linux's object!?!?!?!
>
> Can that happen?
yup.
>
> If so this is entirely broken. Even Polling OBJ RDY will break. And worse yet
> we will not even see BUSY being set in any 'abnormal' way.
>
> >
> > where I've added PROC to mean the device is processing the data.
> > Once it clears the input buffer on the device and hence the device can accept
> > another protocol request BUSY will drop. If device has some pipelining
> > or runs multiple protocols in different threads, you can think of that busy
> > period just being the time it needs to copy out the request to some protocol
> > thread specific storage.
>
> BUSY was not at all doing what I thought it did. I'm now concerned that it is
> completely broken WRT to other entities even without IRQs. Frankly I'm
> confused why pci_doe_send_req() even checks for busy because it is unlikely
> that we will ever see it set. For sure we won't from our side because the
> workqueue is going to process one task at a time.
yup, we could drop it, but leave some comment in there that says the spec
suggests checking it.
>
> If Linux wanted to have multiple objects in flight I think we would need a much
> more complex state machine than we had. Maybe your original state machine
> handled this. If so, I apologize for missing this subtle point.
It didn't. I decided that it wasn't worth the effort :)
>
> At this point I'm debating removing the check for BUSY as well because I don't
> see the point. (Other than maybe flagging some error to say that 'entity' may
> be messing things up for us and bailing.)
>
> Thoughts?
I'm fine with replacing it with comments, or an error print to say that IIRC
the spec says we should wait for it, but reality is that it doesn't work.
Guess I should get on with proposing a _DSM interface to deal with this.
It's a bit messy though as relies on reliable matching of PCI devices against
firmware. In theory, with the right 'no reenumeration' flags that has a high
chance of working these days but requires some extra language to say that all
bets are off if you reenumerate before figuring out the ACPI to PCI device mapping.
I dropped the ball on getting that element in place.
What fun ;)
Jonathan
> Ira
>
> >
> > You won't see this in QEMU without extra hacks because we shorten the
> > flow so that whole thing is instantaneous.
> >
> > If those two interrupts per transfer occur well spread out they can result in
> > your INT flag being set too often and some of the waits dropping through early.
> >
> > It will 'work' I think though because you ultimately spin on Data object
> > ready which won't be set until after the second interrupt.
> >