# Changes since RFC v3 [1]
* Added error message when payload size is too small. (Ben)
* Fix includes in UAPI for Clang (LKP)
* Reorder CXL in MAINTAINERS (Joe Perches)
* Kconfig whitespace and spelling fixes (Randy)
* Remove excess frees controlled by devm, introduced in v3 (Jonathan, Dan)
* Use 'PCI Express' instead of 'PCI-E' in Kconfig (Jonathan)
* Fail when mailbox commands return value is an error (Jonathan)
* Add comment to mailbox protocol to explain ordering of operations
(Jonathan, Ben)
* Fail mailbox xfer when doorbell is busy. (Jonathan)
* Remove extraneous SHIFT defines. (Jonathan)
* Change kdocs for mbox_cmd size_out to output only. (Jonathan)
* Fix transient bug (ENOTTY) in CXL_MEM_QUERY_COMMANDS (Jonathan)
* Add some comments and code beautification to mbox commands (Jonathan)
* Add some comments and code beautification to user commands (Jonathan)
* Fix bogus check of memcpy return value (Ben)
* Add concept of blocking certain RAW opcodes (Dan)
* Add debugfs knob to allow all RAW opcodes (Vishal)
* Move docs to driver-api/ (Dan)
* Use bounce buffer again like in v2 (Jonathan)
* Use kvzalloc instead of memdup (Ben)
* Wordsmith some changelogs and documentation (Dan)
* Use a percpu_ref counter to protect devm allocated data in the ioctl path
(Dan)
* Rework cdev registration and lookup to use inode->i_cdev (Dan)
* Drop mutex_lock_interruptible() from ioctl path (Dan)
* Convert add_taint() to WARN_TAINT_ONCE()
* Drop ACPI coordination for pure mailbox driver milestone (Dan)
* Permit GET_LOG with CEL_UUID (Ben)
* Cover letter overhaul (Ben)
* Use info.id instead of CXL_COMMAND_INDEX (Dan)
* Add several new commands to the mailbox interface (Ben)
---
In addition to the mailing list, please feel free to use #cxl on oftc IRC for
discussion.
---
# Summary
Introduce support for “type-3” memory devices defined in the Compute Express
Link (CXL) 2.0 specification [2]. Specifically, these are the memory devices
defined by section 8.2.8.5 of the CXL 2.0 spec. A reference implementation
emulating these devices has been submitted to the QEMU mailing list [3] and is
available on gitlab [4], but will move to a shared tree on kernel.org after
initial acceptance. “Type-3” is a CXL device that acts as a memory expander for
RAM or Persistent Memory. The device might be interleaved with other CXL devices
in a given physical address range.
In addition to the core functionality of discovering the spec defined registers
and resources, introduce a CXL device model that will be the foundation for
translating CXL capabilities into existing Linux infrastructure for Persistent
Memory and other memory devices. For now, this only includes support for the
management command mailbox the surfacing of type-3 devices. These control
devices fill the role of “DIMMs” / nmemX memory-devices in LIBNVDIMM terms.
## Userspace Interaction
Interaction with the driver and type-3 devices via the CXL drivers is introduced
in this patch series and considered stable ABI. They include
* sysfs - Documentation/ABI/testing/sysfs-bus-cxl
* IOCTL - Documentation/driver-api/cxl/memory-devices.rst
* debugfs - Documentation/ABI/testing/debugfs-debug
Work is in process to add support for CXL interactions to the ndctl project [5]
### Development plans
One of the unique challenges that CXL imposes on the Linux driver model is that
it requires the operating system to perform physical address space management
interleaved across devices and bridges. Whereas LIBNVDIMM handles a list of
established static persistent memory address ranges (for example from the ACPI
NFIT), CXL introduces hotplug and the concept of allocating address space to
instantiate persistent memory ranges. This is similar to PCI in the sense that
the platform establishes the MMIO range for PCI BARs to be allocated, but it is
significantly complicated by the fact that a given device can optionally be
interleaved with other devices and can participate in several interleave-sets at
once. LIBNVDIMM handled something like this with the aliasing between PMEM and
BLOCK-WINDOW mode, but CXL adds flexibility to alias DEVICE MEMORY through up to
10 decoders per device.
All of the above needs to be enabled with respect to PCI hotplug events on
Type-3 memory device which needs hooks to determine if a given device is
contributing to a "System RAM" address range that is unable to be unplugged. In
other words CXL ties PCI hotplug to Memory Hotplug and PCI hotplug needs to be
able to negotiate with memory hotplug. In the medium term the implications of
CXL hotplug vs ACPI SRAT/SLIT/HMAT need to be reconciled. One capability that
seems to be needed is either the dynamic allocation of new memory nodes, or
default initializing extra pgdat instances beyond what is enumerated in ACPI
SRAT to accommodate hot-added CXL memory.
Patches welcome, questions welcome as the development effort on the post v5.12
capabilities proceeds.
## Running in QEMU
The incantation to get CXL support in QEMU [4] is considered unstable at this
time. Future readers of this cover letter should verify if any changes are
needed. For the novice QEMU user, the following can be copy/pasted into a
working QEMU commandline. It is enough to make the simplest topology possible.
The topology would consist of a single memory window, single type3 device,
single root port, and single host bridge.
+-------------+
| CXL PXB |
| |
| +-------+ |<----------+
| |CXL RP | | |
+--+-------+--+ v
| +----------+
| | "window" |
| +----------+
v ^
+-------------+ |
| CXL Type 3 | |
| Device |<----------+
+-------------+
// Memory backend
-object memory-backend-file,id=cxl-mem1,share,mem-path=cxl-type3,size=512M
// Host Bridge
-device pxb-cxl id=cxl.0,bus=pcie.0,bus_nr=52,uid=0 len-window-base=1,window-base[0]=0x4c0000000 memdev[0]=cxl-mem1
// Single root port
-device cxl rp,id=rp0,bus=cxl.0,addr=0.0,chassis=0,slot=0,memdev=cxl-mem1
// Single type3 device
-device cxl-type3,bus=rp0,memdev=cxl-mem1,id=cxl-pmem0,size=256M -device cxl-type3,bus=rp1,memdev=cxl-mem1,id=cxl-pmem1,size=256M
---
[1]: https://lore.kernel.org/linux-cxl/[email protected]/
[2]: https://www.computeexpresslink.org/](https://www.computeexpresslink.org/
[3]: https://lore.kernel.org/qemu-devel/[email protected]/T/#t
[4]: https://gitlab.com/bwidawsk/qemu/-/tree/cxl-2.0v*
[5]: https://github.com/pmem/ndctl/tree/cxl-2.0v*
Ben Widawsky (12):
cxl/mem: Map memory device registers
cxl/mem: Find device capabilities
cxl/mem: Implement polled mode mailbox
cxl/mem: Add basic IOCTL interface
cxl/mem: Add send command
taint: add taint for direct hardware access
cxl/mem: Add a "RAW" send command
cxl/mem: Create concept of enabled commands
cxl/mem: Use CEL for enabling commands
cxl/mem: Add set of informational commands
cxl/mem: Add limited Get Log command (0401h)
MAINTAINERS: Add maintainers of the CXL driver
Dan Williams (2):
cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
cxl/mem: Register CXL memX devices
.clang-format | 1 +
Documentation/ABI/testing/debugfs-cxl | 10 +
Documentation/ABI/testing/sysfs-bus-cxl | 26 +
Documentation/admin-guide/sysctl/kernel.rst | 1 +
Documentation/admin-guide/tainted-kernels.rst | 6 +-
Documentation/driver-api/cxl/index.rst | 12 +
.../driver-api/cxl/memory-devices.rst | 46 +
Documentation/driver-api/index.rst | 1 +
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 11 +
drivers/Kconfig | 1 +
drivers/Makefile | 1 +
drivers/base/core.c | 14 +
drivers/cxl/Kconfig | 49 +
drivers/cxl/Makefile | 7 +
drivers/cxl/bus.c | 29 +
drivers/cxl/cxl.h | 140 ++
drivers/cxl/mem.c | 1603 +++++++++++++++++
drivers/cxl/pci.h | 34 +
include/linux/device.h | 1 +
include/linux/kernel.h | 3 +-
include/uapi/linux/cxl_mem.h | 180 ++
kernel/panic.c | 1 +
23 files changed, 2176 insertions(+), 2 deletions(-)
create mode 100644 Documentation/ABI/testing/debugfs-cxl
create mode 100644 Documentation/ABI/testing/sysfs-bus-cxl
create mode 100644 Documentation/driver-api/cxl/index.rst
create mode 100644 Documentation/driver-api/cxl/memory-devices.rst
create mode 100644 drivers/cxl/Kconfig
create mode 100644 drivers/cxl/Makefile
create mode 100644 drivers/cxl/bus.c
create mode 100644 drivers/cxl/cxl.h
create mode 100644 drivers/cxl/mem.c
create mode 100644 drivers/cxl/pci.h
create mode 100644 include/uapi/linux/cxl_mem.h
--
2.30.0
Provide enough functionality to utilize the mailbox of a memory device.
The mailbox is used to interact with the firmware running on the memory
device.
The CXL specification defines separate capabilities for the mailbox and
the memory device. The mailbox interface has a doorbell to indicate
ready to accept commands and the memory device has a capability register
that indicates the mailbox interface is ready. The expectation is that
the doorbell-ready is always later than the memory-device-indication
that the mailbox is ready.
Create a function to handle sending a command, optionally with a
payload, to the memory device, polling on a result, and then optionally
copying out the payload. The algorithm for doing this comes straight out
of the CXL 2.0 specification.
Primary mailboxes are capable of generating an interrupt when submitting
a command in the background. That implementation is saved for a later
time.
Secondary mailboxes aren't implemented at this time.
The flow is proven with one implemented command, "identify". Because the
class code has already told the driver this is a memory device and the
identify command is mandatory.
Signed-off-by: Ben Widawsky <[email protected]>
---
drivers/cxl/Kconfig | 14 ++
drivers/cxl/cxl.h | 39 +++++
drivers/cxl/mem.c | 342 +++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 394 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 3b66b46af8a0..fe591f74af96 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -32,4 +32,18 @@ config CXL_MEM
Chapter 2.3 Type 3 CXL Device in the CXL 2.0 specification.
If unsure say 'm'.
+
+config CXL_MEM_INSECURE_DEBUG
+ bool "CXL.mem debugging"
+ depends on CXL_MEM
+ help
+ Enable debug of all CXL command payloads.
+
+ Some CXL devices and controllers support encryption and other
+ security features. The payloads for the commands that enable
+ those features may contain sensitive clear-text security
+ material. Disable debug of those command payloads by default.
+ If you are a kernel developer actively working on CXL
+ security enabling say Y, otherwise say N.
+
endif
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index a3da7f8050c4..df3d97154b63 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -31,9 +31,36 @@
#define CXLDEV_MB_CAPS_OFFSET 0x00
#define CXLDEV_MB_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
#define CXLDEV_MB_CTRL_OFFSET 0x04
+#define CXLDEV_MB_CTRL_DOORBELL BIT(0)
#define CXLDEV_MB_CMD_OFFSET 0x08
+#define CXLDEV_MB_CMD_COMMAND_OPCODE_MASK GENMASK(15, 0)
+#define CXLDEV_MB_CMD_PAYLOAD_LENGTH_MASK GENMASK(36, 16)
#define CXLDEV_MB_STATUS_OFFSET 0x10
+#define CXLDEV_MB_STATUS_RET_CODE_MASK GENMASK(47, 32)
#define CXLDEV_MB_BG_CMD_STATUS_OFFSET 0x18
+#define CXLDEV_MB_PAYLOAD_OFFSET 0x20
+
+/* Memory Device (CXL 2.0 - 8.2.8.5.1.1) */
+#define CXLMDEV_STATUS_OFFSET 0x0
+#define CXLMDEV_DEV_FATAL BIT(0)
+#define CXLMDEV_FW_HALT BIT(1)
+#define CXLMDEV_STATUS_MEDIA_STATUS_MASK GENMASK(3, 2)
+#define CXLMDEV_MS_NOT_READY 0
+#define CXLMDEV_MS_READY 1
+#define CXLMDEV_MS_ERROR 2
+#define CXLMDEV_MS_DISABLED 3
+#define CXLMDEV_READY(status) \
+ (CXL_GET_FIELD(status, CXLMDEV_STATUS_MEDIA_STATUS) == CXLMDEV_MS_READY)
+#define CXLMDEV_MBOX_IF_READY BIT(4)
+#define CXLMDEV_RESET_NEEDED_SHIFT 5
+#define CXLMDEV_RESET_NEEDED_MASK GENMASK(7, 5)
+#define CXLMDEV_RESET_NEEDED_NOT 0
+#define CXLMDEV_RESET_NEEDED_COLD 1
+#define CXLMDEV_RESET_NEEDED_WARM 2
+#define CXLMDEV_RESET_NEEDED_HOT 3
+#define CXLMDEV_RESET_NEEDED_CXL 4
+#define CXLMDEV_RESET_NEEDED(status) \
+ (CXL_GET_FIELD(status, CXLMDEV_RESET_NEEDED) != CXLMDEV_RESET_NEEDED_NOT)
/**
* struct cxl_mem - A CXL memory device
@@ -44,6 +71,16 @@ struct cxl_mem {
struct pci_dev *pdev;
void __iomem *regs;
+ struct {
+ struct range range;
+ } pmem;
+
+ struct {
+ struct range range;
+ } ram;
+
+ char firmware_version[0x10];
+
/* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
struct {
void __iomem *regs;
@@ -51,6 +88,7 @@ struct cxl_mem {
/* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
struct {
+ struct mutex mutex; /* Protects device mailbox and firmware */
void __iomem *regs;
size_t payload_size;
} mbox;
@@ -89,5 +127,6 @@ struct cxl_mem {
cxl_reg(status);
cxl_reg(mbox);
+cxl_reg(mem);
#endif /* __CXL_H__ */
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index fa14d51243ee..69ed15bfa5d4 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -6,6 +6,270 @@
#include "pci.h"
#include "cxl.h"
+#define cxl_doorbell_busy(cxlm) \
+ (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
+ CXLDEV_MB_CTRL_DOORBELL)
+
+#define CXL_MAILBOX_TIMEOUT_US 2000
+
+enum opcode {
+ CXL_MBOX_OP_IDENTIFY = 0x4000,
+ CXL_MBOX_OP_MAX = 0x10000
+};
+
+/**
+ * struct mbox_cmd - A command to be submitted to hardware.
+ * @opcode: (input) The command set and command submitted to hardware.
+ * @payload_in: (input) Pointer to the input payload.
+ * @payload_out: (output) Pointer to the output payload. Must be allocated by
+ * the caller.
+ * @size_in: (input) Number of bytes to load from @payload.
+ * @size_out: (output) Number of bytes loaded into @payload.
+ * @return_code: (output) Error code returned from hardware.
+ *
+ * This is the primary mechanism used to send commands to the hardware.
+ * All the fields except @payload_* correspond exactly to the fields described in
+ * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
+ * @payload_out are written to, and read from the Command Payload Registers
+ * defined in (8.2.8.4.8).
+ */
+struct mbox_cmd {
+ u16 opcode;
+ void *payload_in;
+ void *payload_out;
+ size_t size_in;
+ size_t size_out;
+ u16 return_code;
+#define CXL_MBOX_SUCCESS 0
+};
+
+static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
+{
+ const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
+ const unsigned long start = jiffies;
+ unsigned long end = start;
+
+ while (cxl_doorbell_busy(cxlm)) {
+ end = jiffies;
+
+ if (time_after(end, start + timeout)) {
+ /* Check again in case preempted before timeout test */
+ if (!cxl_doorbell_busy(cxlm))
+ break;
+ return -ETIMEDOUT;
+ }
+ cpu_relax();
+ }
+
+ dev_dbg(&cxlm->pdev->dev, "Doorbell wait took %dms",
+ jiffies_to_msecs(end) - jiffies_to_msecs(start));
+ return 0;
+}
+
+static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
+ struct mbox_cmd *mbox_cmd)
+{
+ dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
+ dev_info(&cxlm->pdev->dev,
+ "\topcode: 0x%04x\n"
+ "\tpayload size: %zub\n",
+ mbox_cmd->opcode, mbox_cmd->size_in);
+
+ if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
+ print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
+ mbox_cmd->payload_in, mbox_cmd->size_in,
+ true);
+ }
+
+ /* Here's a good place to figure out if a device reset is needed */
+}
+
+/**
+ * cxl_mem_mbox_send_cmd() - Send a mailbox command to a memory device.
+ * @cxlm: The CXL memory device to communicate with.
+ * @mbox_cmd: Command to send to the memory device.
+ *
+ * Context: Any context. Expects mbox_lock to be held.
+ * Return: -ETIMEDOUT if timeout occurred waiting for completion. 0 on success.
+ * Caller should check the return code in @mbox_cmd to make sure it
+ * succeeded.
+ *
+ * This is a generic form of the CXL mailbox send command, thus the only I/O
+ * operations used are cxl_read_mbox_reg(). Memory devices, and perhaps other
+ * types of CXL devices may have further information available upon error
+ * conditions.
+ *
+ * The CXL spec allows for up to two mailboxes. The intention is for the primary
+ * mailbox to be OS controlled and the secondary mailbox to be used by system
+ * firmware. This allows the OS and firmware to communicate with the device and
+ * not need to coordinate with each other. The driver only uses the primary
+ * mailbox.
+ */
+static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
+ struct mbox_cmd *mbox_cmd)
+{
+ void __iomem *payload = cxlm->mbox.regs + CXLDEV_MB_PAYLOAD_OFFSET;
+ u64 cmd_reg, status_reg;
+ size_t out_len;
+ int rc;
+
+ lockdep_assert_held(&cxlm->mbox.mutex);
+
+ /*
+ * Here are the steps from 8.2.8.4 of the CXL 2.0 spec.
+ * 1. Caller reads MB Control Register to verify doorbell is clear
+ * 2. Caller writes Command Register
+ * 3. Caller writes Command Payload Registers if input payload is non-empty
+ * 4. Caller writes MB Control Register to set doorbell
+ * 5. Caller either polls for doorbell to be clear or waits for interrupt if configured
+ * 6. Caller reads MB Status Register to fetch Return code
+ * 7. If command successful, Caller reads Command Register to get Payload Length
+ * 8. If output payload is non-empty, host reads Command Payload Registers
+ *
+ * Hardware is free to do whatever it wants before the doorbell is
+ * rung, and isn't allowed to change anything after it clears the
+ * doorbell. As such, steps 2 and 3 can happen in any order, and steps
+ * 6, 7, 8 can also happen in any order (though some orders might not
+ * make sense).
+ */
+
+ /* #1 */
+ if (cxl_doorbell_busy(cxlm)) {
+ dev_err_ratelimited(&cxlm->pdev->dev,
+ "Mailbox re-busy after acquiring\n");
+ return -EBUSY;
+ }
+
+ cmd_reg = CXL_SET_FIELD(mbox_cmd->opcode, CXLDEV_MB_CMD_COMMAND_OPCODE);
+ if (mbox_cmd->size_in) {
+ if (WARN_ON(!mbox_cmd->payload_in))
+ return -EINVAL;
+
+ cmd_reg |= CXL_SET_FIELD(mbox_cmd->size_in,
+ CXLDEV_MB_CMD_PAYLOAD_LENGTH);
+ memcpy_toio(payload, mbox_cmd->payload_in, mbox_cmd->size_in);
+ }
+
+ /* #2, #3 */
+ cxl_write_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET, cmd_reg);
+
+ /* #4 */
+ dev_dbg(&cxlm->pdev->dev, "Sending command\n");
+ cxl_write_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET,
+ CXLDEV_MB_CTRL_DOORBELL);
+
+ /* #5 */
+ rc = cxl_mem_wait_for_doorbell(cxlm);
+ if (rc == -ETIMEDOUT) {
+ cxl_mem_mbox_timeout(cxlm, mbox_cmd);
+ return rc;
+ }
+
+ /* #6 */
+ status_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_STATUS_OFFSET);
+ mbox_cmd->return_code =
+ CXL_GET_FIELD(status_reg, CXLDEV_MB_STATUS_RET_CODE);
+
+ if (mbox_cmd->return_code != 0) {
+ dev_dbg(&cxlm->pdev->dev, "Mailbox operation had an error\n");
+ return 0;
+ }
+
+ /* #7 */
+ cmd_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET);
+ out_len = CXL_GET_FIELD(cmd_reg, CXLDEV_MB_CMD_PAYLOAD_LENGTH);
+
+ /* #8 */
+ if (out_len && mbox_cmd->payload_out)
+ memcpy_fromio(mbox_cmd->payload_out, payload, out_len);
+
+ mbox_cmd->size_out = out_len;
+
+ return 0;
+}
+
+/**
+ * cxl_mem_mbox_get() - Acquire exclusive access to the mailbox.
+ * @cxlm: The memory device to gain access to.
+ *
+ * Context: Any context. Takes the mbox_lock.
+ * Return: 0 if exclusive access was acquired.
+ */
+static int cxl_mem_mbox_get(struct cxl_mem *cxlm)
+{
+ struct device *dev = &cxlm->pdev->dev;
+ int rc = -EBUSY;
+ u64 md_status;
+
+ mutex_lock_io(&cxlm->mbox.mutex);
+
+ /*
+ * XXX: There is some amount of ambiguity in the 2.0 version of the spec
+ * around the mailbox interface ready (8.2.8.5.1.1). The purpose of the
+ * bit is to allow firmware running on the device to notify the driver
+ * that it's ready to receive commands. It is unclear if the bit needs
+ * to be read for each transaction mailbox, ie. the firmware can switch
+ * it on and off as needed. Second, there is no defined timeout for
+ * mailbox ready, like there is for the doorbell interface.
+ *
+ * Assumptions:
+ * 1. The firmware might toggle the Mailbox Interface Ready bit, check
+ * it for every command.
+ *
+ * 2. If the doorbell is clear, the firmware should have first set the
+ * Mailbox Interface Ready bit. Therefore, waiting for the doorbell
+ * to be ready is sufficient.
+ */
+ rc = cxl_mem_wait_for_doorbell(cxlm);
+ if (rc) {
+ dev_warn(dev, "Mailbox interface not ready\n");
+ goto out;
+ }
+
+ md_status = cxl_read_mem_reg64(cxlm, CXLMDEV_STATUS_OFFSET);
+ if (!(md_status & CXLMDEV_MBOX_IF_READY && CXLMDEV_READY(md_status))) {
+ dev_err(dev,
+ "mbox: reported doorbell ready, but not mbox ready\n");
+ goto out;
+ }
+
+ /*
+ * Hardware shouldn't allow a ready status but also have failure bits
+ * set. Spit out an error, this should be a bug report
+ */
+ rc = -EFAULT;
+ if (md_status & CXLMDEV_DEV_FATAL) {
+ dev_err(dev, "mbox: reported ready, but fatal\n");
+ goto out;
+ }
+ if (md_status & CXLMDEV_FW_HALT) {
+ dev_err(dev, "mbox: reported ready, but halted\n");
+ goto out;
+ }
+ if (CXLMDEV_RESET_NEEDED(md_status)) {
+ dev_err(dev, "mbox: reported ready, but reset needed\n");
+ goto out;
+ }
+
+ /* with lock held */
+ return 0;
+
+out:
+ mutex_unlock(&cxlm->mbox.mutex);
+ return rc;
+}
+
+/**
+ * cxl_mem_mbox_put() - Release exclusive access to the mailbox.
+ * @cxlm: The CXL memory device to communicate with.
+ *
+ * Context: Any context. Expects mbox_lock to be held.
+ */
+static void cxl_mem_mbox_put(struct cxl_mem *cxlm)
+{
+ mutex_unlock(&cxlm->mbox.mutex);
+}
+
/**
* cxl_mem_setup_regs() - Setup necessary MMIO.
* @cxlm: The CXL memory device to communicate with.
@@ -142,6 +406,8 @@ static struct cxl_mem *cxl_mem_create(struct pci_dev *pdev, u32 reg_lo,
return NULL;
}
+ mutex_init(&cxlm->mbox.mutex);
+
regs = pcim_iomap_table(pdev)[bar];
cxlm->pdev = pdev;
cxlm->regs = regs + offset;
@@ -174,6 +440,76 @@ static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
return 0;
}
+/**
+ * cxl_mem_identify() - Send the IDENTIFY command to the device.
+ * @cxlm: The device to identify.
+ *
+ * Return: 0 if identify was executed successfully.
+ *
+ * This will dispatch the identify command to the device and on success populate
+ * structures to be exported to sysfs.
+ */
+static int cxl_mem_identify(struct cxl_mem *cxlm)
+{
+ struct cxl_mbox_identify {
+ char fw_revision[0x10];
+ __le64 total_capacity;
+ __le64 volatile_capacity;
+ __le64 persistent_capacity;
+ __le64 partition_align;
+ __le16 info_event_log_size;
+ __le16 warning_event_log_size;
+ __le16 failure_event_log_size;
+ __le16 fatal_event_log_size;
+ __le32 lsa_size;
+ u8 poison_list_max_mer[3];
+ __le16 inject_poison_limit;
+ u8 poison_caps;
+ u8 qos_telemetry_caps;
+ } __packed id;
+ struct mbox_cmd mbox_cmd;
+ int rc;
+
+ /* Retrieve initial device memory map */
+ rc = cxl_mem_mbox_get(cxlm);
+ if (rc)
+ return rc;
+
+ mbox_cmd = (struct mbox_cmd){
+ .opcode = CXL_MBOX_OP_IDENTIFY,
+ .payload_out = &id,
+ .size_in = 0,
+ };
+ rc = cxl_mem_mbox_send_cmd(cxlm, &mbox_cmd);
+ cxl_mem_mbox_put(cxlm);
+ if (rc)
+ return rc;
+
+ /* TODO: Handle retry or reset responses from firmware. */
+ if (mbox_cmd.return_code != CXL_MBOX_SUCCESS) {
+ dev_err(&cxlm->pdev->dev, "Mailbox command failed (%d)\n",
+ mbox_cmd.return_code);
+ return -ENXIO;
+ }
+
+ if (mbox_cmd.size_out != sizeof(id))
+ return -ENXIO;
+
+ /*
+ * TODO: enumerate DPA map, as 'ram' and 'pmem' do not alias.
+ * For now, only the capacity is exported in sysfs
+ */
+ cxlm->ram.range.start = 0;
+ cxlm->ram.range.end = le64_to_cpu(id.volatile_capacity) - 1;
+
+ cxlm->pmem.range.start = 0;
+ cxlm->pmem.range.end = le64_to_cpu(id.persistent_capacity) - 1;
+
+ memcpy(cxlm->firmware_version, id.fw_revision, sizeof(id.fw_revision));
+
+ return rc;
+}
+
static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
struct device *dev = &pdev->dev;
@@ -219,7 +555,11 @@ static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
if (rc)
return rc;
- return cxl_mem_setup_mailbox(cxlm);
+ rc = cxl_mem_setup_mailbox(cxlm);
+ if (rc)
+ return rc;
+
+ return cxl_mem_identify(cxlm);
}
static const struct pci_device_id cxl_mem_pci_tbl[] = {
--
2.30.0
The Get Log command returns the actual log entries that are advertised
via the Get Supported Logs command (0400h). CXL device logs are selected
by UUID which is part of the CXL spec. Because the driver tries to
sanitize what is sent to hardware, there becomes a need to restrict the
types of logs which can be accessed by userspace. For example, the
vendor specific log might only be consumable by proprietary, or offline
applications, and therefore a good candidate for userspace.
The current driver infrastructure does allow basic validation for all
commands, but doesn't inspect any of the payload data. Along with Get
Log support comes new infrastructure to add a hook for payload
validation. This infrastructure is used to filter out the CEL UUID,
which the userspace driver doesn't have business knowing, and taints on
invalid UUIDs being sent to hardware.
Signed-off-by: Ben Widawsky <[email protected]>
---
drivers/cxl/mem.c | 42 +++++++++++++++++++++++++++++++++++-
include/uapi/linux/cxl_mem.h | 1 +
2 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index b8ca6dff37b5..086268f1dd6c 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -119,6 +119,8 @@ static const uuid_t log_uuid[] = {
0x07, 0x19, 0x40, 0x3d, 0x86)
};
+static int validate_log_uuid(void __user *payload, size_t size);
+
/**
* struct cxl_mem_command - Driver representation of a memory device command
* @info: Command information as it exists for the UAPI
@@ -132,6 +134,10 @@ static const uuid_t log_uuid[] = {
* * %CXL_CMD_INTERNAL_FLAG_PSEUDO: This is a pseudo command which doesn't have
* a direct mapping to hardware. They are implicitly always enabled.
*
+ * @validate_payload: A function called after the command is validated but
+ * before it's sent to the hardware. The primary purpose is to validate, or
+ * fixup the actual payload.
+ *
* The cxl_mem_command is the driver's internal representation of commands that
* are supported by the driver. Some of these commands may not be supported by
* the hardware. The driver will use @info to validate the fields passed in by
@@ -147,9 +153,11 @@ struct cxl_mem_command {
#define CXL_CMD_INTERNAL_FLAG_HIDDEN BIT(0)
#define CXL_CMD_INTERNAL_FLAG_MANDATORY BIT(1)
#define CXL_CMD_INTERNAL_FLAG_PSEUDO BIT(2)
+
+ int (*validate_payload)(void __user *payload, size_t size);
};
-#define CXL_CMD(_id, _flags, sin, sout, f) \
+#define CXL_CMD_VALIDATE(_id, _flags, sin, sout, f, v) \
[CXL_MEM_COMMAND_ID_##_id] = { \
.info = { \
.id = CXL_MEM_COMMAND_ID_##_id, \
@@ -159,8 +167,12 @@ struct cxl_mem_command {
}, \
.flags = CXL_CMD_INTERNAL_FLAG_##f, \
.opcode = CXL_MBOX_OP_##_id, \
+ .validate_payload = v, \
}
+#define CXL_CMD(_id, _flags, sin, sout, f) \
+ CXL_CMD_VALIDATE(_id, _flags, sin, sout, f, NULL)
+
/*
* This table defines the supported mailbox commands for the driver. This table
* is made up of a UAPI structure. Non-negative values as parameters in the
@@ -176,6 +188,8 @@ static struct cxl_mem_command mem_commands[] = {
CXL_CMD(GET_PARTITION_INFO, NONE, 0, 0x20, NONE),
CXL_CMD(GET_LSA, NONE, 0x8, ~0, MANDATORY),
CXL_CMD(GET_HEALTH_INFO, NONE, 0, 0x12, MANDATORY),
+ CXL_CMD_VALIDATE(GET_LOG, MUTEX, 0x18, ~0, MANDATORY,
+ validate_log_uuid),
};
/*
@@ -563,6 +577,13 @@ static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
kvzalloc(cxlm->mbox.payload_size, GFP_KERNEL);
if (cmd->info.size_in) {
+ if (cmd->validate_payload) {
+ rc = cmd->validate_payload(u64_to_user_ptr(in_payload),
+ cmd->info.size_in);
+ if (rc)
+ goto out;
+ }
+
mbox_cmd.payload_in = kvzalloc(cmd->info.size_in, GFP_KERNEL);
if (!mbox_cmd.payload_in) {
rc = -ENOMEM;
@@ -1205,6 +1226,25 @@ struct cxl_mbox_get_log {
__le32 length;
} __packed;
+static int validate_log_uuid(void __user *input, size_t size)
+{
+ struct cxl_mbox_get_log __user *get_log = input;
+ uuid_t payload_uuid;
+
+ if (copy_from_user(&payload_uuid, &get_log->uuid, sizeof(uuid_t)))
+ return -EFAULT;
+
+ /* All unspec'd logs shall taint */
+ if (uuid_equal(&payload_uuid, &log_uuid[CEL_UUID]))
+ return 0;
+ if (uuid_equal(&payload_uuid, &log_uuid[DEBUG_UUID]))
+ return 0;
+
+ add_taint(TAINT_RAW_PASSTHROUGH, LOCKDEP_STILL_OK);
+
+ return 0;
+}
+
static int cxl_xfer_log(struct cxl_mem *cxlm, uuid_t *uuid, u32 size, u8 *out)
{
u32 remaining = size;
diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
index 766c231d6150..7cdc7f7ce7ec 100644
--- a/include/uapi/linux/cxl_mem.h
+++ b/include/uapi/linux/cxl_mem.h
@@ -39,6 +39,7 @@ extern "C" {
___C(GET_PARTITION_INFO, "Get Partition Information"), \
___C(GET_LSA, "Get Label Storage Area"), \
___C(GET_HEALTH_INFO, "Get Health Info"), \
+ ___C(GET_LOG, "Get Log"), \
___C(MAX, "Last command")
#define ___C(a, b) CXL_MEM_COMMAND_ID_##a
--
2.30.0
The Command Effects Log (CEL) is specified in the CXL 2.0 specification.
The CEL is one of two types of logs, the other being vendor specific.
They are distinguished in hardware/spec via UUID. The CEL is immediately
useful for 2 things:
1. Determine which optional commands are supported by the CXL device.
2. Enumerate any vendor specific commands
The CEL can be used by the driver to determine which commands are
available in the hardware (though it isn't, yet). That set of commands
might itself be a subset of commands which are available to be used via
CXL_MEM_SEND_COMMAND IOCTL.
Prior to this, all commands that the driver exposed were explicitly
enabled. After this, only those commands that are found in the CEL are
enabled.
Signed-off-by: Ben Widawsky <[email protected]>
---
drivers/cxl/mem.c | 186 ++++++++++++++++++++++++++++++++++-
include/uapi/linux/cxl_mem.h | 1 +
2 files changed, 182 insertions(+), 5 deletions(-)
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index d01c6ee32a6b..787417c4d5dc 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -43,6 +43,8 @@ enum opcode {
CXL_MBOX_OP_INVALID = 0x0000,
#define CXL_MBOX_OP_RAW CXL_MBOX_OP_INVALID
CXL_MBOX_OP_ACTIVATE_FW = 0x0202,
+ CXL_MBOX_OP_GET_SUPPORTED_LOGS = 0x0400,
+ CXL_MBOX_OP_GET_LOG = 0x0401,
CXL_MBOX_OP_IDENTIFY = 0x4000,
CXL_MBOX_OP_SET_PARTITION_INFO = 0x4101,
CXL_MBOX_OP_SET_LSA = 0x4103,
@@ -101,6 +103,18 @@ static DEFINE_IDA(cxl_memdev_ida);
static struct dentry *cxl_debugfs;
static bool raw_allow_all;
+enum {
+ CEL_UUID,
+ DEBUG_UUID
+};
+
+static const uuid_t log_uuid[] = {
+ [CEL_UUID] = UUID_INIT(0xda9c0b5, 0xbf41, 0x4b78, 0x8f, 0x79, 0x96,
+ 0xb1, 0x62, 0x3b, 0x3f, 0x17),
+ [DEBUG_UUID] = UUID_INIT(0xe1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6,
+ 0x07, 0x19, 0x40, 0x3d, 0x86)
+};
+
/**
* struct cxl_mem_command - Driver representation of a memory device command
* @info: Command information as it exists for the UAPI
@@ -153,6 +167,7 @@ static struct cxl_mem_command mem_commands[] = {
CXL_CMD(INVALID, KERNEL, 0, 0, HIDDEN),
CXL_CMD(IDENTIFY, NONE, 0, 0x43, MANDATORY),
CXL_CMD(RAW, NONE, ~0, ~0, PSEUDO),
+ CXL_CMD(GET_SUPPORTED_LOGS, NONE, 0, ~0, MANDATORY),
};
/*
@@ -1168,6 +1183,101 @@ static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
return rc;
}
+struct cxl_mbox_get_supported_logs {
+ __le16 entries;
+ u8 rsvd[6];
+ struct gsl_entry {
+ uuid_t uuid;
+ __le32 size;
+ } __packed entry[2];
+} __packed;
+struct cxl_mbox_get_log {
+ uuid_t uuid;
+ __le32 offset;
+ __le32 length;
+} __packed;
+
+static int cxl_xfer_log(struct cxl_mem *cxlm, uuid_t *uuid, u32 size, u8 *out)
+{
+ u32 remaining = size;
+ u32 offset = 0;
+
+ while (remaining) {
+ u32 xfer_size = min_t(u32, remaining, cxlm->mbox.payload_size);
+ struct mbox_cmd mbox_cmd;
+ int rc;
+ struct cxl_mbox_get_log log = {
+ .uuid = *uuid,
+ .offset = cpu_to_le32(offset),
+ .length = cpu_to_le32(xfer_size)
+ };
+
+ mbox_cmd = (struct mbox_cmd) {
+ .opcode = CXL_MBOX_OP_GET_LOG,
+ .payload_in = &log,
+ .payload_out = out,
+ .size_in = sizeof(log),
+ };
+
+ rc = cxl_mem_mbox_send_cmd(cxlm, &mbox_cmd);
+ if (rc)
+ return rc;
+
+ WARN_ON(mbox_cmd.size_out != xfer_size);
+
+ out += xfer_size;
+ remaining -= xfer_size;
+ offset += xfer_size;
+ }
+
+ return 0;
+}
+
+static void cxl_enable_cmd(struct cxl_mem *cxlm,
+ const struct cxl_mem_command *cmd)
+{
+ if (test_and_set_bit(cmd->info.id, cxlm->enabled_cmds))
+ dev_warn(&cxlm->pdev->dev, "Command enabled twice\n");
+
+ dev_info(&cxlm->pdev->dev, "%s enabled",
+ cxl_command_names[cmd->info.id].name);
+}
+
+/**
+ * cxl_walk_cel() - Walk through the Command Effects Log.
+ * @cxlm: Device.
+ * @size: Length of the Command Effects Log.
+ * @cel: CEL
+ *
+ * Iterate over each entry in the CEL and determine if the driver supports the
+ * command. If so, the command is enabled for the device and can be used later.
+ */
+static void cxl_walk_cel(struct cxl_mem *cxlm, size_t size, u8 *cel)
+{
+ struct cel_entry {
+ __le16 opcode;
+ __le16 effect;
+ } *cel_entry;
+ const int cel_entries = size / sizeof(*cel_entry);
+ int i;
+
+ cel_entry = (struct cel_entry *)cel;
+
+ for (i = 0; i < cel_entries; i++) {
+ const struct cel_entry *ce = &cel_entry[i];
+ const struct cxl_mem_command *cmd =
+ cxl_mem_find_command(le16_to_cpu(ce->opcode));
+
+ if (!cmd) {
+ dev_dbg(&cxlm->pdev->dev, "Unsupported opcode 0x%04x",
+ le16_to_cpu(ce->opcode));
+ continue;
+ }
+
+ cxl_enable_cmd(cxlm, cmd);
+ }
+}
+
/**
* cxl_mem_enumerate_cmds() - Enumerate commands for a device.
* @cxlm: The device.
@@ -1180,19 +1290,85 @@ static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
*/
static int cxl_mem_enumerate_cmds(struct cxl_mem *cxlm)
{
- struct cxl_mem_command *c;
+ struct cxl_mbox_get_supported_logs gsl;
+ const struct cxl_mem_command *c;
+ struct mbox_cmd mbox_cmd;
+ int i, rc;
BUILD_BUG_ON(ARRAY_SIZE(mem_commands) >= CXL_MAX_COMMANDS);
- /* All commands are considered enabled for now (except INVALID). */
+ /* Pseudo commands are always enabled */
cxl_for_each_cmd(c) {
- if (c->flags & CXL_CMD_INTERNAL_FLAG_HIDDEN)
+ if (c->flags & CXL_CMD_INTERNAL_FLAG_PSEUDO)
+ cxl_enable_cmd(cxlm, c);
+ }
+
+ mbox_cmd = (struct mbox_cmd){
+ .opcode = CXL_MBOX_OP_GET_SUPPORTED_LOGS,
+ .payload_out = &gsl,
+ .size_in = 0,
+ };
+
+ rc = cxl_mem_mbox_get(cxlm);
+ if (rc)
+ return rc;
+
+ rc = cxl_mem_mbox_send_cmd(cxlm, &mbox_cmd);
+ if (rc)
+ goto out;
+
+ if (mbox_cmd.return_code != CXL_MBOX_SUCCESS) {
+ rc = -ENXIO;
+ goto out;
+ }
+
+ if (mbox_cmd.size_out > sizeof(gsl)) {
+ dev_warn(&cxlm->pdev->dev, "%zu excess logs\n",
+ (mbox_cmd.size_out - sizeof(gsl)) /
+ sizeof(struct gsl_entry));
+ }
+
+ for (i = 0; i < le16_to_cpu(gsl.entries); i++) {
+ u32 size = le32_to_cpu(gsl.entry[i].size);
+ uuid_t uuid = gsl.entry[i].uuid;
+ u8 *log;
+
+ dev_dbg(&cxlm->pdev->dev, "Found LOG type %pU of size %d",
+ &uuid, size);
+
+ if (!uuid_equal(&uuid, &log_uuid[CEL_UUID]))
continue;
- set_bit(c->info.id, cxlm->enabled_cmds);
+ /*
+ * It's a hardware bug if the log size is less than the input
+ * payload size because there are many mandatory commands.
+ */
+ if (sizeof(struct cxl_mbox_get_log) > size) {
+ dev_err(&cxlm->pdev->dev,
+ "CEL log size reported was too small (%d)",
+ size);
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ log = kvmalloc(size, GFP_KERNEL);
+ if (!log) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ rc = cxl_xfer_log(cxlm, &uuid, size, log);
+ if (rc)
+ goto out;
+
+ cxl_walk_cel(cxlm, size, log);
+
+ kvfree(log);
}
- return 0;
+out:
+ cxl_mem_mbox_put(cxlm);
+ return rc;
}
/**
diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
index 25bfcb071c1f..64cb9753a077 100644
--- a/include/uapi/linux/cxl_mem.h
+++ b/include/uapi/linux/cxl_mem.h
@@ -34,6 +34,7 @@ extern "C" {
___C(INVALID, "Invalid Command"), \
___C(IDENTIFY, "Identify Command"), \
___C(RAW, "Raw device command"), \
+ ___C(GET_SUPPORTED_LOGS, "Get Supported Logs"), \
___C(MAX, "Last command")
#define ___C(a, b) CXL_MEM_COMMAND_ID_##a
--
2.30.0
In order to solidify support for a reasonable set of commands a set of
relatively safe commands are added and thus nullifying the need to use
raw operations to access them.
Signed-off-by: Ben Widawsky <[email protected]>
---
drivers/cxl/mem.c | 8 ++++++++
include/uapi/linux/cxl_mem.h | 4 ++++
2 files changed, 12 insertions(+)
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 787417c4d5dc..b8ca6dff37b5 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -42,12 +42,16 @@
enum opcode {
CXL_MBOX_OP_INVALID = 0x0000,
#define CXL_MBOX_OP_RAW CXL_MBOX_OP_INVALID
+ CXL_MBOX_OP_GET_FW_INFO = 0x0200,
CXL_MBOX_OP_ACTIVATE_FW = 0x0202,
CXL_MBOX_OP_GET_SUPPORTED_LOGS = 0x0400,
CXL_MBOX_OP_GET_LOG = 0x0401,
CXL_MBOX_OP_IDENTIFY = 0x4000,
+ CXL_MBOX_OP_GET_PARTITION_INFO = 0x4100,
CXL_MBOX_OP_SET_PARTITION_INFO = 0x4101,
+ CXL_MBOX_OP_GET_LSA = 0x4102,
CXL_MBOX_OP_SET_LSA = 0x4103,
+ CXL_MBOX_OP_GET_HEALTH_INFO = 0x4200,
CXL_MBOX_OP_SET_SHUTDOWN_STATE = 0x4204,
CXL_MBOX_OP_SCAN_MEDIA = 0x4304,
CXL_MBOX_OP_GET_SCAN_MEDIA = 0x4305,
@@ -168,6 +172,10 @@ static struct cxl_mem_command mem_commands[] = {
CXL_CMD(IDENTIFY, NONE, 0, 0x43, MANDATORY),
CXL_CMD(RAW, NONE, ~0, ~0, PSEUDO),
CXL_CMD(GET_SUPPORTED_LOGS, NONE, 0, ~0, MANDATORY),
+ CXL_CMD(GET_FW_INFO, NONE, 0, 0x50, NONE),
+ CXL_CMD(GET_PARTITION_INFO, NONE, 0, 0x20, NONE),
+ CXL_CMD(GET_LSA, NONE, 0x8, ~0, MANDATORY),
+ CXL_CMD(GET_HEALTH_INFO, NONE, 0, 0x12, MANDATORY),
};
/*
diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
index 64cb9753a077..766c231d6150 100644
--- a/include/uapi/linux/cxl_mem.h
+++ b/include/uapi/linux/cxl_mem.h
@@ -35,6 +35,10 @@ extern "C" {
___C(IDENTIFY, "Identify Command"), \
___C(RAW, "Raw device command"), \
___C(GET_SUPPORTED_LOGS, "Get Supported Logs"), \
+ ___C(GET_FW_INFO, "Get FW Info"), \
+ ___C(GET_PARTITION_INFO, "Get Partition Information"), \
+ ___C(GET_LSA, "Get Label Storage Area"), \
+ ___C(GET_HEALTH_INFO, "Get Health Info"), \
___C(MAX, "Last command")
#define ___C(a, b) CXL_MEM_COMMAND_ID_##a
--
2.30.0
For drivers that moderate access to the underlying hardware it is
sometimes desirable to allow userspace to bypass restrictions. Once
userspace has done this, the driver can no longer guarantee the sanctity
of either the OS or the hardware. When in this state, it is helpful for
kernel developers to be made aware (via this taint flag) of this fact
for subsequent bug reports.
Example usage:
- Hardware xyzzy accepts 2 commands, waldo and fred.
- The xyzzy driver provides an interface for using waldo, but not fred.
- quux is convinced they really need the fred command.
- xyzzy driver allows quux to frob hardware to initiate fred.
- kernel gets tainted.
- turns out fred command is borked, and scribbles over memory.
- developers laugh while closing quux's subsequent bug report.
Signed-off-by: Ben Widawsky <[email protected]>
---
Documentation/admin-guide/sysctl/kernel.rst | 1 +
Documentation/admin-guide/tainted-kernels.rst | 6 +++++-
include/linux/kernel.h | 3 ++-
kernel/panic.c | 1 +
4 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 1d56a6b73a4e..3e1eada53504 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1352,6 +1352,7 @@ ORed together. The letters are seen in "Tainted" line of Oops reports.
32768 `(K)` kernel has been live patched
65536 `(X)` Auxiliary taint, defined and used by for distros
131072 `(T)` The kernel was built with the struct randomization plugin
+262144 `(H)` The kernel has allowed vendor shenanigans
====== ===== ==============================================================
See :doc:`/admin-guide/tainted-kernels` for more information.
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index ceeed7b0798d..ee2913316344 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -74,7 +74,7 @@ a particular type of taint. It's best to leave that to the aforementioned
script, but if you need something quick you can use this shell command to check
which bits are set::
- $ for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
+ $ for i in $(seq 19); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
Table for decoding tainted state
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
15 _/K 32768 kernel has been live patched
16 _/X 65536 auxiliary taint, defined for and used by distros
17 _/T 131072 kernel was built with the struct randomization plugin
+ 18 _/H 262144 kernel has allowed vendor shenanigans
=== === ====== ========================================================
Note: The character ``_`` is representing a blank in this table to make reading
@@ -175,3 +176,6 @@ More detailed explanation for tainting
produce extremely unusual kernel structure layouts (even performance
pathological ones), which is important to know when debugging. Set at
build time.
+
+ 18) ``H`` Kernel has allowed direct access to hardware and can no longer make
+ any guarantees about the stability of the device or driver.
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index f7902d8c1048..bc95486f817e 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -443,7 +443,8 @@ extern enum system_states {
#define TAINT_LIVEPATCH 15
#define TAINT_AUX 16
#define TAINT_RANDSTRUCT 17
-#define TAINT_FLAGS_COUNT 18
+#define TAINT_RAW_PASSTHROUGH 18
+#define TAINT_FLAGS_COUNT 19
#define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1)
struct taint_flag {
diff --git a/kernel/panic.c b/kernel/panic.c
index 332736a72a58..dff22bd80eaf 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -386,6 +386,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
[ TAINT_LIVEPATCH ] = { 'K', ' ', true },
[ TAINT_AUX ] = { 'X', ' ', true },
[ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
+ [ TAINT_RAW_PASSTHROUGH ] = { 'H', ' ', true },
};
/**
--
2.30.0
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Alison Schofield <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
---
MAINTAINERS | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 6eff4f720c72..93c8694a8f04 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4444,6 +4444,17 @@ M: Miguel Ojeda <[email protected]>
S: Maintained
F: include/linux/compiler_attributes.h
+COMPUTE EXPRESS LINK (CXL)
+M: Alison Schofield <[email protected]>
+M: Vishal Verma <[email protected]>
+M: Ira Weiny <[email protected]>
+M: Ben Widawsky <[email protected]>
+M: Dan Williams <[email protected]>
+L: [email protected]
+S: Maintained
+F: drivers/cxl/
+F: include/uapi/linux/cxl_mem.h
+
CONEXANT ACCESSRUNNER USB DRIVER
L: [email protected]
S: Orphan
--
2.30.0
The send command allows userspace to issue mailbox commands directly to
the hardware. The driver will verify basic properties of the command and
possible inspect the input (or output) payload to determine whether or
not the command is allowed (or might taint the kernel).
The list of allowed commands and their properties can be determined by
using the QUERY IOCTL for CXL memory devices.
Signed-off-by: Ben Widawsky <[email protected]>
---
drivers/cxl/mem.c | 201 ++++++++++++++++++++++++++++++++++-
include/uapi/linux/cxl_mem.h | 45 ++++++++
2 files changed, 244 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 3c3ff45f01c0..c646f0a1cf66 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -126,8 +126,8 @@ struct cxl_mem_command {
.size_in = sin, \
.size_out = sout, \
}, \
- .flags = CXL_CMD_INTERNAL_FLAG_##f, \
- .opcode = CXL_MBOX_OP_##_id, \
+ .flags = CXL_CMD_INTERNAL_FLAG_##f, \
+ .opcode = CXL_MBOX_OP_##_id, \
}
/*
@@ -427,6 +427,174 @@ static int cxl_mem_count_commands(void)
}
return n;
+};
+
+/**
+ * handle_mailbox_cmd_from_user() - Dispatch a mailbox command.
+ * @cxlmd: The CXL memory device to communicate with.
+ * @cmd: The validated command.
+ * @in_payload: Pointer to userspace's input payload.
+ * @out_payload: Pointer to userspace's output payload.
+ * @u: The command submitted by userspace. Has output fields.
+ *
+ * Return:
+ * * %0 - Mailbox transaction succeeded.
+ * * %-EFAULT - Something happened with copy_to/from_user.
+ * * %-ENOMEM - Couldn't allocate a bounce buffer.
+ * * %-EINTR - Mailbox acquisition interrupted.
+ * * %-E2BIG - Output payload would overrun user's buffer.
+ *
+ * Creates the appropriate mailbox command on behalf of a userspace request.
+ * Return value, size, and output payload are all copied out to @u. The
+ * parameters for the command must be validated before calling this function.
+ *
+ * A 0 return code indicates the command executed successfully, not that it was
+ * itself successful. IOW, the cmd->retval should always be checked if wanting
+ * to determine the actual result.
+ */
+static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
+ const struct cxl_mem_command *cmd,
+ u64 in_payload, u64 out_payload,
+ struct cxl_send_command __user *u)
+{
+ struct cxl_mem *cxlm = cxlmd->cxlm;
+ struct mbox_cmd mbox_cmd = {
+ .opcode = cmd->opcode,
+ .payload_in = NULL, /* Populated with copy_from_user() */
+ .payload_out = NULL, /* Read out by copy_to_user() */
+ .size_in = cmd->info.size_in,
+ };
+ s32 user_size_out;
+ int rc;
+
+ if (get_user(user_size_out, &u->size_out))
+ return -EFAULT;
+
+ if (cmd->info.size_out > 0) /* fixed size command */
+ mbox_cmd.payload_out = kvzalloc(cmd->info.size_out, GFP_KERNEL);
+ else if (cmd->info.size_out < 0) /* variable */
+ mbox_cmd.payload_out =
+ kvzalloc(cxlm->mbox.payload_size, GFP_KERNEL);
+
+ if (cmd->info.size_in) {
+ mbox_cmd.payload_in = kvzalloc(cmd->info.size_in, GFP_KERNEL);
+ if (!mbox_cmd.payload_in) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ if (copy_from_user(mbox_cmd.payload_in,
+ u64_to_user_ptr(in_payload),
+ cmd->info.size_in)) {
+ rc = -EFAULT;
+ goto out;
+ }
+ }
+
+ rc = cxl_mem_mbox_get(cxlm);
+ if (rc)
+ goto out;
+
+ dev_dbg(&cxlmd->dev,
+ "Submitting %s command for user\n"
+ "\topcode: %x\n"
+ "\tsize: %ub\n",
+ cxl_command_names[cmd->info.id].name, mbox_cmd.opcode,
+ cmd->info.size_in);
+
+ rc = cxl_mem_mbox_send_cmd(cxlm, &mbox_cmd);
+ cxl_mem_mbox_put(cxlm);
+ if (rc)
+ goto out;
+
+ rc = put_user(mbox_cmd.return_code, &u->retval);
+ if (rc)
+ goto out;
+
+ if (user_size_out < mbox_cmd.size_out) {
+ rc = -E2BIG;
+ goto out;
+ }
+
+ if (mbox_cmd.size_out) {
+ if (copy_to_user(u64_to_user_ptr(out_payload),
+ mbox_cmd.payload_out, mbox_cmd.size_out)) {
+ rc = -EFAULT;
+ goto out;
+ }
+ }
+
+ rc = put_user(mbox_cmd.size_out, &u->size_out);
+
+out:
+ kvfree(mbox_cmd.payload_in);
+ kvfree(mbox_cmd.payload_out);
+ return rc;
+}
+
+/**
+ * cxl_validate_cmd_from_user() - Check fields for CXL_MEM_SEND_COMMAND.
+ * @cxlm: &struct cxl_mem device whose mailbox will be used.
+ * @send_cmd: &struct cxl_send_command copied in from userspace.
+ * @out_cmd: Sanitized and populated &struct cxl_mem_command.
+ *
+ * Return:
+ * * %0 - @out_cmd is ready to send.
+ * * %-ENOTTY - Invalid command specified.
+ * * %-EINVAL - Reserved fields or invalid values were used.
+ * * %-EPERM - Attempted to use a protected command.
+ * * %-ENOMEM - Input or output buffer wasn't sized properly.
+ *
+ * The result of this command is a fully validated command in @out_cmd that is
+ * safe to send to the hardware.
+ *
+ * See handle_mailbox_cmd_from_user()
+ */
+static int cxl_validate_cmd_from_user(struct cxl_mem *cxlm,
+ const struct cxl_send_command *send_cmd,
+ struct cxl_mem_command *out_cmd)
+{
+ const struct cxl_command_info *info;
+ struct cxl_mem_command *c;
+
+ if (send_cmd->id == 0 || send_cmd->id >= CXL_MEM_COMMAND_ID_MAX)
+ return -ENOTTY;
+
+ /*
+ * The user can never specify an input payload larger than
+ * hardware supports, but output can be arbitrarily large,
+ * simply write out as much data as the hardware provides.
+ */
+ if (send_cmd->size_in > cxlm->mbox.payload_size)
+ return -EINVAL;
+
+ if (send_cmd->flags & ~CXL_MEM_COMMAND_FLAG_MASK)
+ return -EINVAL;
+
+ if (send_cmd->rsvd)
+ return -EINVAL;
+
+ /* Convert user's command into the internal representation */
+ c = &mem_commands[send_cmd->id];
+ info = &c->info;
+
+ if (info->flags & CXL_MEM_COMMAND_FLAG_KERNEL)
+ return -EPERM;
+
+ /* Check the input buffer is the expected size */
+ if (info->size_in >= 0 && info->size_in != send_cmd->size_in)
+ return -ENOMEM;
+
+ /* Check the output buffer is at least large enough */
+ if (info->size_out >= 0 && send_cmd->size_out < info->size_out)
+ return -ENOMEM;
+
+ /* Setting a few const fields here... */
+ memcpy(out_cmd, c, sizeof(*c));
+ *(s32 *)&out_cmd->info.size_in = send_cmd->size_in;
+ *(s32 *)&out_cmd->info.size_out = send_cmd->size_out;
+
+ return 0;
}
static long __cxl_memdev_ioctl(struct cxl_memdev *cxlmd, unsigned int cmd,
@@ -469,6 +637,35 @@ static long __cxl_memdev_ioctl(struct cxl_memdev *cxlmd, unsigned int cmd,
}
return 0;
+ } else if (cmd == CXL_MEM_SEND_COMMAND) {
+ struct cxl_send_command send, __user *u = (void __user *)arg;
+ struct cxl_mem_command c;
+ int rc;
+
+ dev_dbg(dev, "Send IOCTL\n");
+
+ if (copy_from_user(&send, u, sizeof(send)))
+ return -EFAULT;
+
+ rc = device_lock_interruptible(dev);
+ if (rc)
+ return rc;
+
+ if (!get_live_device(dev)) {
+ device_unlock(dev);
+ return -ENXIO;
+ }
+
+ rc = cxl_validate_cmd_from_user(cxlmd->cxlm, &send, &c);
+ if (!rc)
+ rc = handle_mailbox_cmd_from_user(cxlmd, &c,
+ send.in_payload,
+ send.out_payload, u);
+
+ put_device(dev);
+ device_unlock(dev);
+
+ return rc;
}
return -ENOTTY;
diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
index 70e3ba2fa008..9d865794a420 100644
--- a/include/uapi/linux/cxl_mem.h
+++ b/include/uapi/linux/cxl_mem.h
@@ -28,6 +28,7 @@ extern "C" {
*/
#define CXL_MEM_QUERY_COMMANDS _IOR(0xCE, 1, struct cxl_mem_query_commands)
+#define CXL_MEM_SEND_COMMAND _IOWR(0xCE, 2, struct cxl_send_command)
#define CXL_CMDS \
___C(INVALID, "Invalid Command"), \
@@ -37,6 +38,11 @@ extern "C" {
#define ___C(a, b) CXL_MEM_COMMAND_ID_##a
enum { CXL_CMDS };
+#undef ___C
+#define ___C(a, b) { b }
+static const struct {
+ const char *name;
+} cxl_command_names[] = { CXL_CMDS };
#undef ___C
/**
@@ -71,6 +77,7 @@ struct cxl_command_info {
#define CXL_MEM_COMMAND_FLAG_NONE 0
#define CXL_MEM_COMMAND_FLAG_KERNEL BIT(0)
#define CXL_MEM_COMMAND_FLAG_MUTEX BIT(1)
+#define CXL_MEM_COMMAND_FLAG_MASK GENMASK(1, 0)
__s32 size_in;
__s32 size_out;
@@ -112,6 +119,44 @@ struct cxl_mem_query_commands {
struct cxl_command_info __user commands[]; /* out: supported commands */
};
+/**
+ * struct cxl_send_command - Send a command to a memory device.
+ * @id: The command to send to the memory device. This must be one of the
+ * commands returned by the query command.
+ * @flags: Flags for the command (input).
+ * @rsvd: Must be zero.
+ * @retval: Return value from the memory device (output).
+ * @size_in: Size of the payload to provide to the device (input).
+ * @size_out: Size of the payload received from the device (input/output). This
+ * field is filled in by userspace to let the driver know how much
+ * space was allocated for output. It is populated by the driver to
+ * let userspace know how large the output payload actually was.
+ * @in_payload: Pointer to memory for payload input (little endian order).
+ * @out_payload: Pointer to memory for payload output (little endian order).
+ *
+ * Mechanism for userspace to send a command to the hardware for processing. The
+ * driver will do basic validation on the command sizes. In some cases even the
+ * payload may be introspected. Userspace is required to allocate large
+ * enough buffers for size_out which can be variable length in certain
+ * situations.
+ */
+struct cxl_send_command {
+ __u32 id;
+ __u32 flags;
+ __u32 rsvd;
+ __u32 retval;
+
+ struct {
+ __s32 size_in;
+ __u64 in_payload;
+ };
+
+ struct {
+ __s32 size_out;
+ __u64 out_payload;
+ };
+};
+
#if defined(__cplusplus)
}
#endif
--
2.30.0
The CXL memory device send interface will have a number of supported
commands. The raw command is not such a command. Raw commands allow
userspace to send a specified opcode to the underlying hardware and
bypass all driver checks on the command. This is useful for a couple of
usecases, mainly:
1. Undocumented vendor specific hardware commands
2. Prototyping new hardware commands not yet supported by the driver
While this all sounds very powerful it comes with a couple of caveats:
1. Bug reports using raw commands will not get the same level of
attention as bug reports using supported commands (via taint).
2. Supported commands will be rejected by the RAW command.
With this comes new debugfs knob to allow full access to your toes with
your weapon of choice.
Signed-off-by: Ben Widawsky <[email protected]>
---
Documentation/ABI/testing/debugfs-cxl | 10 ++
drivers/cxl/mem.c | 130 ++++++++++++++++++++++++--
include/uapi/linux/cxl_mem.h | 12 ++-
3 files changed, 142 insertions(+), 10 deletions(-)
create mode 100644 Documentation/ABI/testing/debugfs-cxl
diff --git a/Documentation/ABI/testing/debugfs-cxl b/Documentation/ABI/testing/debugfs-cxl
new file mode 100644
index 000000000000..37e89aaac296
--- /dev/null
+++ b/Documentation/ABI/testing/debugfs-cxl
@@ -0,0 +1,10 @@
+What: /sys/kernel/debug/cxl/mbox/raw_allow_all
+Date: January 2021
+KernelVersion: 5.12
+Description:
+ Permits "RAW" mailbox commands to be passed through to hardware
+ without driver intervention. Many such commands require
+ coordination and therefore should only be used for debugging or
+ testing.
+
+ Valid values are boolean.
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index c646f0a1cf66..2942730dc967 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright(c) 2020 Intel Corporation. All rights reserved. */
#include <uapi/linux/cxl_mem.h>
+#include <linux/debugfs.h>
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/cdev.h>
@@ -40,7 +41,14 @@
enum opcode {
CXL_MBOX_OP_INVALID = 0x0000,
+#define CXL_MBOX_OP_RAW CXL_MBOX_OP_INVALID
+ CXL_MBOX_OP_ACTIVATE_FW = 0x0202,
CXL_MBOX_OP_IDENTIFY = 0x4000,
+ CXL_MBOX_OP_SET_PARTITION_INFO = 0x4101,
+ CXL_MBOX_OP_SET_LSA = 0x4103,
+ CXL_MBOX_OP_SET_SHUTDOWN_STATE = 0x4204,
+ CXL_MBOX_OP_SCAN_MEDIA = 0x4304,
+ CXL_MBOX_OP_GET_SCAN_MEDIA = 0x4305,
CXL_MBOX_OP_MAX = 0x10000
};
@@ -90,6 +98,8 @@ struct cxl_memdev {
static int cxl_mem_major;
static DEFINE_IDA(cxl_memdev_ida);
+static struct dentry *cxl_debugfs;
+static bool raw_allow_all;
/**
* struct cxl_mem_command - Driver representation of a memory device command
@@ -139,6 +149,47 @@ struct cxl_mem_command {
static struct cxl_mem_command mem_commands[] = {
CXL_CMD(INVALID, KERNEL, 0, 0, HIDDEN),
CXL_CMD(IDENTIFY, NONE, 0, 0x43, MANDATORY),
+ CXL_CMD(RAW, NONE, ~0, ~0, MANDATORY),
+};
+
+/*
+ * Commands that RAW doesn't permit. The rationale for each:
+ *
+ * CXL_MBOX_OP_ACTIVATE_FW: Firmware activation requires adjustment /
+ * coordination of transaction timeout values at the root bridge level.
+ *
+ * CXL_MBOX_OP_SET_PARTITION_INFO: The device memory map may change live
+ * and needs to be coordinated with HDM updates.
+ *
+ * CXL_MBOX_OP_SET_LSA: The label storage area may be cached by the
+ * driver and any writes from userspace invalidates those contents.
+ *
+ * CXL_MBOX_OP_SET_SHUTDOWN_STATE: Set shutdown state assumes no writes
+ * to the device after it is marked clean, userspace can not make that
+ * assertion.
+ *
+ * CXL_MBOX_OP_[GET_]SCAN_MEDIA: The kernel provides a native error list that
+ * is kept up to date with patrol notifications and error management.
+ */
+static u16 disabled_raw_commands[] = {
+ CXL_MBOX_OP_ACTIVATE_FW,
+ CXL_MBOX_OP_SET_PARTITION_INFO,
+ CXL_MBOX_OP_SET_LSA,
+ CXL_MBOX_OP_SET_SHUTDOWN_STATE,
+ CXL_MBOX_OP_SCAN_MEDIA,
+ CXL_MBOX_OP_GET_SCAN_MEDIA,
+};
+
+/*
+ * Command sets that RAW doesn't permit. All opcodes in this set are
+ * disabled because they pass plain text security payloads over the
+ * user/kernel boundary. This functionality is intended to be wrapped
+ * behind the keys ABI which allows for encrypted payloads in the UAPI
+ */
+static u8 security_command_sets[] = {
+ 0x44, /* Sanitize */
+ 0x45, /* Persistent Memory Data-at-rest Security */
+ 0x46, /* Security Passthrough */
};
#define cxl_for_each_cmd(cmd) \
@@ -180,22 +231,30 @@ static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
return 0;
}
+static bool is_security_command(u16 opcode)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(security_command_sets); i++)
+ if (security_command_sets[i] == (opcode >> 8))
+ return true;
+ return false;
+}
+
static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
struct mbox_cmd *mbox_cmd)
{
- dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
- dev_info(&cxlm->pdev->dev,
- "\topcode: 0x%04x\n"
- "\tpayload size: %zub\n",
- mbox_cmd->opcode, mbox_cmd->size_in);
+ struct device *dev = &cxlm->pdev->dev;
+
+ dev_dbg(dev, "Mailbox command (opcode: %#x size: %zub) timed out\n",
+ mbox_cmd->opcode, mbox_cmd->size_in);
- if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
+ if (!is_security_command(mbox_cmd->opcode) ||
+ IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
mbox_cmd->payload_in, mbox_cmd->size_in,
true);
}
-
- /* Here's a good place to figure out if a device reset is needed */
}
/**
@@ -458,6 +517,7 @@ static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
struct cxl_send_command __user *u)
{
struct cxl_mem *cxlm = cxlmd->cxlm;
+ struct device *dev = &cxlmd->dev;
struct mbox_cmd mbox_cmd = {
.opcode = cmd->opcode,
.payload_in = NULL, /* Populated with copy_from_user() */
@@ -495,13 +555,17 @@ static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
if (rc)
goto out;
- dev_dbg(&cxlmd->dev,
+ dev_dbg(dev,
"Submitting %s command for user\n"
"\topcode: %x\n"
"\tsize: %ub\n",
cxl_command_names[cmd->info.id].name, mbox_cmd.opcode,
cmd->info.size_in);
+ WARN_TAINT_ONCE(cmd->info.id == CXL_MEM_COMMAND_ID_RAW,
+ TAINT_RAW_PASSTHROUGH, "%s %s: raw command path used\n",
+ dev_driver_string(dev), dev_name(dev));
+
rc = cxl_mem_mbox_send_cmd(cxlm, &mbox_cmd);
cxl_mem_mbox_put(cxlm);
if (rc)
@@ -532,6 +596,23 @@ static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
return rc;
}
+static bool cxl_mem_raw_command_allowed(u16 opcode)
+{
+ int i;
+
+ if (raw_allow_all)
+ return true;
+
+ if (is_security_command(opcode))
+ return false;
+
+ for (i = 0; i < ARRAY_SIZE(disabled_raw_commands); i++)
+ if (disabled_raw_commands[i] == opcode)
+ return false;
+
+ return true;
+}
+
/**
* cxl_validate_cmd_from_user() - Check fields for CXL_MEM_SEND_COMMAND.
* @cxlm: &struct cxl_mem device whose mailbox will be used.
@@ -568,6 +649,30 @@ static int cxl_validate_cmd_from_user(struct cxl_mem *cxlm,
if (send_cmd->size_in > cxlm->mbox.payload_size)
return -EINVAL;
+ /* Checks are bypassed for raw commands but along comes the taint! */
+ if (send_cmd->id == CXL_MEM_COMMAND_ID_RAW) {
+ const struct cxl_mem_command temp = {
+ .info = {
+ .id = CXL_MEM_COMMAND_ID_RAW,
+ .flags = CXL_MEM_COMMAND_FLAG_NONE,
+ .size_in = send_cmd->size_in,
+ .size_out = send_cmd->size_out,
+ },
+ .flags = 0,
+ .opcode = send_cmd->raw.opcode
+ };
+
+ if (send_cmd->raw.rsvd)
+ return -EINVAL;
+
+ if (!cxl_mem_raw_command_allowed(send_cmd->raw.opcode))
+ return -EPERM;
+
+ memcpy(out_cmd, &temp, sizeof(temp));
+
+ return 0;
+ }
+
if (send_cmd->flags & ~CXL_MEM_COMMAND_FLAG_MASK)
return -EINVAL;
@@ -1200,6 +1305,7 @@ static __init int cxl_mem_init(void)
{
int rc;
dev_t devt;
+ struct dentry *mbox_debugfs;
rc = alloc_chrdev_region(&devt, 0, CXL_MEM_MAX_DEVS, "cxl");
if (rc)
@@ -1214,11 +1320,17 @@ static __init int cxl_mem_init(void)
return rc;
}
+ cxl_debugfs = debugfs_create_dir("cxl", NULL);
+ mbox_debugfs = debugfs_create_dir("mbox", cxl_debugfs);
+ debugfs_create_bool("raw_allow_all", 0600, mbox_debugfs,
+ &raw_allow_all);
+
return 0;
}
static __exit void cxl_mem_exit(void)
{
+ debugfs_remove_recursive(cxl_debugfs);
pci_unregister_driver(&cxl_mem_driver);
unregister_chrdev_region(MKDEV(cxl_mem_major, 0), CXL_MEM_MAX_DEVS);
}
diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
index 9d865794a420..25bfcb071c1f 100644
--- a/include/uapi/linux/cxl_mem.h
+++ b/include/uapi/linux/cxl_mem.h
@@ -33,6 +33,7 @@ extern "C" {
#define CXL_CMDS \
___C(INVALID, "Invalid Command"), \
___C(IDENTIFY, "Identify Command"), \
+ ___C(RAW, "Raw device command"), \
___C(MAX, "Last command")
#define ___C(a, b) CXL_MEM_COMMAND_ID_##a
@@ -124,6 +125,9 @@ struct cxl_mem_query_commands {
* @id: The command to send to the memory device. This must be one of the
* commands returned by the query command.
* @flags: Flags for the command (input).
+ * @raw: Special fields for raw commands
+ * @raw.opcode: Opcode passed to hardware when using the RAW command.
+ * @raw.rsvd: Must be zero.
* @rsvd: Must be zero.
* @retval: Return value from the memory device (output).
* @size_in: Size of the payload to provide to the device (input).
@@ -143,7 +147,13 @@ struct cxl_mem_query_commands {
struct cxl_send_command {
__u32 id;
__u32 flags;
- __u32 rsvd;
+ union {
+ struct {
+ __u16 opcode;
+ __u16 rsvd;
+ } raw;
+ __u32 rsvd;
+ };
__u32 retval;
struct {
--
2.30.0
CXL devices must implement the Device Command Interface (described in
8.2.9 of the CXL 2.0 spec). While the driver already maintains a list of
commands it supports, there is still a need to be able to distinguish
between commands that the driver knows about from commands that may not
be supported by the hardware. No such commands currently are defined in
the driver.
The implementation leaves the statically defined table of commands and
supplements it with a bitmap to determine commands that are enabled.
There are multiple approaches that can be taken, but this is nice for a
few reasons.
Here are some of the other solutions:
Create a per instance table with only the supported commands.
1. Having a fixed command id -> command mapping is much easier to manage
for development and debugging.
2. Dealing with dynamic memory allocation for the table adds unnecessary
complexity.
3. Most tables for device types are likely to be quite similar.
4. Makes it difficult to implement helper macros like cxl_for_each_cmd()
If the per instance table did preserve ids, #1 above can be addressed.
However, as "enable" is currently the only mutable state for the
commands, it would yield a lot of overhead for not much gain.
Additionally, the other issues remain.
If "enable" remains the only mutable state, I believe this to be the
best solution. Once the number of mutable elements in a command grows,
it probably makes sense to move to per device instance state with a
fixed command ID mapping.
Signed-off-by: Ben Widawsky <[email protected]>
---
drivers/cxl/cxl.h | 4 ++++
drivers/cxl/mem.c | 40 +++++++++++++++++++++++++++++++++++++++-
2 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b042eee7ee25..2d2f25065b81 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -17,6 +17,9 @@
#define CXL_GET_FIELD(word, field) FIELD_GET(field##_MASK, word)
+/* XXX: Arbitrary max */
+#define CXL_MAX_COMMANDS 32
+
/* Device Capabilities (CXL 2.0 - 8.2.8.1) */
#define CXLDEV_CAP_ARRAY_OFFSET 0x0
#define CXLDEV_CAP_ARRAY_CAP_ID 0
@@ -83,6 +86,7 @@ struct cxl_mem {
} ram;
char firmware_version[0x10];
+ DECLARE_BITMAP(enabled_cmds, CXL_MAX_COMMANDS);
/* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
struct {
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 2942730dc967..d01c6ee32a6b 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -111,6 +111,8 @@ static bool raw_allow_all;
* would typically be used for deprecated commands.
* * %CXL_CMD_FLAG_MANDATORY: Hardware must support this command. This flag is
* only used internally by the driver for sanity checking.
+ * * %CXL_CMD_INTERNAL_FLAG_PSEUDO: This is a pseudo command which doesn't have
+ * a direct mapping to hardware. They are implicitly always enabled.
*
* The cxl_mem_command is the driver's internal representation of commands that
* are supported by the driver. Some of these commands may not be supported by
@@ -126,6 +128,7 @@ struct cxl_mem_command {
#define CXL_CMD_INTERNAL_FLAG_NONE 0
#define CXL_CMD_INTERNAL_FLAG_HIDDEN BIT(0)
#define CXL_CMD_INTERNAL_FLAG_MANDATORY BIT(1)
+#define CXL_CMD_INTERNAL_FLAG_PSEUDO BIT(2)
};
#define CXL_CMD(_id, _flags, sin, sout, f) \
@@ -149,7 +152,7 @@ struct cxl_mem_command {
static struct cxl_mem_command mem_commands[] = {
CXL_CMD(INVALID, KERNEL, 0, 0, HIDDEN),
CXL_CMD(IDENTIFY, NONE, 0, 0x43, MANDATORY),
- CXL_CMD(RAW, NONE, ~0, ~0, MANDATORY),
+ CXL_CMD(RAW, NONE, ~0, ~0, PSEUDO),
};
/*
@@ -683,6 +686,10 @@ static int cxl_validate_cmd_from_user(struct cxl_mem *cxlm,
c = &mem_commands[send_cmd->id];
info = &c->info;
+ /* Check that the command is enabled for hardware */
+ if (!test_bit(info->id, cxlm->enabled_cmds))
+ return -ENOTTY;
+
if (info->flags & CXL_MEM_COMMAND_FLAG_KERNEL)
return -EPERM;
@@ -1161,6 +1168,33 @@ static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
return rc;
}
+/**
+ * cxl_mem_enumerate_cmds() - Enumerate commands for a device.
+ * @cxlm: The device.
+ *
+ * Returns 0 if enumerate completed successfully.
+ *
+ * CXL devices have optional support for certain commands. This function will
+ * determine the set of supported commands for the hardware and update the
+ * enabled_cmds bitmap in the @cxlm.
+ */
+static int cxl_mem_enumerate_cmds(struct cxl_mem *cxlm)
+{
+ struct cxl_mem_command *c;
+
+ BUILD_BUG_ON(ARRAY_SIZE(mem_commands) >= CXL_MAX_COMMANDS);
+
+ /* All commands are considered enabled for now (except INVALID). */
+ cxl_for_each_cmd(c) {
+ if (c->flags & CXL_CMD_INTERNAL_FLAG_HIDDEN)
+ continue;
+
+ set_bit(c->info.id, cxlm->enabled_cmds);
+ }
+
+ return 0;
+}
+
/**
* cxl_mem_identify() - Send the IDENTIFY command to the device.
* @cxlm: The device to identify.
@@ -1280,6 +1314,10 @@ static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
if (rc)
return rc;
+ rc = cxl_mem_enumerate_cmds(cxlm);
+ if (rc)
+ return rc;
+
rc = cxl_mem_identify(cxlm);
if (rc)
return rc;
--
2.30.0
CXL devices contain an array of capabilities that describe the
interactions software can have with the device or firmware running on
the device. A CXL compliant device must implement the device status and
the mailbox capability. A CXL compliant memory device must implement the
memory device capability.
Each of the capabilities can [will] provide an offset within the MMIO
region for interacting with the CXL device.
For more details see 8.2.8 of the CXL 2.0 specification (see Link).
Link: https://www.computeexpresslink.org/download-the-specification
Signed-off-by: Ben Widawsky <[email protected]>
---
drivers/cxl/cxl.h | 78 ++++++++++++++++++++++++++++++++++-
drivers/cxl/mem.c | 102 +++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 178 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d81d0ba4617c..a3da7f8050c4 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -4,6 +4,37 @@
#ifndef __CXL_H__
#define __CXL_H__
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/io.h>
+
+#define CXL_SET_FIELD(value, field) \
+ ({ \
+ WARN_ON(!FIELD_FIT(field##_MASK, value)); \
+ FIELD_PREP(field##_MASK, value); \
+ })
+
+#define CXL_GET_FIELD(word, field) FIELD_GET(field##_MASK, word)
+
+/* Device Capabilities (CXL 2.0 - 8.2.8.1) */
+#define CXLDEV_CAP_ARRAY_OFFSET 0x0
+#define CXLDEV_CAP_ARRAY_CAP_ID 0
+#define CXLDEV_CAP_ARRAY_ID_MASK GENMASK(15, 0)
+#define CXLDEV_CAP_ARRAY_COUNT_MASK GENMASK(47, 32)
+/* (CXL 2.0 - 8.2.8.2.1) */
+#define CXLDEV_CAP_CAP_ID_DEVICE_STATUS 0x1
+#define CXLDEV_CAP_CAP_ID_PRIMARY_MAILBOX 0x2
+#define CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX 0x3
+#define CXLDEV_CAP_CAP_ID_MEMDEV 0x4000
+
+/* CXL Device Mailbox (CXL 2.0 - 8.2.8.4) */
+#define CXLDEV_MB_CAPS_OFFSET 0x00
+#define CXLDEV_MB_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
+#define CXLDEV_MB_CTRL_OFFSET 0x04
+#define CXLDEV_MB_CMD_OFFSET 0x08
+#define CXLDEV_MB_STATUS_OFFSET 0x10
+#define CXLDEV_MB_BG_CMD_STATUS_OFFSET 0x18
+
/**
* struct cxl_mem - A CXL memory device
* @pdev: The PCI device associated with this CXL device.
@@ -12,6 +43,51 @@
struct cxl_mem {
struct pci_dev *pdev;
void __iomem *regs;
+
+ /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
+ struct {
+ void __iomem *regs;
+ } status;
+
+ /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
+ struct {
+ void __iomem *regs;
+ size_t payload_size;
+ } mbox;
+
+ /* Cap 4000h - CXL_CAP_CAP_ID_MEMDEV */
+ struct {
+ void __iomem *regs;
+ } mem;
};
-#endif
+#define cxl_reg(type) \
+ static inline void cxl_write_##type##_reg32(struct cxl_mem *cxlm, \
+ u32 reg, u32 value) \
+ { \
+ void __iomem *reg_addr = cxlm->type.regs; \
+ writel(value, reg_addr + reg); \
+ } \
+ static inline void cxl_write_##type##_reg64(struct cxl_mem *cxlm, \
+ u32 reg, u64 value) \
+ { \
+ void __iomem *reg_addr = cxlm->type.regs; \
+ writeq(value, reg_addr + reg); \
+ } \
+ static inline u32 cxl_read_##type##_reg32(struct cxl_mem *cxlm, \
+ u32 reg) \
+ { \
+ void __iomem *reg_addr = cxlm->type.regs; \
+ return readl(reg_addr + reg); \
+ } \
+ static inline u64 cxl_read_##type##_reg64(struct cxl_mem *cxlm, \
+ u32 reg) \
+ { \
+ void __iomem *reg_addr = cxlm->type.regs; \
+ return readq(reg_addr + reg); \
+ }
+
+cxl_reg(status);
+cxl_reg(mbox);
+
+#endif /* __CXL_H__ */
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index a869c8dc24cc..fa14d51243ee 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -6,6 +6,99 @@
#include "pci.h"
#include "cxl.h"
+/**
+ * cxl_mem_setup_regs() - Setup necessary MMIO.
+ * @cxlm: The CXL memory device to communicate with.
+ *
+ * Return: 0 if all necessary registers mapped.
+ *
+ * A memory device is required by spec to implement a certain set of MMIO
+ * regions. The purpose of this function is to enumerate and map those
+ * registers.
+ *
+ * XXX: Register accessors need the mappings set up by this function, so
+ * any reads or writes must be read(b|w|l|q) or write(b|w|l|q)
+ */
+static int cxl_mem_setup_regs(struct cxl_mem *cxlm)
+{
+ struct device *dev = &cxlm->pdev->dev;
+ int cap, cap_count;
+ u64 cap_array;
+
+ cap_array = readq(cxlm->regs + CXLDEV_CAP_ARRAY_OFFSET);
+ if (CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_ID) != CXLDEV_CAP_ARRAY_CAP_ID)
+ return -ENODEV;
+
+ cap_count = CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_COUNT);
+
+ for (cap = 1; cap <= cap_count; cap++) {
+ void __iomem *register_block;
+ u32 offset;
+ u16 cap_id;
+
+ cap_id = readl(cxlm->regs + cap * 0x10) & 0xffff;
+ offset = readl(cxlm->regs + cap * 0x10 + 0x4);
+ register_block = cxlm->regs + offset;
+
+ switch (cap_id) {
+ case CXLDEV_CAP_CAP_ID_DEVICE_STATUS:
+ dev_dbg(dev, "found Status capability (0x%x)\n",
+ offset);
+ cxlm->status.regs = register_block;
+ break;
+ case CXLDEV_CAP_CAP_ID_PRIMARY_MAILBOX:
+ dev_dbg(dev, "found Mailbox capability (0x%x)\n",
+ offset);
+ cxlm->mbox.regs = register_block;
+ break;
+ case CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX:
+ dev_dbg(dev,
+ "found Secondary Mailbox capability (0x%x)\n",
+ offset);
+ break;
+ case CXLDEV_CAP_CAP_ID_MEMDEV:
+ dev_dbg(dev, "found Memory Device capability (0x%x)\n",
+ offset);
+ cxlm->mem.regs = register_block;
+ break;
+ default:
+ dev_warn(dev, "Unknown cap ID: %d (0x%x)\n", cap_id,
+ offset);
+ break;
+ }
+ }
+
+ if (!cxlm->status.regs || !cxlm->mbox.regs || !cxlm->mem.regs) {
+ dev_err(dev, "registers not found: %s%s%s\n",
+ !cxlm->status.regs ? "status " : "",
+ !cxlm->mbox.regs ? "mbox " : "",
+ !cxlm->mem.regs ? "mem" : "");
+ return -ENXIO;
+ }
+
+ return 0;
+}
+
+static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
+{
+ const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
+
+ cxlm->mbox.payload_size =
+ 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
+
+ /* 8.2.8.4.3 */
+ if (cxlm->mbox.payload_size < 256) {
+ dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
+ cxlm->mbox.payload_size);
+ return -ENXIO;
+ }
+
+ dev_dbg(&cxlm->pdev->dev, "Mailbox payload sized %zu",
+ cxlm->mbox.payload_size);
+
+ return 0;
+}
+
/**
* cxl_mem_create() - Create a new &struct cxl_mem.
* @pdev: The pci device associated with the new &struct cxl_mem.
@@ -119,7 +212,14 @@ static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
}
}
- return rc;
+ if (rc)
+ return rc;
+
+ rc = cxl_mem_setup_regs(cxlm);
+ if (rc)
+ return rc;
+
+ return cxl_mem_setup_mailbox(cxlm);
}
static const struct pci_device_id cxl_mem_pci_tbl[] = {
--
2.30.0
From: Dan Williams <[email protected]>
Create the /sys/bus/cxl hierarchy to enumerate:
* Memory Devices (per-endpoint control devices)
* Memory Address Space Devices (platform address ranges with
interleaving, performance, and persistence attributes)
* Memory Regions (active provisioned memory from an address space device
that is in use as System RAM or delegated to libnvdimm as Persistent
Memory regions).
For now, only the per-endpoint control devices are registered on the
'cxl' bus. However, going forward it will provide a mechanism to
coordinate cross-device interleave.
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-cxl | 26 ++
.../driver-api/cxl/memory-devices.rst | 17 +
drivers/base/core.c | 14 +
drivers/cxl/Makefile | 3 +
drivers/cxl/bus.c | 29 ++
drivers/cxl/cxl.h | 4 +
drivers/cxl/mem.c | 308 +++++++++++++++++-
include/linux/device.h | 1 +
8 files changed, 400 insertions(+), 2 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-cxl
create mode 100644 drivers/cxl/bus.c
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
new file mode 100644
index 000000000000..fe7b87eba988
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-cxl
@@ -0,0 +1,26 @@
+What: /sys/bus/cxl/devices/memX/firmware_version
+Date: December, 2020
+KernelVersion: v5.12
+Contact: [email protected]
+Description:
+ (RO) "FW Revision" string as reported by the Identify
+ Memory Device Output Payload in the CXL-2.0
+ specification.
+
+What: /sys/bus/cxl/devices/memX/ram/size
+Date: December, 2020
+KernelVersion: v5.12
+Contact: [email protected]
+Description:
+ (RO) "Volatile Only Capacity" as reported by the
+ Identify Memory Device Output Payload in the CXL-2.0
+ specification.
+
+What: /sys/bus/cxl/devices/memX/pmem/size
+Date: December, 2020
+KernelVersion: v5.12
+Contact: [email protected]
+Description:
+ (RO) "Persistent Only Capacity" as reported by the
+ Identify Memory Device Output Payload in the CXL-2.0
+ specification.
diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/memory-devices.rst
index 43177e700d62..1bad466f9167 100644
--- a/Documentation/driver-api/cxl/memory-devices.rst
+++ b/Documentation/driver-api/cxl/memory-devices.rst
@@ -27,3 +27,20 @@ CXL Memory Device
.. kernel-doc:: drivers/cxl/mem.c
:internal:
+
+CXL Bus
+-------
+.. kernel-doc:: drivers/cxl/bus.c
+ :doc: cxl bus
+
+External Interfaces
+===================
+
+CXL IOCTL Interface
+-------------------
+
+.. kernel-doc:: include/uapi/linux/cxl_mem.h
+ :doc: UAPI
+
+.. kernel-doc:: include/uapi/linux/cxl_mem.h
+ :internal:
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 25e08e5f40bd..33432a4cbe23 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3179,6 +3179,20 @@ struct device *get_device(struct device *dev)
}
EXPORT_SYMBOL_GPL(get_device);
+/**
+ * get_live_device() - increment reference count for device iff !dead
+ * @dev: device.
+ *
+ * Forward the call to get_device() if the device is still alive. If
+ * this is called with the device_lock() held then the device is
+ * guaranteed to not die until the device_lock() is dropped.
+ */
+struct device *get_live_device(struct device *dev)
+{
+ return dev && !dev->p->dead ? get_device(dev) : NULL;
+}
+EXPORT_SYMBOL_GPL(get_live_device);
+
/**
* put_device - decrement reference count.
* @dev: device in question.
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index 4a30f7c3fc4a..a314a1891f4d 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -1,4 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_CXL_BUS) += cxl_bus.o
obj-$(CONFIG_CXL_MEM) += cxl_mem.o
+ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=CXL
+cxl_bus-y := bus.o
cxl_mem-y := mem.o
diff --git a/drivers/cxl/bus.c b/drivers/cxl/bus.c
new file mode 100644
index 000000000000..58f74796d525
--- /dev/null
+++ b/drivers/cxl/bus.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2020 Intel Corporation. All rights reserved. */
+#include <linux/device.h>
+#include <linux/module.h>
+
+/**
+ * DOC: cxl bus
+ *
+ * The CXL bus provides namespace for control devices and a rendezvous
+ * point for cross-device interleave coordination.
+ */
+struct bus_type cxl_bus_type = {
+ .name = "cxl",
+};
+EXPORT_SYMBOL_GPL(cxl_bus_type);
+
+static __init int cxl_bus_init(void)
+{
+ return bus_register(&cxl_bus_type);
+}
+
+static void cxl_bus_exit(void)
+{
+ bus_unregister(&cxl_bus_type);
+}
+
+module_init(cxl_bus_init);
+module_exit(cxl_bus_exit);
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index df3d97154b63..b042eee7ee25 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -3,6 +3,7 @@
#ifndef __CXL_H__
#define __CXL_H__
+#include <linux/range.h>
#include <linux/bitfield.h>
#include <linux/bitops.h>
@@ -62,6 +63,7 @@
#define CXLMDEV_RESET_NEEDED(status) \
(CXL_GET_FIELD(status, CXLMDEV_RESET_NEEDED) != CXLMDEV_RESET_NEEDED_NOT)
+struct cxl_memdev;
/**
* struct cxl_mem - A CXL memory device
* @pdev: The PCI device associated with this CXL device.
@@ -70,6 +72,7 @@
struct cxl_mem {
struct pci_dev *pdev;
void __iomem *regs;
+ struct cxl_memdev *cxlmd;
struct {
struct range range;
@@ -129,4 +132,5 @@ cxl_reg(status);
cxl_reg(mbox);
cxl_reg(mem);
+extern struct bus_type cxl_bus_type;
#endif /* __CXL_H__ */
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 69ed15bfa5d4..f1f5c765623f 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -1,11 +1,36 @@
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright(c) 2020 Intel Corporation. All rights reserved. */
#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/cdev.h>
+#include <linux/idr.h>
#include <linux/pci.h>
#include <linux/io.h>
#include "pci.h"
#include "cxl.h"
+/**
+ * DOC: cxl mem
+ *
+ * This implements a CXL memory device ("type-3") as it is defined by the
+ * Compute Express Link specification.
+ *
+ * The driver has several responsibilities, mainly:
+ * - Create the memX device and register on the CXL bus.
+ * - Enumerate device's register interface and map them.
+ * - Probe the device attributes to establish sysfs interface.
+ * - Provide an IOCTL interface to userspace to communicate with the device for
+ * things like firmware update.
+ * - Support management of interleave sets.
+ * - Handle and manage error conditions.
+ */
+
+/*
+ * An entire PCI topology full of devices should be enough for any
+ * config
+ */
+#define CXL_MEM_MAX_DEVS 65536
+
#define cxl_doorbell_busy(cxlm) \
(cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
CXLDEV_MB_CTRL_DOORBELL)
@@ -43,6 +68,27 @@ struct mbox_cmd {
#define CXL_MBOX_SUCCESS 0
};
+/**
+ * struct cxl_memdev - CXL bus object representing a Type-3 Memory Device
+ * @dev: driver core device object
+ * @cdev: char dev core object for ioctl operations
+ * @cxlm: pointer to the parent device driver data
+ * @ops_active: active user of @cxlm in ops handlers
+ * @ops_dead: completion when all @cxlm ops users have exited
+ * @id: id number of this memdev instance.
+ */
+struct cxl_memdev {
+ struct device dev;
+ struct cdev cdev;
+ struct cxl_mem *cxlm;
+ struct percpu_ref ops_active;
+ struct completion ops_dead;
+ int id;
+};
+
+static int cxl_mem_major;
+static DEFINE_IDA(cxl_memdev_ida);
+
static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
{
const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
@@ -270,6 +316,40 @@ static void cxl_mem_mbox_put(struct cxl_mem *cxlm)
mutex_unlock(&cxlm->mbox.mutex);
}
+static int cxl_memdev_open(struct inode *inode, struct file *file)
+{
+ struct cxl_memdev *cxlmd =
+ container_of(inode->i_cdev, typeof(*cxlmd), cdev);
+
+ file->private_data = cxlmd;
+
+ return 0;
+}
+
+static long cxl_memdev_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ struct cxl_memdev *cxlmd = file->private_data;
+ int rc = -ENOTTY;
+
+ if (!percpu_ref_tryget_live(&cxlmd->ops_active))
+ return -ENXIO;
+
+ /* TODO: ioctl body */
+
+ percpu_ref_put(&cxlmd->ops_active);
+
+ return rc;
+}
+
+static const struct file_operations cxl_memdev_fops = {
+ .owner = THIS_MODULE,
+ .open = cxl_memdev_open,
+ .unlocked_ioctl = cxl_memdev_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
+ .llseek = noop_llseek,
+};
+
/**
* cxl_mem_setup_regs() - Setup necessary MMIO.
* @cxlm: The CXL memory device to communicate with.
@@ -440,6 +520,197 @@ static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
return 0;
}
+static struct cxl_memdev *to_cxl_memdev(struct device *dev)
+{
+ return container_of(dev, struct cxl_memdev, dev);
+}
+
+static void cxl_memdev_release(struct device *dev)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+
+ percpu_ref_exit(&cxlmd->ops_active);
+ ida_free(&cxl_memdev_ida, cxlmd->id);
+ kfree(cxlmd);
+}
+
+static char *cxl_memdev_devnode(struct device *dev, umode_t *mode, kuid_t *uid,
+ kgid_t *gid)
+{
+ return kasprintf(GFP_KERNEL, "cxl/%s", dev_name(dev));
+}
+
+static ssize_t firmware_version_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_mem *cxlm = cxlmd->cxlm;
+
+ return sprintf(buf, "%.16s\n", cxlm->firmware_version);
+}
+static DEVICE_ATTR_RO(firmware_version);
+
+static ssize_t payload_max_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_mem *cxlm = cxlmd->cxlm;
+
+ return sprintf(buf, "%zu\n", cxlm->mbox.payload_size);
+}
+static DEVICE_ATTR_RO(payload_max);
+
+static ssize_t ram_size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_mem *cxlm = cxlmd->cxlm;
+ unsigned long long len = range_len(&cxlm->ram.range);
+
+ return sprintf(buf, "%#llx\n", len);
+}
+
+static struct device_attribute dev_attr_ram_size =
+ __ATTR(size, 0444, ram_size_show, NULL);
+
+static ssize_t pmem_size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+ struct cxl_mem *cxlm = cxlmd->cxlm;
+ unsigned long long len = range_len(&cxlm->pmem.range);
+
+ return sprintf(buf, "%#llx\n", len);
+}
+
+static struct device_attribute dev_attr_pmem_size =
+ __ATTR(size, 0444, pmem_size_show, NULL);
+
+static struct attribute *cxl_memdev_attributes[] = {
+ &dev_attr_firmware_version.attr,
+ &dev_attr_payload_max.attr,
+ NULL,
+};
+
+static struct attribute *cxl_memdev_pmem_attributes[] = {
+ &dev_attr_pmem_size.attr,
+ NULL,
+};
+
+static struct attribute *cxl_memdev_ram_attributes[] = {
+ &dev_attr_ram_size.attr,
+ NULL,
+};
+
+static struct attribute_group cxl_memdev_attribute_group = {
+ .attrs = cxl_memdev_attributes,
+};
+
+static struct attribute_group cxl_memdev_ram_attribute_group = {
+ .name = "ram",
+ .attrs = cxl_memdev_ram_attributes,
+};
+
+static struct attribute_group cxl_memdev_pmem_attribute_group = {
+ .name = "pmem",
+ .attrs = cxl_memdev_pmem_attributes,
+};
+
+static const struct attribute_group *cxl_memdev_attribute_groups[] = {
+ &cxl_memdev_attribute_group,
+ &cxl_memdev_ram_attribute_group,
+ &cxl_memdev_pmem_attribute_group,
+ NULL,
+};
+
+static const struct device_type cxl_memdev_type = {
+ .name = "cxl_memdev",
+ .release = cxl_memdev_release,
+ .devnode = cxl_memdev_devnode,
+ .groups = cxl_memdev_attribute_groups,
+};
+
+static void cxlmdev_unregister(void *_cxlmd)
+{
+ struct cxl_memdev *cxlmd = _cxlmd;
+ struct device *dev = &cxlmd->dev;
+
+ percpu_ref_kill(&cxlmd->ops_active);
+ cdev_device_del(&cxlmd->cdev, dev);
+ wait_for_completion(&cxlmd->ops_dead);
+ cxlmd->cxlm = NULL;
+ put_device(dev);
+}
+
+static void cxlmdev_ops_active_release(struct percpu_ref *ref)
+{
+ struct cxl_memdev *cxlmd =
+ container_of(ref, typeof(*cxlmd), ops_active);
+
+ complete(&cxlmd->ops_dead);
+}
+
+static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
+{
+ struct pci_dev *pdev = cxlm->pdev;
+ struct cxl_memdev *cxlmd;
+ struct device *dev;
+ struct cdev *cdev;
+ int rc;
+
+ cxlmd = kzalloc(sizeof(*cxlmd), GFP_KERNEL);
+ if (!cxlmd)
+ return -ENOMEM;
+ init_completion(&cxlmd->ops_dead);
+
+ /*
+ * @cxlm is deallocated when the driver unbinds so operations
+ * that are using it need to hold a live reference.
+ */
+ cxlmd->cxlm = cxlm;
+ rc = percpu_ref_init(&cxlmd->ops_active, cxlmdev_ops_active_release, 0,
+ GFP_KERNEL);
+ if (rc)
+ goto err_ref;
+
+ rc = ida_alloc_range(&cxl_memdev_ida, 0, CXL_MEM_MAX_DEVS, GFP_KERNEL);
+ if (rc < 0)
+ goto err_id;
+ cxlmd->id = rc;
+
+ dev = &cxlmd->dev;
+ device_initialize(dev);
+ dev->parent = &pdev->dev;
+ dev->bus = &cxl_bus_type;
+ dev->devt = MKDEV(cxl_mem_major, cxlmd->id);
+ dev->type = &cxl_memdev_type;
+ dev_set_name(dev, "mem%d", cxlmd->id);
+
+ cdev = &cxlmd->cdev;
+ cdev_init(cdev, &cxl_memdev_fops);
+
+ rc = cdev_device_add(cdev, dev);
+ if (rc)
+ goto err_add;
+
+ return devm_add_action_or_reset(dev->parent, cxlmdev_unregister, cxlmd);
+
+err_add:
+ ida_free(&cxl_memdev_ida, cxlmd->id);
+err_id:
+ /*
+ * Theoretically userspace could have already entered the fops,
+ * so flush ops_active.
+ */
+ percpu_ref_kill(&cxlmd->ops_active);
+ wait_for_completion(&cxlmd->ops_dead);
+ percpu_ref_exit(&cxlmd->ops_active);
+err_ref:
+ kfree(cxlmd);
+
+ return rc;
+}
+
/**
* cxl_mem_identify() - Send the IDENTIFY command to the device.
* @cxlm: The device to identify.
@@ -559,7 +830,11 @@ static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
if (rc)
return rc;
- return cxl_mem_identify(cxlm);
+ rc = cxl_mem_identify(cxlm);
+ if (rc)
+ return rc;
+
+ return cxl_mem_add_memdev(cxlm);
}
static const struct pci_device_id cxl_mem_pci_tbl[] = {
@@ -576,5 +851,34 @@ static struct pci_driver cxl_mem_driver = {
.probe = cxl_mem_probe,
};
+static __init int cxl_mem_init(void)
+{
+ int rc;
+ dev_t devt;
+
+ rc = alloc_chrdev_region(&devt, 0, CXL_MEM_MAX_DEVS, "cxl");
+ if (rc)
+ return rc;
+
+ cxl_mem_major = MAJOR(devt);
+
+ rc = pci_register_driver(&cxl_mem_driver);
+ if (rc) {
+ unregister_chrdev_region(MKDEV(cxl_mem_major, 0),
+ CXL_MEM_MAX_DEVS);
+ return rc;
+ }
+
+ return 0;
+}
+
+static __exit void cxl_mem_exit(void)
+{
+ pci_unregister_driver(&cxl_mem_driver);
+ unregister_chrdev_region(MKDEV(cxl_mem_major, 0), CXL_MEM_MAX_DEVS);
+}
+
MODULE_LICENSE("GPL v2");
-module_pci_driver(cxl_mem_driver);
+module_init(cxl_mem_init);
+module_exit(cxl_mem_exit);
+MODULE_IMPORT_NS(CXL);
diff --git a/include/linux/device.h b/include/linux/device.h
index 89bb8b84173e..8659deee8ae6 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -895,6 +895,7 @@ extern int (*platform_notify_remove)(struct device *dev);
*
*/
struct device *get_device(struct device *dev);
+struct device *get_live_device(struct device *dev);
void put_device(struct device *dev);
bool kill_device(struct device *dev);
--
2.30.0
Add a straightforward IOCTL that provides a mechanism for userspace to
query the supported memory device commands. CXL commands as they appear
to userspace are described as part of the UAPI kerneldoc. The command
list returned via this IOCTL will contain the full set of commands that
the driver supports, however, some of those commands may not be
available for use by userspace.
Memory device commands are specified in 8.2.9 of the CXL 2.0
specification. They are submitted through a mailbox mechanism specified
in 8.2.8.4.
Reported-by: kernel test robot <[email protected]> # bug in earlier revision
Signed-off-by: Ben Widawsky <[email protected]>
---
.clang-format | 1 +
.../userspace-api/ioctl/ioctl-number.rst | 1 +
drivers/cxl/mem.c | 152 +++++++++++++++++-
include/uapi/linux/cxl_mem.h | 119 ++++++++++++++
4 files changed, 271 insertions(+), 2 deletions(-)
create mode 100644 include/uapi/linux/cxl_mem.h
diff --git a/.clang-format b/.clang-format
index 10dc5a9a61b3..3f11c8901b43 100644
--- a/.clang-format
+++ b/.clang-format
@@ -109,6 +109,7 @@ ForEachMacros:
- 'css_for_each_child'
- 'css_for_each_descendant_post'
- 'css_for_each_descendant_pre'
+ - 'cxl_for_each_cmd'
- 'device_for_each_child_node'
- 'dma_fence_chain_for_each'
- 'do_for_each_ftrace_op'
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a4c75a28c839..6eb8e634664d 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -352,6 +352,7 @@ Code Seq# Include File Comments
<mailto:[email protected]>
0xCC 00-0F drivers/misc/ibmvmc.h pseries VMC driver
0xCD 01 linux/reiserfs_fs.h
+0xCE 01-02 uapi/linux/cxl_mem.h Compute Express Link Memory Devices
0xCF 02 fs/cifs/ioctl.c
0xDB 00-0F drivers/char/mwave/mwavepub.h
0xDD 00-3F ZFCP device driver see drivers/s390/scsi/
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index f1f5c765623f..3c3ff45f01c0 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright(c) 2020 Intel Corporation. All rights reserved. */
+#include <uapi/linux/cxl_mem.h>
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/cdev.h>
@@ -38,6 +39,7 @@
#define CXL_MAILBOX_TIMEOUT_US 2000
enum opcode {
+ CXL_MBOX_OP_INVALID = 0x0000,
CXL_MBOX_OP_IDENTIFY = 0x4000,
CXL_MBOX_OP_MAX = 0x10000
};
@@ -89,6 +91,72 @@ struct cxl_memdev {
static int cxl_mem_major;
static DEFINE_IDA(cxl_memdev_ida);
+/**
+ * struct cxl_mem_command - Driver representation of a memory device command
+ * @info: Command information as it exists for the UAPI
+ * @opcode: The actual bits used for the mailbox protocol
+ * @flags: Set of flags reflecting the state of the command.
+ *
+ * * %CXL_CMD_INTERNAL_FLAG_HIDDEN: Command is hidden from userspace. This
+ * would typically be used for deprecated commands.
+ * * %CXL_CMD_FLAG_MANDATORY: Hardware must support this command. This flag is
+ * only used internally by the driver for sanity checking.
+ *
+ * The cxl_mem_command is the driver's internal representation of commands that
+ * are supported by the driver. Some of these commands may not be supported by
+ * the hardware. The driver will use @info to validate the fields passed in by
+ * the user then submit the @opcode to the hardware.
+ *
+ * See struct cxl_command_info.
+ */
+struct cxl_mem_command {
+ const struct cxl_command_info info;
+ enum opcode opcode;
+ u32 flags;
+#define CXL_CMD_INTERNAL_FLAG_NONE 0
+#define CXL_CMD_INTERNAL_FLAG_HIDDEN BIT(0)
+#define CXL_CMD_INTERNAL_FLAG_MANDATORY BIT(1)
+};
+
+#define CXL_CMD(_id, _flags, sin, sout, f) \
+ [CXL_MEM_COMMAND_ID_##_id] = { \
+ .info = { \
+ .id = CXL_MEM_COMMAND_ID_##_id, \
+ .flags = CXL_MEM_COMMAND_FLAG_##_flags, \
+ .size_in = sin, \
+ .size_out = sout, \
+ }, \
+ .flags = CXL_CMD_INTERNAL_FLAG_##f, \
+ .opcode = CXL_MBOX_OP_##_id, \
+ }
+
+/*
+ * This table defines the supported mailbox commands for the driver. This table
+ * is made up of a UAPI structure. Non-negative values as parameters in the
+ * table will be validated against the user's input. For example, if size_in is
+ * 0, and the user passed in 1, it is an error.
+ */
+static struct cxl_mem_command mem_commands[] = {
+ CXL_CMD(INVALID, KERNEL, 0, 0, HIDDEN),
+ CXL_CMD(IDENTIFY, NONE, 0, 0x43, MANDATORY),
+};
+
+#define cxl_for_each_cmd(cmd) \
+ for ((cmd) = &mem_commands[0]; \
+ ((cmd) - mem_commands) < ARRAY_SIZE(mem_commands); (cmd)++)
+
+static inline struct cxl_mem_command *cxl_mem_find_command(u16 opcode)
+{
+ struct cxl_mem_command *c;
+
+ cxl_for_each_cmd(c) {
+ if (c->opcode == opcode)
+ return c;
+ }
+
+ return NULL;
+}
+
static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
{
const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
@@ -155,6 +223,7 @@ static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
struct mbox_cmd *mbox_cmd)
{
void __iomem *payload = cxlm->mbox.regs + CXLDEV_MB_PAYLOAD_OFFSET;
+ const struct cxl_mem_command *cmd;
u64 cmd_reg, status_reg;
size_t out_len;
int rc;
@@ -179,6 +248,13 @@ static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
* make sense).
*/
+ cmd = cxl_mem_find_command(mbox_cmd->opcode);
+ if (!cmd) {
+ dev_info(&cxlm->pdev->dev,
+ "Unknown opcode 0x%04x being sent to hardware\n",
+ mbox_cmd->opcode);
+ }
+
/* #1 */
if (cxl_doorbell_busy(cxlm)) {
dev_err_ratelimited(&cxlm->pdev->dev,
@@ -225,6 +301,19 @@ static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
cmd_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET);
out_len = CXL_GET_FIELD(cmd_reg, CXLDEV_MB_CMD_PAYLOAD_LENGTH);
+ /*
+ * If the command had a fixed size output, but the hardware did
+ * something unexpected, just print an error and move on. It would be
+ * worth sending a bug report.
+ */
+ if (cmd && cmd->info.size_out >= 0 && out_len != cmd->info.size_out) {
+ bool too_big = out_len > cmd->info.size_out;
+
+ dev_err(&cxlm->pdev->dev,
+ "payload was %s than driver expectations\n",
+ too_big ? "larger" : "smaller");
+ }
+
/* #8 */
if (out_len && mbox_cmd->payload_out)
memcpy_fromio(mbox_cmd->payload_out, payload, out_len);
@@ -326,16 +415,75 @@ static int cxl_memdev_open(struct inode *inode, struct file *file)
return 0;
}
+static int cxl_mem_count_commands(void)
+{
+ struct cxl_mem_command *c;
+ int n = 0;
+
+ cxl_for_each_cmd(c) {
+ if (c->flags & CXL_CMD_INTERNAL_FLAG_HIDDEN)
+ continue;
+ n++;
+ }
+
+ return n;
+}
+
+static long __cxl_memdev_ioctl(struct cxl_memdev *cxlmd, unsigned int cmd,
+ unsigned long arg)
+{
+ struct device *dev = &cxlmd->dev;
+
+ if (cmd == CXL_MEM_QUERY_COMMANDS) {
+ struct cxl_mem_query_commands __user *q = (void __user *)arg;
+ struct cxl_mem_command *cmd;
+ u32 n_commands;
+ int j = 0;
+
+ dev_dbg(dev, "Query IOCTL\n");
+
+ if (get_user(n_commands, &q->n_commands))
+ return -EFAULT;
+
+ /* returns the total number if 0 elements are requested. */
+ if (n_commands == 0)
+ return put_user(cxl_mem_count_commands(),
+ &q->n_commands);
+
+ /*
+ * otherwise, return max(n_commands, total commands)
+ * cxl_command_info structures.
+ */
+ cxl_for_each_cmd(cmd) {
+ const struct cxl_command_info *info = &cmd->info;
+
+ if (cmd->flags & CXL_CMD_INTERNAL_FLAG_HIDDEN)
+ continue;
+
+ if (copy_to_user(&q->commands[j++], info,
+ sizeof(*info)))
+ return -EFAULT;
+
+ if (j == n_commands)
+ break;
+ }
+
+ return 0;
+ }
+
+ return -ENOTTY;
+}
+
static long cxl_memdev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
struct cxl_memdev *cxlmd = file->private_data;
- int rc = -ENOTTY;
+ int rc;
if (!percpu_ref_tryget_live(&cxlmd->ops_active))
return -ENXIO;
- /* TODO: ioctl body */
+ rc = __cxl_memdev_ioctl(cxlmd, cmd, arg);
percpu_ref_put(&cxlmd->ops_active);
diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
new file mode 100644
index 000000000000..70e3ba2fa008
--- /dev/null
+++ b/include/uapi/linux/cxl_mem.h
@@ -0,0 +1,119 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * CXL IOCTLs for Memory Devices
+ */
+
+#ifndef _UAPI_CXL_MEM_H_
+#define _UAPI_CXL_MEM_H_
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+#include <linux/types.h>
+
+/**
+ * DOC: UAPI
+ *
+ * CXL memory devices expose UAPI to have a standard user interface.
+ * Userspace can refer to these structure definitions and UAPI formats
+ * to communicate to driver. The commands themselves are somewhat obfuscated
+ * with macro magic. They have the form CXL_MEM_COMMAND_ID_<name>.
+ *
+ * For example "CXL_MEM_COMMAND_ID_INVALID"
+ *
+ * Not all of all commands that the driver supports are always available for use
+ * by userspace. Userspace must check the results from the QUERY command in
+ * order to determine the live set of commands.
+ */
+
+#define CXL_MEM_QUERY_COMMANDS _IOR(0xCE, 1, struct cxl_mem_query_commands)
+
+#define CXL_CMDS \
+ ___C(INVALID, "Invalid Command"), \
+ ___C(IDENTIFY, "Identify Command"), \
+ ___C(MAX, "Last command")
+
+#define ___C(a, b) CXL_MEM_COMMAND_ID_##a
+enum { CXL_CMDS };
+
+#undef ___C
+
+/**
+ * struct cxl_command_info - Command information returned from a query.
+ * @id: ID number for the command.
+ * @flags: Flags that specify command behavior.
+ *
+ * * %CXL_MEM_COMMAND_FLAG_KERNEL: This command is reserved for exclusive
+ * kernel use.
+ * * %CXL_MEM_COMMAND_FLAG_MUTEX: This command may require coordination with
+ * the kernel in order to complete successfully.
+ *
+ * @size_in: Expected input size, or -1 if variable length.
+ * @size_out: Expected output size, or -1 if variable length.
+ *
+ * Represents a single command that is supported by both the driver and the
+ * hardware. This is returned as part of an array from the query ioctl. The
+ * following would be a command named "foobar" that takes a variable length
+ * input and returns 0 bytes of output.
+ *
+ * - @id = 10
+ * - @flags = CXL_MEM_COMMAND_FLAG_MUTEX
+ * - @size_in = -1
+ * - @size_out = 0
+ *
+ * See struct cxl_mem_query_commands.
+ */
+struct cxl_command_info {
+ __u32 id;
+
+ __u32 flags;
+#define CXL_MEM_COMMAND_FLAG_NONE 0
+#define CXL_MEM_COMMAND_FLAG_KERNEL BIT(0)
+#define CXL_MEM_COMMAND_FLAG_MUTEX BIT(1)
+
+ __s32 size_in;
+ __s32 size_out;
+};
+
+/**
+ * struct cxl_mem_query_commands - Query supported commands.
+ * @n_commands: In/out parameter. When @n_commands is > 0, the driver will
+ * return min(num_support_commands, n_commands). When @n_commands
+ * is 0, driver will return the number of total supported commands.
+ * @rsvd: Reserved for future use.
+ * @commands: Output array of supported commands. This array must be allocated
+ * by userspace to be at least min(num_support_commands, @n_commands)
+ *
+ * Allow userspace to query the available commands supported by both the driver,
+ * and the hardware. Commands that aren't supported by either the driver, or the
+ * hardware are not returned in the query.
+ *
+ * Examples:
+ *
+ * - { .n_commands = 0 } // Get number of supported commands
+ * - { .n_commands = 15, .commands = buf } // Return first 15 (or less)
+ * supported commands
+ *
+ * See struct cxl_command_info.
+ */
+struct cxl_mem_query_commands {
+ /*
+ * Input: Number of commands to return (space allocated by user)
+ * Output: Number of commands supported by the driver/hardware
+ *
+ * If n_commands is 0, kernel will only return number of commands and
+ * not try to populate commands[], thus allowing userspace to know how
+ * much space to allocate
+ */
+ __u32 n_commands;
+ __u32 rsvd;
+
+ struct cxl_command_info __user commands[]; /* out: supported commands */
+};
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif
--
2.30.0
On Fri, 29 Jan 2021, Ben Widawsky wrote:
> Provide enough functionality to utilize the mailbox of a memory device.
> The mailbox is used to interact with the firmware running on the memory
> device.
>
> The CXL specification defines separate capabilities for the mailbox and
> the memory device. The mailbox interface has a doorbell to indicate
> ready to accept commands and the memory device has a capability register
> that indicates the mailbox interface is ready. The expectation is that
> the doorbell-ready is always later than the memory-device-indication
> that the mailbox is ready.
>
> Create a function to handle sending a command, optionally with a
> payload, to the memory device, polling on a result, and then optionally
> copying out the payload. The algorithm for doing this comes straight out
> of the CXL 2.0 specification.
>
> Primary mailboxes are capable of generating an interrupt when submitting
> a command in the background. That implementation is saved for a later
> time.
>
> Secondary mailboxes aren't implemented at this time.
>
> The flow is proven with one implemented command, "identify". Because the
> class code has already told the driver this is a memory device and the
> identify command is mandatory.
>
> Signed-off-by: Ben Widawsky <[email protected]>
> ---
> drivers/cxl/Kconfig | 14 ++
> drivers/cxl/cxl.h | 39 +++++
> drivers/cxl/mem.c | 342 +++++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 394 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 3b66b46af8a0..fe591f74af96 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -32,4 +32,18 @@ config CXL_MEM
> Chapter 2.3 Type 3 CXL Device in the CXL 2.0 specification.
>
> If unsure say 'm'.
> +
> +config CXL_MEM_INSECURE_DEBUG
> + bool "CXL.mem debugging"
> + depends on CXL_MEM
> + help
> + Enable debug of all CXL command payloads.
> +
> + Some CXL devices and controllers support encryption and other
> + security features. The payloads for the commands that enable
> + those features may contain sensitive clear-text security
> + material. Disable debug of those command payloads by default.
> + If you are a kernel developer actively working on CXL
> + security enabling say Y, otherwise say N.
Not specific to this patch, but the reference to encryption made me
curious about integrity: are all CXL.mem devices compatible with DIMP?
Some? None?
> +
> endif
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index a3da7f8050c4..df3d97154b63 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -31,9 +31,36 @@
> #define CXLDEV_MB_CAPS_OFFSET 0x00
> #define CXLDEV_MB_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
> #define CXLDEV_MB_CTRL_OFFSET 0x04
> +#define CXLDEV_MB_CTRL_DOORBELL BIT(0)
> #define CXLDEV_MB_CMD_OFFSET 0x08
> +#define CXLDEV_MB_CMD_COMMAND_OPCODE_MASK GENMASK(15, 0)
> +#define CXLDEV_MB_CMD_PAYLOAD_LENGTH_MASK GENMASK(36, 16)
> #define CXLDEV_MB_STATUS_OFFSET 0x10
> +#define CXLDEV_MB_STATUS_RET_CODE_MASK GENMASK(47, 32)
> #define CXLDEV_MB_BG_CMD_STATUS_OFFSET 0x18
> +#define CXLDEV_MB_PAYLOAD_OFFSET 0x20
> +
> +/* Memory Device (CXL 2.0 - 8.2.8.5.1.1) */
> +#define CXLMDEV_STATUS_OFFSET 0x0
> +#define CXLMDEV_DEV_FATAL BIT(0)
> +#define CXLMDEV_FW_HALT BIT(1)
> +#define CXLMDEV_STATUS_MEDIA_STATUS_MASK GENMASK(3, 2)
> +#define CXLMDEV_MS_NOT_READY 0
> +#define CXLMDEV_MS_READY 1
> +#define CXLMDEV_MS_ERROR 2
> +#define CXLMDEV_MS_DISABLED 3
> +#define CXLMDEV_READY(status) \
> + (CXL_GET_FIELD(status, CXLMDEV_STATUS_MEDIA_STATUS) == CXLMDEV_MS_READY)
> +#define CXLMDEV_MBOX_IF_READY BIT(4)
> +#define CXLMDEV_RESET_NEEDED_SHIFT 5
> +#define CXLMDEV_RESET_NEEDED_MASK GENMASK(7, 5)
> +#define CXLMDEV_RESET_NEEDED_NOT 0
> +#define CXLMDEV_RESET_NEEDED_COLD 1
> +#define CXLMDEV_RESET_NEEDED_WARM 2
> +#define CXLMDEV_RESET_NEEDED_HOT 3
> +#define CXLMDEV_RESET_NEEDED_CXL 4
> +#define CXLMDEV_RESET_NEEDED(status) \
> + (CXL_GET_FIELD(status, CXLMDEV_RESET_NEEDED) != CXLMDEV_RESET_NEEDED_NOT)
>
> /**
> * struct cxl_mem - A CXL memory device
> @@ -44,6 +71,16 @@ struct cxl_mem {
> struct pci_dev *pdev;
> void __iomem *regs;
>
> + struct {
> + struct range range;
> + } pmem;
> +
> + struct {
> + struct range range;
> + } ram;
> +
> + char firmware_version[0x10];
> +
> /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
> struct {
> void __iomem *regs;
> @@ -51,6 +88,7 @@ struct cxl_mem {
>
> /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
> struct {
> + struct mutex mutex; /* Protects device mailbox and firmware */
> void __iomem *regs;
> size_t payload_size;
> } mbox;
> @@ -89,5 +127,6 @@ struct cxl_mem {
>
> cxl_reg(status);
> cxl_reg(mbox);
> +cxl_reg(mem);
>
> #endif /* __CXL_H__ */
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index fa14d51243ee..69ed15bfa5d4 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -6,6 +6,270 @@
> #include "pci.h"
> #include "cxl.h"
>
> +#define cxl_doorbell_busy(cxlm) \
> + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
> + CXLDEV_MB_CTRL_DOORBELL)
> +
> +#define CXL_MAILBOX_TIMEOUT_US 2000
This should be _MS?
> +
> +enum opcode {
> + CXL_MBOX_OP_IDENTIFY = 0x4000,
> + CXL_MBOX_OP_MAX = 0x10000
> +};
> +
> +/**
> + * struct mbox_cmd - A command to be submitted to hardware.
> + * @opcode: (input) The command set and command submitted to hardware.
> + * @payload_in: (input) Pointer to the input payload.
> + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> + * the caller.
> + * @size_in: (input) Number of bytes to load from @payload.
> + * @size_out: (output) Number of bytes loaded into @payload.
> + * @return_code: (output) Error code returned from hardware.
> + *
> + * This is the primary mechanism used to send commands to the hardware.
> + * All the fields except @payload_* correspond exactly to the fields described in
> + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> + * @payload_out are written to, and read from the Command Payload Registers
> + * defined in (8.2.8.4.8).
> + */
> +struct mbox_cmd {
> + u16 opcode;
> + void *payload_in;
> + void *payload_out;
> + size_t size_in;
> + size_t size_out;
> + u16 return_code;
> +#define CXL_MBOX_SUCCESS 0
> +};
> +
> +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> +{
> + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> + const unsigned long start = jiffies;
> + unsigned long end = start;
> +
> + while (cxl_doorbell_busy(cxlm)) {
> + end = jiffies;
> +
> + if (time_after(end, start + timeout)) {
> + /* Check again in case preempted before timeout test */
> + if (!cxl_doorbell_busy(cxlm))
> + break;
> + return -ETIMEDOUT;
> + }
> + cpu_relax();
> + }
> +
> + dev_dbg(&cxlm->pdev->dev, "Doorbell wait took %dms",
> + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> + return 0;
> +}
> +
> +static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
> + struct mbox_cmd *mbox_cmd)
> +{
> + dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
> + dev_info(&cxlm->pdev->dev,
> + "\topcode: 0x%04x\n"
> + "\tpayload size: %zub\n",
> + mbox_cmd->opcode, mbox_cmd->size_in);
> +
> + if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
> + print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
> + mbox_cmd->payload_in, mbox_cmd->size_in,
> + true);
> + }
> +
> + /* Here's a good place to figure out if a device reset is needed */
What are the implications if we don't do a reset, as this implementation
does not? IOW, does a timeout require a device to be recovered through a
reset before it can receive additional commands, or is it safe to simply
drop the command that timed out on the floor and proceed?
> +}
> +
> +/**
> + * cxl_mem_mbox_send_cmd() - Send a mailbox command to a memory device.
> + * @cxlm: The CXL memory device to communicate with.
> + * @mbox_cmd: Command to send to the memory device.
> + *
> + * Context: Any context. Expects mbox_lock to be held.
> + * Return: -ETIMEDOUT if timeout occurred waiting for completion. 0 on success.
> + * Caller should check the return code in @mbox_cmd to make sure it
> + * succeeded.
> + *
> + * This is a generic form of the CXL mailbox send command, thus the only I/O
> + * operations used are cxl_read_mbox_reg(). Memory devices, and perhaps other
> + * types of CXL devices may have further information available upon error
> + * conditions.
> + *
> + * The CXL spec allows for up to two mailboxes. The intention is for the primary
> + * mailbox to be OS controlled and the secondary mailbox to be used by system
> + * firmware. This allows the OS and firmware to communicate with the device and
> + * not need to coordinate with each other. The driver only uses the primary
> + * mailbox.
> + */
> +static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
> + struct mbox_cmd *mbox_cmd)
> +{
> + void __iomem *payload = cxlm->mbox.regs + CXLDEV_MB_PAYLOAD_OFFSET;
Do you need to verify the payload is non-empty per 8.2.8.4?
> + u64 cmd_reg, status_reg;
> + size_t out_len;
> + int rc;
> +
> + lockdep_assert_held(&cxlm->mbox.mutex);
> +
> + /*
> + * Here are the steps from 8.2.8.4 of the CXL 2.0 spec.
> + * 1. Caller reads MB Control Register to verify doorbell is clear
> + * 2. Caller writes Command Register
> + * 3. Caller writes Command Payload Registers if input payload is non-empty
> + * 4. Caller writes MB Control Register to set doorbell
> + * 5. Caller either polls for doorbell to be clear or waits for interrupt if configured
> + * 6. Caller reads MB Status Register to fetch Return code
> + * 7. If command successful, Caller reads Command Register to get Payload Length
> + * 8. If output payload is non-empty, host reads Command Payload Registers
> + *
> + * Hardware is free to do whatever it wants before the doorbell is
> + * rung, and isn't allowed to change anything after it clears the
> + * doorbell. As such, steps 2 and 3 can happen in any order, and steps
> + * 6, 7, 8 can also happen in any order (though some orders might not
> + * make sense).
I'd remove the indent from this paragraph since it's not part of the spec,
it's an inference based on the constraints imposed by the spec.
> + */
> +
> + /* #1 */
> + if (cxl_doorbell_busy(cxlm)) {
> + dev_err_ratelimited(&cxlm->pdev->dev,
> + "Mailbox re-busy after acquiring\n");
> + return -EBUSY;
> + }
> +
> + cmd_reg = CXL_SET_FIELD(mbox_cmd->opcode, CXLDEV_MB_CMD_COMMAND_OPCODE);
> + if (mbox_cmd->size_in) {
> + if (WARN_ON(!mbox_cmd->payload_in))
> + return -EINVAL;
> +
> + cmd_reg |= CXL_SET_FIELD(mbox_cmd->size_in,
> + CXLDEV_MB_CMD_PAYLOAD_LENGTH);
> + memcpy_toio(payload, mbox_cmd->payload_in, mbox_cmd->size_in);
> + }
> +
> + /* #2, #3 */
> + cxl_write_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET, cmd_reg);
> +
> + /* #4 */
> + dev_dbg(&cxlm->pdev->dev, "Sending command\n");
> + cxl_write_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET,
> + CXLDEV_MB_CTRL_DOORBELL);
> +
> + /* #5 */
> + rc = cxl_mem_wait_for_doorbell(cxlm);
> + if (rc == -ETIMEDOUT) {
> + cxl_mem_mbox_timeout(cxlm, mbox_cmd);
> + return rc;
> + }
> +
> + /* #6 */
> + status_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_STATUS_OFFSET);
> + mbox_cmd->return_code =
> + CXL_GET_FIELD(status_reg, CXLDEV_MB_STATUS_RET_CODE);
> +
> + if (mbox_cmd->return_code != 0) {
> + dev_dbg(&cxlm->pdev->dev, "Mailbox operation had an error\n");
> + return 0;
> + }
> +
> + /* #7 */
> + cmd_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET);
> + out_len = CXL_GET_FIELD(cmd_reg, CXLDEV_MB_CMD_PAYLOAD_LENGTH);
> +
> + /* #8 */
> + if (out_len && mbox_cmd->payload_out)
> + memcpy_fromio(mbox_cmd->payload_out, payload, out_len);
> +
> + mbox_cmd->size_out = out_len;
> +
> + return 0;
> +}
> +
> +/**
> + * cxl_mem_mbox_get() - Acquire exclusive access to the mailbox.
> + * @cxlm: The memory device to gain access to.
> + *
> + * Context: Any context. Takes the mbox_lock.
> + * Return: 0 if exclusive access was acquired.
> + */
> +static int cxl_mem_mbox_get(struct cxl_mem *cxlm)
> +{
> + struct device *dev = &cxlm->pdev->dev;
> + int rc = -EBUSY;
> + u64 md_status;
> +
> + mutex_lock_io(&cxlm->mbox.mutex);
> +
> + /*
> + * XXX: There is some amount of ambiguity in the 2.0 version of the spec
> + * around the mailbox interface ready (8.2.8.5.1.1). The purpose of the
> + * bit is to allow firmware running on the device to notify the driver
> + * that it's ready to receive commands. It is unclear if the bit needs
> + * to be read for each transaction mailbox, ie. the firmware can switch
> + * it on and off as needed. Second, there is no defined timeout for
> + * mailbox ready, like there is for the doorbell interface.
> + *
> + * Assumptions:
> + * 1. The firmware might toggle the Mailbox Interface Ready bit, check
> + * it for every command.
> + *
> + * 2. If the doorbell is clear, the firmware should have first set the
> + * Mailbox Interface Ready bit. Therefore, waiting for the doorbell
> + * to be ready is sufficient.
> + */
> + rc = cxl_mem_wait_for_doorbell(cxlm);
> + if (rc) {
> + dev_warn(dev, "Mailbox interface not ready\n");
> + goto out;
> + }
> +
> + md_status = cxl_read_mem_reg64(cxlm, CXLMDEV_STATUS_OFFSET);
> + if (!(md_status & CXLMDEV_MBOX_IF_READY && CXLMDEV_READY(md_status))) {
> + dev_err(dev,
> + "mbox: reported doorbell ready, but not mbox ready\n");
> + goto out;
> + }
> +
> + /*
> + * Hardware shouldn't allow a ready status but also have failure bits
> + * set. Spit out an error, this should be a bug report
> + */
> + rc = -EFAULT;
> + if (md_status & CXLMDEV_DEV_FATAL) {
> + dev_err(dev, "mbox: reported ready, but fatal\n");
> + goto out;
> + }
> + if (md_status & CXLMDEV_FW_HALT) {
> + dev_err(dev, "mbox: reported ready, but halted\n");
> + goto out;
> + }
> + if (CXLMDEV_RESET_NEEDED(md_status)) {
> + dev_err(dev, "mbox: reported ready, but reset needed\n");
> + goto out;
> + }
> +
> + /* with lock held */
> + return 0;
> +
> +out:
> + mutex_unlock(&cxlm->mbox.mutex);
> + return rc;
> +}
> +
> +/**
> + * cxl_mem_mbox_put() - Release exclusive access to the mailbox.
> + * @cxlm: The CXL memory device to communicate with.
> + *
> + * Context: Any context. Expects mbox_lock to be held.
> + */
> +static void cxl_mem_mbox_put(struct cxl_mem *cxlm)
> +{
> + mutex_unlock(&cxlm->mbox.mutex);
> +}
> +
> /**
> * cxl_mem_setup_regs() - Setup necessary MMIO.
> * @cxlm: The CXL memory device to communicate with.
> @@ -142,6 +406,8 @@ static struct cxl_mem *cxl_mem_create(struct pci_dev *pdev, u32 reg_lo,
> return NULL;
> }
>
> + mutex_init(&cxlm->mbox.mutex);
> +
> regs = pcim_iomap_table(pdev)[bar];
> cxlm->pdev = pdev;
> cxlm->regs = regs + offset;
> @@ -174,6 +440,76 @@ static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> return 0;
> }
>
> +/**
> + * cxl_mem_identify() - Send the IDENTIFY command to the device.
> + * @cxlm: The device to identify.
> + *
> + * Return: 0 if identify was executed successfully.
> + *
> + * This will dispatch the identify command to the device and on success populate
> + * structures to be exported to sysfs.
> + */
> +static int cxl_mem_identify(struct cxl_mem *cxlm)
> +{
> + struct cxl_mbox_identify {
> + char fw_revision[0x10];
> + __le64 total_capacity;
> + __le64 volatile_capacity;
> + __le64 persistent_capacity;
> + __le64 partition_align;
> + __le16 info_event_log_size;
> + __le16 warning_event_log_size;
> + __le16 failure_event_log_size;
> + __le16 fatal_event_log_size;
> + __le32 lsa_size;
> + u8 poison_list_max_mer[3];
> + __le16 inject_poison_limit;
> + u8 poison_caps;
> + u8 qos_telemetry_caps;
> + } __packed id;
> + struct mbox_cmd mbox_cmd;
Just initialize mbox_cmd here?
> + int rc;
> +
> + /* Retrieve initial device memory map */
> + rc = cxl_mem_mbox_get(cxlm);
> + if (rc)
> + return rc;
> +
> + mbox_cmd = (struct mbox_cmd){
> + .opcode = CXL_MBOX_OP_IDENTIFY,
> + .payload_out = &id,
> + .size_in = 0,
> + };
> + rc = cxl_mem_mbox_send_cmd(cxlm, &mbox_cmd);
> + cxl_mem_mbox_put(cxlm);
> + if (rc)
> + return rc;
> +
> + /* TODO: Handle retry or reset responses from firmware. */
> + if (mbox_cmd.return_code != CXL_MBOX_SUCCESS) {
> + dev_err(&cxlm->pdev->dev, "Mailbox command failed (%d)\n",
> + mbox_cmd.return_code);
> + return -ENXIO;
> + }
> +
> + if (mbox_cmd.size_out != sizeof(id))
> + return -ENXIO;
> +
> + /*
> + * TODO: enumerate DPA map, as 'ram' and 'pmem' do not alias.
> + * For now, only the capacity is exported in sysfs
> + */
> + cxlm->ram.range.start = 0;
> + cxlm->ram.range.end = le64_to_cpu(id.volatile_capacity) - 1;
> +
> + cxlm->pmem.range.start = 0;
> + cxlm->pmem.range.end = le64_to_cpu(id.persistent_capacity) - 1;
> +
> + memcpy(cxlm->firmware_version, id.fw_revision, sizeof(id.fw_revision));
> +
> + return rc;
> +}
> +
> static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> {
> struct device *dev = &pdev->dev;
> @@ -219,7 +555,11 @@ static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (rc)
> return rc;
>
> - return cxl_mem_setup_mailbox(cxlm);
> + rc = cxl_mem_setup_mailbox(cxlm);
> + if (rc)
> + return rc;
> +
> + return cxl_mem_identify(cxlm);
> }
>
> static const struct pci_device_id cxl_mem_pci_tbl[] = {
> --
> 2.30.0
>
>
On Fri, 29 Jan 2021, Ben Widawsky wrote:
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> new file mode 100644
> index 000000000000..fe7b87eba988
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -0,0 +1,26 @@
> +What: /sys/bus/cxl/devices/memX/firmware_version
> +Date: December, 2020
> +KernelVersion: v5.12
> +Contact: [email protected]
> +Description:
> + (RO) "FW Revision" string as reported by the Identify
> + Memory Device Output Payload in the CXL-2.0
> + specification.
> +
> +What: /sys/bus/cxl/devices/memX/ram/size
> +Date: December, 2020
> +KernelVersion: v5.12
> +Contact: [email protected]
> +Description:
> + (RO) "Volatile Only Capacity" as reported by the
> + Identify Memory Device Output Payload in the CXL-2.0
> + specification.
> +
> +What: /sys/bus/cxl/devices/memX/pmem/size
> +Date: December, 2020
> +KernelVersion: v5.12
> +Contact: [email protected]
> +Description:
> + (RO) "Persistent Only Capacity" as reported by the
> + Identify Memory Device Output Payload in the CXL-2.0
> + specification.
Aren't volatile and persistent capacities expressed in multiples of 256MB?
On Fri, 29 Jan 2021, Ben Widawsky wrote:
> +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> +{
> + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> +
> + cxlm->mbox.payload_size =
> + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> +
> + /* 8.2.8.4.3 */
> + if (cxlm->mbox.payload_size < 256) {
> + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> + cxlm->mbox.payload_size);
> + return -ENXIO;
> + }
Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
return ENXIO if true?
On 21-01-30 15:51:49, David Rientjes wrote:
> On Fri, 29 Jan 2021, Ben Widawsky wrote:
>
> > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > +{
> > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > +
> > + cxlm->mbox.payload_size =
> > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > +
> > + /* 8.2.8.4.3 */
> > + if (cxlm->mbox.payload_size < 256) {
> > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > + cxlm->mbox.payload_size);
> > + return -ENXIO;
> > + }
>
> Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> return ENXIO if true?
If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
driver not allow it?
I'm open to changing it, it just seems like a larger mailbox wouldn't be fatal.
On 21-01-30 15:52:01, David Rientjes wrote:
> On Fri, 29 Jan 2021, Ben Widawsky wrote:
>
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > new file mode 100644
> > index 000000000000..fe7b87eba988
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -0,0 +1,26 @@
> > +What: /sys/bus/cxl/devices/memX/firmware_version
> > +Date: December, 2020
> > +KernelVersion: v5.12
> > +Contact: [email protected]
> > +Description:
> > + (RO) "FW Revision" string as reported by the Identify
> > + Memory Device Output Payload in the CXL-2.0
> > + specification.
> > +
> > +What: /sys/bus/cxl/devices/memX/ram/size
> > +Date: December, 2020
> > +KernelVersion: v5.12
> > +Contact: [email protected]
> > +Description:
> > + (RO) "Volatile Only Capacity" as reported by the
> > + Identify Memory Device Output Payload in the CXL-2.0
> > + specification.
> > +
> > +What: /sys/bus/cxl/devices/memX/pmem/size
> > +Date: December, 2020
> > +KernelVersion: v5.12
> > +Contact: [email protected]
> > +Description:
> > + (RO) "Persistent Only Capacity" as reported by the
> > + Identify Memory Device Output Payload in the CXL-2.0
> > + specification.
>
> Aren't volatile and persistent capacities expressed in multiples of 256MB?
As of the spec today, volatile and persistent capacities are required to be
in multiples of 256MB, however, future specs may not have such a requirement and
I think keeping sysfs ABI easily forward portable makes sense.
> +static int cxl_mem_setup_regs(struct cxl_mem *cxlm)
> +{
> + struct device *dev = &cxlm->pdev->dev;
> + int cap, cap_count;
> + u64 cap_array;
> +
> + cap_array = readq(cxlm->regs + CXLDEV_CAP_ARRAY_OFFSET);
> + if (CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_ID) != CXLDEV_CAP_ARRAY_CAP_ID)
> + return -ENODEV;
> +
> + cap_count = CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_COUNT);
> +
> + for (cap = 1; cap <= cap_count; cap++) {
> + void __iomem *register_block;
> + u32 offset;
> + u16 cap_id;
> +
> + cap_id = readl(cxlm->regs + cap * 0x10) & 0xffff;
> + offset = readl(cxlm->regs + cap * 0x10 + 0x4);
> + register_block = cxlm->regs + offset;
> +
> + switch (cap_id) {
> + case CXLDEV_CAP_CAP_ID_DEVICE_STATUS:
> + dev_dbg(dev, "found Status capability (0x%x)\n",
> + offset);
That 80 character limit is no longer a requirement. Can you just make
this one line? And perhaps change 'found' to 'Found' ?
> + cxlm->status.regs = register_block;
> + break;
> + case CXLDEV_CAP_CAP_ID_PRIMARY_MAILBOX:
> + dev_dbg(dev, "found Mailbox capability (0x%x)\n",
> + offset);
> + cxlm->mbox.regs = register_block;
> + break;
> + case CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX:
> + dev_dbg(dev,
> + "found Secondary Mailbox capability (0x%x)\n",
> + offset);
> + break;
> + case CXLDEV_CAP_CAP_ID_MEMDEV:
> + dev_dbg(dev, "found Memory Device capability (0x%x)\n",
> + offset);
> + cxlm->mem.regs = register_block;
> + break;
> + default:
> + dev_warn(dev, "Unknown cap ID: %d (0x%x)\n", cap_id,
> + offset);
> + break;
> + }
> + }
> +
> + if (!cxlm->status.regs || !cxlm->mbox.regs || !cxlm->mem.regs) {
> + dev_err(dev, "registers not found: %s%s%s\n",
> + !cxlm->status.regs ? "status " : "",
> + !cxlm->mbox.regs ? "mbox " : "",
> + !cxlm->mem.regs ? "mem" : "");
> + return -ENXIO;
> + }
> +
> + return 0;
> +}
> +
> +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> +{
> + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> +
> + cxlm->mbox.payload_size =
> + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> +
I think the static analyzers are not going to be happy that you are not
checking the value of `cap` before using it.
Perhaps you should check that first before doing the manipulations?
> + /* 8.2.8.4.3 */
> + if (cxlm->mbox.payload_size < 256) {
#define for 256?
On 21-02-01 12:41:36, Konrad Rzeszutek Wilk wrote:
> > +static int cxl_mem_setup_regs(struct cxl_mem *cxlm)
> > +{
> > + struct device *dev = &cxlm->pdev->dev;
> > + int cap, cap_count;
> > + u64 cap_array;
> > +
> > + cap_array = readq(cxlm->regs + CXLDEV_CAP_ARRAY_OFFSET);
> > + if (CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_ID) != CXLDEV_CAP_ARRAY_CAP_ID)
> > + return -ENODEV;
> > +
> > + cap_count = CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_COUNT);
> > +
> > + for (cap = 1; cap <= cap_count; cap++) {
> > + void __iomem *register_block;
> > + u32 offset;
> > + u16 cap_id;
> > +
> > + cap_id = readl(cxlm->regs + cap * 0x10) & 0xffff;
> > + offset = readl(cxlm->regs + cap * 0x10 + 0x4);
> > + register_block = cxlm->regs + offset;
> > +
> > + switch (cap_id) {
> > + case CXLDEV_CAP_CAP_ID_DEVICE_STATUS:
> > + dev_dbg(dev, "found Status capability (0x%x)\n",
> > + offset);
>
> That 80 character limit is no longer a requirement. Can you just make
> this one line? And perhaps change 'found' to 'Found' ?
>
Funny that.
https://lore.kernel.org/linux-cxl/[email protected]/
> > + cxlm->status.regs = register_block;
> > + break;
> > + case CXLDEV_CAP_CAP_ID_PRIMARY_MAILBOX:
> > + dev_dbg(dev, "found Mailbox capability (0x%x)\n",
> > + offset);
> > + cxlm->mbox.regs = register_block;
> > + break;
> > + case CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX:
> > + dev_dbg(dev,
> > + "found Secondary Mailbox capability (0x%x)\n",
> > + offset);
> > + break;
> > + case CXLDEV_CAP_CAP_ID_MEMDEV:
> > + dev_dbg(dev, "found Memory Device capability (0x%x)\n",
> > + offset);
> > + cxlm->mem.regs = register_block;
> > + break;
> > + default:
> > + dev_warn(dev, "Unknown cap ID: %d (0x%x)\n", cap_id,
> > + offset);
> > + break;
> > + }
> > + }
> > +
> > + if (!cxlm->status.regs || !cxlm->mbox.regs || !cxlm->mem.regs) {
> > + dev_err(dev, "registers not found: %s%s%s\n",
> > + !cxlm->status.regs ? "status " : "",
> > + !cxlm->mbox.regs ? "mbox " : "",
> > + !cxlm->mem.regs ? "mem" : "");
> > + return -ENXIO;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > +{
> > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > +
> > + cxlm->mbox.payload_size =
> > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > +
>
> I think the static analyzers are not going to be happy that you are not
> checking the value of `cap` before using it.
>
> Perhaps you should check that first before doing the manipulations?
>
I'm not following the request. CXL_GET_FIELD is just doing the shift and mask on
cap.
Can you explain what you're hoping to see?
> > + /* 8.2.8.4.3 */
> > + if (cxlm->mbox.payload_size < 256) {
>
> #define for 256?
> +#define cxl_doorbell_busy(cxlm) \
> + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
> + CXLDEV_MB_CTRL_DOORBELL)
> +
> +#define CXL_MAILBOX_TIMEOUT_US 2000
You been using the spec for the values. Is that number also from it ?
> +
> +enum opcode {
> + CXL_MBOX_OP_IDENTIFY = 0x4000,
> + CXL_MBOX_OP_MAX = 0x10000
> +};
> +
> +/**
> + * struct mbox_cmd - A command to be submitted to hardware.
> + * @opcode: (input) The command set and command submitted to hardware.
> + * @payload_in: (input) Pointer to the input payload.
> + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> + * the caller.
> + * @size_in: (input) Number of bytes to load from @payload.
> + * @size_out: (output) Number of bytes loaded into @payload.
> + * @return_code: (output) Error code returned from hardware.
> + *
> + * This is the primary mechanism used to send commands to the hardware.
> + * All the fields except @payload_* correspond exactly to the fields described in
> + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> + * @payload_out are written to, and read from the Command Payload Registers
> + * defined in (8.2.8.4.8).
> + */
> +struct mbox_cmd {
> + u16 opcode;
> + void *payload_in;
> + void *payload_out;
On a 32-bit OS (not that we use those that more, but lets assume
someone really wants to), the void is 4-bytes, while on 64-bit it is
8-bytes.
`pahole` is your friend as I think there is a gap between opcode and
payload_in in the structure.
> + size_t size_in;
> + size_t size_out;
And those can also change depending on 32-bit/64-bit.
> + u16 return_code;
> +#define CXL_MBOX_SUCCESS 0
> +};
Do you want to use __packed to match with the spec?
Ah, reading later you don't care about it.
In that case may I recommend you move 'return_code' (or perhaps just
call it rc?) to be right after opcode? Less of gaps in that structure.
> +
> +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> +{
> + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> + const unsigned long start = jiffies;
> + unsigned long end = start;
> +
> + while (cxl_doorbell_busy(cxlm)) {
> + end = jiffies;
> +
> + if (time_after(end, start + timeout)) {
> + /* Check again in case preempted before timeout test */
> + if (!cxl_doorbell_busy(cxlm))
> + break;
> + return -ETIMEDOUT;
> + }
> + cpu_relax();
> + }
Hm, that is not very scheduler friendly. I mean we are sitting here for
2000us (2 ms) - that is quite the amount of time spinning.
Should this perhaps be put in a workqueue?
> +
> + dev_dbg(&cxlm->pdev->dev, "Doorbell wait took %dms",
> + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> + return 0;
> +}
> +
> +static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
> + struct mbox_cmd *mbox_cmd)
> +{
> + dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
> + dev_info(&cxlm->pdev->dev,
> + "\topcode: 0x%04x\n"
> + "\tpayload size: %zub\n",
> + mbox_cmd->opcode, mbox_cmd->size_in);
> +
> + if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
> + print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
> + mbox_cmd->payload_in, mbox_cmd->size_in,
> + true);
> + }
> +
> + /* Here's a good place to figure out if a device reset is needed */
> +}
> +
> +/**
> + * cxl_mem_mbox_send_cmd() - Send a mailbox command to a memory device.
> + * @cxlm: The CXL memory device to communicate with.
> + * @mbox_cmd: Command to send to the memory device.
> + *
> + * Context: Any context. Expects mbox_lock to be held.
> + * Return: -ETIMEDOUT if timeout occurred waiting for completion. 0 on success.
> + * Caller should check the return code in @mbox_cmd to make sure it
> + * succeeded.
> + *
> + * This is a generic form of the CXL mailbox send command, thus the only I/O
> + * operations used are cxl_read_mbox_reg(). Memory devices, and perhaps other
> + * types of CXL devices may have further information available upon error
> + * conditions.
> + *
> + * The CXL spec allows for up to two mailboxes. The intention is for the primary
> + * mailbox to be OS controlled and the secondary mailbox to be used by system
> + * firmware. This allows the OS and firmware to communicate with the device and
> + * not need to coordinate with each other. The driver only uses the primary
> + * mailbox.
> + */
> +static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
> + struct mbox_cmd *mbox_cmd)
> +{
> + void __iomem *payload = cxlm->mbox.regs + CXLDEV_MB_PAYLOAD_OFFSET;
> + u64 cmd_reg, status_reg;
> + size_t out_len;
> + int rc;
> +
> + lockdep_assert_held(&cxlm->mbox.mutex);
> +
> + /*
> + * Here are the steps from 8.2.8.4 of the CXL 2.0 spec.
> + * 1. Caller reads MB Control Register to verify doorbell is clear
> + * 2. Caller writes Command Register
> + * 3. Caller writes Command Payload Registers if input payload is non-empty
> + * 4. Caller writes MB Control Register to set doorbell
> + * 5. Caller either polls for doorbell to be clear or waits for interrupt if configured
> + * 6. Caller reads MB Status Register to fetch Return code
> + * 7. If command successful, Caller reads Command Register to get Payload Length
> + * 8. If output payload is non-empty, host reads Command Payload Registers
> + *
> + * Hardware is free to do whatever it wants before the doorbell is
> + * rung, and isn't allowed to change anything after it clears the
> + * doorbell. As such, steps 2 and 3 can happen in any order, and steps
> + * 6, 7, 8 can also happen in any order (though some orders might not
> + * make sense).
> + */
> +
> + /* #1 */
> + if (cxl_doorbell_busy(cxlm)) {
> + dev_err_ratelimited(&cxlm->pdev->dev,
> + "Mailbox re-busy after acquiring\n");
> + return -EBUSY;
> + }
> +
> + cmd_reg = CXL_SET_FIELD(mbox_cmd->opcode, CXLDEV_MB_CMD_COMMAND_OPCODE);
> + if (mbox_cmd->size_in) {
> + if (WARN_ON(!mbox_cmd->payload_in))
> + return -EINVAL;
> +
> + cmd_reg |= CXL_SET_FIELD(mbox_cmd->size_in,
> + CXLDEV_MB_CMD_PAYLOAD_LENGTH);
> + memcpy_toio(payload, mbox_cmd->payload_in, mbox_cmd->size_in);
> + }
> +
> + /* #2, #3 */
> + cxl_write_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET, cmd_reg);
> +
> + /* #4 */
> + dev_dbg(&cxlm->pdev->dev, "Sending command\n");
> + cxl_write_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET,
> + CXLDEV_MB_CTRL_DOORBELL);
> +
> + /* #5 */
> + rc = cxl_mem_wait_for_doorbell(cxlm);
> + if (rc == -ETIMEDOUT) {
> + cxl_mem_mbox_timeout(cxlm, mbox_cmd);
> + return rc;
> + }
> +
> + /* #6 */
> + status_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_STATUS_OFFSET);
> + mbox_cmd->return_code =
> + CXL_GET_FIELD(status_reg, CXLDEV_MB_STATUS_RET_CODE);
> +
> + if (mbox_cmd->return_code != 0) {
> + dev_dbg(&cxlm->pdev->dev, "Mailbox operation had an error\n");
> + return 0;
> + }
> +
> + /* #7 */
> + cmd_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET);
> + out_len = CXL_GET_FIELD(cmd_reg, CXLDEV_MB_CMD_PAYLOAD_LENGTH);
> +
> + /* #8 */
> + if (out_len && mbox_cmd->payload_out)
> + memcpy_fromio(mbox_cmd->payload_out, payload, out_len);
> +
> + mbox_cmd->size_out = out_len;
> +
> + return 0;
> +}
> +
> +/**
> + * cxl_mem_mbox_get() - Acquire exclusive access to the mailbox.
> + * @cxlm: The memory device to gain access to.
> + *
> + * Context: Any context. Takes the mbox_lock.
> + * Return: 0 if exclusive access was acquired.
> + */
> +static int cxl_mem_mbox_get(struct cxl_mem *cxlm)
> +{
> + struct device *dev = &cxlm->pdev->dev;
> + int rc = -EBUSY;
> + u64 md_status;
> +
> + mutex_lock_io(&cxlm->mbox.mutex);
> +
> + /*
> + * XXX: There is some amount of ambiguity in the 2.0 version of the spec
> + * around the mailbox interface ready (8.2.8.5.1.1). The purpose of the
> + * bit is to allow firmware running on the device to notify the driver
> + * that it's ready to receive commands. It is unclear if the bit needs
> + * to be read for each transaction mailbox, ie. the firmware can switch
> + * it on and off as needed. Second, there is no defined timeout for
> + * mailbox ready, like there is for the doorbell interface.
> + *
> + * Assumptions:
> + * 1. The firmware might toggle the Mailbox Interface Ready bit, check
> + * it for every command.
> + *
> + * 2. If the doorbell is clear, the firmware should have first set the
> + * Mailbox Interface Ready bit. Therefore, waiting for the doorbell
> + * to be ready is sufficient.
> + */
> + rc = cxl_mem_wait_for_doorbell(cxlm);
> + if (rc) {
> + dev_warn(dev, "Mailbox interface not ready\n");
> + goto out;
> + }
> +
> + md_status = cxl_read_mem_reg64(cxlm, CXLMDEV_STATUS_OFFSET);
> + if (!(md_status & CXLMDEV_MBOX_IF_READY && CXLMDEV_READY(md_status))) {
> + dev_err(dev,
> + "mbox: reported doorbell ready, but not mbox ready\n");
You can make that oneline.
> + goto out;
> + }
> +
> + /*
> + * Hardware shouldn't allow a ready status but also have failure bits
> + * set. Spit out an error, this should be a bug report
> + */
> + rc = -EFAULT;
Should these include more details? As in a dump of other registers to
help in the field to debug why the device is busted?
> + if (md_status & CXLMDEV_DEV_FATAL) {
> + dev_err(dev, "mbox: reported ready, but fatal\n");
> + goto out;
> + }
> + if (md_status & CXLMDEV_FW_HALT) {
> + dev_err(dev, "mbox: reported ready, but halted\n");
> + goto out;
> + }
> + if (CXLMDEV_RESET_NEEDED(md_status)) {
> + dev_err(dev, "mbox: reported ready, but reset needed\n");
> + goto out;
> + }
> +
> + /* with lock held */
> + return 0;
> +
> +out:
> + mutex_unlock(&cxlm->mbox.mutex);
> + return rc;
> +}
> +
> +/**
> + * cxl_mem_mbox_put() - Release exclusive access to the mailbox.
> + * @cxlm: The CXL memory device to communicate with.
> + *
> + * Context: Any context. Expects mbox_lock to be held.
> + */
> +static void cxl_mem_mbox_put(struct cxl_mem *cxlm)
> +{
> + mutex_unlock(&cxlm->mbox.mutex);
> +}
> +
> /**
> * cxl_mem_setup_regs() - Setup necessary MMIO.
> * @cxlm: The CXL memory device to communicate with.
> @@ -142,6 +406,8 @@ static struct cxl_mem *cxl_mem_create(struct pci_dev *pdev, u32 reg_lo,
> return NULL;
> }
>
> + mutex_init(&cxlm->mbox.mutex);
> +
> regs = pcim_iomap_table(pdev)[bar];
> cxlm->pdev = pdev;
> cxlm->regs = regs + offset;
> @@ -174,6 +440,76 @@ static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> return 0;
> }
>
> +/**
> + * cxl_mem_identify() - Send the IDENTIFY command to the device.
> + * @cxlm: The device to identify.
> + *
> + * Return: 0 if identify was executed successfully.
> + *
> + * This will dispatch the identify command to the device and on success populate
> + * structures to be exported to sysfs.
> + */
> +static int cxl_mem_identify(struct cxl_mem *cxlm)
> +{
> + struct cxl_mbox_identify {
> + char fw_revision[0x10];
> + __le64 total_capacity;
> + __le64 volatile_capacity;
> + __le64 persistent_capacity;
> + __le64 partition_align;
> + __le16 info_event_log_size;
> + __le16 warning_event_log_size;
> + __le16 failure_event_log_size;
> + __le16 fatal_event_log_size;
> + __le32 lsa_size;
> + u8 poison_list_max_mer[3];
> + __le16 inject_poison_limit;
> + u8 poison_caps;
> + u8 qos_telemetry_caps;
> + } __packed id;
> + struct mbox_cmd mbox_cmd;
> + int rc;
> +
> + /* Retrieve initial device memory map */
> + rc = cxl_mem_mbox_get(cxlm);
> + if (rc)
> + return rc;
> +
> + mbox_cmd = (struct mbox_cmd){
> + .opcode = CXL_MBOX_OP_IDENTIFY,
> + .payload_out = &id,
> + .size_in = 0,
> + };
> + rc = cxl_mem_mbox_send_cmd(cxlm, &mbox_cmd);
> + cxl_mem_mbox_put(cxlm);
> + if (rc)
> + return rc;
> +
> + /* TODO: Handle retry or reset responses from firmware. */
> + if (mbox_cmd.return_code != CXL_MBOX_SUCCESS) {
> + dev_err(&cxlm->pdev->dev, "Mailbox command failed (%d)\n",
> + mbox_cmd.return_code);
> + return -ENXIO;
> + }
> +
> + if (mbox_cmd.size_out != sizeof(id))
> + return -ENXIO;
> +
> + /*
> + * TODO: enumerate DPA map, as 'ram' and 'pmem' do not alias.
??? Not sure I understand.
> + * For now, only the capacity is exported in sysfs
> + */
> + cxlm->ram.range.start = 0;
> + cxlm->ram.range.end = le64_to_cpu(id.volatile_capacity) - 1;
> +
> + cxlm->pmem.range.start = 0;
> + cxlm->pmem.range.end = le64_to_cpu(id.persistent_capacity) - 1;
> +
> + memcpy(cxlm->firmware_version, id.fw_revision, sizeof(id.fw_revision));
> +
> + return rc;
> +}
> +
> static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> {
> struct device *dev = &pdev->dev;
> @@ -219,7 +555,11 @@ static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (rc)
> return rc;
>
> - return cxl_mem_setup_mailbox(cxlm);
> + rc = cxl_mem_setup_mailbox(cxlm);
> + if (rc)
> + return rc;
> +
> + return cxl_mem_identify(cxlm);
> }
>
> static const struct pci_device_id cxl_mem_pci_tbl[] = {
> --
> 2.30.0
>
On Mon, Feb 01, 2021 at 09:50:41AM -0800, Ben Widawsky wrote:
> On 21-02-01 12:41:36, Konrad Rzeszutek Wilk wrote:
> > > +static int cxl_mem_setup_regs(struct cxl_mem *cxlm)
> > > +{
> > > + struct device *dev = &cxlm->pdev->dev;
> > > + int cap, cap_count;
> > > + u64 cap_array;
> > > +
> > > + cap_array = readq(cxlm->regs + CXLDEV_CAP_ARRAY_OFFSET);
> > > + if (CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_ID) != CXLDEV_CAP_ARRAY_CAP_ID)
> > > + return -ENODEV;
> > > +
> > > + cap_count = CXL_GET_FIELD(cap_array, CXLDEV_CAP_ARRAY_COUNT);
> > > +
> > > + for (cap = 1; cap <= cap_count; cap++) {
> > > + void __iomem *register_block;
> > > + u32 offset;
> > > + u16 cap_id;
> > > +
> > > + cap_id = readl(cxlm->regs + cap * 0x10) & 0xffff;
> > > + offset = readl(cxlm->regs + cap * 0x10 + 0x4);
> > > + register_block = cxlm->regs + offset;
> > > +
> > > + switch (cap_id) {
> > > + case CXLDEV_CAP_CAP_ID_DEVICE_STATUS:
> > > + dev_dbg(dev, "found Status capability (0x%x)\n",
> > > + offset);
> >
> > That 80 character limit is no longer a requirement. Can you just make
> > this one line? And perhaps change 'found' to 'Found' ?
> >
>
> Funny that.
> https://lore.kernel.org/linux-cxl/[email protected]/
"If there is a good reason to go against the
style (a line which becomes far less readable if split to fit within the
80-column limit, for example), just do it.
"
I would say that having an offset on its own line is kind of silly.
>
> > > + cxlm->status.regs = register_block;
> > > + break;
> > > + case CXLDEV_CAP_CAP_ID_PRIMARY_MAILBOX:
> > > + dev_dbg(dev, "found Mailbox capability (0x%x)\n",
> > > + offset);
> > > + cxlm->mbox.regs = register_block;
> > > + break;
> > > + case CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX:
> > > + dev_dbg(dev,
> > > + "found Secondary Mailbox capability (0x%x)\n",
> > > + offset);
> > > + break;
> > > + case CXLDEV_CAP_CAP_ID_MEMDEV:
> > > + dev_dbg(dev, "found Memory Device capability (0x%x)\n",
> > > + offset);
> > > + cxlm->mem.regs = register_block;
> > > + break;
> > > + default:
> > > + dev_warn(dev, "Unknown cap ID: %d (0x%x)\n", cap_id,
> > > + offset);
> > > + break;
> > > + }
> > > + }
> > > +
> > > + if (!cxlm->status.regs || !cxlm->mbox.regs || !cxlm->mem.regs) {
> > > + dev_err(dev, "registers not found: %s%s%s\n",
> > > + !cxlm->status.regs ? "status " : "",
> > > + !cxlm->mbox.regs ? "mbox " : "",
> > > + !cxlm->mem.regs ? "mem" : "");
> > > + return -ENXIO;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > +{
> > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > +
> > > + cxlm->mbox.payload_size =
> > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > +
> >
> > I think the static analyzers are not going to be happy that you are not
> > checking the value of `cap` before using it.
> >
> > Perhaps you should check that first before doing the manipulations?
> >
>
> I'm not following the request. CXL_GET_FIELD is just doing the shift and mask on
> cap.
>
> Can you explain what you're hoping to see?
My thoughts were that if cxl_read_mbox_reg32 gave you -1 (would be wacky
but we live in the world of having a healthy vision of devices not
always giving right values).
Then your payload_size bit shifting will get bent out of shape as you
are effectively casting cap to unsigned, which means it will end up
being 0xfffffffffffffffff.. and your bit shifting end up with a bunch
of zeros at the end.
>
> > > + /* 8.2.8.4.3 */
> > > + if (cxlm->mbox.payload_size < 256) {
If this ends up being casted back to signed, this conditional will
catch it (-4096 < 256, true).
But if it does not (so 0xffffffff<256, false), then this wacky
number will pass this check and you may reference a payload_size that is
larger than the reality and copy the wrong set of values (buffer
overflow).
So what I was thinking is that you want to check `cap` to make sure it
is not negative nor to large?
> >
> > #define for 256?
On Fri, Jan 29, 2021 at 04:24:32PM -0800, Ben Widawsky wrote:
> For drivers that moderate access to the underlying hardware it is
> sometimes desirable to allow userspace to bypass restrictions. Once
> userspace has done this, the driver can no longer guarantee the sanctity
> of either the OS or the hardware. When in this state, it is helpful for
> kernel developers to be made aware (via this taint flag) of this fact
> for subsequent bug reports.
>
> Example usage:
> - Hardware xyzzy accepts 2 commands, waldo and fred.
> - The xyzzy driver provides an interface for using waldo, but not fred.
> - quux is convinced they really need the fred command.
> - xyzzy driver allows quux to frob hardware to initiate fred.
Would it not be easier to _not_ frob the hardware for fred-operation?
Aka not implement it or just disallow in the first place?
> - kernel gets tainted.
> - turns out fred command is borked, and scribbles over memory.
> - developers laugh while closing quux's subsequent bug report.
Yeah good luck with that theory in-the-field. The customer won't
care about this and will demand a solution for doing fred-operation.
Just easier to not do fred-operation in the first place,no?
> +/**
> + * struct cxl_send_command - Send a command to a memory device.
> + * @id: The command to send to the memory device. This must be one of the
> + * commands returned by the query command.
> + * @flags: Flags for the command (input).
> + * @rsvd: Must be zero.
> + * @retval: Return value from the memory device (output).
> + * @size_in: Size of the payload to provide to the device (input).
> + * @size_out: Size of the payload received from the device (input/output). This
> + * field is filled in by userspace to let the driver know how much
> + * space was allocated for output. It is populated by the driver to
> + * let userspace know how large the output payload actually was.
> + * @in_payload: Pointer to memory for payload input (little endian order).
> + * @out_payload: Pointer to memory for payload output (little endian order).
> + *
> + * Mechanism for userspace to send a command to the hardware for processing. The
> + * driver will do basic validation on the command sizes. In some cases even the
> + * payload may be introspected. Userspace is required to allocate large
> + * enough buffers for size_out which can be variable length in certain
> + * situations.
> + */
I think (and this would help if you ran `pahole` on this structure) has
some gaps in it:
> +struct cxl_send_command {
> + __u32 id;
> + __u32 flags;
> + __u32 rsvd;
> + __u32 retval;
> +
> + struct {
> + __s32 size_in;
Here..Maybe just add:
__u32 rsv_2;
> + __u64 in_payload;
> + };
> +
> + struct {
> + __s32 size_out;
And here. Maybe just add:
__u32 rsv_2;
> + __u64 out_payload;
> + };
> +};
Perhaps to prepare for the future where this may need to be expanded, you
could add a size at the start of the structure, and
maybe what version of structure it is?
Maybe for all the new structs you are adding?
On Fri, Jan 29, 2021 at 04:24:33PM -0800, Ben Widawsky wrote:
> The CXL memory device send interface will have a number of supported
> commands. The raw command is not such a command. Raw commands allow
> userspace to send a specified opcode to the underlying hardware and
> bypass all driver checks on the command. This is useful for a couple of
> usecases, mainly:
> 1. Undocumented vendor specific hardware commands
> 2. Prototyping new hardware commands not yet supported by the driver
This sounds like a recipe for ..
In case you really really want this may I recommend you do two things:
- Wrap this whole thing with #ifdef
CONFIG_CXL_DEBUG_THIS_WILL_DESTROY_YOUR_LIFE
(or something equivalant to make it clear this should never be
enabled in production kernels).
- Add a nice big fat printk in dmesg telling the user that they
are creating a unstable parallel universe that will lead to their
blood pressure going sky-high, or perhaps something more professional
sounding.
- Rethink this. Do you really really want to encourage vendors
to use this raw API instead of them using the proper APIs?
>
> While this all sounds very powerful it comes with a couple of caveats:
> 1. Bug reports using raw commands will not get the same level of
> attention as bug reports using supported commands (via taint).
> 2. Supported commands will be rejected by the RAW command.
>
> With this comes new debugfs knob to allow full access to your toes with
> your weapon of choice.
Problem is that debugfs is no longer "debug" but is enabled in
production kernel.
On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> The Get Log command returns the actual log entries that are advertised
> via the Get Supported Logs command (0400h). CXL device logs are selected
> by UUID which is part of the CXL spec. Because the driver tries to
> sanitize what is sent to hardware, there becomes a need to restrict the
> types of logs which can be accessed by userspace. For example, the
> vendor specific log might only be consumable by proprietary, or offline
> applications, and therefore a good candidate for userspace.
>
> The current driver infrastructure does allow basic validation for all
> commands, but doesn't inspect any of the payload data. Along with Get
> Log support comes new infrastructure to add a hook for payload
> validation. This infrastructure is used to filter out the CEL UUID,
> which the userspace driver doesn't have business knowing, and taints on
> invalid UUIDs being sent to hardware.
Perhaps a better option is to reject invalid UUIDs?
And if you really really want to use invalid UUIDs then:
1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?
2) Wrap it with lockdown code so that you can't do this at all
when in LOCKDOWN_INTEGRITY or such?
>
> Signed-off-by: Ben Widawsky <[email protected]>
> ---
> drivers/cxl/mem.c | 42 +++++++++++++++++++++++++++++++++++-
> include/uapi/linux/cxl_mem.h | 1 +
> 2 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index b8ca6dff37b5..086268f1dd6c 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -119,6 +119,8 @@ static const uuid_t log_uuid[] = {
> 0x07, 0x19, 0x40, 0x3d, 0x86)
> };
>
> +static int validate_log_uuid(void __user *payload, size_t size);
> +
> /**
> * struct cxl_mem_command - Driver representation of a memory device command
> * @info: Command information as it exists for the UAPI
> @@ -132,6 +134,10 @@ static const uuid_t log_uuid[] = {
> * * %CXL_CMD_INTERNAL_FLAG_PSEUDO: This is a pseudo command which doesn't have
> * a direct mapping to hardware. They are implicitly always enabled.
> *
> + * @validate_payload: A function called after the command is validated but
> + * before it's sent to the hardware. The primary purpose is to validate, or
> + * fixup the actual payload.
> + *
> * The cxl_mem_command is the driver's internal representation of commands that
> * are supported by the driver. Some of these commands may not be supported by
> * the hardware. The driver will use @info to validate the fields passed in by
> @@ -147,9 +153,11 @@ struct cxl_mem_command {
> #define CXL_CMD_INTERNAL_FLAG_HIDDEN BIT(0)
> #define CXL_CMD_INTERNAL_FLAG_MANDATORY BIT(1)
> #define CXL_CMD_INTERNAL_FLAG_PSEUDO BIT(2)
> +
> + int (*validate_payload)(void __user *payload, size_t size);
> };
>
> -#define CXL_CMD(_id, _flags, sin, sout, f) \
> +#define CXL_CMD_VALIDATE(_id, _flags, sin, sout, f, v) \
> [CXL_MEM_COMMAND_ID_##_id] = { \
> .info = { \
> .id = CXL_MEM_COMMAND_ID_##_id, \
> @@ -159,8 +167,12 @@ struct cxl_mem_command {
> }, \
> .flags = CXL_CMD_INTERNAL_FLAG_##f, \
> .opcode = CXL_MBOX_OP_##_id, \
> + .validate_payload = v, \
> }
>
> +#define CXL_CMD(_id, _flags, sin, sout, f) \
> + CXL_CMD_VALIDATE(_id, _flags, sin, sout, f, NULL)
> +
> /*
> * This table defines the supported mailbox commands for the driver. This table
> * is made up of a UAPI structure. Non-negative values as parameters in the
> @@ -176,6 +188,8 @@ static struct cxl_mem_command mem_commands[] = {
> CXL_CMD(GET_PARTITION_INFO, NONE, 0, 0x20, NONE),
> CXL_CMD(GET_LSA, NONE, 0x8, ~0, MANDATORY),
> CXL_CMD(GET_HEALTH_INFO, NONE, 0, 0x12, MANDATORY),
> + CXL_CMD_VALIDATE(GET_LOG, MUTEX, 0x18, ~0, MANDATORY,
> + validate_log_uuid),
> };
>
> /*
> @@ -563,6 +577,13 @@ static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
> kvzalloc(cxlm->mbox.payload_size, GFP_KERNEL);
>
> if (cmd->info.size_in) {
> + if (cmd->validate_payload) {
> + rc = cmd->validate_payload(u64_to_user_ptr(in_payload),
> + cmd->info.size_in);
> + if (rc)
> + goto out;
> + }
> +
> mbox_cmd.payload_in = kvzalloc(cmd->info.size_in, GFP_KERNEL);
> if (!mbox_cmd.payload_in) {
> rc = -ENOMEM;
> @@ -1205,6 +1226,25 @@ struct cxl_mbox_get_log {
> __le32 length;
> } __packed;
>
> +static int validate_log_uuid(void __user *input, size_t size)
> +{
> + struct cxl_mbox_get_log __user *get_log = input;
> + uuid_t payload_uuid;
> +
> + if (copy_from_user(&payload_uuid, &get_log->uuid, sizeof(uuid_t)))
> + return -EFAULT;
> +
> + /* All unspec'd logs shall taint */
> + if (uuid_equal(&payload_uuid, &log_uuid[CEL_UUID]))
> + return 0;
> + if (uuid_equal(&payload_uuid, &log_uuid[DEBUG_UUID]))
> + return 0;
> +
> + add_taint(TAINT_RAW_PASSTHROUGH, LOCKDEP_STILL_OK);
> +
> + return 0;
> +}
> +
> static int cxl_xfer_log(struct cxl_mem *cxlm, uuid_t *uuid, u32 size, u8 *out)
> {
> u32 remaining = size;
> diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
> index 766c231d6150..7cdc7f7ce7ec 100644
> --- a/include/uapi/linux/cxl_mem.h
> +++ b/include/uapi/linux/cxl_mem.h
> @@ -39,6 +39,7 @@ extern "C" {
> ___C(GET_PARTITION_INFO, "Get Partition Information"), \
> ___C(GET_LSA, "Get Label Storage Area"), \
> ___C(GET_HEALTH_INFO, "Get Health Info"), \
> + ___C(GET_LOG, "Get Log"), \
> ___C(MAX, "Last command")
>
> #define ___C(a, b) CXL_MEM_COMMAND_ID_##a
> --
> 2.30.0
>
On 21-02-01 13:18:45, Konrad Rzeszutek Wilk wrote:
> On Fri, Jan 29, 2021 at 04:24:32PM -0800, Ben Widawsky wrote:
> > For drivers that moderate access to the underlying hardware it is
> > sometimes desirable to allow userspace to bypass restrictions. Once
> > userspace has done this, the driver can no longer guarantee the sanctity
> > of either the OS or the hardware. When in this state, it is helpful for
> > kernel developers to be made aware (via this taint flag) of this fact
> > for subsequent bug reports.
> >
> > Example usage:
> > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > - The xyzzy driver provides an interface for using waldo, but not fred.
> > - quux is convinced they really need the fred command.
> > - xyzzy driver allows quux to frob hardware to initiate fred.
>
> Would it not be easier to _not_ frob the hardware for fred-operation?
> Aka not implement it or just disallow in the first place?
Yeah. So the idea is you either are in a transient phase of the command and some
future kernel will have real support for fred - or a vendor is being short
sighted and not adding support for fred.
>
>
> > - kernel gets tainted.
> > - turns out fred command is borked, and scribbles over memory.
> > - developers laugh while closing quux's subsequent bug report.
>
> Yeah good luck with that theory in-the-field. The customer won't
> care about this and will demand a solution for doing fred-operation.
>
> Just easier to not do fred-operation in the first place,no?
The short answer is, in an ideal world you are correct. See nvdimm as an example
of the real world.
The longer answer. Unless we want to wait until we have all the hardware we're
ever going to see, it's impossible to have a fully baked, and validated
interface. The RAW interface is my admission that I make no guarantees about
being able to provide the perfect interface and giving the power back to the
hardware vendors and their driver writers.
As an example, suppose a vendor shipped a device with their special vendor
opcode. They can enable their customers to use that opcode on any driver
version. That seems pretty powerful and worthwhile to me.
Or a more realistic example, we ship a driver that adds a command which is
totally broken. Customers can utilize the RAW interface until it gets fixed in a
subsequent release which might be quite a ways out.
I'll say the RAW interface isn't an encouraged usage, but it's one that I expect
to be needed, and if it's not we can always try to kill it later. If nobody is
actually using it, nobody will complain, right :D
On Mon, Feb 1, 2021 at 10:35 AM Ben Widawsky <[email protected]> wrote:
>
> On 21-02-01 13:18:45, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jan 29, 2021 at 04:24:32PM -0800, Ben Widawsky wrote:
> > > For drivers that moderate access to the underlying hardware it is
> > > sometimes desirable to allow userspace to bypass restrictions. Once
> > > userspace has done this, the driver can no longer guarantee the sanctity
> > > of either the OS or the hardware. When in this state, it is helpful for
> > > kernel developers to be made aware (via this taint flag) of this fact
> > > for subsequent bug reports.
> > >
> > > Example usage:
> > > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > > - The xyzzy driver provides an interface for using waldo, but not fred.
> > > - quux is convinced they really need the fred command.
> > > - xyzzy driver allows quux to frob hardware to initiate fred.
> >
> > Would it not be easier to _not_ frob the hardware for fred-operation?
> > Aka not implement it or just disallow in the first place?
>
> Yeah. So the idea is you either are in a transient phase of the command and some
> future kernel will have real support for fred - or a vendor is being short
> sighted and not adding support for fred.
>
> >
> >
> > > - kernel gets tainted.
> > > - turns out fred command is borked, and scribbles over memory.
> > > - developers laugh while closing quux's subsequent bug report.
> >
> > Yeah good luck with that theory in-the-field. The customer won't
> > care about this and will demand a solution for doing fred-operation.
> >
> > Just easier to not do fred-operation in the first place,no?
>
> The short answer is, in an ideal world you are correct. See nvdimm as an example
> of the real world.
>
> The longer answer. Unless we want to wait until we have all the hardware we're
> ever going to see, it's impossible to have a fully baked, and validated
> interface. The RAW interface is my admission that I make no guarantees about
> being able to provide the perfect interface and giving the power back to the
> hardware vendors and their driver writers.
>
> As an example, suppose a vendor shipped a device with their special vendor
> opcode. They can enable their customers to use that opcode on any driver
> version. That seems pretty powerful and worthwhile to me.
>
Powerful, frightening, and questionably worthwhile when there are
already examples of commands that need extra coordination for whatever
reason. However, I still think the decision tilts towards allowing
this given ongoing spec work.
NVDIMM ended up allowing unfettered vendor passthrough given the lack
of an organizing body to unify vendors. CXL on the other hand appears
to have more gravity to keep vendors honest. A WARN splat with a
taint, and a debugfs knob for the truly problematic commands seems
sufficient protection of system integrity while still following the
Linux ethos of giving system owners enough rope to make their own
decisions.
> Or a more realistic example, we ship a driver that adds a command which is
> totally broken. Customers can utilize the RAW interface until it gets fixed in a
> subsequent release which might be quite a ways out.
>
> I'll say the RAW interface isn't an encouraged usage, but it's one that I expect
> to be needed, and if it's not we can always try to kill it later. If nobody is
> actually using it, nobody will complain, right :D
It might be worthwhile to make RAW support a compile time decision so
that Linux distros can only ship support for the commands the CXL
driver-dev community has blessed, but I'll leave it to a distro
developer to second that approach.
On 21-02-01 12:54:00, Konrad Rzeszutek Wilk wrote:
> > +#define cxl_doorbell_busy(cxlm) \
> > + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
> > + CXLDEV_MB_CTRL_DOORBELL)
> > +
> > +#define CXL_MAILBOX_TIMEOUT_US 2000
>
> You been using the spec for the values. Is that number also from it ?
>
Yes it is. I'll add a comment with the spec reference.
> > +
> > +enum opcode {
> > + CXL_MBOX_OP_IDENTIFY = 0x4000,
> > + CXL_MBOX_OP_MAX = 0x10000
> > +};
> > +
> > +/**
> > + * struct mbox_cmd - A command to be submitted to hardware.
> > + * @opcode: (input) The command set and command submitted to hardware.
> > + * @payload_in: (input) Pointer to the input payload.
> > + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> > + * the caller.
> > + * @size_in: (input) Number of bytes to load from @payload.
> > + * @size_out: (output) Number of bytes loaded into @payload.
> > + * @return_code: (output) Error code returned from hardware.
> > + *
> > + * This is the primary mechanism used to send commands to the hardware.
> > + * All the fields except @payload_* correspond exactly to the fields described in
> > + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> > + * @payload_out are written to, and read from the Command Payload Registers
> > + * defined in (8.2.8.4.8).
> > + */
> > +struct mbox_cmd {
> > + u16 opcode;
> > + void *payload_in;
> > + void *payload_out;
>
> On a 32-bit OS (not that we use those that more, but lets assume
> someone really wants to), the void is 4-bytes, while on 64-bit it is
> 8-bytes.
>
> `pahole` is your friend as I think there is a gap between opcode and
> payload_in in the structure.
>
> > + size_t size_in;
> > + size_t size_out;
>
> And those can also change depending on 32-bit/64-bit.
>
> > + u16 return_code;
> > +#define CXL_MBOX_SUCCESS 0
> > +};
>
> Do you want to use __packed to match with the spec?
>
> Ah, reading later you don't care about it.
>
> In that case may I recommend you move 'return_code' (or perhaps just
> call it rc?) to be right after opcode? Less of gaps in that structure.
>
I guess I hadn't realized we're supposed to try to fully pack structs by
default.
> > +
> > +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> > +{
> > + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> > + const unsigned long start = jiffies;
> > + unsigned long end = start;
> > +
> > + while (cxl_doorbell_busy(cxlm)) {
> > + end = jiffies;
> > +
> > + if (time_after(end, start + timeout)) {
> > + /* Check again in case preempted before timeout test */
> > + if (!cxl_doorbell_busy(cxlm))
> > + break;
> > + return -ETIMEDOUT;
> > + }
> > + cpu_relax();
> > + }
>
> Hm, that is not very scheduler friendly. I mean we are sitting here for
> 2000us (2 ms) - that is quite the amount of time spinning.
>
> Should this perhaps be put in a workqueue?
So let me first point you to the friendlier version which was shot down:
https://lore.kernel.org/linux-cxl/[email protected]/
I'm not opposed to this being moved to a workqueue at some point, but I think
that's unnecessary complexity currently. The reality is that it's expected that
commands will finish way sooner than this or be implemented as background
commands. I've heard a person who makes a lot of the spec decisions say, "if
it's 2 seconds, nobody will use these things".
I think adding the summary of this back and forth as a comment to the existing
code makes a lot of sense.
> > +
> > + dev_dbg(&cxlm->pdev->dev, "Doorbell wait took %dms",
> > + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> > + return 0;
> > +}
> > +
> > +static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
> > + struct mbox_cmd *mbox_cmd)
> > +{
> > + dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
> > + dev_info(&cxlm->pdev->dev,
> > + "\topcode: 0x%04x\n"
> > + "\tpayload size: %zub\n",
> > + mbox_cmd->opcode, mbox_cmd->size_in);
> > +
> > + if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
> > + print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
> > + mbox_cmd->payload_in, mbox_cmd->size_in,
> > + true);
> > + }
> > +
> > + /* Here's a good place to figure out if a device reset is needed */
> > +}
> > +
> > +/**
> > + * cxl_mem_mbox_send_cmd() - Send a mailbox command to a memory device.
> > + * @cxlm: The CXL memory device to communicate with.
> > + * @mbox_cmd: Command to send to the memory device.
> > + *
> > + * Context: Any context. Expects mbox_lock to be held.
> > + * Return: -ETIMEDOUT if timeout occurred waiting for completion. 0 on success.
> > + * Caller should check the return code in @mbox_cmd to make sure it
> > + * succeeded.
> > + *
> > + * This is a generic form of the CXL mailbox send command, thus the only I/O
> > + * operations used are cxl_read_mbox_reg(). Memory devices, and perhaps other
> > + * types of CXL devices may have further information available upon error
> > + * conditions.
> > + *
> > + * The CXL spec allows for up to two mailboxes. The intention is for the primary
> > + * mailbox to be OS controlled and the secondary mailbox to be used by system
> > + * firmware. This allows the OS and firmware to communicate with the device and
> > + * not need to coordinate with each other. The driver only uses the primary
> > + * mailbox.
> > + */
> > +static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
> > + struct mbox_cmd *mbox_cmd)
> > +{
> > + void __iomem *payload = cxlm->mbox.regs + CXLDEV_MB_PAYLOAD_OFFSET;
> > + u64 cmd_reg, status_reg;
> > + size_t out_len;
> > + int rc;
> > +
> > + lockdep_assert_held(&cxlm->mbox.mutex);
> > +
> > + /*
> > + * Here are the steps from 8.2.8.4 of the CXL 2.0 spec.
> > + * 1. Caller reads MB Control Register to verify doorbell is clear
> > + * 2. Caller writes Command Register
> > + * 3. Caller writes Command Payload Registers if input payload is non-empty
> > + * 4. Caller writes MB Control Register to set doorbell
> > + * 5. Caller either polls for doorbell to be clear or waits for interrupt if configured
> > + * 6. Caller reads MB Status Register to fetch Return code
> > + * 7. If command successful, Caller reads Command Register to get Payload Length
> > + * 8. If output payload is non-empty, host reads Command Payload Registers
> > + *
> > + * Hardware is free to do whatever it wants before the doorbell is
> > + * rung, and isn't allowed to change anything after it clears the
> > + * doorbell. As such, steps 2 and 3 can happen in any order, and steps
> > + * 6, 7, 8 can also happen in any order (though some orders might not
> > + * make sense).
> > + */
> > +
> > + /* #1 */
> > + if (cxl_doorbell_busy(cxlm)) {
> > + dev_err_ratelimited(&cxlm->pdev->dev,
> > + "Mailbox re-busy after acquiring\n");
> > + return -EBUSY;
> > + }
> > +
> > + cmd_reg = CXL_SET_FIELD(mbox_cmd->opcode, CXLDEV_MB_CMD_COMMAND_OPCODE);
> > + if (mbox_cmd->size_in) {
> > + if (WARN_ON(!mbox_cmd->payload_in))
> > + return -EINVAL;
> > +
> > + cmd_reg |= CXL_SET_FIELD(mbox_cmd->size_in,
> > + CXLDEV_MB_CMD_PAYLOAD_LENGTH);
> > + memcpy_toio(payload, mbox_cmd->payload_in, mbox_cmd->size_in);
> > + }
> > +
> > + /* #2, #3 */
> > + cxl_write_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET, cmd_reg);
> > +
> > + /* #4 */
> > + dev_dbg(&cxlm->pdev->dev, "Sending command\n");
> > + cxl_write_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET,
> > + CXLDEV_MB_CTRL_DOORBELL);
> > +
> > + /* #5 */
> > + rc = cxl_mem_wait_for_doorbell(cxlm);
> > + if (rc == -ETIMEDOUT) {
> > + cxl_mem_mbox_timeout(cxlm, mbox_cmd);
> > + return rc;
> > + }
> > +
> > + /* #6 */
> > + status_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_STATUS_OFFSET);
> > + mbox_cmd->return_code =
> > + CXL_GET_FIELD(status_reg, CXLDEV_MB_STATUS_RET_CODE);
> > +
> > + if (mbox_cmd->return_code != 0) {
> > + dev_dbg(&cxlm->pdev->dev, "Mailbox operation had an error\n");
> > + return 0;
> > + }
> > +
> > + /* #7 */
> > + cmd_reg = cxl_read_mbox_reg64(cxlm, CXLDEV_MB_CMD_OFFSET);
> > + out_len = CXL_GET_FIELD(cmd_reg, CXLDEV_MB_CMD_PAYLOAD_LENGTH);
> > +
> > + /* #8 */
> > + if (out_len && mbox_cmd->payload_out)
> > + memcpy_fromio(mbox_cmd->payload_out, payload, out_len);
> > +
> > + mbox_cmd->size_out = out_len;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * cxl_mem_mbox_get() - Acquire exclusive access to the mailbox.
> > + * @cxlm: The memory device to gain access to.
> > + *
> > + * Context: Any context. Takes the mbox_lock.
> > + * Return: 0 if exclusive access was acquired.
> > + */
> > +static int cxl_mem_mbox_get(struct cxl_mem *cxlm)
> > +{
> > + struct device *dev = &cxlm->pdev->dev;
> > + int rc = -EBUSY;
> > + u64 md_status;
> > +
> > + mutex_lock_io(&cxlm->mbox.mutex);
> > +
> > + /*
> > + * XXX: There is some amount of ambiguity in the 2.0 version of the spec
> > + * around the mailbox interface ready (8.2.8.5.1.1). The purpose of the
> > + * bit is to allow firmware running on the device to notify the driver
> > + * that it's ready to receive commands. It is unclear if the bit needs
> > + * to be read for each transaction mailbox, ie. the firmware can switch
> > + * it on and off as needed. Second, there is no defined timeout for
> > + * mailbox ready, like there is for the doorbell interface.
> > + *
> > + * Assumptions:
> > + * 1. The firmware might toggle the Mailbox Interface Ready bit, check
> > + * it for every command.
> > + *
> > + * 2. If the doorbell is clear, the firmware should have first set the
> > + * Mailbox Interface Ready bit. Therefore, waiting for the doorbell
> > + * to be ready is sufficient.
> > + */
> > + rc = cxl_mem_wait_for_doorbell(cxlm);
> > + if (rc) {
> > + dev_warn(dev, "Mailbox interface not ready\n");
> > + goto out;
> > + }
> > +
> > + md_status = cxl_read_mem_reg64(cxlm, CXLMDEV_STATUS_OFFSET);
> > + if (!(md_status & CXLMDEV_MBOX_IF_READY && CXLMDEV_READY(md_status))) {
> > + dev_err(dev,
> > + "mbox: reported doorbell ready, but not mbox ready\n");
>
> You can make that oneline.
> > + goto out;
> > + }
> > +
> > + /*
> > + * Hardware shouldn't allow a ready status but also have failure bits
> > + * set. Spit out an error, this should be a bug report
> > + */
> > + rc = -EFAULT;
>
> Should these include more details? As in a dump of other registers to
> help in the field to debug why the device is busted?
>
We've discussed a bit, having kind of a general error state (my driver
background is in i915 where we did such a thing). I'm not opposed, but as the
error handling at this point is very minimal, I think it can wait until the next
round of patches for further enabling.
> > + if (md_status & CXLMDEV_DEV_FATAL) {
> > + dev_err(dev, "mbox: reported ready, but fatal\n");
> > + goto out;
> > + }
> > + if (md_status & CXLMDEV_FW_HALT) {
> > + dev_err(dev, "mbox: reported ready, but halted\n");
> > + goto out;
> > + }
> > + if (CXLMDEV_RESET_NEEDED(md_status)) {
> > + dev_err(dev, "mbox: reported ready, but reset needed\n");
> > + goto out;
> > + }
> > +
> > + /* with lock held */
> > + return 0;
> > +
> > +out:
> > + mutex_unlock(&cxlm->mbox.mutex);
> > + return rc;
> > +}
> > +
> > +/**
> > + * cxl_mem_mbox_put() - Release exclusive access to the mailbox.
> > + * @cxlm: The CXL memory device to communicate with.
> > + *
> > + * Context: Any context. Expects mbox_lock to be held.
> > + */
> > +static void cxl_mem_mbox_put(struct cxl_mem *cxlm)
> > +{
> > + mutex_unlock(&cxlm->mbox.mutex);
> > +}
> > +
> > /**
> > * cxl_mem_setup_regs() - Setup necessary MMIO.
> > * @cxlm: The CXL memory device to communicate with.
> > @@ -142,6 +406,8 @@ static struct cxl_mem *cxl_mem_create(struct pci_dev *pdev, u32 reg_lo,
> > return NULL;
> > }
> >
> > + mutex_init(&cxlm->mbox.mutex);
> > +
> > regs = pcim_iomap_table(pdev)[bar];
> > cxlm->pdev = pdev;
> > cxlm->regs = regs + offset;
> > @@ -174,6 +440,76 @@ static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> > return 0;
> > }
> >
> > +/**
> > + * cxl_mem_identify() - Send the IDENTIFY command to the device.
> > + * @cxlm: The device to identify.
> > + *
> > + * Return: 0 if identify was executed successfully.
> > + *
> > + * This will dispatch the identify command to the device and on success populate
> > + * structures to be exported to sysfs.
> > + */
> > +static int cxl_mem_identify(struct cxl_mem *cxlm)
> > +{
> > + struct cxl_mbox_identify {
> > + char fw_revision[0x10];
> > + __le64 total_capacity;
> > + __le64 volatile_capacity;
> > + __le64 persistent_capacity;
> > + __le64 partition_align;
> > + __le16 info_event_log_size;
> > + __le16 warning_event_log_size;
> > + __le16 failure_event_log_size;
> > + __le16 fatal_event_log_size;
> > + __le32 lsa_size;
> > + u8 poison_list_max_mer[3];
> > + __le16 inject_poison_limit;
> > + u8 poison_caps;
> > + u8 qos_telemetry_caps;
> > + } __packed id;
> > + struct mbox_cmd mbox_cmd;
> > + int rc;
> > +
> > + /* Retrieve initial device memory map */
> > + rc = cxl_mem_mbox_get(cxlm);
> > + if (rc)
> > + return rc;
> > +
> > + mbox_cmd = (struct mbox_cmd){
> > + .opcode = CXL_MBOX_OP_IDENTIFY,
> > + .payload_out = &id,
> > + .size_in = 0,
> > + };
> > + rc = cxl_mem_mbox_send_cmd(cxlm, &mbox_cmd);
> > + cxl_mem_mbox_put(cxlm);
> > + if (rc)
> > + return rc;
> > +
> > + /* TODO: Handle retry or reset responses from firmware. */
> > + if (mbox_cmd.return_code != CXL_MBOX_SUCCESS) {
> > + dev_err(&cxlm->pdev->dev, "Mailbox command failed (%d)\n",
> > + mbox_cmd.return_code);
> > + return -ENXIO;
> > + }
> > +
> > + if (mbox_cmd.size_out != sizeof(id))
> > + return -ENXIO;
> > +
> > + /*
> > + * TODO: enumerate DPA map, as 'ram' and 'pmem' do not alias.
>
> ??? Not sure I understand.
>
The current code is showing two aliased/overlapping ranges.
[0, id.volatile_capacity)
[0, id.persistent_capacity)
This is not allowed by spec.
> > + * For now, only the capacity is exported in sysfs
> > + */
> > + cxlm->ram.range.start = 0;
> > + cxlm->ram.range.end = le64_to_cpu(id.volatile_capacity) - 1;
> > +
> > + cxlm->pmem.range.start = 0;
> > + cxlm->pmem.range.end = le64_to_cpu(id.persistent_capacity) - 1;
> > +
> > + memcpy(cxlm->firmware_version, id.fw_revision, sizeof(id.fw_revision));
> > +
> > + return rc;
> > +}
> > +
> > static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > {
> > struct device *dev = &pdev->dev;
> > @@ -219,7 +555,11 @@ static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > if (rc)
> > return rc;
> >
> > - return cxl_mem_setup_mailbox(cxlm);
> > + rc = cxl_mem_setup_mailbox(cxlm);
> > + if (rc)
> > + return rc;
> > +
> > + return cxl_mem_identify(cxlm);
> > }
> >
> > static const struct pci_device_id cxl_mem_pci_tbl[] = {
> > --
> > 2.30.0
> >
On 21-02-01 13:24:00, Konrad Rzeszutek Wilk wrote:
> On Fri, Jan 29, 2021 at 04:24:33PM -0800, Ben Widawsky wrote:
> > The CXL memory device send interface will have a number of supported
> > commands. The raw command is not such a command. Raw commands allow
> > userspace to send a specified opcode to the underlying hardware and
> > bypass all driver checks on the command. This is useful for a couple of
> > usecases, mainly:
> > 1. Undocumented vendor specific hardware commands
> > 2. Prototyping new hardware commands not yet supported by the driver
>
> This sounds like a recipe for ..
>
> In case you really really want this may I recommend you do two things:
>
> - Wrap this whole thing with #ifdef
> CONFIG_CXL_DEBUG_THIS_WILL_DESTROY_YOUR_LIFE
>
> (or something equivalant to make it clear this should never be
> enabled in production kernels).
>
> - Add a nice big fat printk in dmesg telling the user that they
> are creating a unstable parallel universe that will lead to their
> blood pressure going sky-high, or perhaps something more professional
> sounding.
>
> - Rethink this. Do you really really want to encourage vendors
> to use this raw API instead of them using the proper APIs?
Again, the ideal is proper APIs. Barring that they get a WARN, and a taint if
they use the raw commands.
>
> >
> > While this all sounds very powerful it comes with a couple of caveats:
> > 1. Bug reports using raw commands will not get the same level of
> > attention as bug reports using supported commands (via taint).
> > 2. Supported commands will be rejected by the RAW command.
> >
> > With this comes new debugfs knob to allow full access to your toes with
> > your weapon of choice.
>
> Problem is that debugfs is no longer "debug" but is enabled in
> production kernel.
I don't see this as my problem. Again, they've been WARNed and tainted. If they
want to do this, that's their business. They will be asked to reproduce without
RAW if they file a bug report.
On Mon, Feb 1, 2021 at 11:13 AM Ben Widawsky <[email protected]> wrote:
>
> On 21-02-01 12:54:00, Konrad Rzeszutek Wilk wrote:
> > > +#define cxl_doorbell_busy(cxlm) \
> > > + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
> > > + CXLDEV_MB_CTRL_DOORBELL)
> > > +
> > > +#define CXL_MAILBOX_TIMEOUT_US 2000
> >
> > You been using the spec for the values. Is that number also from it ?
> >
>
> Yes it is. I'll add a comment with the spec reference.
>
> > > +
> > > +enum opcode {
> > > + CXL_MBOX_OP_IDENTIFY = 0x4000,
> > > + CXL_MBOX_OP_MAX = 0x10000
> > > +};
> > > +
> > > +/**
> > > + * struct mbox_cmd - A command to be submitted to hardware.
> > > + * @opcode: (input) The command set and command submitted to hardware.
> > > + * @payload_in: (input) Pointer to the input payload.
> > > + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> > > + * the caller.
> > > + * @size_in: (input) Number of bytes to load from @payload.
> > > + * @size_out: (output) Number of bytes loaded into @payload.
> > > + * @return_code: (output) Error code returned from hardware.
> > > + *
> > > + * This is the primary mechanism used to send commands to the hardware.
> > > + * All the fields except @payload_* correspond exactly to the fields described in
> > > + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> > > + * @payload_out are written to, and read from the Command Payload Registers
> > > + * defined in (8.2.8.4.8).
> > > + */
> > > +struct mbox_cmd {
> > > + u16 opcode;
> > > + void *payload_in;
> > > + void *payload_out;
> >
> > On a 32-bit OS (not that we use those that more, but lets assume
> > someone really wants to), the void is 4-bytes, while on 64-bit it is
> > 8-bytes.
> >
> > `pahole` is your friend as I think there is a gap between opcode and
> > payload_in in the structure.
> >
> > > + size_t size_in;
> > > + size_t size_out;
> >
> > And those can also change depending on 32-bit/64-bit.
> >
> > > + u16 return_code;
> > > +#define CXL_MBOX_SUCCESS 0
> > > +};
> >
> > Do you want to use __packed to match with the spec?
> >
> > Ah, reading later you don't care about it.
> >
> > In that case may I recommend you move 'return_code' (or perhaps just
> > call it rc?) to be right after opcode? Less of gaps in that structure.
> >
>
> I guess I hadn't realized we're supposed to try to fully pack structs by
> default.
This is just the internal parsed context of a command, I can't imagine
packing is relevant here. pahole optimization feels premature as well.
>
> > > +
> > > +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> > > +{
> > > + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> > > + const unsigned long start = jiffies;
> > > + unsigned long end = start;
> > > +
> > > + while (cxl_doorbell_busy(cxlm)) {
> > > + end = jiffies;
> > > +
> > > + if (time_after(end, start + timeout)) {
> > > + /* Check again in case preempted before timeout test */
> > > + if (!cxl_doorbell_busy(cxlm))
> > > + break;
> > > + return -ETIMEDOUT;
> > > + }
> > > + cpu_relax();
> > > + }
> >
> > Hm, that is not very scheduler friendly. I mean we are sitting here for
> > 2000us (2 ms) - that is quite the amount of time spinning.
> >
> > Should this perhaps be put in a workqueue?
>
> So let me first point you to the friendlier version which was shot down:
> https://lore.kernel.org/linux-cxl/[email protected]/
>
> I'm not opposed to this being moved to a workqueue at some point, but I think
> that's unnecessary complexity currently. The reality is that it's expected that
> commands will finish way sooner than this or be implemented as background
> commands. I've heard a person who makes a lot of the spec decisions say, "if
> it's 2 seconds, nobody will use these things".
That said, asynchronous probe needs to be enabled for the next driver update.
On Mon, Feb 01, 2021 at 11:27:08AM -0800, Ben Widawsky wrote:
> On 21-02-01 13:24:00, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jan 29, 2021 at 04:24:33PM -0800, Ben Widawsky wrote:
> > > The CXL memory device send interface will have a number of supported
> > > commands. The raw command is not such a command. Raw commands allow
> > > userspace to send a specified opcode to the underlying hardware and
> > > bypass all driver checks on the command. This is useful for a couple of
> > > usecases, mainly:
> > > 1. Undocumented vendor specific hardware commands
> > > 2. Prototyping new hardware commands not yet supported by the driver
> >
> > This sounds like a recipe for ..
> >
> > In case you really really want this may I recommend you do two things:
> >
> > - Wrap this whole thing with #ifdef
> > CONFIG_CXL_DEBUG_THIS_WILL_DESTROY_YOUR_LIFE
> >
> > (or something equivalant to make it clear this should never be
> > enabled in production kernels).
> >
> > - Add a nice big fat printk in dmesg telling the user that they
> > are creating a unstable parallel universe that will lead to their
> > blood pressure going sky-high, or perhaps something more professional
> > sounding.
> >
> > - Rethink this. Do you really really want to encourage vendors
> > to use this raw API instead of them using the proper APIs?
>
> Again, the ideal is proper APIs. Barring that they get a WARN, and a taint if
> they use the raw commands.
Linux upstream is all about proper APIs. Just don't do this.
>
> >
> > >
> > > While this all sounds very powerful it comes with a couple of caveats:
> > > 1. Bug reports using raw commands will not get the same level of
> > > attention as bug reports using supported commands (via taint).
> > > 2. Supported commands will be rejected by the RAW command.
> > >
> > > With this comes new debugfs knob to allow full access to your toes with
> > > your weapon of choice.
> >
> > Problem is that debugfs is no longer "debug" but is enabled in
> > production kernel.
>
> I don't see this as my problem. Again, they've been WARNed and tainted. If they
Right not your problem, nice.
But it is going to be the problem of vendor kernel engineers who don't have this luxury.
> want to do this, that's their business. They will be asked to reproduce without
> RAW if they file a bug report.
This is not how customers see the world. "If it is there, then it is
there to used right? Why else would someone give me the keys to this?"
Just kill this. Or better yet, make it a seperate set of patches for
folks developing code but not have it as part of this patchset.
>
On Sat, Jan 30, 2021 at 3:52 PM David Rientjes <[email protected]> wrote:
>
> On Fri, 29 Jan 2021, Ben Widawsky wrote:
>
> > Provide enough functionality to utilize the mailbox of a memory device.
> > The mailbox is used to interact with the firmware running on the memory
> > device.
> >
> > The CXL specification defines separate capabilities for the mailbox and
> > the memory device. The mailbox interface has a doorbell to indicate
> > ready to accept commands and the memory device has a capability register
> > that indicates the mailbox interface is ready. The expectation is that
> > the doorbell-ready is always later than the memory-device-indication
> > that the mailbox is ready.
> >
> > Create a function to handle sending a command, optionally with a
> > payload, to the memory device, polling on a result, and then optionally
> > copying out the payload. The algorithm for doing this comes straight out
> > of the CXL 2.0 specification.
> >
> > Primary mailboxes are capable of generating an interrupt when submitting
> > a command in the background. That implementation is saved for a later
> > time.
> >
> > Secondary mailboxes aren't implemented at this time.
> >
> > The flow is proven with one implemented command, "identify". Because the
> > class code has already told the driver this is a memory device and the
> > identify command is mandatory.
> >
> > Signed-off-by: Ben Widawsky <[email protected]>
> > ---
> > drivers/cxl/Kconfig | 14 ++
> > drivers/cxl/cxl.h | 39 +++++
> > drivers/cxl/mem.c | 342 +++++++++++++++++++++++++++++++++++++++++++-
> > 3 files changed, 394 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index 3b66b46af8a0..fe591f74af96 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -32,4 +32,18 @@ config CXL_MEM
> > Chapter 2.3 Type 3 CXL Device in the CXL 2.0 specification.
> >
> > If unsure say 'm'.
> > +
> > +config CXL_MEM_INSECURE_DEBUG
> > + bool "CXL.mem debugging"
> > + depends on CXL_MEM
> > + help
> > + Enable debug of all CXL command payloads.
> > +
> > + Some CXL devices and controllers support encryption and other
> > + security features. The payloads for the commands that enable
> > + those features may contain sensitive clear-text security
> > + material. Disable debug of those command payloads by default.
> > + If you are a kernel developer actively working on CXL
> > + security enabling say Y, otherwise say N.
>
> Not specific to this patch, but the reference to encryption made me
> curious about integrity: are all CXL.mem devices compatible with DIMP?
> Some? None?
The encryption here is "device passphrase" similar to the NVDIMM
Security Management described here:
https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf
The LIBNVDIMM enabling wrapped this support with the Linux keys
interface which among other things enforces wrapping the clear text
passphrase with a Linux "trusted/encrypted" key.
Additionally, the CXL.io interface optionally supports PCI IDE:
https://www.intel.com/content/dam/www/public/us/en/documents/reference-guides/pcie-device-security-enhancements.pdf
I'm otherwise not familiar with the DIMP acronym?
> > +
> > endif
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index a3da7f8050c4..df3d97154b63 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -31,9 +31,36 @@
> > #define CXLDEV_MB_CAPS_OFFSET 0x00
> > #define CXLDEV_MB_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
> > #define CXLDEV_MB_CTRL_OFFSET 0x04
> > +#define CXLDEV_MB_CTRL_DOORBELL BIT(0)
> > #define CXLDEV_MB_CMD_OFFSET 0x08
> > +#define CXLDEV_MB_CMD_COMMAND_OPCODE_MASK GENMASK(15, 0)
> > +#define CXLDEV_MB_CMD_PAYLOAD_LENGTH_MASK GENMASK(36, 16)
> > #define CXLDEV_MB_STATUS_OFFSET 0x10
> > +#define CXLDEV_MB_STATUS_RET_CODE_MASK GENMASK(47, 32)
> > #define CXLDEV_MB_BG_CMD_STATUS_OFFSET 0x18
> > +#define CXLDEV_MB_PAYLOAD_OFFSET 0x20
> > +
> > +/* Memory Device (CXL 2.0 - 8.2.8.5.1.1) */
> > +#define CXLMDEV_STATUS_OFFSET 0x0
> > +#define CXLMDEV_DEV_FATAL BIT(0)
> > +#define CXLMDEV_FW_HALT BIT(1)
> > +#define CXLMDEV_STATUS_MEDIA_STATUS_MASK GENMASK(3, 2)
> > +#define CXLMDEV_MS_NOT_READY 0
> > +#define CXLMDEV_MS_READY 1
> > +#define CXLMDEV_MS_ERROR 2
> > +#define CXLMDEV_MS_DISABLED 3
> > +#define CXLMDEV_READY(status) \
> > + (CXL_GET_FIELD(status, CXLMDEV_STATUS_MEDIA_STATUS) == CXLMDEV_MS_READY)
> > +#define CXLMDEV_MBOX_IF_READY BIT(4)
> > +#define CXLMDEV_RESET_NEEDED_SHIFT 5
> > +#define CXLMDEV_RESET_NEEDED_MASK GENMASK(7, 5)
> > +#define CXLMDEV_RESET_NEEDED_NOT 0
> > +#define CXLMDEV_RESET_NEEDED_COLD 1
> > +#define CXLMDEV_RESET_NEEDED_WARM 2
> > +#define CXLMDEV_RESET_NEEDED_HOT 3
> > +#define CXLMDEV_RESET_NEEDED_CXL 4
> > +#define CXLMDEV_RESET_NEEDED(status) \
> > + (CXL_GET_FIELD(status, CXLMDEV_RESET_NEEDED) != CXLMDEV_RESET_NEEDED_NOT)
> >
> > /**
> > * struct cxl_mem - A CXL memory device
> > @@ -44,6 +71,16 @@ struct cxl_mem {
> > struct pci_dev *pdev;
> > void __iomem *regs;
> >
> > + struct {
> > + struct range range;
> > + } pmem;
> > +
> > + struct {
> > + struct range range;
> > + } ram;
> > +
> > + char firmware_version[0x10];
> > +
> > /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
> > struct {
> > void __iomem *regs;
> > @@ -51,6 +88,7 @@ struct cxl_mem {
> >
> > /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
> > struct {
> > + struct mutex mutex; /* Protects device mailbox and firmware */
> > void __iomem *regs;
> > size_t payload_size;
> > } mbox;
> > @@ -89,5 +127,6 @@ struct cxl_mem {
> >
> > cxl_reg(status);
> > cxl_reg(mbox);
> > +cxl_reg(mem);
> >
> > #endif /* __CXL_H__ */
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index fa14d51243ee..69ed15bfa5d4 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -6,6 +6,270 @@
> > #include "pci.h"
> > #include "cxl.h"
> >
> > +#define cxl_doorbell_busy(cxlm) \
> > + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
> > + CXLDEV_MB_CTRL_DOORBELL)
> > +
> > +#define CXL_MAILBOX_TIMEOUT_US 2000
>
> This should be _MS?
>
> > +
> > +enum opcode {
> > + CXL_MBOX_OP_IDENTIFY = 0x4000,
> > + CXL_MBOX_OP_MAX = 0x10000
> > +};
> > +
> > +/**
> > + * struct mbox_cmd - A command to be submitted to hardware.
> > + * @opcode: (input) The command set and command submitted to hardware.
> > + * @payload_in: (input) Pointer to the input payload.
> > + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> > + * the caller.
> > + * @size_in: (input) Number of bytes to load from @payload.
> > + * @size_out: (output) Number of bytes loaded into @payload.
> > + * @return_code: (output) Error code returned from hardware.
> > + *
> > + * This is the primary mechanism used to send commands to the hardware.
> > + * All the fields except @payload_* correspond exactly to the fields described in
> > + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> > + * @payload_out are written to, and read from the Command Payload Registers
> > + * defined in (8.2.8.4.8).
> > + */
> > +struct mbox_cmd {
> > + u16 opcode;
> > + void *payload_in;
> > + void *payload_out;
> > + size_t size_in;
> > + size_t size_out;
> > + u16 return_code;
> > +#define CXL_MBOX_SUCCESS 0
> > +};
> > +
> > +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> > +{
> > + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> > + const unsigned long start = jiffies;
> > + unsigned long end = start;
> > +
> > + while (cxl_doorbell_busy(cxlm)) {
> > + end = jiffies;
> > +
> > + if (time_after(end, start + timeout)) {
> > + /* Check again in case preempted before timeout test */
> > + if (!cxl_doorbell_busy(cxlm))
> > + break;
> > + return -ETIMEDOUT;
> > + }
> > + cpu_relax();
> > + }
> > +
> > + dev_dbg(&cxlm->pdev->dev, "Doorbell wait took %dms",
> > + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> > + return 0;
> > +}
> > +
> > +static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
> > + struct mbox_cmd *mbox_cmd)
> > +{
> > + dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
> > + dev_info(&cxlm->pdev->dev,
> > + "\topcode: 0x%04x\n"
> > + "\tpayload size: %zub\n",
> > + mbox_cmd->opcode, mbox_cmd->size_in);
> > +
> > + if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
> > + print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
> > + mbox_cmd->payload_in, mbox_cmd->size_in,
> > + true);
> > + }
> > +
> > + /* Here's a good place to figure out if a device reset is needed */
>
> What are the implications if we don't do a reset, as this implementation
> does not? IOW, does a timeout require a device to be recovered through a
> reset before it can receive additional commands, or is it safe to simply
> drop the command that timed out on the floor and proceed?
Not a satisfying answer, but "it depends". It's also complicated by
the fact that a reset may need to be coordinated with other devices in
the interleave-set as the HDM decoders may bounce.
For comparison, to date there have been no problems with the "drop on
the floor" policy of LIBNVDIMM command timeouts. At the same time
there simply was not a software visible reset mechanism for those
devices so this problem never came out. This mailbox isn't a fast
path, so the device is likely completely dead if this timeout is ever
violated, and the firmware reporting a timeout might as well assume
that the OS gives up on the device.
I'll let Ben chime in on the rest...
On Mon, Feb 1, 2021 at 11:36 AM Konrad Rzeszutek Wilk
<[email protected]> wrote:
>
> On Mon, Feb 01, 2021 at 11:27:08AM -0800, Ben Widawsky wrote:
> > On 21-02-01 13:24:00, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Jan 29, 2021 at 04:24:33PM -0800, Ben Widawsky wrote:
> > > > The CXL memory device send interface will have a number of supported
> > > > commands. The raw command is not such a command. Raw commands allow
> > > > userspace to send a specified opcode to the underlying hardware and
> > > > bypass all driver checks on the command. This is useful for a couple of
> > > > usecases, mainly:
> > > > 1. Undocumented vendor specific hardware commands
> > > > 2. Prototyping new hardware commands not yet supported by the driver
> > >
> > > This sounds like a recipe for ..
> > >
> > > In case you really really want this may I recommend you do two things:
> > >
> > > - Wrap this whole thing with #ifdef
> > > CONFIG_CXL_DEBUG_THIS_WILL_DESTROY_YOUR_LIFE
> > >
> > > (or something equivalant to make it clear this should never be
> > > enabled in production kernels).
> > >
> > > - Add a nice big fat printk in dmesg telling the user that they
> > > are creating a unstable parallel universe that will lead to their
> > > blood pressure going sky-high, or perhaps something more professional
> > > sounding.
> > >
> > > - Rethink this. Do you really really want to encourage vendors
> > > to use this raw API instead of them using the proper APIs?
> >
> > Again, the ideal is proper APIs. Barring that they get a WARN, and a taint if
> > they use the raw commands.
>
> Linux upstream is all about proper APIs. Just don't do this.
> >
> > >
> > > >
> > > > While this all sounds very powerful it comes with a couple of caveats:
> > > > 1. Bug reports using raw commands will not get the same level of
> > > > attention as bug reports using supported commands (via taint).
> > > > 2. Supported commands will be rejected by the RAW command.
> > > >
> > > > With this comes new debugfs knob to allow full access to your toes with
> > > > your weapon of choice.
> > >
> > > Problem is that debugfs is no longer "debug" but is enabled in
> > > production kernel.
> >
> > I don't see this as my problem. Again, they've been WARNed and tainted. If they
>
> Right not your problem, nice.
>
> But it is going to be the problem of vendor kernel engineers who don't have this luxury.
>
> > want to do this, that's their business. They will be asked to reproduce without
> > RAW if they file a bug report.
>
>
> This is not how customers see the world. "If it is there, then it is
> there to used right? Why else would someone give me the keys to this?"
>
> Just kill this. Or better yet, make it a seperate set of patches for
> folks developing code but not have it as part of this patchset.
In the ACPI NFIT driver, the only protection against vendor
shenanigans is the requirement that any and all DSM functions be
described in a public specification, so there is no unfettered access
to the DSM interface However, multiple vendors just went ahead and
included a "vendor passthrough" as a DSM sub-command in their
implementation. The driver does have the "disable_vendor_specific"
module parameter, however that does not amount to much more than a
stern look from the kernel at vendors shipping functionality through
that path rather than proper functions. It has been a source of bugs.
The RAW command proposal Ben has here is a significant improvement on
that status quo. It's built on the observation that customers pick up
the phone whenever their kernel backtraces, and makes it is easy to
spot broken tooling. That said, I think it is reasonable to place the
RAW interface behind a configuration option and let distribution
policy decide the availability.
On Mon, 1 Feb 2021, Ben Widawsky wrote:
> On 21-01-30 15:51:49, David Rientjes wrote:
> > On Fri, 29 Jan 2021, Ben Widawsky wrote:
> >
> > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > +{
> > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > +
> > > + cxlm->mbox.payload_size =
> > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > +
> > > + /* 8.2.8.4.3 */
> > > + if (cxlm->mbox.payload_size < 256) {
> > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > + cxlm->mbox.payload_size);
> > > + return -ENXIO;
> > > + }
> >
> > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> > return ENXIO if true?
>
> If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> driver not allow it?
>
Because the spec disallows it :)
On Mon, 1 Feb 2021, Ben Widawsky wrote:
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > new file mode 100644
> > > index 000000000000..fe7b87eba988
> > > --- /dev/null
> > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > @@ -0,0 +1,26 @@
> > > +What: /sys/bus/cxl/devices/memX/firmware_version
> > > +Date: December, 2020
> > > +KernelVersion: v5.12
> > > +Contact: [email protected]
> > > +Description:
> > > + (RO) "FW Revision" string as reported by the Identify
> > > + Memory Device Output Payload in the CXL-2.0
> > > + specification.
> > > +
> > > +What: /sys/bus/cxl/devices/memX/ram/size
> > > +Date: December, 2020
> > > +KernelVersion: v5.12
> > > +Contact: [email protected]
> > > +Description:
> > > + (RO) "Volatile Only Capacity" as reported by the
> > > + Identify Memory Device Output Payload in the CXL-2.0
> > > + specification.
> > > +
> > > +What: /sys/bus/cxl/devices/memX/pmem/size
> > > +Date: December, 2020
> > > +KernelVersion: v5.12
> > > +Contact: [email protected]
> > > +Description:
> > > + (RO) "Persistent Only Capacity" as reported by the
> > > + Identify Memory Device Output Payload in the CXL-2.0
> > > + specification.
> >
> > Aren't volatile and persistent capacities expressed in multiples of 256MB?
>
> As of the spec today, volatile and persistent capacities are required to be
> in multiples of 256MB, however, future specs may not have such a requirement and
> I think keeping sysfs ABI easily forward portable makes sense.
>
Makes sense, can we add that these are expressed in bytes or is that
already implied?
On Mon, Feb 1, 2021 at 1:53 PM David Rientjes <[email protected]> wrote:
>
> On Mon, 1 Feb 2021, Ben Widawsky wrote:
>
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > new file mode 100644
> > > > index 000000000000..fe7b87eba988
> > > > --- /dev/null
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > > > @@ -0,0 +1,26 @@
> > > > +What: /sys/bus/cxl/devices/memX/firmware_version
> > > > +Date: December, 2020
> > > > +KernelVersion: v5.12
> > > > +Contact: [email protected]
> > > > +Description:
> > > > + (RO) "FW Revision" string as reported by the Identify
> > > > + Memory Device Output Payload in the CXL-2.0
> > > > + specification.
> > > > +
> > > > +What: /sys/bus/cxl/devices/memX/ram/size
> > > > +Date: December, 2020
> > > > +KernelVersion: v5.12
> > > > +Contact: [email protected]
> > > > +Description:
> > > > + (RO) "Volatile Only Capacity" as reported by the
> > > > + Identify Memory Device Output Payload in the CXL-2.0
> > > > + specification.
> > > > +
> > > > +What: /sys/bus/cxl/devices/memX/pmem/size
> > > > +Date: December, 2020
> > > > +KernelVersion: v5.12
> > > > +Contact: [email protected]
> > > > +Description:
> > > > + (RO) "Persistent Only Capacity" as reported by the
> > > > + Identify Memory Device Output Payload in the CXL-2.0
> > > > + specification.
> > >
> > > Aren't volatile and persistent capacities expressed in multiples of 256MB?
> >
> > As of the spec today, volatile and persistent capacities are required to be
> > in multiples of 256MB, however, future specs may not have such a requirement and
> > I think keeping sysfs ABI easily forward portable makes sense.
> >
>
> Makes sense, can we add that these are expressed in bytes or is that
> already implied?
Makes sense to declare units here.
On 21-02-01 13:51:16, David Rientjes wrote:
> On Mon, 1 Feb 2021, Ben Widawsky wrote:
>
> > On 21-01-30 15:51:49, David Rientjes wrote:
> > > On Fri, 29 Jan 2021, Ben Widawsky wrote:
> > >
> > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > +{
> > > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > > +
> > > > + cxlm->mbox.payload_size =
> > > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > +
> > > > + /* 8.2.8.4.3 */
> > > > + if (cxlm->mbox.payload_size < 256) {
> > > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > > + cxlm->mbox.payload_size);
> > > > + return -ENXIO;
> > > > + }
> > >
> > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> > > return ENXIO if true?
> >
> > If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> > driver not allow it?
> >
>
> Because the spec disallows it :)
I don't see it being the driver's responsibility to enforce spec correctness
though. In certain cases, I need to use the spec, like I have to pick /some/
mailbox timeout. For other cases...
I'm not too familiar with what other similar drivers may or may not do in
situations like this. The current 256 limit is mostly a reflection of that being
too small to even support advertised mandatory commands. So things can't work in
that scenario, but things can work if they have a larger register size (so long
as the BAR advertises enough space).
On Mon, Feb 1, 2021 at 1:51 PM David Rientjes <[email protected]> wrote:
>
> On Mon, 1 Feb 2021, Ben Widawsky wrote:
>
> > On 21-01-30 15:51:49, David Rientjes wrote:
> > > On Fri, 29 Jan 2021, Ben Widawsky wrote:
> > >
> > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > +{
> > > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > > +
> > > > + cxlm->mbox.payload_size =
> > > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > +
> > > > + /* 8.2.8.4.3 */
> > > > + if (cxlm->mbox.payload_size < 256) {
> > > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > > + cxlm->mbox.payload_size);
> > > > + return -ENXIO;
> > > > + }
> > >
> > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> > > return ENXIO if true?
> >
> > If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> > driver not allow it?
> >
>
> Because the spec disallows it :)
Unless it causes an operational failure in practice I'd go with the
Robustness Principle and be liberal in accepting hardware geometries.
On Mon, 1 Feb 2021, Ben Widawsky wrote:
> > > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > > +{
> > > > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > > > +
> > > > > + cxlm->mbox.payload_size =
> > > > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > > +
> > > > > + /* 8.2.8.4.3 */
> > > > > + if (cxlm->mbox.payload_size < 256) {
> > > > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > > > + cxlm->mbox.payload_size);
> > > > > + return -ENXIO;
> > > > > + }
> > > >
> > > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> > > > return ENXIO if true?
> > >
> > > If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> > > driver not allow it?
> > >
> >
> > Because the spec disallows it :)
>
> I don't see it being the driver's responsibility to enforce spec correctness
> though. In certain cases, I need to use the spec, like I have to pick /some/
> mailbox timeout. For other cases...
>
> I'm not too familiar with what other similar drivers may or may not do in
> situations like this. The current 256 limit is mostly a reflection of that being
> too small to even support advertised mandatory commands. So things can't work in
> that scenario, but things can work if they have a larger register size (so long
> as the BAR advertises enough space).
>
I don't think things can work above 1MB, either, though. Section
8.2.8.4.5 specifies 20 bits to define the payload length, if this is
larger than cxlm->mbox.payload_size it would venture into the reserved
bits of the command register.
So is the idea to allow cxl_mem_setup_mailbox() to succeed with a payload
size > 1MB and then only check 20 bits for the command register?
On 21-02-01 14:23:47, David Rientjes wrote:
> On Mon, 1 Feb 2021, Ben Widawsky wrote:
>
> > > > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > > > +{
> > > > > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > > > > +
> > > > > > + cxlm->mbox.payload_size =
> > > > > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > > > +
> > > > > > + /* 8.2.8.4.3 */
> > > > > > + if (cxlm->mbox.payload_size < 256) {
> > > > > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > > > > + cxlm->mbox.payload_size);
> > > > > > + return -ENXIO;
> > > > > > + }
> > > > >
> > > > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> > > > > return ENXIO if true?
> > > >
> > > > If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> > > > driver not allow it?
> > > >
> > >
> > > Because the spec disallows it :)
> >
> > I don't see it being the driver's responsibility to enforce spec correctness
> > though. In certain cases, I need to use the spec, like I have to pick /some/
> > mailbox timeout. For other cases...
> >
> > I'm not too familiar with what other similar drivers may or may not do in
> > situations like this. The current 256 limit is mostly a reflection of that being
> > too small to even support advertised mandatory commands. So things can't work in
> > that scenario, but things can work if they have a larger register size (so long
> > as the BAR advertises enough space).
> >
>
> I don't think things can work above 1MB, either, though. Section
> 8.2.8.4.5 specifies 20 bits to define the payload length, if this is
> larger than cxlm->mbox.payload_size it would venture into the reserved
> bits of the command register.
>
> So is the idea to allow cxl_mem_setup_mailbox() to succeed with a payload
> size > 1MB and then only check 20 bits for the command register?
So it's probably a spec bug, but actually the payload size is 21 bits... I'll
check if that was a mistake.
On 21-02-01 14:28:59, Ben Widawsky wrote:
> On 21-02-01 14:23:47, David Rientjes wrote:
> > On Mon, 1 Feb 2021, Ben Widawsky wrote:
> >
> > > > > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > > > > +{
> > > > > > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > > > > > +
> > > > > > > + cxlm->mbox.payload_size =
> > > > > > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > > > > +
> > > > > > > + /* 8.2.8.4.3 */
> > > > > > > + if (cxlm->mbox.payload_size < 256) {
> > > > > > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > > > > > + cxlm->mbox.payload_size);
> > > > > > > + return -ENXIO;
> > > > > > > + }
> > > > > >
> > > > > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> > > > > > return ENXIO if true?
> > > > >
> > > > > If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> > > > > driver not allow it?
> > > > >
> > > >
> > > > Because the spec disallows it :)
> > >
> > > I don't see it being the driver's responsibility to enforce spec correctness
> > > though. In certain cases, I need to use the spec, like I have to pick /some/
> > > mailbox timeout. For other cases...
> > >
> > > I'm not too familiar with what other similar drivers may or may not do in
> > > situations like this. The current 256 limit is mostly a reflection of that being
> > > too small to even support advertised mandatory commands. So things can't work in
> > > that scenario, but things can work if they have a larger register size (so long
> > > as the BAR advertises enough space).
> > >
> >
> > I don't think things can work above 1MB, either, though. Section
> > 8.2.8.4.5 specifies 20 bits to define the payload length, if this is
> > larger than cxlm->mbox.payload_size it would venture into the reserved
> > bits of the command register.
> >
> > So is the idea to allow cxl_mem_setup_mailbox() to succeed with a payload
> > size > 1MB and then only check 20 bits for the command register?
>
> So it's probably a spec bug, but actually the payload size is 21 bits... I'll
> check if that was a mistake.
Well I guess they wanted to be able to specify 1M exactly... Spec should
probably say you can't go over 1M
On Mon, 1 Feb 2021, Ben Widawsky wrote:
> > > > > > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > > > > > +{
> > > > > > > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > > > > > > +
> > > > > > > > + cxlm->mbox.payload_size =
> > > > > > > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > > > > > +
> > > > > > > > + /* 8.2.8.4.3 */
> > > > > > > > + if (cxlm->mbox.payload_size < 256) {
> > > > > > > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > > > > > > + cxlm->mbox.payload_size);
> > > > > > > > + return -ENXIO;
> > > > > > > > + }
> > > > > > >
> > > > > > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> > > > > > > return ENXIO if true?
> > > > > >
> > > > > > If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> > > > > > driver not allow it?
> > > > > >
> > > > >
> > > > > Because the spec disallows it :)
> > > >
> > > > I don't see it being the driver's responsibility to enforce spec correctness
> > > > though. In certain cases, I need to use the spec, like I have to pick /some/
> > > > mailbox timeout. For other cases...
> > > >
> > > > I'm not too familiar with what other similar drivers may or may not do in
> > > > situations like this. The current 256 limit is mostly a reflection of that being
> > > > too small to even support advertised mandatory commands. So things can't work in
> > > > that scenario, but things can work if they have a larger register size (so long
> > > > as the BAR advertises enough space).
> > > >
> > >
> > > I don't think things can work above 1MB, either, though. Section
> > > 8.2.8.4.5 specifies 20 bits to define the payload length, if this is
> > > larger than cxlm->mbox.payload_size it would venture into the reserved
> > > bits of the command register.
> > >
> > > So is the idea to allow cxl_mem_setup_mailbox() to succeed with a payload
> > > size > 1MB and then only check 20 bits for the command register?
> >
> > So it's probably a spec bug, but actually the payload size is 21 bits... I'll
> > check if that was a mistake.
>
> Well I guess they wanted to be able to specify 1M exactly... Spec should
> probably say you can't go over 1M
>
I think that's what 8.2.8.4.3 says, no? And then 8.2.8.4.5 says you
can use up to Payload Size. That's why my recommendation was to enforce
this in cxl_mem_setup_mailbox() up front.
On 21-02-01 14:45:00, David Rientjes wrote:
> On Mon, 1 Feb 2021, Ben Widawsky wrote:
>
> > > > > > > > > +static int cxl_mem_setup_mailbox(struct cxl_mem *cxlm)
> > > > > > > > > +{
> > > > > > > > > + const int cap = cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > > > > > > > > +
> > > > > > > > > + cxlm->mbox.payload_size =
> > > > > > > > > + 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE);
> > > > > > > > > +
> > > > > > > > > + /* 8.2.8.4.3 */
> > > > > > > > > + if (cxlm->mbox.payload_size < 256) {
> > > > > > > > > + dev_err(&cxlm->pdev->dev, "Mailbox is too small (%zub)",
> > > > > > > > > + cxlm->mbox.payload_size);
> > > > > > > > > + return -ENXIO;
> > > > > > > > > + }
> > > > > > > >
> > > > > > > > Any reason not to check cxlm->mbox.payload_size > (1 << 20) as well and
> > > > > > > > return ENXIO if true?
> > > > > > >
> > > > > > > If some crazy vendor wanted to ship a mailbox larger than 1M, why should the
> > > > > > > driver not allow it?
> > > > > > >
> > > > > >
> > > > > > Because the spec disallows it :)
> > > > >
> > > > > I don't see it being the driver's responsibility to enforce spec correctness
> > > > > though. In certain cases, I need to use the spec, like I have to pick /some/
> > > > > mailbox timeout. For other cases...
> > > > >
> > > > > I'm not too familiar with what other similar drivers may or may not do in
> > > > > situations like this. The current 256 limit is mostly a reflection of that being
> > > > > too small to even support advertised mandatory commands. So things can't work in
> > > > > that scenario, but things can work if they have a larger register size (so long
> > > > > as the BAR advertises enough space).
> > > > >
> > > >
> > > > I don't think things can work above 1MB, either, though. Section
> > > > 8.2.8.4.5 specifies 20 bits to define the payload length, if this is
> > > > larger than cxlm->mbox.payload_size it would venture into the reserved
> > > > bits of the command register.
> > > >
> > > > So is the idea to allow cxl_mem_setup_mailbox() to succeed with a payload
> > > > size > 1MB and then only check 20 bits for the command register?
> > >
> > > So it's probably a spec bug, but actually the payload size is 21 bits... I'll
> > > check if that was a mistake.
> >
> > Well I guess they wanted to be able to specify 1M exactly... Spec should
> > probably say you can't go over 1M
> >
>
> I think that's what 8.2.8.4.3 says, no? And then 8.2.8.4.5 says you
> can use up to Payload Size. That's why my recommendation was to enforce
> this in cxl_mem_setup_mailbox() up front.
Yeah. I asked our spec people to update 8.2.8.4.5 to make it clearer. I'd argue
the intent is how you describe it, but the implementation isn't.
My argument was silly anyway because if you specify greater than 1M as your
payload, you will get EINVAL at the ioctl.
The value of how it works today is the driver will at least bind and allow you
to interact with it.
How strongly do you feel about this?
On Mon, 1 Feb 2021, Ben Widawsky wrote:
> > I think that's what 8.2.8.4.3 says, no? And then 8.2.8.4.5 says you
> > can use up to Payload Size. That's why my recommendation was to enforce
> > this in cxl_mem_setup_mailbox() up front.
>
> Yeah. I asked our spec people to update 8.2.8.4.5 to make it clearer. I'd argue
> the intent is how you describe it, but the implementation isn't.
>
> My argument was silly anyway because if you specify greater than 1M as your
> payload, you will get EINVAL at the ioctl.
>
> The value of how it works today is the driver will at least bind and allow you
> to interact with it.
>
> How strongly do you feel about this?
>
I haven't seen the update to 8.2.8.4.5 to know yet :)
You make a good point of at least being able to interact with the driver.
I think you could argue that if the driver binds, then the payload size is
accepted, in which case it would be strange to get an EINVAL when using
the ioctl with anything >1MB.
Concern was that if we mask off the reserved bits from the command
register that we could be masking part of the payload size that is being
passed if the accepted max is >1MB. Idea was to avoid any possibility of
this inconsistency. If this is being checked for ioctl, seems like it's
checking reserved bits.
But maybe I should just wait for the spec update.
On 21-02-01 15:09:45, David Rientjes wrote:
> On Mon, 1 Feb 2021, Ben Widawsky wrote:
>
> > > I think that's what 8.2.8.4.3 says, no? And then 8.2.8.4.5 says you
> > > can use up to Payload Size. That's why my recommendation was to enforce
> > > this in cxl_mem_setup_mailbox() up front.
> >
> > Yeah. I asked our spec people to update 8.2.8.4.5 to make it clearer. I'd argue
> > the intent is how you describe it, but the implementation isn't.
> >
> > My argument was silly anyway because if you specify greater than 1M as your
> > payload, you will get EINVAL at the ioctl.
> >
> > The value of how it works today is the driver will at least bind and allow you
> > to interact with it.
> >
> > How strongly do you feel about this?
> >
>
> I haven't seen the update to 8.2.8.4.5 to know yet :)
>
> You make a good point of at least being able to interact with the driver.
> I think you could argue that if the driver binds, then the payload size is
> accepted, in which case it would be strange to get an EINVAL when using
> the ioctl with anything >1MB.
>
> Concern was that if we mask off the reserved bits from the command
> register that we could be masking part of the payload size that is being
> passed if the accepted max is >1MB. Idea was to avoid any possibility of
> this inconsistency. If this is being checked for ioctl, seems like it's
> checking reserved bits.
>
> But maybe I should just wait for the spec update.
Well, I wouldn't hold your breath (it would be an errata in this case anyway).
My preference would be to just allow allow mailbox payload size to be 2^31 and
not deal with this.
My question was how strongly do you feel it's an error that should prevent
binding.
On Mon, 1 Feb 2021, Ben Widawsky wrote:
> > I haven't seen the update to 8.2.8.4.5 to know yet :)
> >
> > You make a good point of at least being able to interact with the driver.
> > I think you could argue that if the driver binds, then the payload size is
> > accepted, in which case it would be strange to get an EINVAL when using
> > the ioctl with anything >1MB.
> >
> > Concern was that if we mask off the reserved bits from the command
> > register that we could be masking part of the payload size that is being
> > passed if the accepted max is >1MB. Idea was to avoid any possibility of
> > this inconsistency. If this is being checked for ioctl, seems like it's
> > checking reserved bits.
> >
> > But maybe I should just wait for the spec update.
>
> Well, I wouldn't hold your breath (it would be an errata in this case anyway).
> My preference would be to just allow allow mailbox payload size to be 2^31 and
> not deal with this.
>
> My question was how strongly do you feel it's an error that should prevent
> binding.
>
I don't have an objection to binding, but doesn't this require that the
check in cxl_validate_cmd_from_user() guarantees send_cmd->size_in cannot
be greater than 1MB?
On 21-02-01 15:58:09, David Rientjes wrote:
> On Mon, 1 Feb 2021, Ben Widawsky wrote:
>
> > > I haven't seen the update to 8.2.8.4.5 to know yet :)
> > >
> > > You make a good point of at least being able to interact with the driver.
> > > I think you could argue that if the driver binds, then the payload size is
> > > accepted, in which case it would be strange to get an EINVAL when using
> > > the ioctl with anything >1MB.
> > >
> > > Concern was that if we mask off the reserved bits from the command
> > > register that we could be masking part of the payload size that is being
> > > passed if the accepted max is >1MB. Idea was to avoid any possibility of
> > > this inconsistency. If this is being checked for ioctl, seems like it's
> > > checking reserved bits.
> > >
> > > But maybe I should just wait for the spec update.
> >
> > Well, I wouldn't hold your breath (it would be an errata in this case anyway).
> > My preference would be to just allow allow mailbox payload size to be 2^31 and
> > not deal with this.
> >
> > My question was how strongly do you feel it's an error that should prevent
> > binding.
> >
>
> I don't have an objection to binding, but doesn't this require that the
> check in cxl_validate_cmd_from_user() guarantees send_cmd->size_in cannot
> be greater than 1MB?
You're correct. I'd need to add:
cxlm->mbox.payload_size =
min_t(size_t, 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE), 1<<20)
On Mon, Feb 1, 2021 at 4:11 PM Ben Widawsky <[email protected]> wrote:
>
> On 21-02-01 15:58:09, David Rientjes wrote:
> > On Mon, 1 Feb 2021, Ben Widawsky wrote:
> >
> > > > I haven't seen the update to 8.2.8.4.5 to know yet :)
> > > >
> > > > You make a good point of at least being able to interact with the driver.
> > > > I think you could argue that if the driver binds, then the payload size is
> > > > accepted, in which case it would be strange to get an EINVAL when using
> > > > the ioctl with anything >1MB.
> > > >
> > > > Concern was that if we mask off the reserved bits from the command
> > > > register that we could be masking part of the payload size that is being
> > > > passed if the accepted max is >1MB. Idea was to avoid any possibility of
> > > > this inconsistency. If this is being checked for ioctl, seems like it's
> > > > checking reserved bits.
> > > >
> > > > But maybe I should just wait for the spec update.
> > >
> > > Well, I wouldn't hold your breath (it would be an errata in this case anyway).
> > > My preference would be to just allow allow mailbox payload size to be 2^31 and
> > > not deal with this.
> > >
> > > My question was how strongly do you feel it's an error that should prevent
> > > binding.
> > >
> >
> > I don't have an objection to binding, but doesn't this require that the
> > check in cxl_validate_cmd_from_user() guarantees send_cmd->size_in cannot
> > be greater than 1MB?
>
> You're correct. I'd need to add:
> cxlm->mbox.payload_size =
> min_t(size_t, 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE), 1<<20)
nit, use the existing SZ_1M define.
On Mon, 1 Feb 2021, Dan Williams wrote:
> > > I don't have an objection to binding, but doesn't this require that the
> > > check in cxl_validate_cmd_from_user() guarantees send_cmd->size_in cannot
> > > be greater than 1MB?
> >
> > You're correct. I'd need to add:
> > cxlm->mbox.payload_size =
> > min_t(size_t, 1 << CXL_GET_FIELD(cap, CXLDEV_MB_CAP_PAYLOAD_SIZE), 1<<20)
>
> nit, use the existing SZ_1M define.
>
Sounds good, thanks both! I'll assume you'll follow-up on this in the
next revision for patch 7 ("cxl/mem: Add send command") and we can
consider this resolved :)
On Mon, Feb 01, 2021 at 11:01:11AM -0800, Dan Williams wrote:
> On Mon, Feb 1, 2021 at 10:35 AM Ben Widawsky <[email protected]> wrote:
> >
> > On 21-02-01 13:18:45, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Jan 29, 2021 at 04:24:32PM -0800, Ben Widawsky wrote:
> > > > For drivers that moderate access to the underlying hardware it is
> > > > sometimes desirable to allow userspace to bypass restrictions. Once
> > > > userspace has done this, the driver can no longer guarantee the sanctity
> > > > of either the OS or the hardware. When in this state, it is helpful for
> > > > kernel developers to be made aware (via this taint flag) of this fact
> > > > for subsequent bug reports.
> > > >
> > > > Example usage:
> > > > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > > > - The xyzzy driver provides an interface for using waldo, but not fred.
> > > > - quux is convinced they really need the fred command.
> > > > - xyzzy driver allows quux to frob hardware to initiate fred.
> > >
> > > Would it not be easier to _not_ frob the hardware for fred-operation?
> > > Aka not implement it or just disallow in the first place?
> >
> > Yeah. So the idea is you either are in a transient phase of the command and some
> > future kernel will have real support for fred - or a vendor is being short
> > sighted and not adding support for fred.
> >
> > >
> > >
> > > > - kernel gets tainted.
> > > > - turns out fred command is borked, and scribbles over memory.
> > > > - developers laugh while closing quux's subsequent bug report.
> > >
> > > Yeah good luck with that theory in-the-field. The customer won't
> > > care about this and will demand a solution for doing fred-operation.
> > >
> > > Just easier to not do fred-operation in the first place,no?
> >
> > The short answer is, in an ideal world you are correct. See nvdimm as an example
> > of the real world.
> >
> > The longer answer. Unless we want to wait until we have all the hardware we're
> > ever going to see, it's impossible to have a fully baked, and validated
> > interface. The RAW interface is my admission that I make no guarantees about
> > being able to provide the perfect interface and giving the power back to the
> > hardware vendors and their driver writers.
> >
> > As an example, suppose a vendor shipped a device with their special vendor
> > opcode. They can enable their customers to use that opcode on any driver
> > version. That seems pretty powerful and worthwhile to me.
> >
>
> Powerful, frightening, and questionably worthwhile when there are
> already examples of commands that need extra coordination for whatever
> reason. However, I still think the decision tilts towards allowing
> this given ongoing spec work.
>
> NVDIMM ended up allowing unfettered vendor passthrough given the lack
> of an organizing body to unify vendors. CXL on the other hand appears
> to have more gravity to keep vendors honest. A WARN splat with a
> taint, and a debugfs knob for the truly problematic commands seems
> sufficient protection of system integrity while still following the
> Linux ethos of giving system owners enough rope to make their own
> decisions.
>
> > Or a more realistic example, we ship a driver that adds a command which is
> > totally broken. Customers can utilize the RAW interface until it gets fixed in a
> > subsequent release which might be quite a ways out.
> >
> > I'll say the RAW interface isn't an encouraged usage, but it's one that I expect
> > to be needed, and if it's not we can always try to kill it later. If nobody is
> > actually using it, nobody will complain, right :D
>
> It might be worthwhile to make RAW support a compile time decision so
> that Linux distros can only ship support for the commands the CXL
> driver-dev community has blessed, but I'll leave it to a distro
> developer to second that approach.
Couple of thoughts here:
- As distro developer (well, actually middle manager of distro
developers) this approach of raw interface is a headache.
Customers will pick it and use it since it is there and the poor
support folks will have to go through layers of different devices to
say (for example) to finally find out that some OEM firmware opcode
X is a debug facility for inserting corrupted data, while for another vendor
the same X opcode makes it go super-fast.
Not that anybody would do that, right? Ha!
- I will imagine that some of the more vocal folks in the community
will make it difficult to integrate these patches with these two
(especially this taint one). This will make the acceptance of these
patches more difficult than it should be. If you really want them,
perhaps make them part of another patchset, or a follow up ones.
- I still don't get why as a brand new community hacks are coming up
(even when the hardware is not yet there) instead of pushing back at
the vendors to have a clean up interface. I get in say two or three
years these things .. but from the start? I get your point about
flexibility, but it seems to me that the right way is not give open
RAW interface (big barndoor) but rather maintain the driver and grow
it (properly constructed doors) as more functionality comes about
and then adding it in the driver.
On Mon, Feb 1, 2021 at 6:50 PM Konrad Rzeszutek Wilk
<[email protected]> wrote:
>
> On Mon, Feb 01, 2021 at 11:01:11AM -0800, Dan Williams wrote:
> > On Mon, Feb 1, 2021 at 10:35 AM Ben Widawsky <[email protected]> wrote:
> > >
> > > On 21-02-01 13:18:45, Konrad Rzeszutek Wilk wrote:
> > > > On Fri, Jan 29, 2021 at 04:24:32PM -0800, Ben Widawsky wrote:
> > > > > For drivers that moderate access to the underlying hardware it is
> > > > > sometimes desirable to allow userspace to bypass restrictions. Once
> > > > > userspace has done this, the driver can no longer guarantee the sanctity
> > > > > of either the OS or the hardware. When in this state, it is helpful for
> > > > > kernel developers to be made aware (via this taint flag) of this fact
> > > > > for subsequent bug reports.
> > > > >
> > > > > Example usage:
> > > > > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > > > > - The xyzzy driver provides an interface for using waldo, but not fred.
> > > > > - quux is convinced they really need the fred command.
> > > > > - xyzzy driver allows quux to frob hardware to initiate fred.
> > > >
> > > > Would it not be easier to _not_ frob the hardware for fred-operation?
> > > > Aka not implement it or just disallow in the first place?
> > >
> > > Yeah. So the idea is you either are in a transient phase of the command and some
> > > future kernel will have real support for fred - or a vendor is being short
> > > sighted and not adding support for fred.
> > >
> > > >
> > > >
> > > > > - kernel gets tainted.
> > > > > - turns out fred command is borked, and scribbles over memory.
> > > > > - developers laugh while closing quux's subsequent bug report.
> > > >
> > > > Yeah good luck with that theory in-the-field. The customer won't
> > > > care about this and will demand a solution for doing fred-operation.
> > > >
> > > > Just easier to not do fred-operation in the first place,no?
> > >
> > > The short answer is, in an ideal world you are correct. See nvdimm as an example
> > > of the real world.
> > >
> > > The longer answer. Unless we want to wait until we have all the hardware we're
> > > ever going to see, it's impossible to have a fully baked, and validated
> > > interface. The RAW interface is my admission that I make no guarantees about
> > > being able to provide the perfect interface and giving the power back to the
> > > hardware vendors and their driver writers.
> > >
> > > As an example, suppose a vendor shipped a device with their special vendor
> > > opcode. They can enable their customers to use that opcode on any driver
> > > version. That seems pretty powerful and worthwhile to me.
> > >
> >
> > Powerful, frightening, and questionably worthwhile when there are
> > already examples of commands that need extra coordination for whatever
> > reason. However, I still think the decision tilts towards allowing
> > this given ongoing spec work.
> >
> > NVDIMM ended up allowing unfettered vendor passthrough given the lack
> > of an organizing body to unify vendors. CXL on the other hand appears
> > to have more gravity to keep vendors honest. A WARN splat with a
> > taint, and a debugfs knob for the truly problematic commands seems
> > sufficient protection of system integrity while still following the
> > Linux ethos of giving system owners enough rope to make their own
> > decisions.
> >
> > > Or a more realistic example, we ship a driver that adds a command which is
> > > totally broken. Customers can utilize the RAW interface until it gets fixed in a
> > > subsequent release which might be quite a ways out.
> > >
> > > I'll say the RAW interface isn't an encouraged usage, but it's one that I expect
> > > to be needed, and if it's not we can always try to kill it later. If nobody is
> > > actually using it, nobody will complain, right :D
> >
> > It might be worthwhile to make RAW support a compile time decision so
> > that Linux distros can only ship support for the commands the CXL
> > driver-dev community has blessed, but I'll leave it to a distro
> > developer to second that approach.
>
> Couple of thoughts here:
I am compelled to challenge these assertions because this set is
*more* conservative than the current libnvdimm situation which is
silent by default about the vendor pasthrough. How can this be worse
when the same scope of possible damage is now loudly reported rather
than silent?
>
> - As distro developer (well, actually middle manager of distro
> developers) this approach of raw interface is a headache.
You've convinced me that this needs a compile time disable, is that
not sufficient?
>
> Customers will pick it
Why will they pick it when the kernel screams bloody murder at them
when it gets used?
> and use it since it is there and the poor
> support folks will have to go through layers of different devices
What layers willl support folks need to dig through when the WARN
splat is clearly present in the log and the taint flag is visible in
any future crash?
> say (for example) to finally find out that some OEM firmware opcode
> X is a debug facility for inserting corrupted data, while for another vendor
> the same X opcode makes it go super-fast.
None of these commands are in the fast path.
>
> Not that anybody would do that, right? Ha!
We can look to libnvdimm + ndctl to see the trend. I have not
encountered new competing manageability tool efforts, and the existing
collisions between ndctl and ipmctl are resolving to defeature ipmctl
where ndctl and the standard / native command path can do the job.
>
> - I will imagine that some of the more vocal folks in the community
> will make it difficult to integrate these patches with these two
> (especially this taint one). This will make the acceptance of these
> patches more difficult than it should be. If you really want them,
> perhaps make them part of another patchset, or a follow up ones.
The patches are out now and no such pushback from others has arisen. I
believe your proposal for compile-time disable is reasonable and would
satisfy those concerns.
>
> - I still don't get why as a brand new community hacks are coming up
> (even when the hardware is not yet there) instead of pushing back at
> the vendors to have a clean up interface. I get in say two or three
> years these things .. but from the start? I get your point about
> flexibility, but it seems to me that the right way is not give open
> RAW interface (big barndoor) but rather maintain the driver and grow
> it (properly constructed doors) as more functionality comes about
> and then adding it in the driver.
>
Again, WARN_TAINT and now the threat of vendor tools not being
generally distributable across distro kernels that turn this off, is a
more strict stance than libnvdimm where the worst fears have yet to
come to fruition. In the meantime this enabling is useful for a
validation test bench kernel for new hardware bringup while the
upstream api formalization work continues.
On Fri, Jan 29, 2021 at 04:24:27PM -0800, Ben Widawsky wrote:
> #ifndef __CXL_H__
> #define __CXL_H__
>
> +#include <linux/bitfield.h>
> +#include <linux/bitops.h>
> +#include <linux/io.h>
> +
> +#define CXL_SET_FIELD(value, field) \
> + ({ \
> + WARN_ON(!FIELD_FIT(field##_MASK, value)); \
> + FIELD_PREP(field##_MASK, value); \
> + })
> +
> +#define CXL_GET_FIELD(word, field) FIELD_GET(field##_MASK, word)
This looks like some massive obsfucation. What is the intent
here?
> + /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
> + struct {
> + void __iomem *regs;
> + } status;
> +
> + /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
> + struct {
> + void __iomem *regs;
> + size_t payload_size;
> + } mbox;
> +
> + /* Cap 4000h - CXL_CAP_CAP_ID_MEMDEV */
> + struct {
> + void __iomem *regs;
> + } mem;
This style looks massively obsfucated. For one the comments look like
absolute gibberish, but also what is the point of all these anonymous
structures?
> +#define cxl_reg(type) \
> + static inline void cxl_write_##type##_reg32(struct cxl_mem *cxlm, \
> + u32 reg, u32 value) \
> + { \
> + void __iomem *reg_addr = cxlm->type.regs; \
> + writel(value, reg_addr + reg); \
> + } \
> + static inline void cxl_write_##type##_reg64(struct cxl_mem *cxlm, \
> + u32 reg, u64 value) \
> + { \
> + void __iomem *reg_addr = cxlm->type.regs; \
> + writeq(value, reg_addr + reg); \
> + } \
What is the value add of all this obsfucation over the trivial
calls to the write*/read* functions, possible with a locally
declarate "void __iomem *" variable in the callers like in all
normall drivers? Except for making the life of the poor soul trying
to debug this code some time in the future really hard, of course.
> + /* 8.2.8.4.3 */
????
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 25e08e5f40bd..33432a4cbe23 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -3179,6 +3179,20 @@ struct device *get_device(struct device *dev)
> }
> EXPORT_SYMBOL_GPL(get_device);
>
> +/**
> + * get_live_device() - increment reference count for device iff !dead
> + * @dev: device.
> + *
> + * Forward the call to get_device() if the device is still alive. If
> + * this is called with the device_lock() held then the device is
> + * guaranteed to not die until the device_lock() is dropped.
> + */
> +struct device *get_live_device(struct device *dev)
> +{
> + return dev && !dev->p->dead ? get_device(dev) : NULL;
> +}
> +EXPORT_SYMBOL_GPL(get_live_device);
Err, if you want to add new core functionality that needs to be in a
separate well documented prep patch, and also CCed to the relevant
maintainers.
> mutex_unlock(&cxlm->mbox.mutex);
> }
>
> +static int cxl_memdev_open(struct inode *inode, struct file *file)
> +{
> + struct cxl_memdev *cxlmd =
> + container_of(inode->i_cdev, typeof(*cxlmd), cdev);
> +
> + file->private_data = cxlmd;
There is no good reason to ever mirror stuff from the inode into
file->private_data, as you can just trivially get at the original
location using file_inode(file).
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
This has no business in a kernel header.
On 21-02-02 18:10:16, Christoph Hellwig wrote:
> On Fri, Jan 29, 2021 at 04:24:27PM -0800, Ben Widawsky wrote:
> > #ifndef __CXL_H__
> > #define __CXL_H__
> >
> > +#include <linux/bitfield.h>
> > +#include <linux/bitops.h>
> > +#include <linux/io.h>
> > +
> > +#define CXL_SET_FIELD(value, field) \
> > + ({ \
> > + WARN_ON(!FIELD_FIT(field##_MASK, value)); \
> > + FIELD_PREP(field##_MASK, value); \
> > + })
> > +
> > +#define CXL_GET_FIELD(word, field) FIELD_GET(field##_MASK, word)
>
> This looks like some massive obsfucation. What is the intent
> here?
>
I will drop these. I don't recall why I did this to be honest.
> > + /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
> > + struct {
> > + void __iomem *regs;
> > + } status;
> > +
> > + /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
> > + struct {
> > + void __iomem *regs;
> > + size_t payload_size;
> > + } mbox;
> > +
> > + /* Cap 4000h - CXL_CAP_CAP_ID_MEMDEV */
> > + struct {
> > + void __iomem *regs;
> > + } mem;
>
> This style looks massively obsfucated. For one the comments look like
> absolute gibberish, but also what is the point of all these anonymous
> structures?
They're not anonymous, and their names are for the below register functions. The
comments are connected spec reference 'Cap XXXXh' to definitions in cxl.h. I can
articulate that if it helps.
>
> > +#define cxl_reg(type) \
> > + static inline void cxl_write_##type##_reg32(struct cxl_mem *cxlm, \
> > + u32 reg, u32 value) \
> > + { \
> > + void __iomem *reg_addr = cxlm->type.regs; \
> > + writel(value, reg_addr + reg); \
> > + } \
> > + static inline void cxl_write_##type##_reg64(struct cxl_mem *cxlm, \
> > + u32 reg, u64 value) \
> > + { \
> > + void __iomem *reg_addr = cxlm->type.regs; \
> > + writeq(value, reg_addr + reg); \
> > + } \
>
> What is the value add of all this obsfucation over the trivial
> calls to the write*/read* functions, possible with a locally
> declarate "void __iomem *" variable in the callers like in all
> normall drivers? Except for making the life of the poor soul trying
> to debug this code some time in the future really hard, of course.
>
The register space for CXL devices is a bit weird since it's all subdivided
under 1 BAR for now. To clearly distinguish over the different subregions, these
helpers exist. It's really easy to mess this up as the developer and I actually
would disagree that it makes debugging quite a bit easier. It also gets more
convoluted when you add the other 2 BARs which also each have their own
subregions.
For example. if my mailbox function does:
cxl_read_status_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
instead of:
cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
It's easier to spot than:
readl(cxlm->regs + cxlm->status_offset + CXLDEV_MB_CAPS_OFFSET)
> > + /* 8.2.8.4.3 */
>
> ????
>
I had been trying to be consistent with 'CXL2.0 - ' in front of all spec
reference. I obviously missed this one.
On 21-02-02 18:15:05, Christoph Hellwig wrote:
> > +#if defined(__cplusplus)
> > +extern "C" {
> > +#endif
>
> This has no business in a kernel header.
This was copypasta from DRM headers (which as you're probably aware wasn't
always part of the kernel)... It's my mistake and I will get rid of it.
On 21-01-30 15:51:57, David Rientjes wrote:
> On Fri, 29 Jan 2021, Ben Widawsky wrote:
>
[snip]
> > +/**
> > + * cxl_mem_mbox_send_cmd() - Send a mailbox command to a memory device.
> > + * @cxlm: The CXL memory device to communicate with.
> > + * @mbox_cmd: Command to send to the memory device.
> > + *
> > + * Context: Any context. Expects mbox_lock to be held.
> > + * Return: -ETIMEDOUT if timeout occurred waiting for completion. 0 on success.
> > + * Caller should check the return code in @mbox_cmd to make sure it
> > + * succeeded.
> > + *
> > + * This is a generic form of the CXL mailbox send command, thus the only I/O
> > + * operations used are cxl_read_mbox_reg(). Memory devices, and perhaps other
> > + * types of CXL devices may have further information available upon error
> > + * conditions.
> > + *
> > + * The CXL spec allows for up to two mailboxes. The intention is for the primary
> > + * mailbox to be OS controlled and the secondary mailbox to be used by system
> > + * firmware. This allows the OS and firmware to communicate with the device and
> > + * not need to coordinate with each other. The driver only uses the primary
> > + * mailbox.
> > + */
> > +static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
> > + struct mbox_cmd *mbox_cmd)
> > +{
> > + void __iomem *payload = cxlm->mbox.regs + CXLDEV_MB_PAYLOAD_OFFSET;
>
> Do you need to verify the payload is non-empty per 8.2.8.4?
>
What do you mean exactly? Emptiness or lack thereof is what determines the size
parameter of the mailbox interface (if we want to input data, we need to write
size, if we got output data, we have to read size bytes out of the payload
registers).
Perhaps I miss the point though, could you elaborate?
[snip]
On 21-02-01 13:15:35, Konrad Rzeszutek Wilk wrote:
> > +/**
> > + * struct cxl_send_command - Send a command to a memory device.
> > + * @id: The command to send to the memory device. This must be one of the
> > + * commands returned by the query command.
> > + * @flags: Flags for the command (input).
> > + * @rsvd: Must be zero.
> > + * @retval: Return value from the memory device (output).
> > + * @size_in: Size of the payload to provide to the device (input).
> > + * @size_out: Size of the payload received from the device (input/output). This
> > + * field is filled in by userspace to let the driver know how much
> > + * space was allocated for output. It is populated by the driver to
> > + * let userspace know how large the output payload actually was.
> > + * @in_payload: Pointer to memory for payload input (little endian order).
> > + * @out_payload: Pointer to memory for payload output (little endian order).
> > + *
> > + * Mechanism for userspace to send a command to the hardware for processing. The
> > + * driver will do basic validation on the command sizes. In some cases even the
> > + * payload may be introspected. Userspace is required to allocate large
> > + * enough buffers for size_out which can be variable length in certain
> > + * situations.
> > + */
> I think (and this would help if you ran `pahole` on this structure) has
> some gaps in it:
>
> > +struct cxl_send_command {
> > + __u32 id;
> > + __u32 flags;
> > + __u32 rsvd;
> > + __u32 retval;
> > +
> > + struct {
> > + __s32 size_in;
>
> Here..Maybe just add:
>
> __u32 rsv_2;
> > + __u64 in_payload;
> > + };
> > +
> > + struct {
> > + __s32 size_out;
>
> And here. Maybe just add:
> __u32 rsv_2;
> > + __u64 out_payload;
> > + };
> > +};
>
> Perhaps to prepare for the future where this may need to be expanded, you
> could add a size at the start of the structure, and
> maybe what version of structure it is?
>
> Maybe for all the new structs you are adding?
Thanks for catching the holes. It broke somewhere in the earlier RFC changes.
I don't think we need to size or version. Reserved fields are good enough near
term future proofing and if we get to a point where the command is woefully
incompetent, I think it'd be time to just make cxl_send_command2.
Generally, I think cxl_send_command is fairly future proof because it's so
simple. As you get more complex, you might need better mechanisms, like deferred
command completion for example. It's unclear to me whether we'll get to that
point though, and if we do, I think a new command is warranted.
On 21-02-01 13:28:48, Konrad Rzeszutek Wilk wrote:
> On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> > The Get Log command returns the actual log entries that are advertised
> > via the Get Supported Logs command (0400h). CXL device logs are selected
> > by UUID which is part of the CXL spec. Because the driver tries to
> > sanitize what is sent to hardware, there becomes a need to restrict the
> > types of logs which can be accessed by userspace. For example, the
> > vendor specific log might only be consumable by proprietary, or offline
> > applications, and therefore a good candidate for userspace.
> >
> > The current driver infrastructure does allow basic validation for all
> > commands, but doesn't inspect any of the payload data. Along with Get
> > Log support comes new infrastructure to add a hook for payload
> > validation. This infrastructure is used to filter out the CEL UUID,
> > which the userspace driver doesn't have business knowing, and taints on
> > invalid UUIDs being sent to hardware.
>
> Perhaps a better option is to reject invalid UUIDs?
>
> And if you really really want to use invalid UUIDs then:
>
> 1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?
>
> 2) Wrap it with lockdown code so that you can't do this at all
> when in LOCKDOWN_INTEGRITY or such?
>
The commit message needs update btw as CEL is allowed in the latest rev of the
patches.
We could potentially combine this with the now added (in a branch) CONFIG_RAW
config option. Indeed I think that makes sense. Dan, thoughts?
> >
> > Signed-off-by: Ben Widawsky <[email protected]>
> > ---
> > drivers/cxl/mem.c | 42 +++++++++++++++++++++++++++++++++++-
> > include/uapi/linux/cxl_mem.h | 1 +
> > 2 files changed, 42 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index b8ca6dff37b5..086268f1dd6c 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -119,6 +119,8 @@ static const uuid_t log_uuid[] = {
> > 0x07, 0x19, 0x40, 0x3d, 0x86)
> > };
> >
> > +static int validate_log_uuid(void __user *payload, size_t size);
> > +
> > /**
> > * struct cxl_mem_command - Driver representation of a memory device command
> > * @info: Command information as it exists for the UAPI
> > @@ -132,6 +134,10 @@ static const uuid_t log_uuid[] = {
> > * * %CXL_CMD_INTERNAL_FLAG_PSEUDO: This is a pseudo command which doesn't have
> > * a direct mapping to hardware. They are implicitly always enabled.
> > *
> > + * @validate_payload: A function called after the command is validated but
> > + * before it's sent to the hardware. The primary purpose is to validate, or
> > + * fixup the actual payload.
> > + *
> > * The cxl_mem_command is the driver's internal representation of commands that
> > * are supported by the driver. Some of these commands may not be supported by
> > * the hardware. The driver will use @info to validate the fields passed in by
> > @@ -147,9 +153,11 @@ struct cxl_mem_command {
> > #define CXL_CMD_INTERNAL_FLAG_HIDDEN BIT(0)
> > #define CXL_CMD_INTERNAL_FLAG_MANDATORY BIT(1)
> > #define CXL_CMD_INTERNAL_FLAG_PSEUDO BIT(2)
> > +
> > + int (*validate_payload)(void __user *payload, size_t size);
> > };
> >
> > -#define CXL_CMD(_id, _flags, sin, sout, f) \
> > +#define CXL_CMD_VALIDATE(_id, _flags, sin, sout, f, v) \
> > [CXL_MEM_COMMAND_ID_##_id] = { \
> > .info = { \
> > .id = CXL_MEM_COMMAND_ID_##_id, \
> > @@ -159,8 +167,12 @@ struct cxl_mem_command {
> > }, \
> > .flags = CXL_CMD_INTERNAL_FLAG_##f, \
> > .opcode = CXL_MBOX_OP_##_id, \
> > + .validate_payload = v, \
> > }
> >
> > +#define CXL_CMD(_id, _flags, sin, sout, f) \
> > + CXL_CMD_VALIDATE(_id, _flags, sin, sout, f, NULL)
> > +
> > /*
> > * This table defines the supported mailbox commands for the driver. This table
> > * is made up of a UAPI structure. Non-negative values as parameters in the
> > @@ -176,6 +188,8 @@ static struct cxl_mem_command mem_commands[] = {
> > CXL_CMD(GET_PARTITION_INFO, NONE, 0, 0x20, NONE),
> > CXL_CMD(GET_LSA, NONE, 0x8, ~0, MANDATORY),
> > CXL_CMD(GET_HEALTH_INFO, NONE, 0, 0x12, MANDATORY),
> > + CXL_CMD_VALIDATE(GET_LOG, MUTEX, 0x18, ~0, MANDATORY,
> > + validate_log_uuid),
> > };
> >
> > /*
> > @@ -563,6 +577,13 @@ static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
> > kvzalloc(cxlm->mbox.payload_size, GFP_KERNEL);
> >
> > if (cmd->info.size_in) {
> > + if (cmd->validate_payload) {
> > + rc = cmd->validate_payload(u64_to_user_ptr(in_payload),
> > + cmd->info.size_in);
> > + if (rc)
> > + goto out;
> > + }
> > +
> > mbox_cmd.payload_in = kvzalloc(cmd->info.size_in, GFP_KERNEL);
> > if (!mbox_cmd.payload_in) {
> > rc = -ENOMEM;
> > @@ -1205,6 +1226,25 @@ struct cxl_mbox_get_log {
> > __le32 length;
> > } __packed;
> >
> > +static int validate_log_uuid(void __user *input, size_t size)
> > +{
> > + struct cxl_mbox_get_log __user *get_log = input;
> > + uuid_t payload_uuid;
> > +
> > + if (copy_from_user(&payload_uuid, &get_log->uuid, sizeof(uuid_t)))
> > + return -EFAULT;
> > +
> > + /* All unspec'd logs shall taint */
> > + if (uuid_equal(&payload_uuid, &log_uuid[CEL_UUID]))
> > + return 0;
> > + if (uuid_equal(&payload_uuid, &log_uuid[DEBUG_UUID]))
> > + return 0;
> > +
> > + add_taint(TAINT_RAW_PASSTHROUGH, LOCKDEP_STILL_OK);
> > +
> > + return 0;
> > +}
> > +
> > static int cxl_xfer_log(struct cxl_mem *cxlm, uuid_t *uuid, u32 size, u8 *out)
> > {
> > u32 remaining = size;
> > diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
> > index 766c231d6150..7cdc7f7ce7ec 100644
> > --- a/include/uapi/linux/cxl_mem.h
> > +++ b/include/uapi/linux/cxl_mem.h
> > @@ -39,6 +39,7 @@ extern "C" {
> > ___C(GET_PARTITION_INFO, "Get Partition Information"), \
> > ___C(GET_LSA, "Get Label Storage Area"), \
> > ___C(GET_HEALTH_INFO, "Get Health Info"), \
> > + ___C(GET_LOG, "Get Log"), \
> > ___C(MAX, "Last command")
> >
> > #define ___C(a, b) CXL_MEM_COMMAND_ID_##a
> > --
> > 2.30.0
> >
On Tue, Feb 2, 2021 at 2:57 PM Ben Widawsky <[email protected]> wrote:
>
> On 21-02-01 12:00:18, Dan Williams wrote:
> > On Sat, Jan 30, 2021 at 3:52 PM David Rientjes <[email protected]> wrote:
> > >
> > > On Fri, 29 Jan 2021, Ben Widawsky wrote:
> > >
> > > > Provide enough functionality to utilize the mailbox of a memory device.
> > > > The mailbox is used to interact with the firmware running on the memory
> > > > device.
> > > >
> > > > The CXL specification defines separate capabilities for the mailbox and
> > > > the memory device. The mailbox interface has a doorbell to indicate
> > > > ready to accept commands and the memory device has a capability register
> > > > that indicates the mailbox interface is ready. The expectation is that
> > > > the doorbell-ready is always later than the memory-device-indication
> > > > that the mailbox is ready.
> > > >
> > > > Create a function to handle sending a command, optionally with a
> > > > payload, to the memory device, polling on a result, and then optionally
> > > > copying out the payload. The algorithm for doing this comes straight out
> > > > of the CXL 2.0 specification.
> > > >
> > > > Primary mailboxes are capable of generating an interrupt when submitting
> > > > a command in the background. That implementation is saved for a later
> > > > time.
> > > >
> > > > Secondary mailboxes aren't implemented at this time.
> > > >
> > > > The flow is proven with one implemented command, "identify". Because the
> > > > class code has already told the driver this is a memory device and the
> > > > identify command is mandatory.
> > > >
> > > > Signed-off-by: Ben Widawsky <[email protected]>
> > > > ---
> > > > drivers/cxl/Kconfig | 14 ++
> > > > drivers/cxl/cxl.h | 39 +++++
> > > > drivers/cxl/mem.c | 342 +++++++++++++++++++++++++++++++++++++++++++-
> > > > 3 files changed, 394 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > > index 3b66b46af8a0..fe591f74af96 100644
> > > > --- a/drivers/cxl/Kconfig
> > > > +++ b/drivers/cxl/Kconfig
> > > > @@ -32,4 +32,18 @@ config CXL_MEM
> > > > Chapter 2.3 Type 3 CXL Device in the CXL 2.0 specification.
> > > >
> > > > If unsure say 'm'.
> > > > +
> > > > +config CXL_MEM_INSECURE_DEBUG
> > > > + bool "CXL.mem debugging"
> > > > + depends on CXL_MEM
> > > > + help
> > > > + Enable debug of all CXL command payloads.
> > > > +
> > > > + Some CXL devices and controllers support encryption and other
> > > > + security features. The payloads for the commands that enable
> > > > + those features may contain sensitive clear-text security
> > > > + material. Disable debug of those command payloads by default.
> > > > + If you are a kernel developer actively working on CXL
> > > > + security enabling say Y, otherwise say N.
> > >
> > > Not specific to this patch, but the reference to encryption made me
> > > curious about integrity: are all CXL.mem devices compatible with DIMP?
> > > Some? None?
> >
> > The encryption here is "device passphrase" similar to the NVDIMM
> > Security Management described here:
> >
> > https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf
> >
> > The LIBNVDIMM enabling wrapped this support with the Linux keys
> > interface which among other things enforces wrapping the clear text
> > passphrase with a Linux "trusted/encrypted" key.
> >
> > Additionally, the CXL.io interface optionally supports PCI IDE:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/reference-guides/pcie-device-security-enhancements.pdf
> >
> > I'm otherwise not familiar with the DIMP acronym?
> >
> > > > +
> > > > endif
> > > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > > > index a3da7f8050c4..df3d97154b63 100644
> > > > --- a/drivers/cxl/cxl.h
> > > > +++ b/drivers/cxl/cxl.h
> > > > @@ -31,9 +31,36 @@
> > > > #define CXLDEV_MB_CAPS_OFFSET 0x00
> > > > #define CXLDEV_MB_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
> > > > #define CXLDEV_MB_CTRL_OFFSET 0x04
> > > > +#define CXLDEV_MB_CTRL_DOORBELL BIT(0)
> > > > #define CXLDEV_MB_CMD_OFFSET 0x08
> > > > +#define CXLDEV_MB_CMD_COMMAND_OPCODE_MASK GENMASK(15, 0)
> > > > +#define CXLDEV_MB_CMD_PAYLOAD_LENGTH_MASK GENMASK(36, 16)
> > > > #define CXLDEV_MB_STATUS_OFFSET 0x10
> > > > +#define CXLDEV_MB_STATUS_RET_CODE_MASK GENMASK(47, 32)
> > > > #define CXLDEV_MB_BG_CMD_STATUS_OFFSET 0x18
> > > > +#define CXLDEV_MB_PAYLOAD_OFFSET 0x20
> > > > +
> > > > +/* Memory Device (CXL 2.0 - 8.2.8.5.1.1) */
> > > > +#define CXLMDEV_STATUS_OFFSET 0x0
> > > > +#define CXLMDEV_DEV_FATAL BIT(0)
> > > > +#define CXLMDEV_FW_HALT BIT(1)
> > > > +#define CXLMDEV_STATUS_MEDIA_STATUS_MASK GENMASK(3, 2)
> > > > +#define CXLMDEV_MS_NOT_READY 0
> > > > +#define CXLMDEV_MS_READY 1
> > > > +#define CXLMDEV_MS_ERROR 2
> > > > +#define CXLMDEV_MS_DISABLED 3
> > > > +#define CXLMDEV_READY(status) \
> > > > + (CXL_GET_FIELD(status, CXLMDEV_STATUS_MEDIA_STATUS) == CXLMDEV_MS_READY)
> > > > +#define CXLMDEV_MBOX_IF_READY BIT(4)
> > > > +#define CXLMDEV_RESET_NEEDED_SHIFT 5
> > > > +#define CXLMDEV_RESET_NEEDED_MASK GENMASK(7, 5)
> > > > +#define CXLMDEV_RESET_NEEDED_NOT 0
> > > > +#define CXLMDEV_RESET_NEEDED_COLD 1
> > > > +#define CXLMDEV_RESET_NEEDED_WARM 2
> > > > +#define CXLMDEV_RESET_NEEDED_HOT 3
> > > > +#define CXLMDEV_RESET_NEEDED_CXL 4
> > > > +#define CXLMDEV_RESET_NEEDED(status) \
> > > > + (CXL_GET_FIELD(status, CXLMDEV_RESET_NEEDED) != CXLMDEV_RESET_NEEDED_NOT)
> > > >
> > > > /**
> > > > * struct cxl_mem - A CXL memory device
> > > > @@ -44,6 +71,16 @@ struct cxl_mem {
> > > > struct pci_dev *pdev;
> > > > void __iomem *regs;
> > > >
> > > > + struct {
> > > > + struct range range;
> > > > + } pmem;
> > > > +
> > > > + struct {
> > > > + struct range range;
> > > > + } ram;
> > > > +
> > > > + char firmware_version[0x10];
> > > > +
> > > > /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
> > > > struct {
> > > > void __iomem *regs;
> > > > @@ -51,6 +88,7 @@ struct cxl_mem {
> > > >
> > > > /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
> > > > struct {
> > > > + struct mutex mutex; /* Protects device mailbox and firmware */
> > > > void __iomem *regs;
> > > > size_t payload_size;
> > > > } mbox;
> > > > @@ -89,5 +127,6 @@ struct cxl_mem {
> > > >
> > > > cxl_reg(status);
> > > > cxl_reg(mbox);
> > > > +cxl_reg(mem);
> > > >
> > > > #endif /* __CXL_H__ */
> > > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > > index fa14d51243ee..69ed15bfa5d4 100644
> > > > --- a/drivers/cxl/mem.c
> > > > +++ b/drivers/cxl/mem.c
> > > > @@ -6,6 +6,270 @@
> > > > #include "pci.h"
> > > > #include "cxl.h"
> > > >
> > > > +#define cxl_doorbell_busy(cxlm) \
> > > > + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
> > > > + CXLDEV_MB_CTRL_DOORBELL)
> > > > +
> > > > +#define CXL_MAILBOX_TIMEOUT_US 2000
> > >
> > > This should be _MS?
> > >
> > > > +
> > > > +enum opcode {
> > > > + CXL_MBOX_OP_IDENTIFY = 0x4000,
> > > > + CXL_MBOX_OP_MAX = 0x10000
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct mbox_cmd - A command to be submitted to hardware.
> > > > + * @opcode: (input) The command set and command submitted to hardware.
> > > > + * @payload_in: (input) Pointer to the input payload.
> > > > + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> > > > + * the caller.
> > > > + * @size_in: (input) Number of bytes to load from @payload.
> > > > + * @size_out: (output) Number of bytes loaded into @payload.
> > > > + * @return_code: (output) Error code returned from hardware.
> > > > + *
> > > > + * This is the primary mechanism used to send commands to the hardware.
> > > > + * All the fields except @payload_* correspond exactly to the fields described in
> > > > + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> > > > + * @payload_out are written to, and read from the Command Payload Registers
> > > > + * defined in (8.2.8.4.8).
> > > > + */
> > > > +struct mbox_cmd {
> > > > + u16 opcode;
> > > > + void *payload_in;
> > > > + void *payload_out;
> > > > + size_t size_in;
> > > > + size_t size_out;
> > > > + u16 return_code;
> > > > +#define CXL_MBOX_SUCCESS 0
> > > > +};
> > > > +
> > > > +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> > > > +{
> > > > + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> > > > + const unsigned long start = jiffies;
> > > > + unsigned long end = start;
> > > > +
> > > > + while (cxl_doorbell_busy(cxlm)) {
> > > > + end = jiffies;
> > > > +
> > > > + if (time_after(end, start + timeout)) {
> > > > + /* Check again in case preempted before timeout test */
> > > > + if (!cxl_doorbell_busy(cxlm))
> > > > + break;
> > > > + return -ETIMEDOUT;
> > > > + }
> > > > + cpu_relax();
> > > > + }
> > > > +
> > > > + dev_dbg(&cxlm->pdev->dev, "Doorbell wait took %dms",
> > > > + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
> > > > + struct mbox_cmd *mbox_cmd)
> > > > +{
> > > > + dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
> > > > + dev_info(&cxlm->pdev->dev,
> > > > + "\topcode: 0x%04x\n"
> > > > + "\tpayload size: %zub\n",
> > > > + mbox_cmd->opcode, mbox_cmd->size_in);
> > > > +
> > > > + if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
> > > > + print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
> > > > + mbox_cmd->payload_in, mbox_cmd->size_in,
> > > > + true);
> > > > + }
> > > > +
> > > > + /* Here's a good place to figure out if a device reset is needed */
> > >
> > > What are the implications if we don't do a reset, as this implementation
> > > does not? IOW, does a timeout require a device to be recovered through a
> > > reset before it can receive additional commands, or is it safe to simply
> > > drop the command that timed out on the floor and proceed?
> >
> > Not a satisfying answer, but "it depends". It's also complicated by
> > the fact that a reset may need to be coordinated with other devices in
> > the interleave-set as the HDM decoders may bounce.
> >
> > For comparison, to date there have been no problems with the "drop on
> > the floor" policy of LIBNVDIMM command timeouts. At the same time
> > there simply was not a software visible reset mechanism for those
> > devices so this problem never came out. This mailbox isn't a fast
> > path, so the device is likely completely dead if this timeout is ever
> > violated, and the firmware reporting a timeout might as well assume
> > that the OS gives up on the device.
> >
> > I'll let Ben chime in on the rest...
>
> Reset handling is next on the TODO list for the driver. I had two main reasons
> for not even taking a stab at it.
> 1. I have no good way to test it. We are working on adding some test conditions
> to QEMU for it.
> 2. The main difficulty in my mind with reset is you can't pull the memory out
> from under the OS here. While the driver doesn't yet handle persistent memory
> capacities, it may have volatile capacity configured by the BIOS. So the goal
> was, get the bits of the driver in that would at least allow developers,
> hardware vendors, and folks contributing to the spec a way to have basic
> interaction with a CXL type 3 device.
Honestly I think in most cases if the firmware decides to return a
"reset required" status the Linux response will be "lol, no" because
the firmware has no concept of the violence that would impose on the
rest of the system.
On Tue, Feb 2, 2021 at 3:51 PM Ben Widawsky <[email protected]> wrote:
>
> On 21-02-01 13:28:48, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> > > The Get Log command returns the actual log entries that are advertised
> > > via the Get Supported Logs command (0400h). CXL device logs are selected
> > > by UUID which is part of the CXL spec. Because the driver tries to
> > > sanitize what is sent to hardware, there becomes a need to restrict the
> > > types of logs which can be accessed by userspace. For example, the
> > > vendor specific log might only be consumable by proprietary, or offline
> > > applications, and therefore a good candidate for userspace.
> > >
> > > The current driver infrastructure does allow basic validation for all
> > > commands, but doesn't inspect any of the payload data. Along with Get
> > > Log support comes new infrastructure to add a hook for payload
> > > validation. This infrastructure is used to filter out the CEL UUID,
> > > which the userspace driver doesn't have business knowing, and taints on
> > > invalid UUIDs being sent to hardware.
> >
> > Perhaps a better option is to reject invalid UUIDs?
> >
> > And if you really really want to use invalid UUIDs then:
> >
> > 1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?
> >
> > 2) Wrap it with lockdown code so that you can't do this at all
> > when in LOCKDOWN_INTEGRITY or such?
> >
>
> The commit message needs update btw as CEL is allowed in the latest rev of the
> patches.
>
> We could potentially combine this with the now added (in a branch) CONFIG_RAW
> config option. Indeed I think that makes sense. Dan, thoughts?
Yeah, unknown UUIDs blocking is the same risk as raw commands as a
vendor can trigger any behavior they want. A "CONFIG_RAW depends on
!CONFIG_INTEGRITY" policy sounds reasonable as well.
On 21-02-01 12:00:18, Dan Williams wrote:
> On Sat, Jan 30, 2021 at 3:52 PM David Rientjes <[email protected]> wrote:
> >
> > On Fri, 29 Jan 2021, Ben Widawsky wrote:
> >
> > > Provide enough functionality to utilize the mailbox of a memory device.
> > > The mailbox is used to interact with the firmware running on the memory
> > > device.
> > >
> > > The CXL specification defines separate capabilities for the mailbox and
> > > the memory device. The mailbox interface has a doorbell to indicate
> > > ready to accept commands and the memory device has a capability register
> > > that indicates the mailbox interface is ready. The expectation is that
> > > the doorbell-ready is always later than the memory-device-indication
> > > that the mailbox is ready.
> > >
> > > Create a function to handle sending a command, optionally with a
> > > payload, to the memory device, polling on a result, and then optionally
> > > copying out the payload. The algorithm for doing this comes straight out
> > > of the CXL 2.0 specification.
> > >
> > > Primary mailboxes are capable of generating an interrupt when submitting
> > > a command in the background. That implementation is saved for a later
> > > time.
> > >
> > > Secondary mailboxes aren't implemented at this time.
> > >
> > > The flow is proven with one implemented command, "identify". Because the
> > > class code has already told the driver this is a memory device and the
> > > identify command is mandatory.
> > >
> > > Signed-off-by: Ben Widawsky <[email protected]>
> > > ---
> > > drivers/cxl/Kconfig | 14 ++
> > > drivers/cxl/cxl.h | 39 +++++
> > > drivers/cxl/mem.c | 342 +++++++++++++++++++++++++++++++++++++++++++-
> > > 3 files changed, 394 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > index 3b66b46af8a0..fe591f74af96 100644
> > > --- a/drivers/cxl/Kconfig
> > > +++ b/drivers/cxl/Kconfig
> > > @@ -32,4 +32,18 @@ config CXL_MEM
> > > Chapter 2.3 Type 3 CXL Device in the CXL 2.0 specification.
> > >
> > > If unsure say 'm'.
> > > +
> > > +config CXL_MEM_INSECURE_DEBUG
> > > + bool "CXL.mem debugging"
> > > + depends on CXL_MEM
> > > + help
> > > + Enable debug of all CXL command payloads.
> > > +
> > > + Some CXL devices and controllers support encryption and other
> > > + security features. The payloads for the commands that enable
> > > + those features may contain sensitive clear-text security
> > > + material. Disable debug of those command payloads by default.
> > > + If you are a kernel developer actively working on CXL
> > > + security enabling say Y, otherwise say N.
> >
> > Not specific to this patch, but the reference to encryption made me
> > curious about integrity: are all CXL.mem devices compatible with DIMP?
> > Some? None?
>
> The encryption here is "device passphrase" similar to the NVDIMM
> Security Management described here:
>
> https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf
>
> The LIBNVDIMM enabling wrapped this support with the Linux keys
> interface which among other things enforces wrapping the clear text
> passphrase with a Linux "trusted/encrypted" key.
>
> Additionally, the CXL.io interface optionally supports PCI IDE:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/reference-guides/pcie-device-security-enhancements.pdf
>
> I'm otherwise not familiar with the DIMP acronym?
>
> > > +
> > > endif
> > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > > index a3da7f8050c4..df3d97154b63 100644
> > > --- a/drivers/cxl/cxl.h
> > > +++ b/drivers/cxl/cxl.h
> > > @@ -31,9 +31,36 @@
> > > #define CXLDEV_MB_CAPS_OFFSET 0x00
> > > #define CXLDEV_MB_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
> > > #define CXLDEV_MB_CTRL_OFFSET 0x04
> > > +#define CXLDEV_MB_CTRL_DOORBELL BIT(0)
> > > #define CXLDEV_MB_CMD_OFFSET 0x08
> > > +#define CXLDEV_MB_CMD_COMMAND_OPCODE_MASK GENMASK(15, 0)
> > > +#define CXLDEV_MB_CMD_PAYLOAD_LENGTH_MASK GENMASK(36, 16)
> > > #define CXLDEV_MB_STATUS_OFFSET 0x10
> > > +#define CXLDEV_MB_STATUS_RET_CODE_MASK GENMASK(47, 32)
> > > #define CXLDEV_MB_BG_CMD_STATUS_OFFSET 0x18
> > > +#define CXLDEV_MB_PAYLOAD_OFFSET 0x20
> > > +
> > > +/* Memory Device (CXL 2.0 - 8.2.8.5.1.1) */
> > > +#define CXLMDEV_STATUS_OFFSET 0x0
> > > +#define CXLMDEV_DEV_FATAL BIT(0)
> > > +#define CXLMDEV_FW_HALT BIT(1)
> > > +#define CXLMDEV_STATUS_MEDIA_STATUS_MASK GENMASK(3, 2)
> > > +#define CXLMDEV_MS_NOT_READY 0
> > > +#define CXLMDEV_MS_READY 1
> > > +#define CXLMDEV_MS_ERROR 2
> > > +#define CXLMDEV_MS_DISABLED 3
> > > +#define CXLMDEV_READY(status) \
> > > + (CXL_GET_FIELD(status, CXLMDEV_STATUS_MEDIA_STATUS) == CXLMDEV_MS_READY)
> > > +#define CXLMDEV_MBOX_IF_READY BIT(4)
> > > +#define CXLMDEV_RESET_NEEDED_SHIFT 5
> > > +#define CXLMDEV_RESET_NEEDED_MASK GENMASK(7, 5)
> > > +#define CXLMDEV_RESET_NEEDED_NOT 0
> > > +#define CXLMDEV_RESET_NEEDED_COLD 1
> > > +#define CXLMDEV_RESET_NEEDED_WARM 2
> > > +#define CXLMDEV_RESET_NEEDED_HOT 3
> > > +#define CXLMDEV_RESET_NEEDED_CXL 4
> > > +#define CXLMDEV_RESET_NEEDED(status) \
> > > + (CXL_GET_FIELD(status, CXLMDEV_RESET_NEEDED) != CXLMDEV_RESET_NEEDED_NOT)
> > >
> > > /**
> > > * struct cxl_mem - A CXL memory device
> > > @@ -44,6 +71,16 @@ struct cxl_mem {
> > > struct pci_dev *pdev;
> > > void __iomem *regs;
> > >
> > > + struct {
> > > + struct range range;
> > > + } pmem;
> > > +
> > > + struct {
> > > + struct range range;
> > > + } ram;
> > > +
> > > + char firmware_version[0x10];
> > > +
> > > /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
> > > struct {
> > > void __iomem *regs;
> > > @@ -51,6 +88,7 @@ struct cxl_mem {
> > >
> > > /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
> > > struct {
> > > + struct mutex mutex; /* Protects device mailbox and firmware */
> > > void __iomem *regs;
> > > size_t payload_size;
> > > } mbox;
> > > @@ -89,5 +127,6 @@ struct cxl_mem {
> > >
> > > cxl_reg(status);
> > > cxl_reg(mbox);
> > > +cxl_reg(mem);
> > >
> > > #endif /* __CXL_H__ */
> > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > index fa14d51243ee..69ed15bfa5d4 100644
> > > --- a/drivers/cxl/mem.c
> > > +++ b/drivers/cxl/mem.c
> > > @@ -6,6 +6,270 @@
> > > #include "pci.h"
> > > #include "cxl.h"
> > >
> > > +#define cxl_doorbell_busy(cxlm) \
> > > + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
> > > + CXLDEV_MB_CTRL_DOORBELL)
> > > +
> > > +#define CXL_MAILBOX_TIMEOUT_US 2000
> >
> > This should be _MS?
> >
> > > +
> > > +enum opcode {
> > > + CXL_MBOX_OP_IDENTIFY = 0x4000,
> > > + CXL_MBOX_OP_MAX = 0x10000
> > > +};
> > > +
> > > +/**
> > > + * struct mbox_cmd - A command to be submitted to hardware.
> > > + * @opcode: (input) The command set and command submitted to hardware.
> > > + * @payload_in: (input) Pointer to the input payload.
> > > + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> > > + * the caller.
> > > + * @size_in: (input) Number of bytes to load from @payload.
> > > + * @size_out: (output) Number of bytes loaded into @payload.
> > > + * @return_code: (output) Error code returned from hardware.
> > > + *
> > > + * This is the primary mechanism used to send commands to the hardware.
> > > + * All the fields except @payload_* correspond exactly to the fields described in
> > > + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> > > + * @payload_out are written to, and read from the Command Payload Registers
> > > + * defined in (8.2.8.4.8).
> > > + */
> > > +struct mbox_cmd {
> > > + u16 opcode;
> > > + void *payload_in;
> > > + void *payload_out;
> > > + size_t size_in;
> > > + size_t size_out;
> > > + u16 return_code;
> > > +#define CXL_MBOX_SUCCESS 0
> > > +};
> > > +
> > > +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> > > +{
> > > + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> > > + const unsigned long start = jiffies;
> > > + unsigned long end = start;
> > > +
> > > + while (cxl_doorbell_busy(cxlm)) {
> > > + end = jiffies;
> > > +
> > > + if (time_after(end, start + timeout)) {
> > > + /* Check again in case preempted before timeout test */
> > > + if (!cxl_doorbell_busy(cxlm))
> > > + break;
> > > + return -ETIMEDOUT;
> > > + }
> > > + cpu_relax();
> > > + }
> > > +
> > > + dev_dbg(&cxlm->pdev->dev, "Doorbell wait took %dms",
> > > + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> > > + return 0;
> > > +}
> > > +
> > > +static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
> > > + struct mbox_cmd *mbox_cmd)
> > > +{
> > > + dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
> > > + dev_info(&cxlm->pdev->dev,
> > > + "\topcode: 0x%04x\n"
> > > + "\tpayload size: %zub\n",
> > > + mbox_cmd->opcode, mbox_cmd->size_in);
> > > +
> > > + if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
> > > + print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
> > > + mbox_cmd->payload_in, mbox_cmd->size_in,
> > > + true);
> > > + }
> > > +
> > > + /* Here's a good place to figure out if a device reset is needed */
> >
> > What are the implications if we don't do a reset, as this implementation
> > does not? IOW, does a timeout require a device to be recovered through a
> > reset before it can receive additional commands, or is it safe to simply
> > drop the command that timed out on the floor and proceed?
>
> Not a satisfying answer, but "it depends". It's also complicated by
> the fact that a reset may need to be coordinated with other devices in
> the interleave-set as the HDM decoders may bounce.
>
> For comparison, to date there have been no problems with the "drop on
> the floor" policy of LIBNVDIMM command timeouts. At the same time
> there simply was not a software visible reset mechanism for those
> devices so this problem never came out. This mailbox isn't a fast
> path, so the device is likely completely dead if this timeout is ever
> violated, and the firmware reporting a timeout might as well assume
> that the OS gives up on the device.
>
> I'll let Ben chime in on the rest...
Reset handling is next on the TODO list for the driver. I had two main reasons
for not even taking a stab at it.
1. I have no good way to test it. We are working on adding some test conditions
to QEMU for it.
2. The main difficulty in my mind with reset is you can't pull the memory out
from under the OS here. While the driver doesn't yet handle persistent memory
capacities, it may have volatile capacity configured by the BIOS. So the goal
was, get the bits of the driver in that would at least allow developers,
hardware vendors, and folks contributing to the spec a way to have basic
interaction with a CXL type 3 device.
On 21-02-02 15:54:03, Dan Williams wrote:
> On Tue, Feb 2, 2021 at 2:57 PM Ben Widawsky <[email protected]> wrote:
> >
> > On 21-02-01 12:00:18, Dan Williams wrote:
> > > On Sat, Jan 30, 2021 at 3:52 PM David Rientjes <[email protected]> wrote:
> > > >
> > > > On Fri, 29 Jan 2021, Ben Widawsky wrote:
> > > >
> > > > > Provide enough functionality to utilize the mailbox of a memory device.
> > > > > The mailbox is used to interact with the firmware running on the memory
> > > > > device.
> > > > >
> > > > > The CXL specification defines separate capabilities for the mailbox and
> > > > > the memory device. The mailbox interface has a doorbell to indicate
> > > > > ready to accept commands and the memory device has a capability register
> > > > > that indicates the mailbox interface is ready. The expectation is that
> > > > > the doorbell-ready is always later than the memory-device-indication
> > > > > that the mailbox is ready.
> > > > >
> > > > > Create a function to handle sending a command, optionally with a
> > > > > payload, to the memory device, polling on a result, and then optionally
> > > > > copying out the payload. The algorithm for doing this comes straight out
> > > > > of the CXL 2.0 specification.
> > > > >
> > > > > Primary mailboxes are capable of generating an interrupt when submitting
> > > > > a command in the background. That implementation is saved for a later
> > > > > time.
> > > > >
> > > > > Secondary mailboxes aren't implemented at this time.
> > > > >
> > > > > The flow is proven with one implemented command, "identify". Because the
> > > > > class code has already told the driver this is a memory device and the
> > > > > identify command is mandatory.
> > > > >
> > > > > Signed-off-by: Ben Widawsky <[email protected]>
> > > > > ---
> > > > > drivers/cxl/Kconfig | 14 ++
> > > > > drivers/cxl/cxl.h | 39 +++++
> > > > > drivers/cxl/mem.c | 342 +++++++++++++++++++++++++++++++++++++++++++-
> > > > > 3 files changed, 394 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > > > index 3b66b46af8a0..fe591f74af96 100644
> > > > > --- a/drivers/cxl/Kconfig
> > > > > +++ b/drivers/cxl/Kconfig
> > > > > @@ -32,4 +32,18 @@ config CXL_MEM
> > > > > Chapter 2.3 Type 3 CXL Device in the CXL 2.0 specification.
> > > > >
> > > > > If unsure say 'm'.
> > > > > +
> > > > > +config CXL_MEM_INSECURE_DEBUG
> > > > > + bool "CXL.mem debugging"
> > > > > + depends on CXL_MEM
> > > > > + help
> > > > > + Enable debug of all CXL command payloads.
> > > > > +
> > > > > + Some CXL devices and controllers support encryption and other
> > > > > + security features. The payloads for the commands that enable
> > > > > + those features may contain sensitive clear-text security
> > > > > + material. Disable debug of those command payloads by default.
> > > > > + If you are a kernel developer actively working on CXL
> > > > > + security enabling say Y, otherwise say N.
> > > >
> > > > Not specific to this patch, but the reference to encryption made me
> > > > curious about integrity: are all CXL.mem devices compatible with DIMP?
> > > > Some? None?
> > >
> > > The encryption here is "device passphrase" similar to the NVDIMM
> > > Security Management described here:
> > >
> > > https://pmem.io/documents/IntelOptanePMem_DSM_Interface-V2.0.pdf
> > >
> > > The LIBNVDIMM enabling wrapped this support with the Linux keys
> > > interface which among other things enforces wrapping the clear text
> > > passphrase with a Linux "trusted/encrypted" key.
> > >
> > > Additionally, the CXL.io interface optionally supports PCI IDE:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/reference-guides/pcie-device-security-enhancements.pdf
> > >
> > > I'm otherwise not familiar with the DIMP acronym?
> > >
> > > > > +
> > > > > endif
> > > > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > > > > index a3da7f8050c4..df3d97154b63 100644
> > > > > --- a/drivers/cxl/cxl.h
> > > > > +++ b/drivers/cxl/cxl.h
> > > > > @@ -31,9 +31,36 @@
> > > > > #define CXLDEV_MB_CAPS_OFFSET 0x00
> > > > > #define CXLDEV_MB_CAP_PAYLOAD_SIZE_MASK GENMASK(4, 0)
> > > > > #define CXLDEV_MB_CTRL_OFFSET 0x04
> > > > > +#define CXLDEV_MB_CTRL_DOORBELL BIT(0)
> > > > > #define CXLDEV_MB_CMD_OFFSET 0x08
> > > > > +#define CXLDEV_MB_CMD_COMMAND_OPCODE_MASK GENMASK(15, 0)
> > > > > +#define CXLDEV_MB_CMD_PAYLOAD_LENGTH_MASK GENMASK(36, 16)
> > > > > #define CXLDEV_MB_STATUS_OFFSET 0x10
> > > > > +#define CXLDEV_MB_STATUS_RET_CODE_MASK GENMASK(47, 32)
> > > > > #define CXLDEV_MB_BG_CMD_STATUS_OFFSET 0x18
> > > > > +#define CXLDEV_MB_PAYLOAD_OFFSET 0x20
> > > > > +
> > > > > +/* Memory Device (CXL 2.0 - 8.2.8.5.1.1) */
> > > > > +#define CXLMDEV_STATUS_OFFSET 0x0
> > > > > +#define CXLMDEV_DEV_FATAL BIT(0)
> > > > > +#define CXLMDEV_FW_HALT BIT(1)
> > > > > +#define CXLMDEV_STATUS_MEDIA_STATUS_MASK GENMASK(3, 2)
> > > > > +#define CXLMDEV_MS_NOT_READY 0
> > > > > +#define CXLMDEV_MS_READY 1
> > > > > +#define CXLMDEV_MS_ERROR 2
> > > > > +#define CXLMDEV_MS_DISABLED 3
> > > > > +#define CXLMDEV_READY(status) \
> > > > > + (CXL_GET_FIELD(status, CXLMDEV_STATUS_MEDIA_STATUS) == CXLMDEV_MS_READY)
> > > > > +#define CXLMDEV_MBOX_IF_READY BIT(4)
> > > > > +#define CXLMDEV_RESET_NEEDED_SHIFT 5
> > > > > +#define CXLMDEV_RESET_NEEDED_MASK GENMASK(7, 5)
> > > > > +#define CXLMDEV_RESET_NEEDED_NOT 0
> > > > > +#define CXLMDEV_RESET_NEEDED_COLD 1
> > > > > +#define CXLMDEV_RESET_NEEDED_WARM 2
> > > > > +#define CXLMDEV_RESET_NEEDED_HOT 3
> > > > > +#define CXLMDEV_RESET_NEEDED_CXL 4
> > > > > +#define CXLMDEV_RESET_NEEDED(status) \
> > > > > + (CXL_GET_FIELD(status, CXLMDEV_RESET_NEEDED) != CXLMDEV_RESET_NEEDED_NOT)
> > > > >
> > > > > /**
> > > > > * struct cxl_mem - A CXL memory device
> > > > > @@ -44,6 +71,16 @@ struct cxl_mem {
> > > > > struct pci_dev *pdev;
> > > > > void __iomem *regs;
> > > > >
> > > > > + struct {
> > > > > + struct range range;
> > > > > + } pmem;
> > > > > +
> > > > > + struct {
> > > > > + struct range range;
> > > > > + } ram;
> > > > > +
> > > > > + char firmware_version[0x10];
> > > > > +
> > > > > /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
> > > > > struct {
> > > > > void __iomem *regs;
> > > > > @@ -51,6 +88,7 @@ struct cxl_mem {
> > > > >
> > > > > /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
> > > > > struct {
> > > > > + struct mutex mutex; /* Protects device mailbox and firmware */
> > > > > void __iomem *regs;
> > > > > size_t payload_size;
> > > > > } mbox;
> > > > > @@ -89,5 +127,6 @@ struct cxl_mem {
> > > > >
> > > > > cxl_reg(status);
> > > > > cxl_reg(mbox);
> > > > > +cxl_reg(mem);
> > > > >
> > > > > #endif /* __CXL_H__ */
> > > > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > > > index fa14d51243ee..69ed15bfa5d4 100644
> > > > > --- a/drivers/cxl/mem.c
> > > > > +++ b/drivers/cxl/mem.c
> > > > > @@ -6,6 +6,270 @@
> > > > > #include "pci.h"
> > > > > #include "cxl.h"
> > > > >
> > > > > +#define cxl_doorbell_busy(cxlm) \
> > > > > + (cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET) & \
> > > > > + CXLDEV_MB_CTRL_DOORBELL)
> > > > > +
> > > > > +#define CXL_MAILBOX_TIMEOUT_US 2000
> > > >
> > > > This should be _MS?
> > > >
> > > > > +
> > > > > +enum opcode {
> > > > > + CXL_MBOX_OP_IDENTIFY = 0x4000,
> > > > > + CXL_MBOX_OP_MAX = 0x10000
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct mbox_cmd - A command to be submitted to hardware.
> > > > > + * @opcode: (input) The command set and command submitted to hardware.
> > > > > + * @payload_in: (input) Pointer to the input payload.
> > > > > + * @payload_out: (output) Pointer to the output payload. Must be allocated by
> > > > > + * the caller.
> > > > > + * @size_in: (input) Number of bytes to load from @payload.
> > > > > + * @size_out: (output) Number of bytes loaded into @payload.
> > > > > + * @return_code: (output) Error code returned from hardware.
> > > > > + *
> > > > > + * This is the primary mechanism used to send commands to the hardware.
> > > > > + * All the fields except @payload_* correspond exactly to the fields described in
> > > > > + * Command Register section of the CXL 2.0 spec (8.2.8.4.5). @payload_in and
> > > > > + * @payload_out are written to, and read from the Command Payload Registers
> > > > > + * defined in (8.2.8.4.8).
> > > > > + */
> > > > > +struct mbox_cmd {
> > > > > + u16 opcode;
> > > > > + void *payload_in;
> > > > > + void *payload_out;
> > > > > + size_t size_in;
> > > > > + size_t size_out;
> > > > > + u16 return_code;
> > > > > +#define CXL_MBOX_SUCCESS 0
> > > > > +};
> > > > > +
> > > > > +static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> > > > > +{
> > > > > + const int timeout = msecs_to_jiffies(CXL_MAILBOX_TIMEOUT_US);
> > > > > + const unsigned long start = jiffies;
> > > > > + unsigned long end = start;
> > > > > +
> > > > > + while (cxl_doorbell_busy(cxlm)) {
> > > > > + end = jiffies;
> > > > > +
> > > > > + if (time_after(end, start + timeout)) {
> > > > > + /* Check again in case preempted before timeout test */
> > > > > + if (!cxl_doorbell_busy(cxlm))
> > > > > + break;
> > > > > + return -ETIMEDOUT;
> > > > > + }
> > > > > + cpu_relax();
> > > > > + }
> > > > > +
> > > > > + dev_dbg(&cxlm->pdev->dev, "Doorbell wait took %dms",
> > > > > + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static void cxl_mem_mbox_timeout(struct cxl_mem *cxlm,
> > > > > + struct mbox_cmd *mbox_cmd)
> > > > > +{
> > > > > + dev_warn(&cxlm->pdev->dev, "Mailbox command timed out\n");
> > > > > + dev_info(&cxlm->pdev->dev,
> > > > > + "\topcode: 0x%04x\n"
> > > > > + "\tpayload size: %zub\n",
> > > > > + mbox_cmd->opcode, mbox_cmd->size_in);
> > > > > +
> > > > > + if (IS_ENABLED(CONFIG_CXL_MEM_INSECURE_DEBUG)) {
> > > > > + print_hex_dump_debug("Payload ", DUMP_PREFIX_OFFSET, 16, 1,
> > > > > + mbox_cmd->payload_in, mbox_cmd->size_in,
> > > > > + true);
> > > > > + }
> > > > > +
> > > > > + /* Here's a good place to figure out if a device reset is needed */
> > > >
> > > > What are the implications if we don't do a reset, as this implementation
> > > > does not? IOW, does a timeout require a device to be recovered through a
> > > > reset before it can receive additional commands, or is it safe to simply
> > > > drop the command that timed out on the floor and proceed?
> > >
> > > Not a satisfying answer, but "it depends". It's also complicated by
> > > the fact that a reset may need to be coordinated with other devices in
> > > the interleave-set as the HDM decoders may bounce.
> > >
> > > For comparison, to date there have been no problems with the "drop on
> > > the floor" policy of LIBNVDIMM command timeouts. At the same time
> > > there simply was not a software visible reset mechanism for those
> > > devices so this problem never came out. This mailbox isn't a fast
> > > path, so the device is likely completely dead if this timeout is ever
> > > violated, and the firmware reporting a timeout might as well assume
> > > that the OS gives up on the device.
> > >
> > > I'll let Ben chime in on the rest...
> >
> > Reset handling is next on the TODO list for the driver. I had two main reasons
> > for not even taking a stab at it.
> > 1. I have no good way to test it. We are working on adding some test conditions
> > to QEMU for it.
> > 2. The main difficulty in my mind with reset is you can't pull the memory out
> > from under the OS here. While the driver doesn't yet handle persistent memory
> > capacities, it may have volatile capacity configured by the BIOS. So the goal
> > was, get the bits of the driver in that would at least allow developers,
> > hardware vendors, and folks contributing to the spec a way to have basic
> > interaction with a CXL type 3 device.
>
> Honestly I think in most cases if the firmware decides to return a
> "reset required" status the Linux response will be "lol, no" because
> the firmware has no concept of the violence that would impose on the
> rest of the system.
How about UAPI to initiate a reset? I think a sysfs bool would do the trick.
Maybe sysfs file to display current error status, and one to reset?
On Tue, Feb 02, 2021 at 10:24:18AM -0800, Ben Widawsky wrote:
> > > + /* Cap 4000h - CXL_CAP_CAP_ID_MEMDEV */
> > > + struct {
> > > + void __iomem *regs;
> > > + } mem;
> >
> > This style looks massively obsfucated. For one the comments look like
> > absolute gibberish, but also what is the point of all these anonymous
> > structures?
>
> They're not anonymous, and their names are for the below register functions. The
> comments are connected spec reference 'Cap XXXXh' to definitions in cxl.h. I can
> articulate that if it helps.
But why no simply a
void __iomem *mem_regs;
field vs the extra struct?
> The register space for CXL devices is a bit weird since it's all subdivided
> under 1 BAR for now. To clearly distinguish over the different subregions, these
> helpers exist. It's really easy to mess this up as the developer and I actually
> would disagree that it makes debugging quite a bit easier. It also gets more
> convoluted when you add the other 2 BARs which also each have their own
> subregions.
>
> For example. if my mailbox function does:
> cxl_read_status_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
>
> instead of:
> cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
>
> It's easier to spot than:
> readl(cxlm->regs + cxlm->status_offset + CXLDEV_MB_CAPS_OFFSET)
Well, what I think would be the most obvious is:
readl(cxlm->status_regs + CXLDEV_MB_CAPS_OFFSET);
> > > + /* 8.2.8.4.3 */
> >
> > ????
> >
>
> I had been trying to be consistent with 'CXL2.0 - ' in front of all spec
> reference. I obviously missed this one.
FYI, I generally find section names much easier to find than section
numbers. Especially as the numbers change very frequently, some times
even for very minor updates to the spec. E.g. in NVMe the numbers might
even change from NVMe 1.X to NVMe 1.Xa because an errata had to add
a clarification as its own section.
On 21-02-02 15:57:03, Dan Williams wrote:
> On Tue, Feb 2, 2021 at 3:51 PM Ben Widawsky <[email protected]> wrote:
> >
> > On 21-02-01 13:28:48, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> > > > The Get Log command returns the actual log entries that are advertised
> > > > via the Get Supported Logs command (0400h). CXL device logs are selected
> > > > by UUID which is part of the CXL spec. Because the driver tries to
> > > > sanitize what is sent to hardware, there becomes a need to restrict the
> > > > types of logs which can be accessed by userspace. For example, the
> > > > vendor specific log might only be consumable by proprietary, or offline
> > > > applications, and therefore a good candidate for userspace.
> > > >
> > > > The current driver infrastructure does allow basic validation for all
> > > > commands, but doesn't inspect any of the payload data. Along with Get
> > > > Log support comes new infrastructure to add a hook for payload
> > > > validation. This infrastructure is used to filter out the CEL UUID,
> > > > which the userspace driver doesn't have business knowing, and taints on
> > > > invalid UUIDs being sent to hardware.
> > >
> > > Perhaps a better option is to reject invalid UUIDs?
> > >
> > > And if you really really want to use invalid UUIDs then:
> > >
> > > 1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?
> > >
> > > 2) Wrap it with lockdown code so that you can't do this at all
> > > when in LOCKDOWN_INTEGRITY or such?
> > >
> >
> > The commit message needs update btw as CEL is allowed in the latest rev of the
> > patches.
> >
> > We could potentially combine this with the now added (in a branch) CONFIG_RAW
> > config option. Indeed I think that makes sense. Dan, thoughts?
>
> Yeah, unknown UUIDs blocking is the same risk as raw commands as a
> vendor can trigger any behavior they want. A "CONFIG_RAW depends on
> !CONFIG_INTEGRITY" policy sounds reasonable as well.
What about LOCKDOWN_NONE though? I think we need something runtime for this.
Can we summarize the CONFIG options here?
CXL_MEM_INSECURE_DEBUG // no change
CXL_MEM_RAW_COMMANDS // if !security_locked_down(LOCKDOWN_NONE)
bool cxl_unsafe()
{
#ifndef CXL_MEM_RAW_COMMANDS
return false;
#else
return !security_locked_down(LOCKDOWN_NONE);
#endif
}
---
Did I get that right?
On 21-02-03 17:15:34, Christoph Hellwig wrote:
> On Tue, Feb 02, 2021 at 10:24:18AM -0800, Ben Widawsky wrote:
> > > > + /* Cap 4000h - CXL_CAP_CAP_ID_MEMDEV */
> > > > + struct {
> > > > + void __iomem *regs;
> > > > + } mem;
> > >
> > > This style looks massively obsfucated. For one the comments look like
> > > absolute gibberish, but also what is the point of all these anonymous
> > > structures?
> >
> > They're not anonymous, and their names are for the below register functions. The
> > comments are connected spec reference 'Cap XXXXh' to definitions in cxl.h. I can
> > articulate that if it helps.
>
> But why no simply a
>
> void __iomem *mem_regs;
>
> field vs the extra struct?
>
> > The register space for CXL devices is a bit weird since it's all subdivided
> > under 1 BAR for now. To clearly distinguish over the different subregions, these
> > helpers exist. It's really easy to mess this up as the developer and I actually
> > would disagree that it makes debugging quite a bit easier. It also gets more
> > convoluted when you add the other 2 BARs which also each have their own
> > subregions.
> >
> > For example. if my mailbox function does:
> > cxl_read_status_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> >
> > instead of:
> > cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> >
> > It's easier to spot than:
> > readl(cxlm->regs + cxlm->status_offset + CXLDEV_MB_CAPS_OFFSET)
>
> Well, what I think would be the most obvious is:
>
> readl(cxlm->status_regs + CXLDEV_MB_CAPS_OFFSET);
>
Right, so you wrote the buggy version. Should be.
readl(cxlm->mbox_regs + CXLDEV_MB_CAPS_OFFSET);
Admittedly, "MB" for mailbox isn't super obvious. I think you've convinced me to
rename these register definitions
s/MB/MBOX.
I'd prefer to keep the helpers for now as I do find them helpful, and so far
nobody else who has touched the code has complained. If you feel strongly, I
will change it.
> > > > + /* 8.2.8.4.3 */
> > >
> > > ????
> > >
> >
> > I had been trying to be consistent with 'CXL2.0 - ' in front of all spec
> > reference. I obviously missed this one.
>
> FYI, I generally find section names much easier to find than section
> numbers. Especially as the numbers change very frequently, some times
> even for very minor updates to the spec. E.g. in NVMe the numbers might
> even change from NVMe 1.X to NVMe 1.Xa because an errata had to add
> a clarification as its own section.
Why not both?
I ran into this in fact going from version 0.7 to 1.0 of the spec. I did call
out the spec version to address this, but you're right. Section names can change
too in theory.
/*
* CXL 2.0 8.2.8.4.3
* Mailbox Capabilities Register
*/
Too much?
On Wed, Feb 03, 2021 at 09:16:10AM -0800, Ben Widawsky wrote:
> On 21-02-02 15:57:03, Dan Williams wrote:
> > On Tue, Feb 2, 2021 at 3:51 PM Ben Widawsky <[email protected]> wrote:
> > >
> > > On 21-02-01 13:28:48, Konrad Rzeszutek Wilk wrote:
> > > > On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> > > > > The Get Log command returns the actual log entries that are advertised
> > > > > via the Get Supported Logs command (0400h). CXL device logs are selected
> > > > > by UUID which is part of the CXL spec. Because the driver tries to
> > > > > sanitize what is sent to hardware, there becomes a need to restrict the
> > > > > types of logs which can be accessed by userspace. For example, the
> > > > > vendor specific log might only be consumable by proprietary, or offline
> > > > > applications, and therefore a good candidate for userspace.
> > > > >
> > > > > The current driver infrastructure does allow basic validation for all
> > > > > commands, but doesn't inspect any of the payload data. Along with Get
> > > > > Log support comes new infrastructure to add a hook for payload
> > > > > validation. This infrastructure is used to filter out the CEL UUID,
> > > > > which the userspace driver doesn't have business knowing, and taints on
> > > > > invalid UUIDs being sent to hardware.
> > > >
> > > > Perhaps a better option is to reject invalid UUIDs?
> > > >
> > > > And if you really really want to use invalid UUIDs then:
> > > >
> > > > 1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?
> > > >
> > > > 2) Wrap it with lockdown code so that you can't do this at all
> > > > when in LOCKDOWN_INTEGRITY or such?
> > > >
> > >
> > > The commit message needs update btw as CEL is allowed in the latest rev of the
> > > patches.
> > >
> > > We could potentially combine this with the now added (in a branch) CONFIG_RAW
> > > config option. Indeed I think that makes sense. Dan, thoughts?
> >
> > Yeah, unknown UUIDs blocking is the same risk as raw commands as a
> > vendor can trigger any behavior they want. A "CONFIG_RAW depends on
> > !CONFIG_INTEGRITY" policy sounds reasonable as well.
>
> What about LOCKDOWN_NONE though? I think we need something runtime for this.
>
> Can we summarize the CONFIG options here?
>
> CXL_MEM_INSECURE_DEBUG // no change
> CXL_MEM_RAW_COMMANDS // if !security_locked_down(LOCKDOWN_NONE)
>
> bool cxl_unsafe()
Would it be better if this inverted? Aka cxl_safe()..
?
> {
> #ifndef CXL_MEM_RAW_COMMANDS
> return false;
> #else
> return !security_locked_down(LOCKDOWN_NONE);
:thumbsup:
(Naturally this would inverted if this was cxl_safe()).
> #endif
> }
>
> ---
>
> Did I get that right?
:nods:
On Wed, Feb 3, 2021 at 10:16 AM Konrad Rzeszutek Wilk
<[email protected]> wrote:
>
> On Wed, Feb 03, 2021 at 09:16:10AM -0800, Ben Widawsky wrote:
> > On 21-02-02 15:57:03, Dan Williams wrote:
> > > On Tue, Feb 2, 2021 at 3:51 PM Ben Widawsky <[email protected]> wrote:
> > > >
> > > > On 21-02-01 13:28:48, Konrad Rzeszutek Wilk wrote:
> > > > > On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> > > > > > The Get Log command returns the actual log entries that are advertised
> > > > > > via the Get Supported Logs command (0400h). CXL device logs are selected
> > > > > > by UUID which is part of the CXL spec. Because the driver tries to
> > > > > > sanitize what is sent to hardware, there becomes a need to restrict the
> > > > > > types of logs which can be accessed by userspace. For example, the
> > > > > > vendor specific log might only be consumable by proprietary, or offline
> > > > > > applications, and therefore a good candidate for userspace.
> > > > > >
> > > > > > The current driver infrastructure does allow basic validation for all
> > > > > > commands, but doesn't inspect any of the payload data. Along with Get
> > > > > > Log support comes new infrastructure to add a hook for payload
> > > > > > validation. This infrastructure is used to filter out the CEL UUID,
> > > > > > which the userspace driver doesn't have business knowing, and taints on
> > > > > > invalid UUIDs being sent to hardware.
> > > > >
> > > > > Perhaps a better option is to reject invalid UUIDs?
> > > > >
> > > > > And if you really really want to use invalid UUIDs then:
> > > > >
> > > > > 1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?
> > > > >
> > > > > 2) Wrap it with lockdown code so that you can't do this at all
> > > > > when in LOCKDOWN_INTEGRITY or such?
> > > > >
> > > >
> > > > The commit message needs update btw as CEL is allowed in the latest rev of the
> > > > patches.
> > > >
> > > > We could potentially combine this with the now added (in a branch) CONFIG_RAW
> > > > config option. Indeed I think that makes sense. Dan, thoughts?
> > >
> > > Yeah, unknown UUIDs blocking is the same risk as raw commands as a
> > > vendor can trigger any behavior they want. A "CONFIG_RAW depends on
> > > !CONFIG_INTEGRITY" policy sounds reasonable as well.
> >
> > What about LOCKDOWN_NONE though? I think we need something runtime for this.
> >
> > Can we summarize the CONFIG options here?
> >
> > CXL_MEM_INSECURE_DEBUG // no change
> > CXL_MEM_RAW_COMMANDS // if !security_locked_down(LOCKDOWN_NONE)
> >
> > bool cxl_unsafe()
>
> Would it be better if this inverted? Aka cxl_safe()..
> ?
> > {
> > #ifndef CXL_MEM_RAW_COMMANDS
nit use IS_ENABLED() if this function lives in a C file, or provide
whole alternate static inline versions in a header gated by ifdefs.
> > return false;
> > #else
> > return !security_locked_down(LOCKDOWN_NONE);
>
> :thumbsup:
>
> (Naturally this would inverted if this was cxl_safe()).
>
>
> > #endif
> > }
> >
> > ---
> >
> > Did I get that right?
>
> :nods:
Looks good which means it's time to bikeshed the naming. I'd call it
cxl_raw_allowed(). As "safety" isn't the only reason for blocking raw,
it's also to corral the userspace api. I.e. things like enforcing
security passphrase material through the Linux keys api.
On Wed, Feb 3, 2021 at 9:23 AM Ben Widawsky <[email protected]> wrote:
>
> On 21-02-03 17:15:34, Christoph Hellwig wrote:
> > On Tue, Feb 02, 2021 at 10:24:18AM -0800, Ben Widawsky wrote:
> > > > > + /* Cap 4000h - CXL_CAP_CAP_ID_MEMDEV */
> > > > > + struct {
> > > > > + void __iomem *regs;
> > > > > + } mem;
> > > >
> > > > This style looks massively obsfucated. For one the comments look like
> > > > absolute gibberish, but also what is the point of all these anonymous
> > > > structures?
> > >
> > > They're not anonymous, and their names are for the below register functions. The
> > > comments are connected spec reference 'Cap XXXXh' to definitions in cxl.h. I can
> > > articulate that if it helps.
> >
> > But why no simply a
> >
> > void __iomem *mem_regs;
> >
> > field vs the extra struct?
> >
> > > The register space for CXL devices is a bit weird since it's all subdivided
> > > under 1 BAR for now. To clearly distinguish over the different subregions, these
> > > helpers exist. It's really easy to mess this up as the developer and I actually
> > > would disagree that it makes debugging quite a bit easier. It also gets more
> > > convoluted when you add the other 2 BARs which also each have their own
> > > subregions.
> > >
> > > For example. if my mailbox function does:
> > > cxl_read_status_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > >
> > > instead of:
> > > cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> > >
> > > It's easier to spot than:
> > > readl(cxlm->regs + cxlm->status_offset + CXLDEV_MB_CAPS_OFFSET)
> >
> > Well, what I think would be the most obvious is:
> >
> > readl(cxlm->status_regs + CXLDEV_MB_CAPS_OFFSET);
> >
>
> Right, so you wrote the buggy version. Should be.
> readl(cxlm->mbox_regs + CXLDEV_MB_CAPS_OFFSET);
>
> Admittedly, "MB" for mailbox isn't super obvious. I think you've convinced me to
> rename these register definitions
> s/MB/MBOX.
>
> I'd prefer to keep the helpers for now as I do find them helpful, and so far
> nobody else who has touched the code has complained. If you feel strongly, I
> will change it.
After seeing the options, I think I'd prefer to not have to worry what
extra magic is happening with cxl_read_mbox_reg32()
cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
readl(cxlm->mbox_regs + CXLDEV_MB_CAPS_OFFSET);
The latter is both shorter and more idiomatic.
>
> > > > > + /* 8.2.8.4.3 */
> > > >
> > > > ????
> > > >
> > >
> > > I had been trying to be consistent with 'CXL2.0 - ' in front of all spec
> > > reference. I obviously missed this one.
> >
> > FYI, I generally find section names much easier to find than section
> > numbers. Especially as the numbers change very frequently, some times
> > even for very minor updates to the spec. E.g. in NVMe the numbers might
> > even change from NVMe 1.X to NVMe 1.Xa because an errata had to add
> > a clarification as its own section.
>
> Why not both?
>
> I ran into this in fact going from version 0.7 to 1.0 of the spec. I did call
> out the spec version to address this, but you're right. Section names can change
> too in theory.
>
> /*
> * CXL 2.0 8.2.8.4.3
> * Mailbox Capabilities Register
> */
>
> Too much?
That seems like too many lines.
/* CXL 2.0 8.2.8.4.3 Mailbox Capabilities Register */
...this looks ok.
On Wed, Feb 03, 2021 at 01:23:31PM -0800, Dan Williams wrote:
> > I'd prefer to keep the helpers for now as I do find them helpful, and so far
> > nobody else who has touched the code has complained. If you feel strongly, I
> > will change it.
>
> After seeing the options, I think I'd prefer to not have to worry what
> extra magic is happening with cxl_read_mbox_reg32()
>
> cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
>
> readl(cxlm->mbox_regs + CXLDEV_MB_CAPS_OFFSET);
>
> The latter is both shorter and more idiomatic.
Same here. That being said I know some driver maintainers like
wrappers, my real main irk was the macro magic to generate them.
On 21-02-04 07:16:46, Christoph Hellwig wrote:
> On Wed, Feb 03, 2021 at 01:23:31PM -0800, Dan Williams wrote:
> > > I'd prefer to keep the helpers for now as I do find them helpful, and so far
> > > nobody else who has touched the code has complained. If you feel strongly, I
> > > will change it.
> >
> > After seeing the options, I think I'd prefer to not have to worry what
> > extra magic is happening with cxl_read_mbox_reg32()
> >
> > cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> >
> > readl(cxlm->mbox_regs + CXLDEV_MB_CAPS_OFFSET);
> >
> > The latter is both shorter and more idiomatic.
>
> Same here. That being said I know some driver maintainers like
> wrappers, my real main irk was the macro magic to generate them.
I think the wrapper is often used as a means of trying to have cross OS
compatibility to some degree. Just to be clear, that was *not* the purpose here.
Stating I disagree for posterity, I'll begin reworking this code and it will be
changed for v2.
Thanks.
Ben
On Thu, Feb 4, 2021 at 10:56 AM Ben Widawsky <[email protected]> wrote:
[..]
> It actually got pushed into cxl_mem_raw_command_allowed()
>
> static bool cxl_mem_raw_command_allowed(u16 opcode)
> {
> int i;
>
> if (!IS_ENABLED(CONFIG_CXL_MEM_RAW_COMMANDS))
> return false;
>
> if (security_locked_down(LOCKDOWN_NONE))
> return false;
>
> if (raw_allow_all)
> return true;
>
> if (is_security_command(opcode))
> return false;
>
> for (i = 0; i < ARRAY_SIZE(disabled_raw_commands); i++)
> if (disabled_raw_commands[i] == opcode)
> return false;
>
> return true;
> }
>
> That work for you?
Looks good to me.
On 21-02-03 12:31:00, Dan Williams wrote:
> On Wed, Feb 3, 2021 at 10:16 AM Konrad Rzeszutek Wilk
> <[email protected]> wrote:
> >
> > On Wed, Feb 03, 2021 at 09:16:10AM -0800, Ben Widawsky wrote:
> > > On 21-02-02 15:57:03, Dan Williams wrote:
> > > > On Tue, Feb 2, 2021 at 3:51 PM Ben Widawsky <[email protected]> wrote:
> > > > >
> > > > > On 21-02-01 13:28:48, Konrad Rzeszutek Wilk wrote:
> > > > > > On Fri, Jan 29, 2021 at 04:24:37PM -0800, Ben Widawsky wrote:
> > > > > > > The Get Log command returns the actual log entries that are advertised
> > > > > > > via the Get Supported Logs command (0400h). CXL device logs are selected
> > > > > > > by UUID which is part of the CXL spec. Because the driver tries to
> > > > > > > sanitize what is sent to hardware, there becomes a need to restrict the
> > > > > > > types of logs which can be accessed by userspace. For example, the
> > > > > > > vendor specific log might only be consumable by proprietary, or offline
> > > > > > > applications, and therefore a good candidate for userspace.
> > > > > > >
> > > > > > > The current driver infrastructure does allow basic validation for all
> > > > > > > commands, but doesn't inspect any of the payload data. Along with Get
> > > > > > > Log support comes new infrastructure to add a hook for payload
> > > > > > > validation. This infrastructure is used to filter out the CEL UUID,
> > > > > > > which the userspace driver doesn't have business knowing, and taints on
> > > > > > > invalid UUIDs being sent to hardware.
> > > > > >
> > > > > > Perhaps a better option is to reject invalid UUIDs?
> > > > > >
> > > > > > And if you really really want to use invalid UUIDs then:
> > > > > >
> > > > > > 1) Make that code wrapped in CONFIG_CXL_DEBUG_THIS_IS_GOING_TO..?
> > > > > >
> > > > > > 2) Wrap it with lockdown code so that you can't do this at all
> > > > > > when in LOCKDOWN_INTEGRITY or such?
> > > > > >
> > > > >
> > > > > The commit message needs update btw as CEL is allowed in the latest rev of the
> > > > > patches.
> > > > >
> > > > > We could potentially combine this with the now added (in a branch) CONFIG_RAW
> > > > > config option. Indeed I think that makes sense. Dan, thoughts?
> > > >
> > > > Yeah, unknown UUIDs blocking is the same risk as raw commands as a
> > > > vendor can trigger any behavior they want. A "CONFIG_RAW depends on
> > > > !CONFIG_INTEGRITY" policy sounds reasonable as well.
> > >
> > > What about LOCKDOWN_NONE though? I think we need something runtime for this.
> > >
> > > Can we summarize the CONFIG options here?
> > >
> > > CXL_MEM_INSECURE_DEBUG // no change
> > > CXL_MEM_RAW_COMMANDS // if !security_locked_down(LOCKDOWN_NONE)
> > >
> > > bool cxl_unsafe()
> >
> > Would it be better if this inverted? Aka cxl_safe()..
> > ?
> > > {
> > > #ifndef CXL_MEM_RAW_COMMANDS
>
> nit use IS_ENABLED() if this function lives in a C file, or provide
> whole alternate static inline versions in a header gated by ifdefs.
>
I had done this independently since... but agreed.
> > > return false;
> > > #else
> > > return !security_locked_down(LOCKDOWN_NONE);
> >
> > :thumbsup:
> >
> > (Naturally this would inverted if this was cxl_safe()).
> >
> >
> > > #endif
> > > }
> > >
> > > ---
> > >
> > > Did I get that right?
> >
> > :nods:
>
> Looks good which means it's time to bikeshed the naming. I'd call it
> cxl_raw_allowed(). As "safety" isn't the only reason for blocking raw,
> it's also to corral the userspace api. I.e. things like enforcing
> security passphrase material through the Linux keys api.
It actually got pushed into cxl_mem_raw_command_allowed()
static bool cxl_mem_raw_command_allowed(u16 opcode)
{
int i;
if (!IS_ENABLED(CONFIG_CXL_MEM_RAW_COMMANDS))
return false;
if (security_locked_down(LOCKDOWN_NONE))
return false;
if (raw_allow_all)
return true;
if (is_security_command(opcode))
return false;
for (i = 0; i < ARRAY_SIZE(disabled_raw_commands); i++)
if (disabled_raw_commands[i] == opcode)
return false;
return true;
}
That work for you?
[ add Jon Corbet as I'd expect him to be Cc'd on anything that
generically touches Documentation/ like this, and add Kees as the last
person who added a taint (tag you're it) ]
Jon, Kees, are either of you willing to ack this concept?
Top-posting to add more context for the below:
This taint is proposed because it has implications for
CONFIG_LOCK_DOWN_KERNEL among other things. These CXL devices
implement memory like DDR would, but unlike DDR there are
administrative / configuration commands that demand kernel
coordination before they can be sent. The posture taken with this
taint is "guilty until proven innocent" for commands that have yet to
be explicitly allowed by the driver. This is different than NVME for
example where an errant vendor-defined command could destroy data on
the device, but there is no wider threat to system integrity. The
taint allows a pressure release valve for any and all commands to be
sent, but flagged with WARN_TAINT_ONCE if the driver has not
explicitly enabled it on an allowed list of known-good / kernel
coordinated commands.
On Fri, Jan 29, 2021 at 4:25 PM Ben Widawsky <[email protected]> wrote:
>
> For drivers that moderate access to the underlying hardware it is
> sometimes desirable to allow userspace to bypass restrictions. Once
> userspace has done this, the driver can no longer guarantee the sanctity
> of either the OS or the hardware. When in this state, it is helpful for
> kernel developers to be made aware (via this taint flag) of this fact
> for subsequent bug reports.
>
> Example usage:
> - Hardware xyzzy accepts 2 commands, waldo and fred.
> - The xyzzy driver provides an interface for using waldo, but not fred.
> - quux is convinced they really need the fred command.
> - xyzzy driver allows quux to frob hardware to initiate fred.
> - kernel gets tainted.
> - turns out fred command is borked, and scribbles over memory.
> - developers laugh while closing quux's subsequent bug report.
>
> Signed-off-by: Ben Widawsky <[email protected]>
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 1 +
> Documentation/admin-guide/tainted-kernels.rst | 6 +++++-
> include/linux/kernel.h | 3 ++-
> kernel/panic.c | 1 +
> 4 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 1d56a6b73a4e..3e1eada53504 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1352,6 +1352,7 @@ ORed together. The letters are seen in "Tainted" line of Oops reports.
> 32768 `(K)` kernel has been live patched
> 65536 `(X)` Auxiliary taint, defined and used by for distros
> 131072 `(T)` The kernel was built with the struct randomization plugin
> +262144 `(H)` The kernel has allowed vendor shenanigans
> ====== ===== ==============================================================
>
> See :doc:`/admin-guide/tainted-kernels` for more information.
> diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
> index ceeed7b0798d..ee2913316344 100644
> --- a/Documentation/admin-guide/tainted-kernels.rst
> +++ b/Documentation/admin-guide/tainted-kernels.rst
> @@ -74,7 +74,7 @@ a particular type of taint. It's best to leave that to the aforementioned
> script, but if you need something quick you can use this shell command to check
> which bits are set::
>
> - $ for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
> + $ for i in $(seq 19); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
>
> Table for decoding tainted state
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
> 15 _/K 32768 kernel has been live patched
> 16 _/X 65536 auxiliary taint, defined for and used by distros
> 17 _/T 131072 kernel was built with the struct randomization plugin
> + 18 _/H 262144 kernel has allowed vendor shenanigans
> === === ====== ========================================================
>
> Note: The character ``_`` is representing a blank in this table to make reading
> @@ -175,3 +176,6 @@ More detailed explanation for tainting
> produce extremely unusual kernel structure layouts (even performance
> pathological ones), which is important to know when debugging. Set at
> build time.
> +
> + 18) ``H`` Kernel has allowed direct access to hardware and can no longer make
> + any guarantees about the stability of the device or driver.
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index f7902d8c1048..bc95486f817e 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -443,7 +443,8 @@ extern enum system_states {
> #define TAINT_LIVEPATCH 15
> #define TAINT_AUX 16
> #define TAINT_RANDSTRUCT 17
> -#define TAINT_FLAGS_COUNT 18
> +#define TAINT_RAW_PASSTHROUGH 18
> +#define TAINT_FLAGS_COUNT 19
> #define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1)
>
> struct taint_flag {
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 332736a72a58..dff22bd80eaf 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -386,6 +386,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
> [ TAINT_LIVEPATCH ] = { 'K', ' ', true },
> [ TAINT_AUX ] = { 'X', ' ', true },
> [ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
> + [ TAINT_RAW_PASSTHROUGH ] = { 'H', ' ', true },
> };
>
> /**
> --
> 2.30.0
>
On Mon, Feb 08, 2021 at 02:00:33PM -0800, Dan Williams wrote:
> [ add Jon Corbet as I'd expect him to be Cc'd on anything that
> generically touches Documentation/ like this, and add Kees as the last
> person who added a taint (tag you're it) ]
>
> Jon, Kees, are either of you willing to ack this concept?
>
> Top-posting to add more context for the below:
>
> This taint is proposed because it has implications for
> CONFIG_LOCK_DOWN_KERNEL among other things. These CXL devices
> implement memory like DDR would, but unlike DDR there are
> administrative / configuration commands that demand kernel
> coordination before they can be sent. The posture taken with this
> taint is "guilty until proven innocent" for commands that have yet to
> be explicitly allowed by the driver. This is different than NVME for
> example where an errant vendor-defined command could destroy data on
> the device, but there is no wider threat to system integrity. The
> taint allows a pressure release valve for any and all commands to be
> sent, but flagged with WARN_TAINT_ONCE if the driver has not
> explicitly enabled it on an allowed list of known-good / kernel
> coordinated commands.
>
> On Fri, Jan 29, 2021 at 4:25 PM Ben Widawsky <[email protected]> wrote:
> >
> > For drivers that moderate access to the underlying hardware it is
> > sometimes desirable to allow userspace to bypass restrictions. Once
> > userspace has done this, the driver can no longer guarantee the sanctity
> > of either the OS or the hardware. When in this state, it is helpful for
> > kernel developers to be made aware (via this taint flag) of this fact
> > for subsequent bug reports.
> >
> > Example usage:
> > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > - The xyzzy driver provides an interface for using waldo, but not fred.
> > - quux is convinced they really need the fred command.
> > - xyzzy driver allows quux to frob hardware to initiate fred.
> > - kernel gets tainted.
> > - turns out fred command is borked, and scribbles over memory.
> > - developers laugh while closing quux's subsequent bug report.
But a taint flag only lasts for the current boot. If this is a drive, it
could still be compromised after reboot. It sounds like this taint is
really only for ephemeral things? "vendor shenanigans" is a pretty giant
scope ...
-Kees
> >
> > Signed-off-by: Ben Widawsky <[email protected]>
> > ---
> > Documentation/admin-guide/sysctl/kernel.rst | 1 +
> > Documentation/admin-guide/tainted-kernels.rst | 6 +++++-
> > include/linux/kernel.h | 3 ++-
> > kernel/panic.c | 1 +
> > 4 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > index 1d56a6b73a4e..3e1eada53504 100644
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -1352,6 +1352,7 @@ ORed together. The letters are seen in "Tainted" line of Oops reports.
> > 32768 `(K)` kernel has been live patched
> > 65536 `(X)` Auxiliary taint, defined and used by for distros
> > 131072 `(T)` The kernel was built with the struct randomization plugin
> > +262144 `(H)` The kernel has allowed vendor shenanigans
> > ====== ===== ==============================================================
> >
> > See :doc:`/admin-guide/tainted-kernels` for more information.
> > diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
> > index ceeed7b0798d..ee2913316344 100644
> > --- a/Documentation/admin-guide/tainted-kernels.rst
> > +++ b/Documentation/admin-guide/tainted-kernels.rst
> > @@ -74,7 +74,7 @@ a particular type of taint. It's best to leave that to the aforementioned
> > script, but if you need something quick you can use this shell command to check
> > which bits are set::
> >
> > - $ for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
> > + $ for i in $(seq 19); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
> >
> > Table for decoding tainted state
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > @@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
> > 15 _/K 32768 kernel has been live patched
> > 16 _/X 65536 auxiliary taint, defined for and used by distros
> > 17 _/T 131072 kernel was built with the struct randomization plugin
> > + 18 _/H 262144 kernel has allowed vendor shenanigans
> > === === ====== ========================================================
> >
> > Note: The character ``_`` is representing a blank in this table to make reading
> > @@ -175,3 +176,6 @@ More detailed explanation for tainting
> > produce extremely unusual kernel structure layouts (even performance
> > pathological ones), which is important to know when debugging. Set at
> > build time.
> > +
> > + 18) ``H`` Kernel has allowed direct access to hardware and can no longer make
> > + any guarantees about the stability of the device or driver.
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index f7902d8c1048..bc95486f817e 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -443,7 +443,8 @@ extern enum system_states {
> > #define TAINT_LIVEPATCH 15
> > #define TAINT_AUX 16
> > #define TAINT_RANDSTRUCT 17
> > -#define TAINT_FLAGS_COUNT 18
> > +#define TAINT_RAW_PASSTHROUGH 18
> > +#define TAINT_FLAGS_COUNT 19
> > #define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1)
> >
> > struct taint_flag {
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index 332736a72a58..dff22bd80eaf 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -386,6 +386,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
> > [ TAINT_LIVEPATCH ] = { 'K', ' ', true },
> > [ TAINT_AUX ] = { 'X', ' ', true },
> > [ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
> > + [ TAINT_RAW_PASSTHROUGH ] = { 'H', ' ', true },
> > };
> >
> > /**
> > --
> > 2.30.0
> >
--
Kees Cook
On 21-02-08 14:09:19, Kees Cook wrote:
> On Mon, Feb 08, 2021 at 02:00:33PM -0800, Dan Williams wrote:
> > [ add Jon Corbet as I'd expect him to be Cc'd on anything that
> > generically touches Documentation/ like this, and add Kees as the last
> > person who added a taint (tag you're it) ]
> >
> > Jon, Kees, are either of you willing to ack this concept?
> >
> > Top-posting to add more context for the below:
> >
> > This taint is proposed because it has implications for
> > CONFIG_LOCK_DOWN_KERNEL among other things. These CXL devices
> > implement memory like DDR would, but unlike DDR there are
> > administrative / configuration commands that demand kernel
> > coordination before they can be sent. The posture taken with this
> > taint is "guilty until proven innocent" for commands that have yet to
> > be explicitly allowed by the driver. This is different than NVME for
> > example where an errant vendor-defined command could destroy data on
> > the device, but there is no wider threat to system integrity. The
> > taint allows a pressure release valve for any and all commands to be
> > sent, but flagged with WARN_TAINT_ONCE if the driver has not
> > explicitly enabled it on an allowed list of known-good / kernel
> > coordinated commands.
> >
> > On Fri, Jan 29, 2021 at 4:25 PM Ben Widawsky <[email protected]> wrote:
> > >
> > > For drivers that moderate access to the underlying hardware it is
> > > sometimes desirable to allow userspace to bypass restrictions. Once
> > > userspace has done this, the driver can no longer guarantee the sanctity
> > > of either the OS or the hardware. When in this state, it is helpful for
> > > kernel developers to be made aware (via this taint flag) of this fact
> > > for subsequent bug reports.
> > >
> > > Example usage:
> > > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > > - The xyzzy driver provides an interface for using waldo, but not fred.
> > > - quux is convinced they really need the fred command.
> > > - xyzzy driver allows quux to frob hardware to initiate fred.
> > > - kernel gets tainted.
> > > - turns out fred command is borked, and scribbles over memory.
> > > - developers laugh while closing quux's subsequent bug report.
>
> But a taint flag only lasts for the current boot. If this is a drive, it
> could still be compromised after reboot. It sounds like this taint is
> really only for ephemeral things? "vendor shenanigans" is a pretty giant
> scope ...
>
> -Kees
>
Good point. Any suggestions?
> > >
> > > Signed-off-by: Ben Widawsky <[email protected]>
> > > ---
> > > Documentation/admin-guide/sysctl/kernel.rst | 1 +
> > > Documentation/admin-guide/tainted-kernels.rst | 6 +++++-
> > > include/linux/kernel.h | 3 ++-
> > > kernel/panic.c | 1 +
> > > 4 files changed, 9 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > > index 1d56a6b73a4e..3e1eada53504 100644
> > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > @@ -1352,6 +1352,7 @@ ORed together. The letters are seen in "Tainted" line of Oops reports.
> > > 32768 `(K)` kernel has been live patched
> > > 65536 `(X)` Auxiliary taint, defined and used by for distros
> > > 131072 `(T)` The kernel was built with the struct randomization plugin
> > > +262144 `(H)` The kernel has allowed vendor shenanigans
> > > ====== ===== ==============================================================
> > >
> > > See :doc:`/admin-guide/tainted-kernels` for more information.
> > > diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
> > > index ceeed7b0798d..ee2913316344 100644
> > > --- a/Documentation/admin-guide/tainted-kernels.rst
> > > +++ b/Documentation/admin-guide/tainted-kernels.rst
> > > @@ -74,7 +74,7 @@ a particular type of taint. It's best to leave that to the aforementioned
> > > script, but if you need something quick you can use this shell command to check
> > > which bits are set::
> > >
> > > - $ for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
> > > + $ for i in $(seq 19); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
> > >
> > > Table for decoding tainted state
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > @@ -100,6 +100,7 @@ Bit Log Number Reason that got the kernel tainted
> > > 15 _/K 32768 kernel has been live patched
> > > 16 _/X 65536 auxiliary taint, defined for and used by distros
> > > 17 _/T 131072 kernel was built with the struct randomization plugin
> > > + 18 _/H 262144 kernel has allowed vendor shenanigans
> > > === === ====== ========================================================
> > >
> > > Note: The character ``_`` is representing a blank in this table to make reading
> > > @@ -175,3 +176,6 @@ More detailed explanation for tainting
> > > produce extremely unusual kernel structure layouts (even performance
> > > pathological ones), which is important to know when debugging. Set at
> > > build time.
> > > +
> > > + 18) ``H`` Kernel has allowed direct access to hardware and can no longer make
> > > + any guarantees about the stability of the device or driver.
> > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > > index f7902d8c1048..bc95486f817e 100644
> > > --- a/include/linux/kernel.h
> > > +++ b/include/linux/kernel.h
> > > @@ -443,7 +443,8 @@ extern enum system_states {
> > > #define TAINT_LIVEPATCH 15
> > > #define TAINT_AUX 16
> > > #define TAINT_RANDSTRUCT 17
> > > -#define TAINT_FLAGS_COUNT 18
> > > +#define TAINT_RAW_PASSTHROUGH 18
> > > +#define TAINT_FLAGS_COUNT 19
> > > #define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1)
> > >
> > > struct taint_flag {
> > > diff --git a/kernel/panic.c b/kernel/panic.c
> > > index 332736a72a58..dff22bd80eaf 100644
> > > --- a/kernel/panic.c
> > > +++ b/kernel/panic.c
> > > @@ -386,6 +386,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
> > > [ TAINT_LIVEPATCH ] = { 'K', ' ', true },
> > > [ TAINT_AUX ] = { 'X', ' ', true },
> > > [ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
> > > + [ TAINT_RAW_PASSTHROUGH ] = { 'H', ' ', true },
> > > };
> > >
> > > /**
> > > --
> > > 2.30.0
> > >
>
> --
> Kees Cook
On Mon, Feb 8, 2021 at 2:09 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 08, 2021 at 02:00:33PM -0800, Dan Williams wrote:
> > [ add Jon Corbet as I'd expect him to be Cc'd on anything that
> > generically touches Documentation/ like this, and add Kees as the last
> > person who added a taint (tag you're it) ]
> >
> > Jon, Kees, are either of you willing to ack this concept?
> >
> > Top-posting to add more context for the below:
> >
> > This taint is proposed because it has implications for
> > CONFIG_LOCK_DOWN_KERNEL among other things. These CXL devices
> > implement memory like DDR would, but unlike DDR there are
> > administrative / configuration commands that demand kernel
> > coordination before they can be sent. The posture taken with this
> > taint is "guilty until proven innocent" for commands that have yet to
> > be explicitly allowed by the driver. This is different than NVME for
> > example where an errant vendor-defined command could destroy data on
> > the device, but there is no wider threat to system integrity. The
> > taint allows a pressure release valve for any and all commands to be
> > sent, but flagged with WARN_TAINT_ONCE if the driver has not
> > explicitly enabled it on an allowed list of known-good / kernel
> > coordinated commands.
> >
> > On Fri, Jan 29, 2021 at 4:25 PM Ben Widawsky <[email protected]> wrote:
> > >
> > > For drivers that moderate access to the underlying hardware it is
> > > sometimes desirable to allow userspace to bypass restrictions. Once
> > > userspace has done this, the driver can no longer guarantee the sanctity
> > > of either the OS or the hardware. When in this state, it is helpful for
> > > kernel developers to be made aware (via this taint flag) of this fact
> > > for subsequent bug reports.
> > >
> > > Example usage:
> > > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > > - The xyzzy driver provides an interface for using waldo, but not fred.
> > > - quux is convinced they really need the fred command.
> > > - xyzzy driver allows quux to frob hardware to initiate fred.
> > > - kernel gets tainted.
> > > - turns out fred command is borked, and scribbles over memory.
> > > - developers laugh while closing quux's subsequent bug report.
>
> But a taint flag only lasts for the current boot. If this is a drive, it
> could still be compromised after reboot. It sounds like this taint is
> really only for ephemeral things? "vendor shenanigans" is a pretty giant
> scope ...
>
That is true. This is more about preventing an ecosystem / cottage
industry of tooling built around bypassing the kernel. So the kernel
complains loudly and hopefully prevents vendor tooling from
propagating and instead directs that development effort back to the
native tooling. However for the rare "I know what I'm doing" cases,
this tainted kernel bypass lets some experimentation and debug happen,
but the kernel is transparent that when the capability ships in
production it needs to be a native implementation.
So it's less, "the system integrity is compromised" and more like
"you're bypassing the development process that ensures sanity for CXL
implementations that may take down a system if implemented
incorrectly". For example, NVME reset is a non-invent, CXL reset can
be like surprise removing DDR DIMM.
Should this be more tightly scoped to CXL? I had hoped to use this in
other places in LIBNVDIMM, but I'm ok to lose some generality for the
specific concerns that make CXL devices different than other PCI
endpoints.
On Mon, Feb 8, 2021 at 3:36 PM Dan Williams <[email protected]> wrote:
>
> On Mon, Feb 8, 2021 at 2:09 PM Kees Cook <[email protected]> wrote:
> >
> > On Mon, Feb 08, 2021 at 02:00:33PM -0800, Dan Williams wrote:
> > > [ add Jon Corbet as I'd expect him to be Cc'd on anything that
> > > generically touches Documentation/ like this, and add Kees as the last
> > > person who added a taint (tag you're it) ]
> > >
> > > Jon, Kees, are either of you willing to ack this concept?
> > >
> > > Top-posting to add more context for the below:
> > >
> > > This taint is proposed because it has implications for
> > > CONFIG_LOCK_DOWN_KERNEL among other things. These CXL devices
> > > implement memory like DDR would, but unlike DDR there are
> > > administrative / configuration commands that demand kernel
> > > coordination before they can be sent. The posture taken with this
> > > taint is "guilty until proven innocent" for commands that have yet to
> > > be explicitly allowed by the driver. This is different than NVME for
> > > example where an errant vendor-defined command could destroy data on
> > > the device, but there is no wider threat to system integrity. The
> > > taint allows a pressure release valve for any and all commands to be
> > > sent, but flagged with WARN_TAINT_ONCE if the driver has not
> > > explicitly enabled it on an allowed list of known-good / kernel
> > > coordinated commands.
> > >
> > > On Fri, Jan 29, 2021 at 4:25 PM Ben Widawsky <[email protected]> wrote:
> > > >
> > > > For drivers that moderate access to the underlying hardware it is
> > > > sometimes desirable to allow userspace to bypass restrictions. Once
> > > > userspace has done this, the driver can no longer guarantee the sanctity
> > > > of either the OS or the hardware. When in this state, it is helpful for
> > > > kernel developers to be made aware (via this taint flag) of this fact
> > > > for subsequent bug reports.
> > > >
> > > > Example usage:
> > > > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > > > - The xyzzy driver provides an interface for using waldo, but not fred.
> > > > - quux is convinced they really need the fred command.
> > > > - xyzzy driver allows quux to frob hardware to initiate fred.
> > > > - kernel gets tainted.
> > > > - turns out fred command is borked, and scribbles over memory.
> > > > - developers laugh while closing quux's subsequent bug report.
> >
> > But a taint flag only lasts for the current boot. If this is a drive, it
> > could still be compromised after reboot. It sounds like this taint is
> > really only for ephemeral things? "vendor shenanigans" is a pretty giant
> > scope ...
> >
>
> That is true. This is more about preventing an ecosystem / cottage
> industry of tooling built around bypassing the kernel. So the kernel
> complains loudly and hopefully prevents vendor tooling from
> propagating and instead directs that development effort back to the
> native tooling. However for the rare "I know what I'm doing" cases,
> this tainted kernel bypass lets some experimentation and debug happen,
> but the kernel is transparent that when the capability ships in
> production it needs to be a native implementation.
>
> So it's less, "the system integrity is compromised" and more like
> "you're bypassing the development process that ensures sanity for CXL
> implementations that may take down a system if implemented
> incorrectly". For example, NVME reset is a non-invent, CXL reset can
> be like surprise removing DDR DIMM.
>
> Should this be more tightly scoped to CXL? I had hoped to use this in
> other places in LIBNVDIMM, but I'm ok to lose some generality for the
> specific concerns that make CXL devices different than other PCI
> endpoints.
As I type this out it strikes me that plain WARN already does
TAINT_WARN and meets the spirit of what is trying to be achieved.
Appreciate the skeptical eye Kees, we'll drop this one.
On 21-02-08 17:03:25, Dan Williams wrote:
> On Mon, Feb 8, 2021 at 3:36 PM Dan Williams <[email protected]> wrote:
> >
> > On Mon, Feb 8, 2021 at 2:09 PM Kees Cook <[email protected]> wrote:
> > >
> > > On Mon, Feb 08, 2021 at 02:00:33PM -0800, Dan Williams wrote:
> > > > [ add Jon Corbet as I'd expect him to be Cc'd on anything that
> > > > generically touches Documentation/ like this, and add Kees as the last
> > > > person who added a taint (tag you're it) ]
> > > >
> > > > Jon, Kees, are either of you willing to ack this concept?
> > > >
> > > > Top-posting to add more context for the below:
> > > >
> > > > This taint is proposed because it has implications for
> > > > CONFIG_LOCK_DOWN_KERNEL among other things. These CXL devices
> > > > implement memory like DDR would, but unlike DDR there are
> > > > administrative / configuration commands that demand kernel
> > > > coordination before they can be sent. The posture taken with this
> > > > taint is "guilty until proven innocent" for commands that have yet to
> > > > be explicitly allowed by the driver. This is different than NVME for
> > > > example where an errant vendor-defined command could destroy data on
> > > > the device, but there is no wider threat to system integrity. The
> > > > taint allows a pressure release valve for any and all commands to be
> > > > sent, but flagged with WARN_TAINT_ONCE if the driver has not
> > > > explicitly enabled it on an allowed list of known-good / kernel
> > > > coordinated commands.
> > > >
> > > > On Fri, Jan 29, 2021 at 4:25 PM Ben Widawsky <[email protected]> wrote:
> > > > >
> > > > > For drivers that moderate access to the underlying hardware it is
> > > > > sometimes desirable to allow userspace to bypass restrictions. Once
> > > > > userspace has done this, the driver can no longer guarantee the sanctity
> > > > > of either the OS or the hardware. When in this state, it is helpful for
> > > > > kernel developers to be made aware (via this taint flag) of this fact
> > > > > for subsequent bug reports.
> > > > >
> > > > > Example usage:
> > > > > - Hardware xyzzy accepts 2 commands, waldo and fred.
> > > > > - The xyzzy driver provides an interface for using waldo, but not fred.
> > > > > - quux is convinced they really need the fred command.
> > > > > - xyzzy driver allows quux to frob hardware to initiate fred.
> > > > > - kernel gets tainted.
> > > > > - turns out fred command is borked, and scribbles over memory.
> > > > > - developers laugh while closing quux's subsequent bug report.
> > >
> > > But a taint flag only lasts for the current boot. If this is a drive, it
> > > could still be compromised after reboot. It sounds like this taint is
> > > really only for ephemeral things? "vendor shenanigans" is a pretty giant
> > > scope ...
> > >
> >
> > That is true. This is more about preventing an ecosystem / cottage
> > industry of tooling built around bypassing the kernel. So the kernel
> > complains loudly and hopefully prevents vendor tooling from
> > propagating and instead directs that development effort back to the
> > native tooling. However for the rare "I know what I'm doing" cases,
> > this tainted kernel bypass lets some experimentation and debug happen,
> > but the kernel is transparent that when the capability ships in
> > production it needs to be a native implementation.
> >
> > So it's less, "the system integrity is compromised" and more like
> > "you're bypassing the development process that ensures sanity for CXL
> > implementations that may take down a system if implemented
> > incorrectly". For example, NVME reset is a non-invent, CXL reset can
> > be like surprise removing DDR DIMM.
> >
> > Should this be more tightly scoped to CXL? I had hoped to use this in
> > other places in LIBNVDIMM, but I'm ok to lose some generality for the
> > specific concerns that make CXL devices different than other PCI
> > endpoints.
>
> As I type this out it strikes me that plain WARN already does
> TAINT_WARN and meets the spirit of what is trying to be achieved.
>
> Appreciate the skeptical eye Kees, we'll drop this one.
So I think this is a good compromise for now. However, the point of this taint
was that it is specifically called out what tainted the kernel. It'd be great to
know when we have a bug report it was this specifically that was the issue.
Rambling further I realize now that taint doesn't tell you which module tainted,
which would be great here. That's actually what I'd like.
End ramble.