2022-08-15 21:02:55

by Jeffrey Hugo

[permalink] [raw]
Subject: [RFC PATCH 00/14] QAIC DRM accelerator driver

This patchset introduces a Linux Kernel driver (QAIC - Qualcomm AIC) for the
Qualcomm Cloud AI 100 product (AIC100).

Qualcomm Cloud AI 100 is a PCIe adapter card that hosts a dedicated machine
learning inference accelerator. Tons of documentation in the first patch of
the series.

The driver was a misc device until recently. In accordance with the 2021
Ksummit (per LWN), it has been converted to a DRM driver due to the use of
dma_buf.

For historical purposes, the last revision that was on list is:
https://lore.kernel.org/all/[email protected]/
The driver has evolved quite a bit in the two years since.

Regarding the open userspace, it is currently a work in progress (WIP) but will
be delivered. The motivation for this RFC series is to get some early feedback
on the driver since Daniel Vetter and David Airlie indicated that would good
idea while the userspace is being worked on.

We are a bit new to the DRM area, and appreciate all guidence/feedback.

Questions we are hoping to get an answer to:

1. Does Qualcomm Cloud AI 100 fit in DRM?

2. Would a "QAIC" directory in the GPU documentation be acceptable?
We'd like to split up the documentation into multiple files as we feel that
would make it more organized. It looks like only AMD has a directory,
everyone else has a single file.

Things that are still a todo (in no particular order):

-Open userspace (see above)

-Figure out what to do with the device partitioning feature. The uAPI for it
is clunky. Seems like perhaps it should fall under a cgroup. The intent is
to start a discussion over in the cgroup area to see what the experts say.

-Add proper documentation for our sysfs additions

-Extend the driver to export a few of the MHI channels to userspace. We are
currently using an old driver which was proposed and rejected. Need to
refactor and make something QAIC specific.

-Covert the documentation (patch 1) to proper rst syntax

Jeffrey Hugo (14):
drm/qaic: Add documentation for AIC100 accelerator driver
drm/qaic: Add uapi and core driver file
drm/qaic: Add qaic.h internal header
drm/qaic: Add MHI controller
drm/qaic: Add control path
drm/qaic: Add datapath
drm/qaic: Add debugfs
drm/qaic: Add RAS component
drm/qaic: Add ssr component
drm/qaic: Add sysfs
drm/qaic: Add telemetry
drm/qaic: Add tracepoints
drm/qaic: Add qaic driver to the build system
MAINTAINERS: Add entry for QAIC driver

Documentation/gpu/drivers.rst | 1 +
Documentation/gpu/qaic.rst | 567 +++++++++
MAINTAINERS | 7 +
drivers/gpu/drm/Kconfig | 2 +
drivers/gpu/drm/Makefile | 1 +
drivers/gpu/drm/qaic/Kconfig | 33 +
drivers/gpu/drm/qaic/Makefile | 17 +
drivers/gpu/drm/qaic/mhi_controller.c | 575 +++++++++
drivers/gpu/drm/qaic/mhi_controller.h | 18 +
drivers/gpu/drm/qaic/qaic.h | 396 ++++++
drivers/gpu/drm/qaic/qaic_control.c | 1788 +++++++++++++++++++++++++++
drivers/gpu/drm/qaic/qaic_data.c | 2152 +++++++++++++++++++++++++++++++++
drivers/gpu/drm/qaic/qaic_debugfs.c | 335 +++++
drivers/gpu/drm/qaic/qaic_debugfs.h | 33 +
drivers/gpu/drm/qaic/qaic_drv.c | 825 +++++++++++++
drivers/gpu/drm/qaic/qaic_ras.c | 653 ++++++++++
drivers/gpu/drm/qaic/qaic_ras.h | 11 +
drivers/gpu/drm/qaic/qaic_ssr.c | 889 ++++++++++++++
drivers/gpu/drm/qaic/qaic_ssr.h | 13 +
drivers/gpu/drm/qaic/qaic_sysfs.c | 113 ++
drivers/gpu/drm/qaic/qaic_telemetry.c | 851 +++++++++++++
drivers/gpu/drm/qaic/qaic_telemetry.h | 14 +
drivers/gpu/drm/qaic/qaic_trace.h | 493 ++++++++
include/uapi/drm/qaic_drm.h | 283 +++++
24 files changed, 10070 insertions(+)
create mode 100644 Documentation/gpu/qaic.rst
create mode 100644 drivers/gpu/drm/qaic/Kconfig
create mode 100644 drivers/gpu/drm/qaic/Makefile
create mode 100644 drivers/gpu/drm/qaic/mhi_controller.c
create mode 100644 drivers/gpu/drm/qaic/mhi_controller.h
create mode 100644 drivers/gpu/drm/qaic/qaic.h
create mode 100644 drivers/gpu/drm/qaic/qaic_control.c
create mode 100644 drivers/gpu/drm/qaic/qaic_data.c
create mode 100644 drivers/gpu/drm/qaic/qaic_debugfs.c
create mode 100644 drivers/gpu/drm/qaic/qaic_debugfs.h
create mode 100644 drivers/gpu/drm/qaic/qaic_drv.c
create mode 100644 drivers/gpu/drm/qaic/qaic_ras.c
create mode 100644 drivers/gpu/drm/qaic/qaic_ras.h
create mode 100644 drivers/gpu/drm/qaic/qaic_ssr.c
create mode 100644 drivers/gpu/drm/qaic/qaic_ssr.h
create mode 100644 drivers/gpu/drm/qaic/qaic_sysfs.c
create mode 100644 drivers/gpu/drm/qaic/qaic_telemetry.c
create mode 100644 drivers/gpu/drm/qaic/qaic_telemetry.h
create mode 100644 drivers/gpu/drm/qaic/qaic_trace.h
create mode 100644 include/uapi/drm/qaic_drm.h

--
2.7.4


2022-08-15 21:04:42

by Jeffrey Hugo

[permalink] [raw]
Subject: [RFC PATCH 04/14] drm/qaic: Add MHI controller

A QAIC device contains a MHI interface with a number of different channels
for controlling different aspects of the device. The MHI controller works
with the MHI bus to enable and drive that interface.

Change-Id: I77363193b1a2dece7abab287a6acef3cac1b4e1b
Signed-off-by: Jeffrey Hugo <[email protected]>
---
drivers/gpu/drm/qaic/mhi_controller.c | 575 ++++++++++++++++++++++++++++++++++
drivers/gpu/drm/qaic/mhi_controller.h | 18 ++
2 files changed, 593 insertions(+)
create mode 100644 drivers/gpu/drm/qaic/mhi_controller.c
create mode 100644 drivers/gpu/drm/qaic/mhi_controller.h

diff --git a/drivers/gpu/drm/qaic/mhi_controller.c b/drivers/gpu/drm/qaic/mhi_controller.c
new file mode 100644
index 0000000..e88e0fe
--- /dev/null
+++ b/drivers/gpu/drm/qaic/mhi_controller.c
@@ -0,0 +1,575 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* Copyright (c) 2019-2021, The Linux Foundation. All rights reserved. */
+/* Copyright (c) 2021-2022 Qualcomm Innovation Center, Inc. All rights reserved. */
+
+#include <linux/delay.h>
+#include <linux/err.h>
+#include <linux/memblock.h>
+#include <linux/mhi.h>
+#include <linux/moduleparam.h>
+#include <linux/pci.h>
+#include <linux/sizes.h>
+
+#include "mhi_controller.h"
+#include "qaic.h"
+
+#define MAX_RESET_TIME_SEC 25
+
+static unsigned int mhi_timeout = 2000; /* 2 sec default */
+module_param(mhi_timeout, uint, 0600);
+
+static struct mhi_channel_config aic100_channels[] = {
+ {
+ .name = "QAIC_LOOPBACK",
+ .num = 0,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_LOOPBACK",
+ .num = 1,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_SAHARA",
+ .num = 2,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_SBL,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_SAHARA",
+ .num = 3,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_SBL,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_DIAG",
+ .num = 4,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_DIAG",
+ .num = 5,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_SSR",
+ .num = 6,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_SSR",
+ .num = 7,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_QDSS",
+ .num = 8,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_QDSS",
+ .num = 9,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_CONTROL",
+ .num = 10,
+ .num_elements = 128,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_CONTROL",
+ .num = 11,
+ .num_elements = 128,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_LOGGING",
+ .num = 12,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_SBL,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_LOGGING",
+ .num = 13,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_SBL,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_STATUS",
+ .num = 14,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_STATUS",
+ .num = 15,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_TELEMETRY",
+ .num = 16,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_TELEMETRY",
+ .num = 17,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_DEBUG",
+ .num = 18,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_DEBUG",
+ .num = 19,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .name = "QAIC_TIMESYNC",
+ .num = 20,
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_TO_DEVICE,
+ .ee_mask = MHI_CH_EE_SBL | MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+ {
+ .num = 21,
+ .name = "QAIC_TIMESYNC",
+ .num_elements = 32,
+ .local_elements = 0,
+ .event_ring = 0,
+ .dir = DMA_FROM_DEVICE,
+ .ee_mask = MHI_CH_EE_SBL | MHI_CH_EE_AMSS,
+ .pollcfg = 0,
+ .doorbell = MHI_DB_BRST_DISABLE,
+ .lpm_notify = false,
+ .offload_channel = false,
+ .doorbell_mode_switch = false,
+ .auto_queue = false,
+ .wake_capable = false,
+ },
+};
+
+static struct mhi_event_config aic100_events[] = {
+ {
+ .num_elements = 32,
+ .irq_moderation_ms = 0,
+ .irq = 0,
+ .channel = U32_MAX,
+ .priority = 1,
+ .mode = MHI_DB_BRST_DISABLE,
+ .data_type = MHI_ER_CTRL,
+ .hardware_event = false,
+ .client_managed = false,
+ .offload_channel = false,
+ },
+};
+
+static struct mhi_controller_config aic100_config = {
+ .max_channels = 128,
+ .timeout_ms = 0, /* controlled by mhi_timeout */
+ .buf_len = 0,
+ .num_channels = ARRAY_SIZE(aic100_channels),
+ .ch_cfg = aic100_channels,
+ .num_events = ARRAY_SIZE(aic100_events),
+ .event_cfg = aic100_events,
+ .use_bounce_buf = false,
+ .m2_no_db = false,
+};
+
+static int mhi_read_reg(struct mhi_controller *mhi_cntl, void __iomem *addr, u32 *out)
+{
+ u32 tmp = readl_relaxed(addr);
+
+ if (tmp == U32_MAX)
+ return -EIO;
+
+ *out = tmp;
+
+ return 0;
+}
+
+static void mhi_write_reg(struct mhi_controller *mhi_cntl, void __iomem *addr,
+ u32 val)
+{
+ writel_relaxed(val, addr);
+}
+
+static int mhi_runtime_get(struct mhi_controller *mhi_cntl)
+{
+ return 0;
+}
+
+static void mhi_runtime_put(struct mhi_controller *mhi_cntl)
+{
+}
+
+static void mhi_status_cb(struct mhi_controller *mhi_cntl, enum mhi_callback reason)
+{
+ struct qaic_device *qdev = pci_get_drvdata(to_pci_dev(mhi_cntl->cntrl_dev));
+
+ /* this event occurs in atomic context */
+ if (reason == MHI_CB_FATAL_ERROR)
+ pci_err(qdev->pdev, "Fatal error received from device. Attempting to recover\n");
+ /* this event occurs in non-atomic context */
+ if (reason == MHI_CB_SYS_ERROR && !qdev->in_reset)
+ qaic_dev_reset_clean_local_state(qdev, true);
+}
+
+static int mhi_reset_and_async_power_up(struct mhi_controller *mhi_cntl)
+{
+ char time_sec = 1;
+ int current_ee;
+ int ret;
+
+ /* Reset the device to bring the device in PBL EE */
+ mhi_soc_reset(mhi_cntl);
+
+ /*
+ * Keep checking the execution environment(EE) after every 1 second
+ * interval.
+ */
+ do {
+ msleep(1000);
+ current_ee = mhi_get_exec_env(mhi_cntl);
+ } while (current_ee != MHI_EE_PBL && time_sec++ <= MAX_RESET_TIME_SEC);
+
+ /* If the device is in PBL EE retry power up */
+ if (current_ee == MHI_EE_PBL)
+ ret = mhi_async_power_up(mhi_cntl);
+ else
+ ret = -EIO;
+
+ return ret;
+}
+
+struct mhi_controller *qaic_mhi_register_controller(struct pci_dev *pci_dev,
+ void __iomem *mhi_bar,
+ int mhi_irq)
+{
+ struct mhi_controller *mhi_cntl;
+ int ret;
+
+ mhi_cntl = kzalloc(sizeof(*mhi_cntl), GFP_KERNEL);
+ if (!mhi_cntl)
+ return ERR_PTR(-ENOMEM);
+
+ mhi_cntl->cntrl_dev = &pci_dev->dev;
+
+ /*
+ * Covers the entire possible physical ram region. Remote side is
+ * going to calculate a size of this range, so subtract 1 to prevent
+ * rollover.
+ */
+ mhi_cntl->iova_start = 0;
+ mhi_cntl->iova_stop = PHYS_ADDR_MAX - 1;
+
+ mhi_cntl->status_cb = mhi_status_cb;
+ mhi_cntl->runtime_get = mhi_runtime_get;
+ mhi_cntl->runtime_put = mhi_runtime_put;
+ mhi_cntl->read_reg = mhi_read_reg;
+ mhi_cntl->write_reg = mhi_write_reg;
+ mhi_cntl->regs = mhi_bar;
+ mhi_cntl->reg_len = SZ_4K;
+ mhi_cntl->nr_irqs = 1;
+ mhi_cntl->irq = kmalloc(sizeof(*mhi_cntl->irq), GFP_KERNEL);
+
+ if (!mhi_cntl->irq) {
+ kfree(mhi_cntl);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ mhi_cntl->irq[0] = mhi_irq;
+
+ mhi_cntl->fw_image = "qcom/aic100/sbl.bin";
+
+ /* use latest configured timeout */
+ aic100_config.timeout_ms = mhi_timeout;
+ ret = mhi_register_controller(mhi_cntl, &aic100_config);
+ if (ret) {
+ pci_err(pci_dev, "mhi_register_controller failed %d\n", ret);
+ kfree(mhi_cntl->irq);
+ kfree(mhi_cntl);
+ return ERR_PTR(ret);
+ }
+
+ ret = mhi_prepare_for_power_up(mhi_cntl);
+ if (ret) {
+ pci_err(pci_dev, "mhi_prepare_for_power_up failed %d\n", ret);
+ mhi_unregister_controller(mhi_cntl);
+ kfree(mhi_cntl->irq);
+ kfree(mhi_cntl);
+ return ERR_PTR(ret);
+ }
+
+ ret = mhi_async_power_up(mhi_cntl);
+ /*
+ * If EIO is returned it is possible that device is in SBL EE, which is
+ * undesired. SOC reset the device and try to power up again.
+ */
+ if (ret == -EIO && MHI_EE_SBL == mhi_get_exec_env(mhi_cntl)) {
+ pci_err(pci_dev, "Device is not expected to be SBL EE. SOC resetting the device to put it in PBL EE and again trying mhi async power up. Error %d\n",
+ ret);
+ ret = mhi_reset_and_async_power_up(mhi_cntl);
+ }
+
+ if (ret) {
+ pci_err(pci_dev, "mhi_async_power_up failed %d\n", ret);
+ mhi_unprepare_after_power_down(mhi_cntl);
+ mhi_unregister_controller(mhi_cntl);
+ kfree(mhi_cntl->irq);
+ kfree(mhi_cntl);
+ return ERR_PTR(ret);
+ }
+
+ return mhi_cntl;
+}
+
+void qaic_mhi_free_controller(struct mhi_controller *mhi_cntl, bool link_up)
+{
+ mhi_power_down(mhi_cntl, link_up);
+ mhi_unprepare_after_power_down(mhi_cntl);
+ mhi_unregister_controller(mhi_cntl);
+ kfree(mhi_cntl->irq);
+ kfree(mhi_cntl);
+}
+
+void qaic_mhi_start_reset(struct mhi_controller *mhi_cntl)
+{
+ mhi_power_down(mhi_cntl, true);
+}
+
+void qaic_mhi_reset_done(struct mhi_controller *mhi_cntl)
+{
+ struct pci_dev *pci_dev = container_of(mhi_cntl->cntrl_dev,
+ struct pci_dev, dev);
+ int ret;
+
+ ret = mhi_async_power_up(mhi_cntl);
+ if (ret)
+ pci_err(pci_dev, "mhi_async_power_up failed after reset %d\n", ret);
+}
diff --git a/drivers/gpu/drm/qaic/mhi_controller.h b/drivers/gpu/drm/qaic/mhi_controller.h
new file mode 100644
index 0000000..5a739bb
--- /dev/null
+++ b/drivers/gpu/drm/qaic/mhi_controller.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * Copyright (c) 2019-2020, The Linux Foundation. All rights reserved.
+ */
+
+#ifndef MHICONTROLLERQAIC_H_
+#define MHICONTROLLERQAIC_H_
+
+struct mhi_controller *qaic_mhi_register_controller(struct pci_dev *pci_dev,
+ void __iomem *mhi_bar,
+ int mhi_irq);
+
+void qaic_mhi_free_controller(struct mhi_controller *mhi_cntl, bool link_up);
+
+void qaic_mhi_start_reset(struct mhi_controller *mhi_cntl);
+void qaic_mhi_reset_done(struct mhi_controller *mhi_cntl);
+
+#endif /* MHICONTROLLERQAIC_H_ */
--
2.7.4

2022-08-15 21:09:34

by Jeffrey Hugo

[permalink] [raw]
Subject: [RFC PATCH 13/14] drm/qaic: Add qaic driver to the build system

Add the infrastructure that allows the QAIC driver to be built.

Change-Id: I5b609b2e91b6a99939bdac35849813263ad874af
Signed-off-by: Jeffrey Hugo <[email protected]>
---
drivers/gpu/drm/Kconfig | 2 ++
drivers/gpu/drm/Makefile | 1 +
drivers/gpu/drm/qaic/Kconfig | 33 +++++++++++++++++++++++++++++++++
drivers/gpu/drm/qaic/Makefile | 17 +++++++++++++++++
4 files changed, 53 insertions(+)
create mode 100644 drivers/gpu/drm/qaic/Kconfig
create mode 100644 drivers/gpu/drm/qaic/Makefile

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index b1f22e4..b614940 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -390,6 +390,8 @@ source "drivers/gpu/drm/gud/Kconfig"

source "drivers/gpu/drm/sprd/Kconfig"

+source "drivers/gpu/drm/qaic/Kconfig"
+
config DRM_HYPERV
tristate "DRM Support for Hyper-V synthetic video device"
depends on DRM && PCI && MMU && HYPERV
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 301a44d..28b0f1b 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -135,3 +135,4 @@ obj-y += xlnx/
obj-y += gud/
obj-$(CONFIG_DRM_HYPERV) += hyperv/
obj-$(CONFIG_DRM_SPRD) += sprd/
+obj-$(CONFIG_DRM_QAIC) += qaic/
diff --git a/drivers/gpu/drm/qaic/Kconfig b/drivers/gpu/drm/qaic/Kconfig
new file mode 100644
index 0000000..eca2bcb
--- /dev/null
+++ b/drivers/gpu/drm/qaic/Kconfig
@@ -0,0 +1,33 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Qualcomm Cloud AI accelerators driver
+#
+
+config DRM_QAIC
+ tristate "Qualcomm Cloud AI accelerators"
+ depends on PCI && HAS_IOMEM
+ depends on MHI_BUS
+ depends on DRM
+ depends on MMU
+ select CRC32
+ help
+ Enables driver for Qualcomm's Cloud AI accelerator PCIe cards that are
+ designed to accelerate Deep Learning inference workloads.
+
+ The driver manages the PCIe devices and provides an IOCTL interface
+ for users to submit workloads to the devices.
+
+ If unsure, say N.
+
+ To compile this driver as a module, choose M here: the
+ module will be called qaic.
+
+config QAIC_HWMON
+ bool "Qualcomm Cloud AI accelerator telemetry"
+ depends on DRM_QAIC
+ depends on HWMON
+ help
+ Enables telemetry via the HWMON interface for Qualcomm's Cloud AI
+ accelerator PCIe cards.
+
+ If unsure, say N.
diff --git a/drivers/gpu/drm/qaic/Makefile b/drivers/gpu/drm/qaic/Makefile
new file mode 100644
index 0000000..4a5daff
--- /dev/null
+++ b/drivers/gpu/drm/qaic/Makefile
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Makefile for Qualcomm Cloud AI accelerators driver
+#
+
+obj-$(CONFIG_DRM_QAIC) := qaic.o
+
+qaic-y := \
+ qaic_drv.o \
+ mhi_controller.o \
+ qaic_control.o \
+ qaic_data.o \
+ qaic_debugfs.o \
+ qaic_telemetry.o \
+ qaic_ras.o \
+ qaic_ssr.o \
+ qaic_sysfs.o
--
2.7.4

2022-08-15 21:12:23

by Jeffrey Hugo

[permalink] [raw]
Subject: [RFC PATCH 10/14] drm/qaic: Add sysfs

The QAIC driver can advertise the state of individual dma_bridge channels
to userspace. Userspace can use this information to manage userspace
state when a channel crashes.

Change-Id: Ifc7435c53cec6aa326bdcd9bfcb77ea7f2a63bab
Signed-off-by: Jeffrey Hugo <[email protected]>
---
drivers/gpu/drm/qaic/qaic_sysfs.c | 113 ++++++++++++++++++++++++++++++++++++++
1 file changed, 113 insertions(+)
create mode 100644 drivers/gpu/drm/qaic/qaic_sysfs.c

diff --git a/drivers/gpu/drm/qaic/qaic_sysfs.c b/drivers/gpu/drm/qaic/qaic_sysfs.c
new file mode 100644
index 0000000..5ee1696
--- /dev/null
+++ b/drivers/gpu/drm/qaic/qaic_sysfs.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* Copyright (c) 2020-2021, The Linux Foundation. All rights reserved. */
+
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/sysfs.h>
+
+#include "qaic.h"
+
+#define NAME_LEN 14
+
+struct dbc_attribute {
+ struct device_attribute dev_attr;
+ u32 dbc_id;
+ char name[NAME_LEN];
+};
+
+static ssize_t dbc_state_show(struct device *dev,
+ struct device_attribute *a, char *buf)
+{
+ struct dbc_attribute *attr = container_of(a, struct dbc_attribute, dev_attr);
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+
+ return sprintf(buf, "%d\n", qdev->dbc[attr->dbc_id].state);
+}
+
+void set_dbc_state(struct qaic_device *qdev, u32 dbc_id, unsigned int state)
+{
+ char id_str[12];
+ char state_str[16];
+ char *envp[] = { id_str, state_str, NULL };
+ struct qaic_drm_device *qddev;
+
+ if (state >= DBC_STATE_MAX) {
+ pci_dbg(qdev->pdev, "%s invalid state %d\n", __func__, state);
+ return;
+ }
+ if (dbc_id >= qdev->num_dbc) {
+ pci_dbg(qdev->pdev, "%s invalid dbc_id %d\n", __func__, dbc_id);
+ return;
+ }
+ if (state == qdev->dbc[dbc_id].state) {
+ pci_dbg(qdev->pdev, "%s already at state %d\n", __func__, state);
+ return;
+ }
+
+ snprintf(id_str, ARRAY_SIZE(id_str), "DBC_ID=%d", dbc_id);
+ snprintf(state_str, ARRAY_SIZE(state_str), "DBC_STATE=%d", state);
+
+ qdev->dbc[dbc_id].state = state;
+ mutex_lock(&qdev->qaic_drm_devices_mutex);
+ list_for_each_entry(qddev, &qdev->qaic_drm_devices, node)
+ kobject_uevent_env(&qddev->ddev->dev->kobj, KOBJ_CHANGE, envp);
+ mutex_unlock(&qdev->qaic_drm_devices_mutex);
+}
+
+int qaic_sysfs_init(struct qaic_drm_device *qddev)
+{
+ u32 num_dbc = qddev->qdev->num_dbc;
+ struct dbc_attribute *dbc_attrs;
+ int i, ret;
+
+ dbc_attrs = kcalloc(num_dbc, sizeof(*dbc_attrs), GFP_KERNEL);
+ if (!dbc_attrs)
+ return -ENOMEM;
+
+ qddev->sysfs_attrs = dbc_attrs;
+
+ for (i = 0; i < num_dbc; ++i) {
+ struct dbc_attribute *dbc = &dbc_attrs[i];
+
+ sysfs_attr_init(&dbc->dev_attr.attr);
+ dbc->dbc_id = i;
+ snprintf(dbc->name, NAME_LEN, "dbc%d_state", i);
+ dbc->dev_attr.attr.name = dbc->name;
+ dbc->dev_attr.attr.mode = 0444;
+ dbc->dev_attr.show = dbc_state_show;
+ ret = sysfs_create_file(&qddev->ddev->dev->kobj,
+ &dbc->dev_attr.attr);
+ if (ret) {
+ int j;
+
+ for (j = 0; j < i; ++j) {
+ dbc = &dbc_attrs[j];
+ sysfs_remove_file(&qddev->ddev->dev->kobj,
+ &dbc->dev_attr.attr);
+ }
+ break;
+ }
+ }
+
+ if (ret)
+ kfree(dbc_attrs);
+
+ return ret;
+}
+
+void qaic_sysfs_remove(struct qaic_drm_device *qddev)
+{
+ struct dbc_attribute *dbc_attrs = qddev->sysfs_attrs;
+ u32 num_dbc = qddev->qdev->num_dbc;
+ int i;
+
+ for (i = 0; i < num_dbc; ++i)
+ sysfs_remove_file(&qddev->ddev->dev->kobj,
+ &dbc_attrs[i].dev_attr.attr);
+
+ kfree(dbc_attrs);
+}
--
2.7.4

2022-08-15 21:13:11

by Jeffrey Hugo

[permalink] [raw]
Subject: [RFC PATCH 09/14] drm/qaic: Add ssr component

A QAIC device supports the concept of subsystem restart (ssr). If a
processing unit for a workload crashes, it is possible to reset that unit
instead of crashing the device. Since such an error is likely related to
the workload code that was running, it is possible to collect a crashdump
of the workload for offline analysis.

Change-Id: I77aa21ecbf0f730d8736a7465285ce5290ed3745
Signed-off-by: Jeffrey Hugo <[email protected]>
---
drivers/gpu/drm/qaic/qaic_ssr.c | 889 ++++++++++++++++++++++++++++++++++++++++
drivers/gpu/drm/qaic/qaic_ssr.h | 13 +
2 files changed, 902 insertions(+)
create mode 100644 drivers/gpu/drm/qaic/qaic_ssr.c
create mode 100644 drivers/gpu/drm/qaic/qaic_ssr.h

diff --git a/drivers/gpu/drm/qaic/qaic_ssr.c b/drivers/gpu/drm/qaic/qaic_ssr.c
new file mode 100644
index 0000000..826361b
--- /dev/null
+++ b/drivers/gpu/drm/qaic/qaic_ssr.c
@@ -0,0 +1,889 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* Copyright (c) 2020-2021, The Linux Foundation. All rights reserved. */
+/* Copyright (c) 2021-2022 Qualcomm Innovation Center, Inc. All rights reserved. */
+
+#include <asm/byteorder.h>
+#include <linux/devcoredump.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/mhi.h>
+#include <linux/workqueue.h>
+
+#include "qaic.h"
+#include "qaic_ssr.h"
+#include "qaic_trace.h"
+
+#define MSG_BUF_SZ 32
+#define MAX_PAGE_DUMP_RESP 4 /* It should always be in powers of 2 */
+
+enum ssr_cmds {
+ DEBUG_TRANSFER_INFO = BIT(0),
+ DEBUG_TRANSFER_INFO_RSP = BIT(1),
+ MEMORY_READ = BIT(2),
+ MEMORY_READ_RSP = BIT(3),
+ DEBUG_TRANSFER_DONE = BIT(4),
+ DEBUG_TRANSFER_DONE_RSP = BIT(5),
+ SSR_EVENT = BIT(8),
+ SSR_EVENT_RSP = BIT(9),
+};
+
+enum ssr_events {
+ SSR_EVENT_NACK = BIT(0),
+ BEFORE_SHUTDOWN = BIT(1),
+ AFTER_SHUTDOWN = BIT(2),
+ BEFORE_POWER_UP = BIT(3),
+ AFTER_POWER_UP = BIT(4),
+};
+
+struct debug_info_table {
+ /* Save preferences. Default is mandatory */
+ u64 save_perf;
+ /* Base address of the debug region */
+ u64 mem_base;
+ /* Size of debug region in bytes */
+ u64 len;
+ /* Description */
+ char desc[20];
+ /* Filename of debug region */
+ char filename[20];
+};
+
+struct _ssr_hdr {
+ __le32 cmd;
+ __le32 len;
+ __le32 dbc_id;
+};
+
+struct ssr_hdr {
+ u32 cmd;
+ u32 len;
+ u32 dbc_id;
+};
+
+struct ssr_debug_transfer_info {
+ struct ssr_hdr hdr;
+ u32 resv;
+ u64 tbl_addr;
+ u64 tbl_len;
+} __packed;
+
+struct ssr_debug_transfer_info_rsp {
+ struct _ssr_hdr hdr;
+ __le32 ret;
+} __packed;
+
+struct ssr_memory_read {
+ struct _ssr_hdr hdr;
+ __le32 resv;
+ __le64 addr;
+ __le64 len;
+} __packed;
+
+struct ssr_memory_read_rsp {
+ struct _ssr_hdr hdr;
+ __le32 resv;
+ u8 data[];
+} __packed;
+
+struct ssr_debug_transfer_done {
+ struct _ssr_hdr hdr;
+ __le32 resv;
+} __packed;
+
+struct ssr_debug_transfer_done_rsp {
+ struct _ssr_hdr hdr;
+ __le32 ret;
+} __packed;
+
+struct ssr_event {
+ struct ssr_hdr hdr;
+ u32 event;
+} __packed;
+
+struct ssr_event_rsp {
+ struct _ssr_hdr hdr;
+ __le32 event;
+} __packed;
+
+struct ssr_resp {
+ /* Work struct to schedule work coming on QAIC_SSR channel */
+ struct work_struct work;
+ /* Root struct of device, used to access device resources */
+ struct qaic_device *qdev;
+ /* Buffer used by MHI for transfer requests */
+ u8 data[] __aligned(8);
+};
+
+/* SSR crashdump book keeping structure */
+struct ssr_dump_info {
+ /* DBC associated with this SSR crashdump */
+ struct dma_bridge_chan *dbc;
+ /*
+ * It will be used when we complete the crashdump download and switch
+ * to waiting on SSR events
+ */
+ struct ssr_resp *resp;
+ /* We use this buffer to queue Crashdump downloading requests */
+ struct ssr_resp *dump_resp;
+ /* TRUE: dump_resp is queued for MHI transaction. FALSE: Otherwise */
+ bool dump_resp_queued;
+ /* TRUE: mem_rd_buf is queued for MHI transaction. FALSE: Otherwise */
+ bool mem_rd_buf_queued;
+ /* MEMORY READ request MHI buffer.*/
+ struct ssr_memory_read *mem_rd_buf;
+ /* Address of table in host */
+ void *tbl_addr;
+ /* Ptr to the entire dump */
+ void *dump_addr;
+ /* Address of table in device/target */
+ u64 tbl_addr_dev;
+ /* Total size of table */
+ u64 tbl_len;
+ /* Entire crashdump size */
+ u64 dump_sz;
+ /* Size of the buffer queued in for MHI transfer */
+ u64 resp_buf_sz;
+ /*
+ * Crashdump will be collected chunk by chunk and this is max size of
+ * one chunk
+ */
+ u64 chunk_sz;
+ /* Offset of table(tbl_addr) where the new chunk will be dumped */
+ u64 tbl_off;
+ /* Points to the table entry we are currently downloading */
+ struct debug_info_table *tbl_ent;
+ /* Number of bytes downloaded for current entry in table */
+ u64 tbl_ent_rd;
+ /* Offset of crashdump(dump_addr) where the new chunk will be dumped */
+ u64 dump_off;
+};
+
+struct dump_file_meta {
+ u64 size; /* Total size of the entire dump */
+ u64 tbl_len; /* Length of the table in byte */
+};
+
+/*
+ * Layout of crashdump
+ * +------------------------------------------+
+ * | Crashdump Meta structure |
+ * | type: struct dump_file_meta |
+ * +------------------------------------------+
+ * | Crashdump Table |
+ * | type: array of struct debug_info_table |
+ * | |
+ * | |
+ * | |
+ * +------------------------------------------+
+ * | Crashdump |
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * +------------------------------------------+
+ */
+
+static void free_ssr_dump_buf(struct ssr_dump_info *dump_info)
+{
+ if (!dump_info)
+ return;
+ if (!dump_info->mem_rd_buf_queued)
+ kfree(dump_info->mem_rd_buf);
+ if (!dump_info->dump_resp_queued)
+ kfree(dump_info->dump_resp);
+ trace_qaic_ssr_dump(dump_info->dbc->qdev, "SSR releasing resources required during crashdump collection");
+ vfree(dump_info->tbl_addr);
+ vfree(dump_info->dump_addr);
+ dump_info->dbc->dump_info = NULL;
+ kfree(dump_info);
+}
+
+void clean_up_ssr(struct qaic_device *qdev, u32 dbc_id)
+{
+ dbc_exit_ssr(qdev, dbc_id);
+ free_ssr_dump_buf(qdev->dbc[dbc_id].dump_info);
+}
+
+static int alloc_dump(struct ssr_dump_info *dump_info)
+{
+ struct debug_info_table *tbl_ent = dump_info->tbl_addr;
+ struct dump_file_meta *dump_meta;
+ u64 tbl_sz_lp = 0;
+ u64 sz = 0;
+
+ while (tbl_sz_lp < dump_info->tbl_len) {
+ le64_to_cpus(&tbl_ent->save_perf);
+ le64_to_cpus(&tbl_ent->mem_base);
+ le64_to_cpus(&tbl_ent->len);
+
+ if (tbl_ent->len == 0) {
+ pci_warn(dump_info->dump_resp->qdev->pdev, "An entry in dump table points to 0 len segment. Entry index %llu desc %.20s filename %.20s.\n",
+ tbl_sz_lp / sizeof(*tbl_ent), tbl_ent->desc,
+ tbl_ent->filename);
+ return -EINVAL;
+ }
+
+ sz += tbl_ent->len;
+ tbl_ent++;
+ tbl_sz_lp += sizeof(*tbl_ent);
+ }
+
+ dump_info->dump_sz = sz + dump_info->tbl_len + sizeof(*dump_meta);
+ /* Actual crashdump will be offsetted by crashdump meta and table */
+ dump_info->dump_off = dump_info->tbl_len + sizeof(*dump_meta);
+
+ dump_info->dump_addr = vzalloc(dump_info->dump_sz);
+ if (!dump_info->dump_addr) {
+ pci_warn(dump_info->dump_resp->qdev->pdev, "Failed to allocate crashdump memory. Virtual memory requested %llu\n",
+ dump_info->dump_sz);
+ return -ENOMEM;
+ }
+
+ trace_qaic_ssr_dump(dump_info->dbc->qdev, "SSR crashdump memory is allocated. Crashdump collection will be initiated");
+
+ /* Copy crashdump meta and table */
+ dump_meta = dump_info->dump_addr;
+ dump_meta->size = dump_info->dump_sz;
+ dump_meta->tbl_len = dump_info->tbl_len;
+ memcpy(dump_info->dump_addr + sizeof(*dump_meta), dump_info->tbl_addr,
+ dump_info->tbl_len);
+
+ return 0;
+}
+
+static int send_xfer_done(struct qaic_device *qdev, void *resp, u32 dbc_id)
+{
+ struct ssr_debug_transfer_done *xfer_done;
+ int ret;
+
+ xfer_done = kmalloc(sizeof(*xfer_done), GFP_KERNEL);
+ if (!xfer_done) {
+ pci_warn(qdev->pdev, "Failed to allocate SSR transfer done request struct. DBC ID %u. Physical memory requested %lu\n",
+ dbc_id, sizeof(*xfer_done));
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = mhi_queue_buf(qdev->ssr_ch, DMA_FROM_DEVICE, resp,
+ MSG_BUF_SZ, MHI_EOT);
+ if (ret) {
+ pci_warn(qdev->pdev, "Could not queue SSR transfer done response %d. DBC ID %u.\n",
+ ret, dbc_id);
+ goto free_xfer_done;
+ }
+
+ xfer_done->hdr.cmd = cpu_to_le32(DEBUG_TRANSFER_DONE);
+ xfer_done->hdr.len = cpu_to_le32(sizeof(*xfer_done));
+ xfer_done->hdr.dbc_id = cpu_to_le32(dbc_id);
+
+ ret = mhi_queue_buf(qdev->ssr_ch, DMA_TO_DEVICE, xfer_done,
+ sizeof(*xfer_done), MHI_EOT);
+ if (ret) {
+ pci_warn(qdev->pdev, "Could not send DEBUG TRANSFER DONE %d. DBC ID %u.\n",
+ ret, dbc_id);
+ goto free_xfer_done;
+ }
+
+ return 0;
+
+free_xfer_done:
+ kfree(xfer_done);
+out:
+ return ret;
+}
+
+static int send_mem_rd(struct qaic_device *qdev, struct ssr_dump_info *dump_info,
+ u64 dest_addr, u64 dest_len)
+{
+ u32 dbc_id = dump_info->dbc->id;
+ int ret;
+
+ ret = mhi_queue_buf(qdev->ssr_ch, DMA_FROM_DEVICE,
+ dump_info->dump_resp->data,
+ dump_info->resp_buf_sz, MHI_EOT);
+ if (ret) {
+ pci_warn(qdev->pdev, "Could not queue SSR dump buf %d. DBC ID %u.\n",
+ ret, dbc_id);
+ goto out;
+ } else {
+ dump_info->dump_resp_queued = true;
+ }
+
+ dump_info->mem_rd_buf->hdr.cmd = cpu_to_le32(MEMORY_READ);
+ dump_info->mem_rd_buf->hdr.len =
+ cpu_to_le32(sizeof(*dump_info->mem_rd_buf));
+ dump_info->mem_rd_buf->hdr.dbc_id = cpu_to_le32(dbc_id);
+ dump_info->mem_rd_buf->addr = cpu_to_le64(dest_addr);
+ dump_info->mem_rd_buf->len = cpu_to_le64(dest_len);
+
+ ret = mhi_queue_buf(qdev->ssr_ch, DMA_TO_DEVICE,
+ dump_info->mem_rd_buf,
+ sizeof(*dump_info->mem_rd_buf), MHI_EOT);
+ if (ret)
+ pci_warn(qdev->pdev, "Could not send MEMORY READ %d. DBC ID %u.\n",
+ ret, dbc_id);
+ else
+ dump_info->mem_rd_buf_queued = true;
+
+out:
+ return ret;
+}
+
+static int ssr_copy_table(struct ssr_dump_info *dump_info, void *data, u64 len)
+{
+ if (len > dump_info->tbl_len - dump_info->tbl_off) {
+ pci_warn(dump_info->dump_resp->qdev->pdev, "Invalid data length of table chunk. Length provided %llu & at most expected length %llu\n",
+ len, dump_info->tbl_len - dump_info->tbl_off);
+ return -EINVAL;
+ }
+
+ memcpy(dump_info->tbl_addr + dump_info->tbl_off, data, len);
+
+ dump_info->tbl_off += len;
+
+ /* Entire table has been downloaded, alloc dump memory */
+ if (dump_info->tbl_off == dump_info->tbl_len) {
+ dump_info->tbl_ent = dump_info->tbl_addr;
+ trace_qaic_ssr_dump(dump_info->dbc->qdev, "SSR debug table download complete");
+ return alloc_dump(dump_info);
+ }
+
+ return 0;
+}
+
+static int ssr_copy_dump(struct ssr_dump_info *dump_info, void *data, u64 len)
+{
+ struct debug_info_table *tbl_ent;
+
+ tbl_ent = dump_info->tbl_ent;
+
+ if (len > tbl_ent->len - dump_info->tbl_ent_rd) {
+ pci_warn(dump_info->dump_resp->qdev->pdev, "Invalid data length of dump chunk. Length provided %llu & at most expected length %llu. Segment details base_addr: 0x%llx len: %llu desc: %.20s filename: %.20s.\n",
+ len, tbl_ent->len - dump_info->tbl_ent_rd,
+ tbl_ent->mem_base, tbl_ent->len, tbl_ent->desc,
+ tbl_ent->filename);
+ return -EINVAL;
+ }
+
+ memcpy(dump_info->dump_addr + dump_info->dump_off, data, len);
+
+ dump_info->dump_off += len;
+ dump_info->tbl_ent_rd += len;
+
+ /* Current segment of the crashdump is complete, move to next one */
+ if (tbl_ent->len == dump_info->tbl_ent_rd) {
+ dump_info->tbl_ent++;
+ dump_info->tbl_ent_rd = 0;
+ }
+
+ return 0;
+}
+
+static void ssr_dump_worker(struct work_struct *work)
+{
+ struct ssr_resp *dump_resp =
+ container_of(work, struct ssr_resp, work);
+ struct qaic_device *qdev = dump_resp->qdev;
+ struct ssr_memory_read_rsp *mem_rd_resp;
+ struct debug_info_table *tbl_ent;
+ struct ssr_dump_info *dump_info;
+ u64 dest_addr, dest_len;
+ struct _ssr_hdr *_hdr;
+ struct ssr_hdr hdr;
+ u64 data_len;
+ int ret;
+
+ mem_rd_resp = (struct ssr_memory_read_rsp *)dump_resp->data;
+ _hdr = &mem_rd_resp->hdr;
+ hdr.cmd = le32_to_cpu(_hdr->cmd);
+ hdr.len = le32_to_cpu(_hdr->len);
+ hdr.dbc_id = le32_to_cpu(_hdr->dbc_id);
+
+ if (hdr.dbc_id >= qdev->num_dbc) {
+ pci_warn(qdev->pdev, "Dropping SSR message with invalid DBC ID %u. DBC ID should be less than %u.\n",
+ hdr.dbc_id, qdev->num_dbc);
+ goto reset_device;
+ }
+ dump_info = qdev->dbc[hdr.dbc_id].dump_info;
+
+ if (!dump_info) {
+ pci_warn(qdev->pdev, "Dropping SSR message with invalid dbc id %u. Crashdump is not initiated for this DBC ID.\n",
+ hdr.dbc_id);
+ goto reset_device;
+ }
+
+ dump_info->dump_resp_queued = false;
+
+ if (hdr.cmd != MEMORY_READ_RSP) {
+ pci_warn(qdev->pdev, "Dropping SSR message with invalid CMD %u. Expected command is %u.\n",
+ hdr.cmd, MEMORY_READ_RSP);
+ goto free_dump_info;
+ }
+
+ if (hdr.len > dump_info->resp_buf_sz) {
+ pci_warn(qdev->pdev, "Dropping SSR message with invalid length %u. At most length expected is %llu.\n",
+ hdr.len, dump_info->resp_buf_sz);
+ goto free_dump_info;
+ }
+
+ data_len = hdr.len - sizeof(*mem_rd_resp);
+
+ if (dump_info->tbl_off < dump_info->tbl_len)
+ /* Chunk belongs to table */
+ ret = ssr_copy_table(dump_info, mem_rd_resp->data, data_len);
+ else
+ /* Chunk belongs to crashdump */
+ ret = ssr_copy_dump(dump_info, mem_rd_resp->data, data_len);
+
+ if (ret)
+ goto free_dump_info;
+
+ if (dump_info->tbl_off < dump_info->tbl_len) {
+ /* Continue downloading table */
+ dest_addr = dump_info->tbl_addr_dev + dump_info->tbl_off;
+ dest_len = min(dump_info->chunk_sz,
+ dump_info->tbl_len - dump_info->tbl_off);
+ ret = send_mem_rd(qdev, dump_info, dest_addr, dest_len);
+ } else if (dump_info->dump_off < dump_info->dump_sz) {
+ /* Continue downloading crashdump */
+ tbl_ent = dump_info->tbl_ent;
+ dest_addr = tbl_ent->mem_base + dump_info->tbl_ent_rd;
+ dest_len = min(dump_info->chunk_sz,
+ tbl_ent->len - dump_info->tbl_ent_rd);
+ ret = send_mem_rd(qdev, dump_info, dest_addr, dest_len);
+ } else {
+ /* Crashdump download complete */
+ trace_qaic_ssr_dump(qdev, "SSR crashdump download complete");
+ ret = send_xfer_done(qdev, dump_info->resp->data, hdr.dbc_id);
+ }
+
+ if (ret)
+ /* Most likely a MHI xfer has failed */
+ goto free_dump_info;
+
+ return;
+
+free_dump_info:
+ /* Free the allocated memory */
+ free_ssr_dump_buf(dump_info);
+reset_device:
+ /*
+ * After subsystem crashes in device crashdump collection begins but
+ * something went wrong while collecting crashdump, now instead of
+ * handling this error we just reset the device as the best effort has
+ * been made
+ */
+ mhi_soc_reset(qdev->mhi_cntl);
+}
+
+static struct ssr_dump_info *alloc_dump_info(struct qaic_device *qdev,
+ struct ssr_debug_transfer_info *debug_info)
+{
+ struct ssr_dump_info *dump_info;
+ int nr_page;
+ int ret;
+
+ le64_to_cpus(&debug_info->tbl_len);
+ le64_to_cpus(&debug_info->tbl_addr);
+
+ if (debug_info->tbl_len == 0 ||
+ debug_info->tbl_len % sizeof(struct debug_info_table) != 0) {
+ pci_warn(qdev->pdev, "Invalid table length %llu passed. Table length should be non-zero & multiple of %lu\n",
+ debug_info->tbl_len, sizeof(struct debug_info_table));
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Allocate SSR crashdump book keeping structure */
+ dump_info = kzalloc(sizeof(*dump_info), GFP_KERNEL);
+ if (!dump_info) {
+ pci_warn(qdev->pdev, "Failed to allocate SSR dump book keeping buffer. Physical memory requested %lu\n",
+ sizeof(*dump_info));
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* Allocate SSR crashdump request buffer, used for SSR MEMORY READ */
+ nr_page = MAX_PAGE_DUMP_RESP;
+ while (nr_page > 0) {
+ dump_info->dump_resp = kzalloc(nr_page * PAGE_SIZE,
+ GFP_KERNEL | __GFP_NOWARN);
+ if (dump_info->dump_resp)
+ break;
+ nr_page >>= 1;
+ }
+
+ if (!dump_info->dump_resp) {
+ pci_warn(qdev->pdev, "Failed to allocate SSR dump response buffer. Physical memory requested %lu\n",
+ PAGE_SIZE);
+ ret = -ENOMEM;
+ goto free_dump_info;
+ }
+
+ INIT_WORK(&dump_info->dump_resp->work, ssr_dump_worker);
+ dump_info->dump_resp->qdev = qdev;
+
+ dump_info->tbl_addr_dev = debug_info->tbl_addr;
+ dump_info->tbl_len = debug_info->tbl_len;
+ dump_info->resp_buf_sz = nr_page * PAGE_SIZE -
+ sizeof(*dump_info->dump_resp);
+ dump_info->chunk_sz = dump_info->resp_buf_sz -
+ sizeof(struct ssr_memory_read_rsp);
+
+ dump_info->tbl_addr = vzalloc(dump_info->tbl_len);
+ if (!dump_info->tbl_addr) {
+ pci_warn(qdev->pdev, "Failed to allocate SSR table struct. Virtual memory requested %llu\n",
+ dump_info->tbl_len);
+ ret = -ENOMEM;
+ goto free_dump_resp;
+ }
+
+ dump_info->mem_rd_buf = kzalloc(sizeof(*dump_info->mem_rd_buf),
+ GFP_KERNEL);
+ if (!dump_info->mem_rd_buf) {
+ pci_warn(qdev->pdev, "Failed to allocate memory read request buffer for MHI transactions. Physical memory requested %lu\n",
+ sizeof(*dump_info->mem_rd_buf));
+ ret = -ENOMEM;
+ goto free_dump_tbl;
+ }
+
+ return dump_info;
+
+free_dump_tbl:
+ vfree(dump_info->tbl_addr);
+free_dump_resp:
+ kfree(dump_info->dump_resp);
+free_dump_info:
+ kfree(dump_info);
+out:
+ return ERR_PTR(ret);
+}
+
+static void ssr_worker(struct work_struct *work)
+{
+ struct ssr_resp *resp = container_of(work, struct ssr_resp, work);
+ struct ssr_hdr *hdr = (struct ssr_hdr *)resp->data;
+ struct ssr_debug_transfer_info_rsp *debug_rsp;
+ struct ssr_debug_transfer_done_rsp *xfer_rsp;
+ struct ssr_debug_transfer_info *debug_info;
+ struct ssr_dump_info *dump_info = NULL;
+ struct qaic_device *qdev = resp->qdev;
+ struct ssr_event_rsp *event_rsp;
+ struct dma_bridge_chan *dbc;
+ struct ssr_event *event;
+ bool debug_nack = false;
+ u32 ssr_event_ack;
+ int ret;
+
+ le32_to_cpus(&hdr->cmd);
+ le32_to_cpus(&hdr->len);
+ le32_to_cpus(&hdr->dbc_id);
+
+ if (hdr->len > MSG_BUF_SZ) {
+ pci_warn(qdev->pdev, "Dropping SSR message with invalid len %d\n", hdr->len);
+ goto out;
+ }
+
+ if (hdr->dbc_id >= qdev->num_dbc) {
+ pci_warn(qdev->pdev, "Dropping SSR message with invalid dbc_id %d\n", hdr->dbc_id);
+ goto out;
+ }
+
+ dbc = &qdev->dbc[hdr->dbc_id];
+
+ switch (hdr->cmd) {
+ case DEBUG_TRANSFER_INFO:
+ trace_qaic_ssr_cmd(qdev, "SSR received DEBUG_TRANSFER_INFO command");
+ debug_info = (struct ssr_debug_transfer_info *)resp->data;
+
+ debug_rsp = kmalloc(sizeof(*debug_rsp), GFP_KERNEL);
+ if (!debug_rsp)
+ break;
+
+ if (dbc->state != DBC_STATE_BEFORE_POWER_UP) {
+ /* NACK */
+ pci_warn(qdev->pdev, "Invalid command received. DEBUG_TRANSFER_INFO is expected when DBC is in %d state and actual DBC state is %u. DBC ID %u.\n",
+ DBC_STATE_BEFORE_POWER_UP, dbc->state,
+ hdr->dbc_id);
+ debug_nack = true;
+ }
+
+ /* Skip buffer allocations for Crashdump downloading */
+ if (!debug_nack) {
+ /* Buffer for MEMORY READ request */
+ dump_info = alloc_dump_info(qdev, debug_info);
+ if (IS_ERR(dump_info)) {
+ /* NACK */
+ ret = PTR_ERR(dump_info);
+ pci_warn(qdev->pdev, "Failed to allocate dump resp memory %d. DBC ID %u.\n",
+ ret, hdr->dbc_id);
+ debug_nack = true;
+ } else {
+ /* ACK */
+ debug_nack = false;
+ }
+ }
+
+ debug_rsp->hdr.cmd = cpu_to_le32(DEBUG_TRANSFER_INFO_RSP);
+ debug_rsp->hdr.len = cpu_to_le32(sizeof(*debug_rsp));
+ debug_rsp->hdr.dbc_id = cpu_to_le32(hdr->dbc_id);
+ /* 1 = NACK and 0 = ACK */
+ debug_rsp->ret = cpu_to_le32(debug_nack ? 1 : 0);
+
+ ret = mhi_queue_buf(qdev->ssr_ch, DMA_TO_DEVICE,
+ debug_rsp, sizeof(*debug_rsp), MHI_EOT);
+ if (ret) {
+ pci_warn(qdev->pdev, "Could not send DEBUG_TRANSFER_INFO_RSP %d\n", ret);
+ free_ssr_dump_buf(dump_info);
+ kfree(debug_rsp);
+ break;
+ }
+
+ /* Command has been NACKed skip crashdump. */
+ if (debug_nack)
+ break;
+
+ dbc->dump_info = dump_info;
+ dump_info->dbc = dbc;
+ dump_info->resp = resp;
+
+ trace_qaic_ssr_dump(qdev, "SSR debug table download initiated");
+ ret = send_mem_rd(qdev, dump_info, dump_info->tbl_addr_dev,
+ min(dump_info->tbl_len, dump_info->chunk_sz));
+ if (ret) {
+ free_ssr_dump_buf(dump_info);
+ break;
+ }
+
+ /*
+ * Till now everything went fine, which means that we will be
+ * collecting crashdump chunk by chunk. Do not queue a response
+ * buffer for SSR cmds till the crashdump is complete.
+ */
+ return;
+ case SSR_EVENT:
+ trace_qaic_ssr_cmd(qdev, "SSR received SSR_EVENT command");
+ event = (struct ssr_event *)hdr;
+ le32_to_cpus(&event->event);
+ ssr_event_ack = event->event;
+
+ switch (event->event) {
+ case BEFORE_SHUTDOWN:
+ trace_qaic_ssr_event(qdev, "SSR received BEFORE_SHUTDOWN event");
+ set_dbc_state(qdev, hdr->dbc_id,
+ DBC_STATE_BEFORE_SHUTDOWN);
+ dbc_enter_ssr(qdev, hdr->dbc_id);
+ break;
+ case AFTER_SHUTDOWN:
+ trace_qaic_ssr_event(qdev, "SSR received AFTER_SHUTDOWN event");
+ set_dbc_state(qdev, hdr->dbc_id,
+ DBC_STATE_AFTER_SHUTDOWN);
+ break;
+ case BEFORE_POWER_UP:
+ trace_qaic_ssr_event(qdev, "SSR received BEFORE_POWER_UP event");
+ set_dbc_state(qdev, hdr->dbc_id,
+ DBC_STATE_BEFORE_POWER_UP);
+ break;
+ case AFTER_POWER_UP:
+ trace_qaic_ssr_event(qdev, "SSR received AFTER_POWER_UP event");
+ /*
+ * If dump info is a non NULL value it means that we
+ * have received this SSR event while downloading a
+ * crashdump for this DBC is still in progress. NACK
+ * the SSR event
+ */
+ if (dbc->dump_info) {
+ free_ssr_dump_buf(dbc->dump_info);
+ ssr_event_ack = SSR_EVENT_NACK;
+ break;
+ }
+
+ set_dbc_state(qdev, hdr->dbc_id,
+ DBC_STATE_AFTER_POWER_UP);
+ break;
+ default:
+ pci_warn(qdev->pdev, "Unknown event %d\n", event->event);
+ break;
+ }
+
+ event_rsp = kmalloc(sizeof(*event_rsp), GFP_KERNEL);
+ if (!event_rsp)
+ break;
+
+ event_rsp->hdr.cmd = cpu_to_le32(SSR_EVENT_RSP);
+ event_rsp->hdr.len = cpu_to_le32(sizeof(*event_rsp));
+ event_rsp->hdr.dbc_id = cpu_to_le32(hdr->dbc_id);
+ event_rsp->event = cpu_to_le32(ssr_event_ack);
+
+ ret = mhi_queue_buf(qdev->ssr_ch, DMA_TO_DEVICE,
+ event_rsp, sizeof(*event_rsp), MHI_EOT);
+ if (ret) {
+ pci_warn(qdev->pdev, "Could not send SSR_EVENT_RSP %d\n", ret);
+ kfree(event_rsp);
+ }
+
+ if (event->event == AFTER_POWER_UP &&
+ ssr_event_ack != SSR_EVENT_NACK) {
+ dbc_exit_ssr(qdev, hdr->dbc_id);
+ set_dbc_state(qdev, hdr->dbc_id, DBC_STATE_IDLE);
+ }
+
+ break;
+ case DEBUG_TRANSFER_DONE_RSP:
+ trace_qaic_ssr_cmd(qdev, "SSR received DEBUG_TRANSFER_DONE_RSP command");
+ xfer_rsp = (struct ssr_debug_transfer_done_rsp *)hdr;
+ dump_info = dbc->dump_info;
+
+ if (!dump_info) {
+ pci_warn(qdev->pdev, "Crashdump download is not in progress for this DBC ID %u\n",
+ hdr->dbc_id);
+ break;
+ }
+
+ if (xfer_rsp->ret) {
+ pci_warn(qdev->pdev, "Device has NACKed SSR transfer done with %u\n",
+ xfer_rsp->ret);
+ free_ssr_dump_buf(dump_info);
+ break;
+ }
+
+ dev_coredumpv(qdev->base_dev->ddev->dev, dump_info->dump_addr,
+ dump_info->dump_sz, GFP_KERNEL);
+ /* dev_coredumpv will free dump_info->dump_addr */
+ dump_info->dump_addr = NULL;
+ free_ssr_dump_buf(dump_info);
+
+ break;
+ default:
+ pci_warn(qdev->pdev, "Dropping SSR message with invalid cmd %d\n", hdr->cmd);
+ break;
+ }
+
+out:
+ ret = mhi_queue_buf(qdev->ssr_ch, DMA_FROM_DEVICE, resp->data,
+ MSG_BUF_SZ, MHI_EOT);
+ if (ret) {
+ pci_warn(qdev->pdev, "Could not requeue SSR recv buf %d\n", ret);
+ kfree(resp);
+ }
+}
+
+static int qaic_ssr_mhi_probe(struct mhi_device *mhi_dev,
+ const struct mhi_device_id *id)
+{
+ struct qaic_device *qdev;
+ struct ssr_resp *resp;
+ int ret;
+
+ qdev = pci_get_drvdata(to_pci_dev(mhi_dev->mhi_cntrl->cntrl_dev));
+
+ dev_set_drvdata(&mhi_dev->dev, qdev);
+ qdev->ssr_ch = mhi_dev;
+ ret = mhi_prepare_for_transfer(qdev->ssr_ch);
+
+ if (ret)
+ return ret;
+
+ resp = kmalloc(sizeof(*resp) + MSG_BUF_SZ, GFP_KERNEL);
+ if (!resp) {
+ mhi_unprepare_from_transfer(qdev->ssr_ch);
+ return -ENOMEM;
+ }
+
+ resp->qdev = qdev;
+ INIT_WORK(&resp->work, ssr_worker);
+
+ ret = mhi_queue_buf(qdev->ssr_ch, DMA_FROM_DEVICE, resp->data,
+ MSG_BUF_SZ, MHI_EOT);
+ if (ret) {
+ mhi_unprepare_from_transfer(qdev->ssr_ch);
+ kfree(resp);
+ return ret;
+ }
+
+ return 0;
+}
+
+static void qaic_ssr_mhi_remove(struct mhi_device *mhi_dev)
+{
+ struct qaic_device *qdev;
+
+ qdev = dev_get_drvdata(&mhi_dev->dev);
+ mhi_unprepare_from_transfer(qdev->ssr_ch);
+ qdev->ssr_ch = NULL;
+}
+
+static void qaic_ssr_mhi_ul_xfer_cb(struct mhi_device *mhi_dev,
+ struct mhi_result *mhi_result)
+{
+ struct qaic_device *qdev = dev_get_drvdata(&mhi_dev->dev);
+ struct _ssr_hdr *hdr = mhi_result->buf_addr;
+ struct ssr_dump_info *dump_info;
+
+ if (mhi_result->transaction_status) {
+ kfree(mhi_result->buf_addr);
+ return;
+ }
+
+ /*
+ * MEMORY READ is used to download crashdump. And crashdump is
+ * downloaded chunk by chunk in a series of MEMORY READ SSR commands.
+ * Hence to avoid too many kmalloc() and kfree() of the same MEMORY READ
+ * request buffer, we allocate only one such buffer and free it only
+ * once.
+ */
+ dump_info = qdev->dbc[le32_to_cpu(hdr->dbc_id)].dump_info;
+ if (le32_to_cpu(hdr->cmd) == MEMORY_READ) {
+ dump_info->mem_rd_buf_queued = false;
+ return;
+ }
+
+ kfree(mhi_result->buf_addr);
+}
+
+static void qaic_ssr_mhi_dl_xfer_cb(struct mhi_device *mhi_dev,
+ struct mhi_result *mhi_result)
+{
+ struct ssr_resp *resp = container_of(mhi_result->buf_addr,
+ struct ssr_resp, data);
+
+ if (mhi_result->transaction_status) {
+ kfree(resp);
+ return;
+ }
+
+ queue_work(resp->qdev->ssr_wq, &resp->work);
+}
+
+static const struct mhi_device_id qaic_ssr_mhi_match_table[] = {
+ { .chan = "QAIC_SSR", },
+ {},
+};
+
+static struct mhi_driver qaic_ssr_mhi_driver = {
+ .id_table = qaic_ssr_mhi_match_table,
+ .remove = qaic_ssr_mhi_remove,
+ .probe = qaic_ssr_mhi_probe,
+ .ul_xfer_cb = qaic_ssr_mhi_ul_xfer_cb,
+ .dl_xfer_cb = qaic_ssr_mhi_dl_xfer_cb,
+ .driver = {
+ .name = "qaic_ssr",
+ .owner = THIS_MODULE,
+ },
+};
+
+void qaic_ssr_register(void)
+{
+ int ret;
+
+ ret = mhi_driver_register(&qaic_ssr_mhi_driver);
+ if (ret)
+ pr_debug("qaic: ssr register failed %d\n", ret);
+}
+
+void qaic_ssr_unregister(void)
+{
+ mhi_driver_unregister(&qaic_ssr_mhi_driver);
+}
diff --git a/drivers/gpu/drm/qaic/qaic_ssr.h b/drivers/gpu/drm/qaic/qaic_ssr.h
new file mode 100644
index 0000000..a3a02f7
--- /dev/null
+++ b/drivers/gpu/drm/qaic/qaic_ssr.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * Copyright (c) 2020, The Linux Foundation. All rights reserved.
+ * Copyright (c) 2021 Qualcomm Innovation Center, Inc. All rights reserved.
+ */
+
+#ifndef __QAIC_SSR_H__
+#define __QAIC_SSR_H__
+
+void qaic_ssr_register(void);
+void qaic_ssr_unregister(void);
+void clean_up_ssr(struct qaic_device *qdev, u32 dbc_id);
+#endif /* __QAIC_SSR_H__ */
--
2.7.4

2022-08-15 21:18:05

by Jeffrey Hugo

[permalink] [raw]
Subject: [RFC PATCH 11/14] drm/qaic: Add telemetry

A QAIC device has a number of attributes like thermal limits which can be
read and in some cases, controlled from the host. Expose these attributes
via hwmon. Use the pre-defined interface where possible, but define
custom interfaces where it is not possible.

Change-Id: I3b559baed4016e27457658c9286f4c529f95dbbb
Signed-off-by: Jeffrey Hugo <[email protected]>
---
drivers/gpu/drm/qaic/qaic_telemetry.c | 851 ++++++++++++++++++++++++++++++++++
drivers/gpu/drm/qaic/qaic_telemetry.h | 14 +
2 files changed, 865 insertions(+)
create mode 100644 drivers/gpu/drm/qaic/qaic_telemetry.c
create mode 100644 drivers/gpu/drm/qaic/qaic_telemetry.h

diff --git a/drivers/gpu/drm/qaic/qaic_telemetry.c b/drivers/gpu/drm/qaic/qaic_telemetry.c
new file mode 100644
index 0000000..44950d1
--- /dev/null
+++ b/drivers/gpu/drm/qaic/qaic_telemetry.c
@@ -0,0 +1,851 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* Copyright (c) 2020-2021, The Linux Foundation. All rights reserved. */
+/* Copyright (c) 2021-2022 Qualcomm Innovation Center, Inc. All rights reserved. */
+
+#include <asm/byteorder.h>
+#include <linux/completion.h>
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
+#include <linux/kernel.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/mhi.h>
+#include <linux/mutex.h>
+#include <linux/srcu.h>
+#include <linux/workqueue.h>
+
+#include "qaic.h"
+#include "qaic_telemetry.h"
+
+#if defined(CONFIG_QAIC_HWMON)
+
+#define MAGIC 0x55AA
+#define VERSION 0x1
+#define RESP_TIMEOUT (1 * HZ)
+
+enum cmds {
+ CMD_THERMAL_SOC_TEMP,
+ CMD_THERMAL_SOC_MAX_TEMP,
+ CMD_THERMAL_BOARD_TEMP,
+ CMD_THERMAL_BOARD_MAX_TEMP,
+ CMD_THERMAL_DDR_TEMP,
+ CMD_THERMAL_WARNING_TEMP,
+ CMD_THERMAL_SHUTDOWN_TEMP,
+ CMD_CURRENT_TDP,
+ CMD_BOARD_POWER,
+ CMD_POWER_STATE,
+ CMD_POWER_MAX,
+ CMD_THROTTLE_PERCENT,
+ CMD_THROTTLE_TIME,
+ CMD_UPTIME,
+ CMD_THERMAL_SOC_FLOOR_TEMP,
+ CMD_THERMAL_SOC_CEILING_TEMP,
+};
+
+enum cmd_type {
+ TYPE_READ, /* read value from device */
+ TYPE_WRITE, /* write value to device */
+};
+
+enum msg_type {
+ MSG_PUSH, /* async push from device */
+ MSG_REQ, /* sync request to device */
+ MSG_RESP, /* sync response from device */
+};
+
+struct telemetry_data {
+ u8 cmd;
+ u8 cmd_type;
+ u8 status;
+ __le64 val; /*signed*/
+} __packed;
+
+struct telemetry_header {
+ __le16 magic;
+ __le16 ver;
+ __le32 seq_num;
+ u8 type;
+ u8 id;
+ __le16 len;
+} __packed;
+
+struct telemetry_msg { /* little endian encoded */
+ struct telemetry_header hdr;
+ struct telemetry_data data;
+} __packed;
+
+struct wrapper_msg {
+ struct kref ref_count;
+ struct telemetry_msg msg;
+};
+
+struct xfer_queue_elem {
+ /*
+ * Node in list of ongoing transfer request on telemetry channel.
+ * Maintained by root device struct
+ */
+ struct list_head list;
+ /* Sequence number of this transfer request */
+ u32 seq_num;
+ /* This is used to wait on until completion of transfer request */
+ struct completion xfer_done;
+ /* Received data from device */
+ void *buf;
+};
+
+struct resp_work {
+ /* Work struct to schedule work coming on QAIC_TELEMETRY channel */
+ struct work_struct work;
+ /* Root struct of device, used to access device resources */
+ struct qaic_device *qdev;
+ /* Buffer used by MHI for transfer requests */
+ void *buf;
+};
+
+static void free_wrapper(struct kref *ref)
+{
+ struct wrapper_msg *wrapper = container_of(ref, struct wrapper_msg,
+ ref_count);
+
+ kfree(wrapper);
+}
+
+static int telemetry_request(struct qaic_device *qdev, u8 cmd, u8 cmd_type,
+ s64 *val)
+{
+ struct wrapper_msg *wrapper;
+ struct xfer_queue_elem elem;
+ struct telemetry_msg *resp;
+ struct telemetry_msg *req;
+ long ret = 0;
+
+ wrapper = kzalloc(sizeof(*wrapper), GFP_KERNEL);
+ if (!wrapper)
+ return -ENOMEM;
+
+ kref_init(&wrapper->ref_count);
+ req = &wrapper->msg;
+
+ ret = mutex_lock_interruptible(&qdev->tele_mutex);
+ if (ret)
+ goto free_req;
+
+ req->hdr.magic = cpu_to_le16(MAGIC);
+ req->hdr.ver = cpu_to_le16(VERSION);
+ req->hdr.seq_num = cpu_to_le32(qdev->tele_next_seq_num++);
+ req->hdr.type = MSG_REQ;
+ req->hdr.id = 0;
+ req->hdr.len = cpu_to_le16(sizeof(req->data));
+
+ req->data.cmd = cmd;
+ req->data.cmd_type = cmd_type;
+ req->data.status = 0;
+ if (cmd_type == TYPE_READ)
+ req->data.val = cpu_to_le64(0);
+ else
+ req->data.val = cpu_to_le64(*val);
+
+ elem.seq_num = qdev->tele_next_seq_num - 1;
+ elem.buf = NULL;
+ init_completion(&elem.xfer_done);
+ if (likely(!qdev->tele_lost_buf)) {
+ resp = kmalloc(sizeof(*resp), GFP_KERNEL);
+ if (!resp) {
+ mutex_unlock(&qdev->tele_mutex);
+ ret = -ENOMEM;
+ goto free_req;
+ }
+
+ ret = mhi_queue_buf(qdev->tele_ch, DMA_FROM_DEVICE,
+ resp, sizeof(*resp), MHI_EOT);
+ if (ret) {
+ mutex_unlock(&qdev->tele_mutex);
+ goto free_resp;
+ }
+ } else {
+ /*
+ * we lost a buffer because we queued a recv buf, but then
+ * queuing the corresponding tx buf failed. To try to avoid
+ * a memory leak, lets reclaim it and use it for this
+ * transaction.
+ */
+ qdev->tele_lost_buf = false;
+ }
+
+ kref_get(&wrapper->ref_count);
+ ret = mhi_queue_buf(qdev->tele_ch, DMA_TO_DEVICE, req, sizeof(*req),
+ MHI_EOT);
+ if (ret) {
+ qdev->tele_lost_buf = true;
+ kref_put(&wrapper->ref_count, free_wrapper);
+ mutex_unlock(&qdev->tele_mutex);
+ goto free_req;
+ }
+
+ list_add_tail(&elem.list, &qdev->tele_xfer_list);
+ mutex_unlock(&qdev->tele_mutex);
+
+ ret = wait_for_completion_interruptible_timeout(&elem.xfer_done,
+ RESP_TIMEOUT);
+ /*
+ * not using _interruptable because we have to cleanup or we'll
+ * likely cause memory corruption
+ */
+ mutex_lock(&qdev->tele_mutex);
+ if (!list_empty(&elem.list))
+ list_del(&elem.list);
+ if (!ret && !elem.buf)
+ ret = -ETIMEDOUT;
+ else if (ret > 0 && !elem.buf)
+ ret = -EIO;
+ mutex_unlock(&qdev->tele_mutex);
+
+ resp = elem.buf;
+
+ if (ret < 0)
+ goto free_resp;
+
+ if (le16_to_cpu(resp->hdr.magic) != MAGIC ||
+ le16_to_cpu(resp->hdr.ver) != VERSION ||
+ resp->hdr.type != MSG_RESP ||
+ resp->hdr.id != 0 ||
+ le16_to_cpu(resp->hdr.len) != sizeof(resp->data) ||
+ resp->data.cmd != cmd ||
+ resp->data.cmd_type != cmd_type ||
+ resp->data.status) {
+ ret = -EINVAL;
+ goto free_resp;
+ }
+
+ if (cmd_type == TYPE_READ)
+ *val = le64_to_cpu(resp->data.val);
+
+ ret = 0;
+
+free_resp:
+ kfree(resp);
+free_req:
+ kref_put(&wrapper->ref_count, free_wrapper);
+
+ return ret;
+}
+
+static ssize_t throttle_percent_show(struct device *dev,
+ struct device_attribute *a, char *buf)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ s64 val = 0;
+ int rcu_id;
+ int ret;
+
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return -ENODEV;
+ }
+
+ ret = telemetry_request(qdev, CMD_THROTTLE_PERCENT, TYPE_READ, &val);
+
+ if (ret) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+ }
+
+ /*
+ * The percent the device performance is being throttled to meet
+ * the limits. IE performance is throttled 20% to meet power/thermal/
+ * etc limits.
+ */
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return sprintf(buf, "%lld\n", val);
+}
+
+static SENSOR_DEVICE_ATTR_RO(throttle_percent, throttle_percent, 0);
+
+static ssize_t throttle_time_show(struct device *dev,
+ struct device_attribute *a, char *buf)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ s64 val = 0;
+ int rcu_id;
+ int ret;
+
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return -ENODEV;
+ }
+
+ ret = telemetry_request(qdev, CMD_THROTTLE_TIME, TYPE_READ, &val);
+
+ if (ret) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+ }
+
+ /* The time, in seconds, the device has been in a throttled state */
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return sprintf(buf, "%lld\n", val);
+}
+
+static SENSOR_DEVICE_ATTR_RO(throttle_time, throttle_time, 0);
+
+static ssize_t power_level_show(struct device *dev, struct device_attribute *a,
+ char *buf)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ s64 val = 0;
+ int rcu_id;
+ int ret;
+
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return -ENODEV;
+ }
+
+ ret = telemetry_request(qdev, CMD_POWER_STATE, TYPE_READ, &val);
+
+ if (ret) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+ }
+
+ /*
+ * Power level the device is operating at. What is the upper limit
+ * it is allowed to consume.
+ * 1 - full power
+ * 2 - reduced power
+ * 3 - minimal power
+ */
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return sprintf(buf, "%lld\n", val);
+}
+
+static ssize_t power_level_store(struct device *dev, struct device_attribute *a,
+ const char *buf, size_t count)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ int rcu_id;
+ s64 val;
+ int ret;
+
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return -ENODEV;
+ }
+
+ if (kstrtol(buf, 10, (long *)&val)) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return -EINVAL;
+ }
+
+ ret = telemetry_request(qdev, CMD_POWER_STATE, TYPE_WRITE, &val);
+
+ if (ret) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+ }
+
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return count;
+}
+
+static SENSOR_DEVICE_ATTR_RW(power_level, power_level, 0);
+
+static struct attribute *power_attrs[] = {
+ &sensor_dev_attr_power_level.dev_attr.attr,
+ &sensor_dev_attr_throttle_percent.dev_attr.attr,
+ &sensor_dev_attr_throttle_time.dev_attr.attr,
+ NULL,
+};
+
+static const struct attribute_group power_group = {
+ .attrs = power_attrs,
+};
+
+static ssize_t uptime_show(struct device *dev,
+ struct device_attribute *a, char *buf)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ s64 val = 0;
+ int rcu_id;
+ int ret;
+
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return -ENODEV;
+ }
+
+ ret = telemetry_request(qdev, CMD_UPTIME, TYPE_READ, &val);
+
+ if (ret) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+ }
+
+ /* The time, in seconds, the device has been up */
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return sprintf(buf, "%lld\n", val);
+}
+
+static SENSOR_DEVICE_ATTR_RO(uptime, uptime, 0);
+
+static struct attribute *uptime_attrs[] = {
+ &sensor_dev_attr_uptime.dev_attr.attr,
+ NULL,
+};
+
+static const struct attribute_group uptime_group = {
+ .attrs = uptime_attrs,
+};
+
+static ssize_t soc_temp_floor_show(struct device *dev,
+ struct device_attribute *a, char *buf)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ int rcu_id;
+ int ret;
+ s64 val;
+
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ ret = -ENODEV;
+ goto exit;
+ }
+
+ ret = telemetry_request(qdev, CMD_THERMAL_SOC_FLOOR_TEMP,
+ TYPE_READ, &val);
+ if (ret)
+ goto exit;
+
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return sprintf(buf, "%lld\n", val * 1000);
+
+exit:
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+}
+
+static SENSOR_DEVICE_ATTR_RO(temp2_floor, soc_temp_floor, 0);
+
+static ssize_t soc_temp_ceiling_show(struct device *dev,
+ struct device_attribute *a, char *buf)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ int rcu_id;
+ int ret;
+ s64 val;
+
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ ret = -ENODEV;
+ goto exit;
+ }
+
+ ret = telemetry_request(qdev, CMD_THERMAL_SOC_CEILING_TEMP,
+ TYPE_READ, &val);
+ if (ret)
+ goto exit;
+
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return sprintf(buf, "%lld\n", val * 1000);
+
+exit:
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+}
+
+static SENSOR_DEVICE_ATTR_RO(temp2_ceiling, soc_temp_ceiling, 0);
+
+static struct attribute *temp2_attrs[] = {
+ &sensor_dev_attr_temp2_floor.dev_attr.attr,
+ &sensor_dev_attr_temp2_ceiling.dev_attr.attr,
+ NULL,
+};
+
+static const struct attribute_group temp2_group = {
+ .attrs = temp2_attrs,
+};
+
+static umode_t qaic_is_visible(const void *data, enum hwmon_sensor_types type,
+ u32 attr, int channel)
+{
+ switch (type) {
+ case hwmon_power:
+ switch (attr) {
+ case hwmon_power_max:
+ return 0644;
+ default:
+ return 0444;
+ }
+ break;
+ case hwmon_temp:
+ switch (attr) {
+ case hwmon_temp_input:
+ fallthrough;
+ case hwmon_temp_highest:
+ fallthrough;
+ case hwmon_temp_alarm:
+ return 0444;
+ case hwmon_temp_crit:
+ fallthrough;
+ case hwmon_temp_emergency:
+ return 0644;
+ }
+ break;
+ default:
+ return 0;
+ }
+ return 0;
+}
+
+static int qaic_read(struct device *dev, enum hwmon_sensor_types type,
+ u32 attr, int channel, long *vall)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ int ret = -EOPNOTSUPP;
+ s64 val = 0;
+ int rcu_id;
+ u8 cmd;
+
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return -ENODEV;
+ }
+
+ switch (type) {
+ case hwmon_power:
+ switch (attr) {
+ case hwmon_power_max:
+ ret = telemetry_request(qdev, CMD_CURRENT_TDP,
+ TYPE_READ, &val);
+ val *= 1000000;
+ goto exit;
+ case hwmon_power_input:
+ ret = telemetry_request(qdev, CMD_BOARD_POWER,
+ TYPE_READ, &val);
+ val *= 1000000;
+ goto exit;
+ default:
+ goto exit;
+ }
+ case hwmon_temp:
+ switch (attr) {
+ case hwmon_temp_crit:
+ ret = telemetry_request(qdev, CMD_THERMAL_WARNING_TEMP,
+ TYPE_READ, &val);
+ val *= 1000;
+ goto exit;
+ case hwmon_temp_emergency:
+ ret = telemetry_request(qdev, CMD_THERMAL_SHUTDOWN_TEMP,
+ TYPE_READ, &val);
+ val *= 1000;
+ goto exit;
+ case hwmon_temp_alarm:
+ ret = telemetry_request(qdev, CMD_THERMAL_DDR_TEMP,
+ TYPE_READ, &val);
+ goto exit;
+ case hwmon_temp_input:
+ if (channel == 0)
+ cmd = CMD_THERMAL_BOARD_TEMP;
+ else if (channel == 1)
+ cmd = CMD_THERMAL_SOC_TEMP;
+ else
+ goto exit;
+ ret = telemetry_request(qdev, cmd, TYPE_READ, &val);
+ val *= 1000;
+ goto exit;
+ case hwmon_temp_highest:
+ if (channel == 0)
+ cmd = CMD_THERMAL_BOARD_MAX_TEMP;
+ else if (channel == 1)
+ cmd = CMD_THERMAL_SOC_MAX_TEMP;
+ else
+ goto exit;
+ ret = telemetry_request(qdev, cmd, TYPE_READ, &val);
+ val *= 1000;
+ goto exit;
+ default:
+ goto exit;
+ }
+ default:
+ goto exit;
+ }
+
+exit:
+ *vall = (long)val;
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+}
+
+static int qaic_write(struct device *dev, enum hwmon_sensor_types type,
+ u32 attr, int channel, long vall)
+{
+ struct qaic_device *qdev = dev_get_drvdata(dev);
+ int ret = -EOPNOTSUPP;
+ int rcu_id;
+ s64 val;
+
+ val = vall;
+ rcu_id = srcu_read_lock(&qdev->dev_lock);
+ if (qdev->in_reset) {
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return -ENODEV;
+ }
+
+ switch (type) {
+ case hwmon_power:
+ switch (attr) {
+ case hwmon_power_max:
+ val /= 1000000;
+ ret = telemetry_request(qdev, CMD_CURRENT_TDP,
+ TYPE_WRITE, &val);
+ goto exit;
+ default:
+ goto exit;
+ }
+ case hwmon_temp:
+ switch (attr) {
+ case hwmon_temp_crit:
+ val /= 1000;
+ ret = telemetry_request(qdev, CMD_THERMAL_WARNING_TEMP,
+ TYPE_WRITE, &val);
+ goto exit;
+ case hwmon_temp_emergency:
+ val /= 1000;
+ ret = telemetry_request(qdev, CMD_THERMAL_SHUTDOWN_TEMP,
+ TYPE_WRITE, &val);
+ goto exit;
+ default:
+ goto exit;
+ }
+ default:
+ goto exit;
+ }
+
+exit:
+ srcu_read_unlock(&qdev->dev_lock, rcu_id);
+ return ret;
+}
+
+static const struct attribute_group *special_groups[] = {
+ &power_group,
+ &uptime_group,
+ &temp2_group,
+ NULL,
+};
+
+static const struct hwmon_ops qaic_ops = {
+ .is_visible = qaic_is_visible,
+ .read = qaic_read,
+ .write = qaic_write,
+};
+
+static const u32 qaic_config_temp[] = {
+ /* board level */
+ HWMON_T_INPUT | HWMON_T_HIGHEST,
+ /* SoC level */
+ HWMON_T_INPUT | HWMON_T_HIGHEST | HWMON_T_CRIT | HWMON_T_EMERGENCY,
+ /* DDR level */
+ HWMON_T_ALARM,
+ 0
+};
+
+static const struct hwmon_channel_info qaic_temp = {
+ .type = hwmon_temp,
+ .config = qaic_config_temp,
+};
+
+static const u32 qaic_config_power[] = {
+ HWMON_P_INPUT | HWMON_P_MAX, /* board level */
+ 0
+};
+
+static const struct hwmon_channel_info qaic_power = {
+ .type = hwmon_power,
+ .config = qaic_config_power,
+};
+
+static const struct hwmon_channel_info *qaic_info[] = {
+ &qaic_power,
+ &qaic_temp,
+ NULL
+};
+
+static const struct hwmon_chip_info qaic_chip_info = {
+ .ops = &qaic_ops,
+ .info = qaic_info
+};
+
+static int qaic_telemetry_mhi_probe(struct mhi_device *mhi_dev,
+ const struct mhi_device_id *id)
+{
+ struct qaic_device *qdev;
+ int ret;
+
+ qdev = pci_get_drvdata(to_pci_dev(mhi_dev->mhi_cntrl->cntrl_dev));
+
+ dev_set_drvdata(&mhi_dev->dev, qdev);
+ qdev->tele_ch = mhi_dev;
+ qdev->tele_lost_buf = false;
+ ret = mhi_prepare_for_transfer(qdev->tele_ch);
+
+ if (ret)
+ return ret;
+
+ qdev->hwmon = hwmon_device_register_with_info(&qdev->pdev->dev, "qaic",
+ qdev, &qaic_chip_info,
+ special_groups);
+ if (!qdev->hwmon) {
+ mhi_unprepare_from_transfer(qdev->tele_ch);
+ return -ENODEV;
+ }
+
+ return 0;
+}
+
+static void qaic_telemetry_mhi_remove(struct mhi_device *mhi_dev)
+{
+ struct qaic_device *qdev;
+
+ qdev = dev_get_drvdata(&mhi_dev->dev);
+ hwmon_device_unregister(qdev->hwmon);
+ mhi_unprepare_from_transfer(qdev->tele_ch);
+ qdev->tele_ch = NULL;
+ qdev->hwmon = NULL;
+}
+
+static void resp_worker(struct work_struct *work)
+{
+ struct resp_work *resp = container_of(work, struct resp_work, work);
+ struct qaic_device *qdev = resp->qdev;
+ struct telemetry_msg *msg = resp->buf;
+ struct xfer_queue_elem *elem;
+ struct xfer_queue_elem *i;
+ bool found = false;
+
+ if (le16_to_cpu(msg->hdr.magic) != MAGIC) {
+ kfree(msg);
+ kfree(resp);
+ return;
+ }
+
+ mutex_lock(&qdev->tele_mutex);
+ list_for_each_entry_safe(elem, i, &qdev->tele_xfer_list, list) {
+ if (elem->seq_num == le32_to_cpu(msg->hdr.seq_num)) {
+ found = true;
+ list_del_init(&elem->list);
+ elem->buf = msg;
+ complete_all(&elem->xfer_done);
+ break;
+ }
+ }
+ mutex_unlock(&qdev->tele_mutex);
+
+ if (!found)
+ /* request must have timed out, drop packet */
+ kfree(msg);
+
+ kfree(resp);
+}
+
+static void qaic_telemetry_mhi_ul_xfer_cb(struct mhi_device *mhi_dev,
+ struct mhi_result *mhi_result)
+{
+ struct telemetry_msg *msg = mhi_result->buf_addr;
+ struct wrapper_msg *wrapper = container_of(msg, struct wrapper_msg,
+ msg);
+
+ kref_put(&wrapper->ref_count, free_wrapper);
+}
+
+static void qaic_telemetry_mhi_dl_xfer_cb(struct mhi_device *mhi_dev,
+ struct mhi_result *mhi_result)
+{
+ struct qaic_device *qdev = dev_get_drvdata(&mhi_dev->dev);
+ struct telemetry_msg *msg = mhi_result->buf_addr;
+ struct resp_work *resp;
+
+ if (mhi_result->transaction_status) {
+ kfree(msg);
+ return;
+ }
+
+ resp = kmalloc(sizeof(*resp), GFP_ATOMIC);
+ if (!resp) {
+ pci_err(qdev->pdev, "dl_xfer_cb alloc fail, dropping message\n");
+ kfree(msg);
+ return;
+ }
+
+ INIT_WORK(&resp->work, resp_worker);
+ resp->qdev = qdev;
+ resp->buf = msg;
+ queue_work(qdev->tele_wq, &resp->work);
+}
+
+static const struct mhi_device_id qaic_telemetry_mhi_match_table[] = {
+ { .chan = "QAIC_TELEMETRY", },
+ {},
+};
+
+static struct mhi_driver qaic_telemetry_mhi_driver = {
+ .id_table = qaic_telemetry_mhi_match_table,
+ .remove = qaic_telemetry_mhi_remove,
+ .probe = qaic_telemetry_mhi_probe,
+ .ul_xfer_cb = qaic_telemetry_mhi_ul_xfer_cb,
+ .dl_xfer_cb = qaic_telemetry_mhi_dl_xfer_cb,
+ .driver = {
+ .name = "qaic_telemetry",
+ .owner = THIS_MODULE,
+ },
+};
+
+void qaic_telemetry_register(void)
+{
+ int ret;
+
+ ret = mhi_driver_register(&qaic_telemetry_mhi_driver);
+ if (ret)
+ pr_debug("qaic: telemetry register failed %d\n", ret);
+}
+
+void qaic_telemetry_unregister(void)
+{
+ mhi_driver_unregister(&qaic_telemetry_mhi_driver);
+}
+
+void wake_all_telemetry(struct qaic_device *qdev)
+{
+ struct xfer_queue_elem *elem;
+ struct xfer_queue_elem *i;
+
+ mutex_lock(&qdev->tele_mutex);
+ list_for_each_entry_safe(elem, i, &qdev->tele_xfer_list, list) {
+ list_del_init(&elem->list);
+ complete_all(&elem->xfer_done);
+ }
+ qdev->tele_lost_buf = false;
+ mutex_unlock(&qdev->tele_mutex);
+}
+
+#else
+
+void qaic_telemetry_register(void)
+{
+}
+
+void qaic_telemetry_unregister(void)
+{
+}
+
+void wake_all_telemetry(struct qaic_device *qdev)
+{
+}
+
+#endif /* CONFIG_QAIC_HWMON */
diff --git a/drivers/gpu/drm/qaic/qaic_telemetry.h b/drivers/gpu/drm/qaic/qaic_telemetry.h
new file mode 100644
index 0000000..01e178f4
--- /dev/null
+++ b/drivers/gpu/drm/qaic/qaic_telemetry.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * Copyright (c) 2020, The Linux Foundation. All rights reserved.
+ */
+
+#ifndef __QAIC_TELEMETRY_H__
+#define __QAIC_TELEMETRY_H__
+
+#include "qaic.h"
+
+void qaic_telemetry_register(void);
+void qaic_telemetry_unregister(void);
+void wake_all_telemetry(struct qaic_device *qdev);
+#endif /* __QAIC_TELEMETRY_H__ */
--
2.7.4