LinuxLists.cc - [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling

2021-06-10 21:45:31

[permalink] [raw]

Subject: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling

From: Rob Clark <[email protected]>

This picks up an earlier series[1] from Jordan, and adds additional
support needed to generate GPU devcore dumps on iova faults. Original
description:

This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v5: [Rob] Use RBBM_STATUS3.SMMU_STALLED_ON_FAULT to detect case where
GPU snapshotting needs to avoid crashdumper, and check the
RBBM_STATUS3.SMMU_STALLED_ON_FAULT in GPU hang irq paths
v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
resume translation after it has had a chance to snapshot the GPUs
state
v3: Always clear FSR even if the target driver is going to handle resume
v2: Fix comment wording and function pointer check per Rob Clark

[1] https://lore.kernel.org/dri-devel/[email protected]/

Jordan Crouse (3):
iommu/arm-smmu: Add support for driver IOMMU fault handlers
iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
info
drm/msm: Improve the a6xx page fault handler

Rob Clark (2):
iommu/arm-smmu-qcom: Add stall support
drm/msm: devcoredump iommu fault support

drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 23 +++-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 110 +++++++++++++++++++-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 42 ++++++--
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 +++
drivers/gpu/drm/msm/msm_gem.h | 1 +
drivers/gpu/drm/msm/msm_gem_submit.c | 1 +
drivers/gpu/drm/msm/msm_gpu.c | 48 +++++++++
drivers/gpu/drm/msm/msm_gpu.h | 17 +++
drivers/gpu/drm/msm/msm_gpummu.c | 5 +
drivers/gpu/drm/msm/msm_iommu.c | 22 +++-
drivers/gpu/drm/msm/msm_mmu.h | 5 +-
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 50 +++++++++
drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +-
drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 +
include/linux/adreno-smmu-priv.h | 38 ++++++-
15 files changed, 367 insertions(+), 21 deletions(-)

--
2.31.1

2021-06-10 21:45:49

[permalink] [raw]

Subject: [PATCH v5 2/5] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info

From: Jordan Crouse <[email protected]>

Add a callback in adreno-smmu-priv to read interesting SMMU
registers to provide an opportunity for a richer debug experience
in the GPU driver.

Signed-off-by: Jordan Crouse <[email protected]>
Signed-off-by: Rob Clark <[email protected]>
---
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 ++++++++++++
drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 ++
include/linux/adreno-smmu-priv.h | 31 +++++++++++++++++++++-
3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..b2e31ea84128 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
}

+static void qcom_adreno_smmu_get_fault_info(const void *cookie,
+ struct adreno_smmu_fault_info *info)
+{
+ struct arm_smmu_domain *smmu_domain = (void *)cookie;
+ struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+ struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+ info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
+ info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
+ info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
+ info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
+ info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
+ info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
+ info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
+}
+
#define QCOM_ADRENO_SMMU_GPU_SID 0

static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
priv->cookie = smmu_domain;
priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
+ priv->get_fault_info = qcom_adreno_smmu_get_fault_info;

return 0;
}
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index c31a59d35c64..84c21c4b0691 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
#define ARM_SMMU_CB_FSYNR0 0x68
#define ARM_SMMU_FSYNR0_WNR BIT(4)

+#define ARM_SMMU_CB_FSYNR1 0x6c
+
#define ARM_SMMU_CB_S1_TLBIVA 0x600
#define ARM_SMMU_CB_S1_TLBIASID 0x610
#define ARM_SMMU_CB_S1_TLBIVAL 0x620
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index a889f28afb42..53fe32fb9214 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -8,6 +8,32 @@

#include <linux/io-pgtable.h>

+/**
+ * struct adreno_smmu_fault_info - container for key fault information
+ *
+ * @far: The faulting IOVA from ARM_SMMU_CB_FAR
+ * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
+ * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
+ * @fsr: The fault status from ARM_SMMU_CB_FSR
+ * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
+ * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
+ * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
+ *
+ * This struct passes back key page fault information to the GPU driver
+ * through the get_fault_info function pointer.
+ * The GPU driver can use this information to print informative
+ * log messages and provide deeper GPU specific insight into the fault.
+ */
+struct adreno_smmu_fault_info {
+ u64 far;
+ u64 ttbr0;
+ u32 contextidr;
+ u32 fsr;
+ u32 fsynr0;
+ u32 fsynr1;
+ u32 cbfrsynra;
+};
+
/**
* struct adreno_smmu_priv - private interface between adreno-smmu and GPU
*
@@ -17,6 +43,8 @@
* @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank. A
* NULL config disables TTBR0 translation, otherwise
* TTBR0 translation is enabled with the specified cfg
+ * @get_fault_info: Called by the GPU fault handler to get information about
+ * the fault
*
* The GPU driver (drm/msm) and adreno-smmu work together for controlling
* the GPU's SMMU instance. This is by necessity, as the GPU is directly
@@ -31,6 +59,7 @@ struct adreno_smmu_priv {
const void *cookie;
const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
+ void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
};

-#endif /* __ADRENO_SMMU_PRIV_H */
\ No newline at end of file
+#endif /* __ADRENO_SMMU_PRIV_H */
--
2.31.1

2021-06-10 21:47:06

[permalink] [raw]

Subject: [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support

From: Rob Clark <[email protected]>

Add, via the adreno-smmu-priv interface, a way for the GPU to request
the SMMU to stall translation on faults, and then later resume the
translation, either retrying or terminating the current translation.

This will be used on the GPU side to "freeze" the GPU while we snapshot
useful state for devcoredump.

Signed-off-by: Rob Clark <[email protected]>
---
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
include/linux/adreno-smmu-priv.h | 7 +++++
2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index b2e31ea84128..61fc645c1325 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -13,6 +13,7 @@ struct qcom_smmu {
struct arm_smmu_device smmu;
bool bypass_quirk;
u8 bypass_cbndx;
+ u32 stall_enabled;
};

static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
@@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
u32 reg)
{
+ struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+
/*
* On the GPU device we want to process subsequent transactions after a
* fault to keep the GPU from hanging
*/
reg |= ARM_SMMU_SCTLR_HUPCF;

+ if (qsmmu->stall_enabled & BIT(idx))
+ reg |= ARM_SMMU_SCTLR_CFCFG;
+
arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
}

@@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
}

+static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
+{
+ struct arm_smmu_domain *smmu_domain = (void *)cookie;
+ struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+ struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+
+ if (enabled)
+ qsmmu->stall_enabled |= BIT(cfg->cbndx);
+ else
+ qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+}
+
+static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
+{
+ struct arm_smmu_domain *smmu_domain = (void *)cookie;
+ struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+ struct arm_smmu_device *smmu = smmu_domain->smmu;
+ u32 reg = 0;
+
+ if (terminate)
+ reg |= ARM_SMMU_RESUME_TERMINATE;
+
+ arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+}
+
#define QCOM_ADRENO_SMMU_GPU_SID 0

static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
+ priv->set_stall = qcom_adreno_smmu_set_stall;
+ priv->resume_translation = qcom_adreno_smmu_resume_translation;

return 0;
}
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index 53fe32fb9214..c637e0997f6d 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
* TTBR0 translation is enabled with the specified cfg
* @get_fault_info: Called by the GPU fault handler to get information about
* the fault
+ * @set_stall: Configure whether stall on fault (CFCFG) is enabled. Call
+ * before set_ttbr0_cfg(). If stalling on fault is enabled,
+ * the GPU driver must call resume_translation()
+ * @resume_translation: Resume translation after a fault
+ *
*
* The GPU driver (drm/msm) and adreno-smmu work together for controlling
* the GPU's SMMU instance. This is by necessity, as the GPU is directly
@@ -60,6 +65,8 @@ struct adreno_smmu_priv {
const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
+ void (*set_stall)(const void *cookie, bool enabled);
+ void (*resume_translation)(const void *cookie, bool terminate);
};

#endif /* __ADRENO_SMMU_PRIV_H */
--
2.31.1

2021-06-10 21:47:29

[permalink] [raw]

Subject: [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler

From: Jordan Crouse <[email protected]>

Use the new adreno-smmu-priv fault info function to get more SMMU
debug registers and print the current TTBR0 to debug per-instance
pagetables and figure out which GPU block generated the request.

Signed-off-by: Jordan Crouse <[email protected]>
Signed-off-by: Rob Clark <[email protected]>
---
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 4 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 76 +++++++++++++++++++++++++--
drivers/gpu/drm/msm/msm_iommu.c | 11 +++-
drivers/gpu/drm/msm/msm_mmu.h | 4 +-
4 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index f46562c12022..eb030b00bff4 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1075,7 +1075,7 @@ bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
return true;
}

-static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
+static int a5xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
{
struct msm_gpu *gpu = arg;
pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
@@ -1085,7 +1085,7 @@ static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(6)),
gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(7)));

- return -EFAULT;
+ return 0;
}

static void a5xx_cp_err_irq(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index c7f0ddb12d8f..fc19db10bff1 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1032,18 +1032,88 @@ static void a6xx_recover(struct msm_gpu *gpu)
msm_gpu_hw_init(gpu);
}

-static int a6xx_fault_handler(void *arg, unsigned long iova, int flags)
+static const char *a6xx_uche_fault_block(struct msm_gpu *gpu, u32 mid)
+{
+ static const char *uche_clients[7] = {
+ "VFD", "SP", "VSC", "VPC", "HLSQ", "PC", "LRZ",
+ };
+ u32 val;
+
+ if (mid < 1 || mid > 3)
+ return "UNKNOWN";
+
+ /*
+ * The source of the data depends on the mid ID read from FSYNR1.
+ * and the client ID read from the UCHE block
+ */
+ val = gpu_read(gpu, REG_A6XX_UCHE_CLIENT_PF);
+
+ /* mid = 3 is most precise and refers to only one block per client */
+ if (mid == 3)
+ return uche_clients[val & 7];
+
+ /* For mid=2 the source is TP or VFD except when the client id is 0 */
+ if (mid == 2)
+ return ((val & 7) == 0) ? "TP" : "TP|VFD";
+
+ /* For mid=1 just return "UCHE" as a catchall for everything else */
+ return "UCHE";
+}
+
+static const char *a6xx_fault_block(struct msm_gpu *gpu, u32 id)
+{
+ if (id == 0)
+ return "CP";
+ else if (id == 4)
+ return "CCU";
+ else if (id == 6)
+ return "CDP Prefetch";
+
+ return a6xx_uche_fault_block(gpu, id);
+}
+
+#define ARM_SMMU_FSR_TF BIT(1)
+#define ARM_SMMU_FSR_PF BIT(3)
+#define ARM_SMMU_FSR_EF BIT(4)
+
+static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
{
struct msm_gpu *gpu = arg;
+ struct adreno_smmu_fault_info *info = data;
+ const char *type = "UNKNOWN";

- pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
+ /*
+ * Print a default message if we couldn't get the data from the
+ * adreno-smmu-priv
+ */
+ if (!info) {
+ pr_warn_ratelimited("*** gpu fault: iova=%.16lx flags=%d (%u,%u,%u,%u)\n",
iova, flags,
gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));

- return -EFAULT;
+ return 0;
+ }
+
+ if (info->fsr & ARM_SMMU_FSR_TF)
+ type = "TRANSLATION";
+ else if (info->fsr & ARM_SMMU_FSR_PF)
+ type = "PERMISSION";
+ else if (info->fsr & ARM_SMMU_FSR_EF)
+ type = "EXTERNAL";
+
+ pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
+ info->ttbr0, iova,
+ flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
+ a6xx_fault_block(gpu, info->fsynr1 & 0xff),
+ gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
+ gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
+ gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
+ gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
+
+ return 0;
}

static void a6xx_cp_hw_err_irq(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 50d881794758..6975b95c3c29 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
unsigned long iova, int flags, void *arg)
{
struct msm_iommu *iommu = arg;
+ struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
+ struct adreno_smmu_fault_info info, *ptr = NULL;
+
+ if (adreno_smmu->get_fault_info) {
+ adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);
+ ptr = &info;
+ }
+
if (iommu->base.handler)
- return iommu->base.handler(iommu->base.arg, iova, flags);
+ return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
+
pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
return 0;
}
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index 61ade89d9e48..a88f44c3268d 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -26,7 +26,7 @@ enum msm_mmu_type {
struct msm_mmu {
const struct msm_mmu_funcs *funcs;
struct device *dev;
- int (*handler)(void *arg, unsigned long iova, int flags);
+ int (*handler)(void *arg, unsigned long iova, int flags, void *data);
void *arg;
enum msm_mmu_type type;
};
@@ -43,7 +43,7 @@ struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain);
struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu);

static inline void msm_mmu_set_fault_handler(struct msm_mmu *mmu, void *arg,
- int (*handler)(void *arg, unsigned long iova, int flags))
+ int (*handler)(void *arg, unsigned long iova, int flags, void *data))
{
mmu->arg = arg;
mmu->handler = handler;
--
2.31.1

2021-06-10 21:48:09

[permalink] [raw]

Subject: [PATCH v5 5/5] drm/msm: devcoredump iommu fault support

From: Rob Clark <[email protected]>

Wire up support to stall the SMMU on iova fault, and collect a devcore-
dump snapshot for easier debugging of faults.

Currently this is a6xx-only, but mostly only because so far it is the
only one using adreno-smmu-priv.

Signed-off-by: Rob Clark <[email protected]>
---
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 19 +++++++-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 38 +++++++++++++++-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 42 ++++++++++++++----
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 +++++++
drivers/gpu/drm/msm/msm_gem.h | 1 +
drivers/gpu/drm/msm/msm_gem_submit.c | 1 +
drivers/gpu/drm/msm/msm_gpu.c | 48 +++++++++++++++++++++
drivers/gpu/drm/msm/msm_gpu.h | 17 ++++++++
drivers/gpu/drm/msm/msm_gpummu.c | 5 +++
drivers/gpu/drm/msm/msm_iommu.c | 11 +++++
drivers/gpu/drm/msm/msm_mmu.h | 1 +
11 files changed, 186 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index eb030b00bff4..7a271de9a212 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1200,6 +1200,15 @@ static void a5xx_fault_detect_irq(struct msm_gpu *gpu)
struct drm_device *dev = gpu->dev;
struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);

+ /*
+ * If stalled on SMMU fault, we could trip the GPU's hang detection,
+ * but the fault handler will trigger the devcore dump, and we want
+ * to otherwise resume normally rather than killing the submit, so
+ * just bail.
+ */
+ if (gpu_read(gpu, REG_A5XX_RBBM_STATUS3) & BIT(24))
+ return;
+
DRM_DEV_ERROR(dev->dev, "gpu fault ring %d fence %x status %8.8X rb %4.4x/%4.4x ib1 %16.16llX/%4.4x ib2 %16.16llX/%4.4x\n",
ring ? ring->id : -1, ring ? ring->seqno : 0,
gpu_read(gpu, REG_A5XX_RBBM_STATUS),
@@ -1523,6 +1532,7 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
{
struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
GFP_KERNEL);
+ bool stalled = !!(gpu_read(gpu, REG_A5XX_RBBM_STATUS3) & BIT(24));

if (!a5xx_state)
return ERR_PTR(-ENOMEM);
@@ -1535,8 +1545,13 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)

a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);

- /* Get the HLSQ regs with the help of the crashdumper */
- a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
+ /*
+ * Get the HLSQ regs with the help of the crashdumper, but only if
+ * we are not stalled in an iommu fault (in which case the crashdumper
+ * would not have access to memory)
+ */
+ if (!stalled)
+ a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);

a5xx_set_hwcg(gpu, true);

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index fc19db10bff1..c3699408bd1f 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1081,6 +1081,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
struct msm_gpu *gpu = arg;
struct adreno_smmu_fault_info *info = data;
const char *type = "UNKNOWN";
+ const char *block;
+ bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
+
+ /*
+ * If we aren't going to be resuming later from fault_worker, then do
+ * it now.
+ */
+ if (!do_devcoredump) {
+ gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
+ }

/*
* Print a default message if we couldn't get the data from the
@@ -1104,15 +1114,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
else if (info->fsr & ARM_SMMU_FSR_EF)
type = "EXTERNAL";

+ block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
+
pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
info->ttbr0, iova,
- flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
- a6xx_fault_block(gpu, info->fsynr1 & 0xff),
+ flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
+ type, block,
gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));

+ if (do_devcoredump) {
+ /* Turn off the hangcheck timer to keep it from bothering us */
+ del_timer(&gpu->hangcheck_timer);
+
+ gpu->fault_info.ttbr0 = info->ttbr0;
+ gpu->fault_info.iova = iova;
+ gpu->fault_info.flags = flags;
+ gpu->fault_info.type = type;
+ gpu->fault_info.block = block;
+
+ kthread_queue_work(gpu->worker, &gpu->fault_work);
+ }
+
return 0;
}

@@ -1164,6 +1189,15 @@ static void a6xx_fault_detect_irq(struct msm_gpu *gpu)
struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);

+ /*
+ * If stalled on SMMU fault, we could trip the GPU's hang detection,
+ * but the fault handler will trigger the devcore dump, and we want
+ * to otherwise resume normally rather than killing the submit, so
+ * just bail.
+ */
+ if (gpu_read(gpu, REG_A6XX_RBBM_STATUS3) & A6XX_RBBM_STATUS3_SMMU_STALLED_ON_FAULT)
+ return;
+
/*
* Force the GPU to stay on until after we finish
* collecting information
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
index 21c49c5b4519..ad4ea0ed5d99 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
@@ -832,6 +832,20 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
a6xx_get_ahb_gpu_registers(gpu,
a6xx_state, &a6xx_vbif_reglist,
&a6xx_state->registers[index++]);
+ if (!dumper) {
+ /*
+ * We can't use the crashdumper when the SMMU is stalled,
+ * because the GPU has no memory access until we resume
+ * translation (but we don't want to do that until after
+ * we have captured as much useful GPU state as possible).
+ * So instead collect registers via the CPU:
+ */
+ for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
+ a6xx_get_ahb_gpu_registers(gpu,
+ a6xx_state, &a6xx_reglist[i],
+ &a6xx_state->registers[index++]);
+ return;
+ }

for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
a6xx_get_crashdumper_registers(gpu,
@@ -905,11 +919,13 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,

struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
{
- struct a6xx_crashdumper dumper = { 0 };
+ struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
GFP_KERNEL);
+ bool stalled = !!(gpu_read(gpu, REG_A6XX_RBBM_STATUS3) &
+ A6XX_RBBM_STATUS3_SMMU_STALLED_ON_FAULT);

if (!a6xx_state)
return ERR_PTR(-ENOMEM);
@@ -928,14 +944,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
/* Get the banks of indexed registers */
a6xx_get_indexed_registers(gpu, a6xx_state);

- /* Try to initialize the crashdumper */
- if (!a6xx_crashdumper_init(gpu, &dumper)) {
- a6xx_get_registers(gpu, a6xx_state, &dumper);
- a6xx_get_shaders(gpu, a6xx_state, &dumper);
- a6xx_get_clusters(gpu, a6xx_state, &dumper);
- a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
+ /*
+ * Try to initialize the crashdumper, if we are not dumping state
+ * with the SMMU stalled. The crashdumper needs memory access to
+ * write out GPU state, so we need to skip this when the SMMU is
+ * stalled in response to an iova fault
+ */
+ if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
+ dumper = &_dumper;
+ }
+
+ a6xx_get_registers(gpu, a6xx_state, dumper);
+
+ if (dumper) {
+ a6xx_get_shaders(gpu, a6xx_state, dumper);
+ a6xx_get_clusters(gpu, a6xx_state, dumper);
+ a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);

- msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
+ msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
}

if (snapshot_debugbus)
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index c1b02f790804..2bfe014995c7 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
adreno_gpu->info->revn, adreno_gpu->rev.core,
adreno_gpu->rev.major, adreno_gpu->rev.minor,
adreno_gpu->rev.patchid);
+ /*
+ * If this is state collected due to iova fault, so fault related info
+ *
+ * TTBR0 would not be zero, so this is a good way to distinguish
+ */
+ if (state->fault_info.ttbr0) {
+ const struct msm_gpu_fault_info *info = &state->fault_info;
+
+ drm_puts(p, "fault-info:\n");
+ drm_printf(p, " - ttbr0=%.16llx\n", info->ttbr0);
+ drm_printf(p, " - iova=%.16lx\n", info->iova);
+ drm_printf(p, " - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
+ drm_printf(p, " - type=%s\n", info->type);
+ drm_printf(p, " - source=%s\n", info->block);
+ }

drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);

diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index 03e2cc2a2ce1..405f8411e395 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -328,6 +328,7 @@ struct msm_gem_submit {
struct dma_fence *fence;
struct msm_gpu_submitqueue *queue;
struct pid *pid; /* submitting process */
+ bool fault_dumped; /* Limit devcoredump dumping to one per submit */
bool valid; /* true if no cmdstream patching needed */
bool in_rb; /* "sudo" mode, copy cmds into RB */
struct msm_ringbuffer *ring;
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 5480852bdeda..44f84bfd0c0e 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
submit->cmd = (void *)&submit->bos[nr_bos];
submit->queue = queue;
submit->ring = gpu->rb[queue->prio];
+ submit->fault_dumped = false;

/* initially, until copy_from_user() and bo lookup succeeds: */
submit->nr_bos = 0;
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index fa7691cb4614..414ba2dd34e5 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -400,6 +400,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
/* Fill in the additional crash state information */
state->comm = kstrdup(comm, GFP_KERNEL);
state->cmd = kstrdup(cmd, GFP_KERNEL);
+ state->fault_info = gpu->fault_info;

if (submit) {
int i, nr = 0;
@@ -572,6 +573,52 @@ static void recover_worker(struct kthread_work *work)
msm_gpu_retire(gpu);
}

+static void fault_worker(struct kthread_work *work)
+{
+ struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
+ struct drm_device *dev = gpu->dev;
+ struct msm_gem_submit *submit;
+ struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
+ char *comm = NULL, *cmd = NULL;
+
+ mutex_lock(&dev->struct_mutex);
+
+ submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
+ if (submit && submit->fault_dumped)
+ goto resume_smmu;
+
+ if (submit) {
+ struct task_struct *task;
+
+ task = get_pid_task(submit->pid, PIDTYPE_PID);
+ if (task) {
+ comm = kstrdup(task->comm, GFP_KERNEL);
+ cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
+ put_task_struct(task);
+ }
+
+ /*
+ * When we get GPU iova faults, we can get 1000s of them,
+ * but we really only want to log the first one.
+ */
+ submit->fault_dumped = true;
+ }
+
+ /* Record the crash state */
+ pm_runtime_get_sync(&gpu->pdev->dev);
+ msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
+ pm_runtime_put_sync(&gpu->pdev->dev);
+
+ kfree(cmd);
+ kfree(comm);
+
+resume_smmu:
+ memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
+ gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
+
+ mutex_unlock(&dev->struct_mutex);
+}
+
static void hangcheck_timer_reset(struct msm_gpu *gpu)
{
mod_timer(&gpu->hangcheck_timer,
@@ -948,6 +995,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
INIT_LIST_HEAD(&gpu->active_list);
kthread_init_work(&gpu->retire_work, retire_worker);
kthread_init_work(&gpu->recover_work, recover_worker);
+ kthread_init_work(&gpu->fault_work, fault_worker);

timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);

diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 7a082a12d98f..8eefb3aeca10 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -71,6 +71,15 @@ struct msm_gpu_funcs {
uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
};

+/* Additional state for iommu faults: */
+struct msm_gpu_fault_info {
+ u64 ttbr0;
+ unsigned long iova;
+ int flags;
+ const char *type;
+ const char *block;
+};
+
struct msm_gpu {
const char *name;
struct drm_device *dev;
@@ -135,6 +144,12 @@ struct msm_gpu {
#define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
struct timer_list hangcheck_timer;

+ /* Fault info for most recent iova fault: */
+ struct msm_gpu_fault_info fault_info;
+
+ /* work for handling GPU ioval faults: */
+ struct kthread_work fault_work;
+
/* work for handling GPU recovery: */
struct kthread_work recover_work;

@@ -243,6 +258,8 @@ struct msm_gpu_state {
char *comm;
char *cmd;

+ struct msm_gpu_fault_info fault_info;
+
int nr_bos;
struct msm_gpu_state_bo *bos;
};
diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
index 379496186c7f..f7d1945e0c9f 100644
--- a/drivers/gpu/drm/msm/msm_gpummu.c
+++ b/drivers/gpu/drm/msm/msm_gpummu.c
@@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
return 0;
}

+static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
+{
+}
+
static void msm_gpummu_destroy(struct msm_mmu *mmu)
{
struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
@@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
.map = msm_gpummu_map,
.unmap = msm_gpummu_unmap,
.destroy = msm_gpummu_destroy,
+ .resume_translation = msm_gpummu_resume_translation,
};

struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 6975b95c3c29..eed2a762e9dd 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
* the arm-smmu driver as a trigger to set up TTBR0
*/
if (atomic_inc_return(&iommu->pagetables) == 1) {
+ /* Enable stall on iommu fault: */
+ adreno_smmu->set_stall(adreno_smmu->cookie, true);
+
ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
if (ret) {
free_io_pgtable_ops(pagetable->pgtbl_ops);
@@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
return 0;
}

+static void msm_iommu_resume_translation(struct msm_mmu *mmu)
+{
+ struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
+
+ adreno_smmu->resume_translation(adreno_smmu->cookie, true);
+}
+
static void msm_iommu_detach(struct msm_mmu *mmu)
{
struct msm_iommu *iommu = to_msm_iommu(mmu);
@@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
.map = msm_iommu_map,
.unmap = msm_iommu_unmap,
.destroy = msm_iommu_destroy,
+ .resume_translation = msm_iommu_resume_translation,
};

struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index a88f44c3268d..de158e1bf765 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -15,6 +15,7 @@ struct msm_mmu_funcs {
size_t len, int prot);
int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
void (*destroy)(struct msm_mmu *mmu);
+ void (*resume_translation)(struct msm_mmu *mmu);
};

enum msm_mmu_type {
--
2.31.1

2021-06-10 21:48:18

[permalink] [raw]

Subject: [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers

From: Jordan Crouse <[email protected]>

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse <[email protected]>
Signed-off-by: Rob Clark <[email protected]>
Acked-by: Will Deacon <[email protected]>
---
drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..b4b32d31fc06 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_device *smmu = smmu_domain->smmu;
int idx = smmu_domain->cfg.cbndx;
+ int ret;

fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));

- dev_err_ratelimited(smmu->dev,
- "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
+ ret = report_iommu_fault(domain, NULL, iova,
+ fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+
+ if (ret == -ENOSYS)
+ dev_err_ratelimited(smmu->dev,
+ "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
fsr, iova, fsynr, cbfrsynra, idx);

arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
--
2.31.1

2021-06-11 13:52:39

by Jordan Crouse

[permalink] [raw]

Subject: Re: [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support

On Thu, Jun 10, 2021 at 02:44:12PM -0700, Rob Clark wrote:
> From: Rob Clark <[email protected]>
>
> Add, via the adreno-smmu-priv interface, a way for the GPU to request
> the SMMU to stall translation on faults, and then later resume the
> translation, either retrying or terminating the current translation.
>
> This will be used on the GPU side to "freeze" the GPU while we snapshot
> useful state for devcoredump.
>

Acked-by: Jordan Crouse <[email protected]>

> Signed-off-by: Rob Clark <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
> include/linux/adreno-smmu-priv.h | 7 +++++
> 2 files changed, 40 insertions(+)
>
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index b2e31ea84128..61fc645c1325 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -13,6 +13,7 @@ struct qcom_smmu {
> struct arm_smmu_device smmu;
> bool bypass_quirk;
> u8 bypass_cbndx;
> + u32 stall_enabled;
> };
>
> static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
> @@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
> static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
> u32 reg)
> {
> + struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
> +
> /*
> * On the GPU device we want to process subsequent transactions after a
> * fault to keep the GPU from hanging
> */
> reg |= ARM_SMMU_SCTLR_HUPCF;
>
> + if (qsmmu->stall_enabled & BIT(idx))
> + reg |= ARM_SMMU_SCTLR_CFCFG;
> +
> arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
> }
>
> @@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
> info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
> }
>
> +static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
> +{
> + struct arm_smmu_domain *smmu_domain = (void *)cookie;
> + struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> + struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
> +
> + if (enabled)
> + qsmmu->stall_enabled |= BIT(cfg->cbndx);
> + else
> + qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
> +}
> +
> +static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
> +{
> + struct arm_smmu_domain *smmu_domain = (void *)cookie;
> + struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> + struct arm_smmu_device *smmu = smmu_domain->smmu;
> + u32 reg = 0;
> +
> + if (terminate)
> + reg |= ARM_SMMU_RESUME_TERMINATE;
> +
> + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
> +}
> +
> #define QCOM_ADRENO_SMMU_GPU_SID 0
>
> static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
> @@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
> priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
> priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
> priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
> + priv->set_stall = qcom_adreno_smmu_set_stall;
> + priv->resume_translation = qcom_adreno_smmu_resume_translation;
>
> return 0;
> }
> diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
> index 53fe32fb9214..c637e0997f6d 100644
> --- a/include/linux/adreno-smmu-priv.h
> +++ b/include/linux/adreno-smmu-priv.h
> @@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
> * TTBR0 translation is enabled with the specified cfg
> * @get_fault_info: Called by the GPU fault handler to get information about
> * the fault
> + * @set_stall: Configure whether stall on fault (CFCFG) is enabled. Call
> + * before set_ttbr0_cfg(). If stalling on fault is enabled,
> + * the GPU driver must call resume_translation()
> + * @resume_translation: Resume translation after a fault
> + *
> *
> * The GPU driver (drm/msm) and adreno-smmu work together for controlling
> * the GPU's SMMU instance. This is by necessity, as the GPU is directly
> @@ -60,6 +65,8 @@ struct adreno_smmu_priv {
> const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
> int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
> void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
> + void (*set_stall)(const void *cookie, bool enabled);
> + void (*resume_translation)(const void *cookie, bool terminate);
> };
>
> #endif /* __ADRENO_SMMU_PRIV_H */
> --
> 2.31.1
>

2021-06-11 13:53:25

by Jordan Crouse

[permalink] [raw]

Subject: Re: [PATCH v5 5/5] drm/msm: devcoredump iommu fault support

On Thu, Jun 10, 2021 at 02:44:13PM -0700, Rob Clark wrote:
> From: Rob Clark <[email protected]>
>
> Wire up support to stall the SMMU on iova fault, and collect a devcore-
> dump snapshot for easier debugging of faults.
>
> Currently this is a6xx-only, but mostly only because so far it is the
> only one using adreno-smmu-priv.

Acked-by: Jordan Crouse <[email protected]>

> Signed-off-by: Rob Clark <[email protected]>
> ---
> drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 19 +++++++-
> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 38 +++++++++++++++-
> drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 42 ++++++++++++++----
> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 +++++++
> drivers/gpu/drm/msm/msm_gem.h | 1 +
> drivers/gpu/drm/msm/msm_gem_submit.c | 1 +
> drivers/gpu/drm/msm/msm_gpu.c | 48 +++++++++++++++++++++
> drivers/gpu/drm/msm/msm_gpu.h | 17 ++++++++
> drivers/gpu/drm/msm/msm_gpummu.c | 5 +++
> drivers/gpu/drm/msm/msm_iommu.c | 11 +++++
> drivers/gpu/drm/msm/msm_mmu.h | 1 +
> 11 files changed, 186 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index eb030b00bff4..7a271de9a212 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -1200,6 +1200,15 @@ static void a5xx_fault_detect_irq(struct msm_gpu *gpu)
> struct drm_device *dev = gpu->dev;
> struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);
>
> + /*
> + * If stalled on SMMU fault, we could trip the GPU's hang detection,
> + * but the fault handler will trigger the devcore dump, and we want
> + * to otherwise resume normally rather than killing the submit, so
> + * just bail.
> + */
> + if (gpu_read(gpu, REG_A5XX_RBBM_STATUS3) & BIT(24))
> + return;
> +
> DRM_DEV_ERROR(dev->dev, "gpu fault ring %d fence %x status %8.8X rb %4.4x/%4.4x ib1 %16.16llX/%4.4x ib2 %16.16llX/%4.4x\n",
> ring ? ring->id : -1, ring ? ring->seqno : 0,
> gpu_read(gpu, REG_A5XX_RBBM_STATUS),
> @@ -1523,6 +1532,7 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
> {
> struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
> GFP_KERNEL);
> + bool stalled = !!(gpu_read(gpu, REG_A5XX_RBBM_STATUS3) & BIT(24));
>
> if (!a5xx_state)
> return ERR_PTR(-ENOMEM);
> @@ -1535,8 +1545,13 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
>
> a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
>
> - /* Get the HLSQ regs with the help of the crashdumper */
> - a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
> + /*
> + * Get the HLSQ regs with the help of the crashdumper, but only if
> + * we are not stalled in an iommu fault (in which case the crashdumper
> + * would not have access to memory)
> + */
> + if (!stalled)
> + a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
>
> a5xx_set_hwcg(gpu, true);
>
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index fc19db10bff1..c3699408bd1f 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1081,6 +1081,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
> struct msm_gpu *gpu = arg;
> struct adreno_smmu_fault_info *info = data;
> const char *type = "UNKNOWN";
> + const char *block;
> + bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
> +
> + /*
> + * If we aren't going to be resuming later from fault_worker, then do
> + * it now.
> + */
> + if (!do_devcoredump) {
> + gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> + }
>
> /*
> * Print a default message if we couldn't get the data from the
> @@ -1104,15 +1114,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
> else if (info->fsr & ARM_SMMU_FSR_EF)
> type = "EXTERNAL";
>
> + block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
> +
> pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
> info->ttbr0, iova,
> - flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> - a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> + flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
> + type, block,
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
>
> + if (do_devcoredump) {
> + /* Turn off the hangcheck timer to keep it from bothering us */
> + del_timer(&gpu->hangcheck_timer);
> +
> + gpu->fault_info.ttbr0 = info->ttbr0;
> + gpu->fault_info.iova = iova;
> + gpu->fault_info.flags = flags;
> + gpu->fault_info.type = type;
> + gpu->fault_info.block = block;
> +
> + kthread_queue_work(gpu->worker, &gpu->fault_work);
> + }
> +
> return 0;
> }
>
> @@ -1164,6 +1189,15 @@ static void a6xx_fault_detect_irq(struct msm_gpu *gpu)
> struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
> struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);
>
> + /*
> + * If stalled on SMMU fault, we could trip the GPU's hang detection,
> + * but the fault handler will trigger the devcore dump, and we want
> + * to otherwise resume normally rather than killing the submit, so
> + * just bail.
> + */
> + if (gpu_read(gpu, REG_A6XX_RBBM_STATUS3) & A6XX_RBBM_STATUS3_SMMU_STALLED_ON_FAULT)
> + return;
> +
> /*
> * Force the GPU to stay on until after we finish
> * collecting information
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> index 21c49c5b4519..ad4ea0ed5d99 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> @@ -832,6 +832,20 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
> a6xx_get_ahb_gpu_registers(gpu,
> a6xx_state, &a6xx_vbif_reglist,
> &a6xx_state->registers[index++]);
> + if (!dumper) {
> + /*
> + * We can't use the crashdumper when the SMMU is stalled,
> + * because the GPU has no memory access until we resume
> + * translation (but we don't want to do that until after
> + * we have captured as much useful GPU state as possible).
> + * So instead collect registers via the CPU:
> + */
> + for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> + a6xx_get_ahb_gpu_registers(gpu,
> + a6xx_state, &a6xx_reglist[i],
> + &a6xx_state->registers[index++]);
> + return;
> + }
>
> for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> a6xx_get_crashdumper_registers(gpu,
> @@ -905,11 +919,13 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
>
> struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
> {
> - struct a6xx_crashdumper dumper = { 0 };
> + struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
> struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
> struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
> GFP_KERNEL);
> + bool stalled = !!(gpu_read(gpu, REG_A6XX_RBBM_STATUS3) &
> + A6XX_RBBM_STATUS3_SMMU_STALLED_ON_FAULT);
>
> if (!a6xx_state)
> return ERR_PTR(-ENOMEM);
> @@ -928,14 +944,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
> /* Get the banks of indexed registers */
> a6xx_get_indexed_registers(gpu, a6xx_state);
>
> - /* Try to initialize the crashdumper */
> - if (!a6xx_crashdumper_init(gpu, &dumper)) {
> - a6xx_get_registers(gpu, a6xx_state, &dumper);
> - a6xx_get_shaders(gpu, a6xx_state, &dumper);
> - a6xx_get_clusters(gpu, a6xx_state, &dumper);
> - a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
> + /*
> + * Try to initialize the crashdumper, if we are not dumping state
> + * with the SMMU stalled. The crashdumper needs memory access to
> + * write out GPU state, so we need to skip this when the SMMU is
> + * stalled in response to an iova fault
> + */
> + if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
> + dumper = &_dumper;
> + }
> +
> + a6xx_get_registers(gpu, a6xx_state, dumper);
> +
> + if (dumper) {
> + a6xx_get_shaders(gpu, a6xx_state, dumper);
> + a6xx_get_clusters(gpu, a6xx_state, dumper);
> + a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
>
> - msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
> + msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
> }
>
> if (snapshot_debugbus)
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index c1b02f790804..2bfe014995c7 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
> adreno_gpu->info->revn, adreno_gpu->rev.core,
> adreno_gpu->rev.major, adreno_gpu->rev.minor,
> adreno_gpu->rev.patchid);
> + /*
> + * If this is state collected due to iova fault, so fault related info
> + *
> + * TTBR0 would not be zero, so this is a good way to distinguish
> + */
> + if (state->fault_info.ttbr0) {
> + const struct msm_gpu_fault_info *info = &state->fault_info;
> +
> + drm_puts(p, "fault-info:\n");
> + drm_printf(p, " - ttbr0=%.16llx\n", info->ttbr0);
> + drm_printf(p, " - iova=%.16lx\n", info->iova);
> + drm_printf(p, " - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
> + drm_printf(p, " - type=%s\n", info->type);
> + drm_printf(p, " - source=%s\n", info->block);
> + }
>
> drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
>
> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> index 03e2cc2a2ce1..405f8411e395 100644
> --- a/drivers/gpu/drm/msm/msm_gem.h
> +++ b/drivers/gpu/drm/msm/msm_gem.h
> @@ -328,6 +328,7 @@ struct msm_gem_submit {
> struct dma_fence *fence;
> struct msm_gpu_submitqueue *queue;
> struct pid *pid; /* submitting process */
> + bool fault_dumped; /* Limit devcoredump dumping to one per submit */
> bool valid; /* true if no cmdstream patching needed */
> bool in_rb; /* "sudo" mode, copy cmds into RB */
> struct msm_ringbuffer *ring;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 5480852bdeda..44f84bfd0c0e 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> submit->cmd = (void *)&submit->bos[nr_bos];
> submit->queue = queue;
> submit->ring = gpu->rb[queue->prio];
> + submit->fault_dumped = false;
>
> /* initially, until copy_from_user() and bo lookup succeeds: */
> submit->nr_bos = 0;
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index fa7691cb4614..414ba2dd34e5 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -400,6 +400,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> /* Fill in the additional crash state information */
> state->comm = kstrdup(comm, GFP_KERNEL);
> state->cmd = kstrdup(cmd, GFP_KERNEL);
> + state->fault_info = gpu->fault_info;
>
> if (submit) {
> int i, nr = 0;
> @@ -572,6 +573,52 @@ static void recover_worker(struct kthread_work *work)
> msm_gpu_retire(gpu);
> }
>
> +static void fault_worker(struct kthread_work *work)
> +{
> + struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
> + struct drm_device *dev = gpu->dev;
> + struct msm_gem_submit *submit;
> + struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
> + char *comm = NULL, *cmd = NULL;
> +
> + mutex_lock(&dev->struct_mutex);
> +
> + submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
> + if (submit && submit->fault_dumped)
> + goto resume_smmu;
> +
> + if (submit) {
> + struct task_struct *task;
> +
> + task = get_pid_task(submit->pid, PIDTYPE_PID);
> + if (task) {
> + comm = kstrdup(task->comm, GFP_KERNEL);
> + cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
> + put_task_struct(task);
> + }
> +
> + /*
> + * When we get GPU iova faults, we can get 1000s of them,
> + * but we really only want to log the first one.
> + */
> + submit->fault_dumped = true;
> + }
> +
> + /* Record the crash state */
> + pm_runtime_get_sync(&gpu->pdev->dev);
> + msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
> + pm_runtime_put_sync(&gpu->pdev->dev);
> +
> + kfree(cmd);
> + kfree(comm);
> +
> +resume_smmu:
> + memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
> + gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> +
> + mutex_unlock(&dev->struct_mutex);
> +}
> +
> static void hangcheck_timer_reset(struct msm_gpu *gpu)
> {
> mod_timer(&gpu->hangcheck_timer,
> @@ -948,6 +995,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> INIT_LIST_HEAD(&gpu->active_list);
> kthread_init_work(&gpu->retire_work, retire_worker);
> kthread_init_work(&gpu->recover_work, recover_worker);
> + kthread_init_work(&gpu->fault_work, fault_worker);
>
> timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
>
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index 7a082a12d98f..8eefb3aeca10 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -71,6 +71,15 @@ struct msm_gpu_funcs {
> uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
> };
>
> +/* Additional state for iommu faults: */
> +struct msm_gpu_fault_info {
> + u64 ttbr0;
> + unsigned long iova;
> + int flags;
> + const char *type;
> + const char *block;
> +};
> +
> struct msm_gpu {
> const char *name;
> struct drm_device *dev;
> @@ -135,6 +144,12 @@ struct msm_gpu {
> #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
> struct timer_list hangcheck_timer;
>
> + /* Fault info for most recent iova fault: */
> + struct msm_gpu_fault_info fault_info;
> +
> + /* work for handling GPU ioval faults: */
> + struct kthread_work fault_work;
> +
> /* work for handling GPU recovery: */
> struct kthread_work recover_work;
>
> @@ -243,6 +258,8 @@ struct msm_gpu_state {
> char *comm;
> char *cmd;
>
> + struct msm_gpu_fault_info fault_info;
> +
> int nr_bos;
> struct msm_gpu_state_bo *bos;
> };
> diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
> index 379496186c7f..f7d1945e0c9f 100644
> --- a/drivers/gpu/drm/msm/msm_gpummu.c
> +++ b/drivers/gpu/drm/msm/msm_gpummu.c
> @@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
> return 0;
> }
>
> +static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
> +{
> +}
> +
> static void msm_gpummu_destroy(struct msm_mmu *mmu)
> {
> struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
> @@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
> .map = msm_gpummu_map,
> .unmap = msm_gpummu_unmap,
> .destroy = msm_gpummu_destroy,
> + .resume_translation = msm_gpummu_resume_translation,
> };
>
> struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 6975b95c3c29..eed2a762e9dd 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
> * the arm-smmu driver as a trigger to set up TTBR0
> */
> if (atomic_inc_return(&iommu->pagetables) == 1) {
> + /* Enable stall on iommu fault: */
> + adreno_smmu->set_stall(adreno_smmu->cookie, true);
> +
> ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
> if (ret) {
> free_io_pgtable_ops(pagetable->pgtbl_ops);
> @@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
> return 0;
> }
>
> +static void msm_iommu_resume_translation(struct msm_mmu *mmu)
> +{
> + struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
> +
> + adreno_smmu->resume_translation(adreno_smmu->cookie, true);
> +}
> +
> static void msm_iommu_detach(struct msm_mmu *mmu)
> {
> struct msm_iommu *iommu = to_msm_iommu(mmu);
> @@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
> .map = msm_iommu_map,
> .unmap = msm_iommu_unmap,
> .destroy = msm_iommu_destroy,
> + .resume_translation = msm_iommu_resume_translation,
> };
>
> struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
> diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> index a88f44c3268d..de158e1bf765 100644
> --- a/drivers/gpu/drm/msm/msm_mmu.h
> +++ b/drivers/gpu/drm/msm/msm_mmu.h
> @@ -15,6 +15,7 @@ struct msm_mmu_funcs {
> size_t len, int prot);
> int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
> void (*destroy)(struct msm_mmu *mmu);
> + void (*resume_translation)(struct msm_mmu *mmu);
> };
>
> enum msm_mmu_type {
> --
> 2.31.1
>

2021-06-14 17:28:20

by Bjorn Andersson

[permalink] [raw]

Subject: Re: [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Jordan Crouse <[email protected]>
>
> Call report_iommu_fault() to allow upper-level drivers to register their
> own fault handlers.
>
> Signed-off-by: Jordan Crouse <[email protected]>
> Signed-off-by: Rob Clark <[email protected]>
> Acked-by: Will Deacon <[email protected]>

Reviewed-by: Bjorn Andersson <[email protected]>

Regards,
Bjorn

> ---
> drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> index 6f72c4d208ca..b4b32d31fc06 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> @@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> struct arm_smmu_device *smmu = smmu_domain->smmu;
> int idx = smmu_domain->cfg.cbndx;
> + int ret;
>
> fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
> if (!(fsr & ARM_SMMU_FSR_FAULT))
> @@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
> cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
>
> - dev_err_ratelimited(smmu->dev,
> - "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
> + ret = report_iommu_fault(domain, NULL, iova,
> + fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
> +
> + if (ret == -ENOSYS)
> + dev_err_ratelimited(smmu->dev,
> + "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
> fsr, iova, fsynr, cbfrsynra, idx);
>
> arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
> --
> 2.31.1
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

2021-06-14 17:35:20

by Bjorn Andersson

[permalink] [raw]

Subject: Re: [PATCH v5 2/5] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Jordan Crouse <[email protected]>
>
> Add a callback in adreno-smmu-priv to read interesting SMMU
> registers to provide an opportunity for a richer debug experience
> in the GPU driver.
>
> Signed-off-by: Jordan Crouse <[email protected]>
> Signed-off-by: Rob Clark <[email protected]>

I presume this implies that more generic options has been discussed.
Regardless, if further conclusions are made in that regard I expect that
this could serve as a base for such efforts.

Reviewed-by: Bjorn Andersson <[email protected]>

Regards,
Bjorn

> ---
> drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 ++++++++++++
> drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 ++
> include/linux/adreno-smmu-priv.h | 31 +++++++++++++++++++++-
> 3 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index 98b3a1c2a181..b2e31ea84128 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
> arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
> }
>
> +static void qcom_adreno_smmu_get_fault_info(const void *cookie,
> + struct adreno_smmu_fault_info *info)
> +{
> + struct arm_smmu_domain *smmu_domain = (void *)cookie;
> + struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> + struct arm_smmu_device *smmu = smmu_domain->smmu;
> +
> + info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
> + info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
> + info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
> + info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
> + info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
> + info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
> + info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
> +}
> +
> #define QCOM_ADRENO_SMMU_GPU_SID 0
>
> static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
> @@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
> priv->cookie = smmu_domain;
> priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
> priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
> + priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
>
> return 0;
> }
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> index c31a59d35c64..84c21c4b0691 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> @@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
> #define ARM_SMMU_CB_FSYNR0 0x68
> #define ARM_SMMU_FSYNR0_WNR BIT(4)
>
> +#define ARM_SMMU_CB_FSYNR1 0x6c
> +
> #define ARM_SMMU_CB_S1_TLBIVA 0x600
> #define ARM_SMMU_CB_S1_TLBIASID 0x610
> #define ARM_SMMU_CB_S1_TLBIVAL 0x620
> diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
> index a889f28afb42..53fe32fb9214 100644
> --- a/include/linux/adreno-smmu-priv.h
> +++ b/include/linux/adreno-smmu-priv.h
> @@ -8,6 +8,32 @@
>
> #include <linux/io-pgtable.h>
>
> +/**
> + * struct adreno_smmu_fault_info - container for key fault information
> + *
> + * @far: The faulting IOVA from ARM_SMMU_CB_FAR
> + * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
> + * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
> + * @fsr: The fault status from ARM_SMMU_CB_FSR
> + * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
> + * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
> + * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
> + *
> + * This struct passes back key page fault information to the GPU driver
> + * through the get_fault_info function pointer.
> + * The GPU driver can use this information to print informative
> + * log messages and provide deeper GPU specific insight into the fault.
> + */
> +struct adreno_smmu_fault_info {
> + u64 far;
> + u64 ttbr0;
> + u32 contextidr;
> + u32 fsr;
> + u32 fsynr0;
> + u32 fsynr1;
> + u32 cbfrsynra;
> +};
> +
> /**
> * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
> *
> @@ -17,6 +43,8 @@
> * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank. A
> * NULL config disables TTBR0 translation, otherwise
> * TTBR0 translation is enabled with the specified cfg
> + * @get_fault_info: Called by the GPU fault handler to get information about
> + * the fault
> *
> * The GPU driver (drm/msm) and adreno-smmu work together for controlling
> * the GPU's SMMU instance. This is by necessity, as the GPU is directly
> @@ -31,6 +59,7 @@ struct adreno_smmu_priv {
> const void *cookie;
> const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
> int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
> + void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
> };
>
> -#endif /* __ADRENO_SMMU_PRIV_H */
> \ No newline at end of file
> +#endif /* __ADRENO_SMMU_PRIV_H */
> --
> 2.31.1
>

2021-06-14 17:48:31

by Bjorn Andersson

[permalink] [raw]

Subject: Re: [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Jordan Crouse <[email protected]>
>
> Use the new adreno-smmu-priv fault info function to get more SMMU
> debug registers and print the current TTBR0 to debug per-instance
> pagetables and figure out which GPU block generated the request.
>

Acked-by: Bjorn Andersson <[email protected]>

Regards,
Bjorn

> Signed-off-by: Jordan Crouse <[email protected]>
> Signed-off-by: Rob Clark <[email protected]>
> ---
> drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 4 +-
> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 76 +++++++++++++++++++++++++--
> drivers/gpu/drm/msm/msm_iommu.c | 11 +++-
> drivers/gpu/drm/msm/msm_mmu.h | 4 +-
> 4 files changed, 87 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index f46562c12022..eb030b00bff4 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -1075,7 +1075,7 @@ bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
> return true;
> }
>
> -static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
> +static int a5xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
> {
> struct msm_gpu *gpu = arg;
> pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
> @@ -1085,7 +1085,7 @@ static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
> gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(6)),
> gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(7)));
>
> - return -EFAULT;
> + return 0;
> }
>
> static void a5xx_cp_err_irq(struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index c7f0ddb12d8f..fc19db10bff1 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1032,18 +1032,88 @@ static void a6xx_recover(struct msm_gpu *gpu)
> msm_gpu_hw_init(gpu);
> }
>
> -static int a6xx_fault_handler(void *arg, unsigned long iova, int flags)
> +static const char *a6xx_uche_fault_block(struct msm_gpu *gpu, u32 mid)
> +{
> + static const char *uche_clients[7] = {
> + "VFD", "SP", "VSC", "VPC", "HLSQ", "PC", "LRZ",
> + };
> + u32 val;
> +
> + if (mid < 1 || mid > 3)
> + return "UNKNOWN";
> +
> + /*
> + * The source of the data depends on the mid ID read from FSYNR1.
> + * and the client ID read from the UCHE block
> + */
> + val = gpu_read(gpu, REG_A6XX_UCHE_CLIENT_PF);
> +
> + /* mid = 3 is most precise and refers to only one block per client */
> + if (mid == 3)
> + return uche_clients[val & 7];
> +
> + /* For mid=2 the source is TP or VFD except when the client id is 0 */
> + if (mid == 2)
> + return ((val & 7) == 0) ? "TP" : "TP|VFD";
> +
> + /* For mid=1 just return "UCHE" as a catchall for everything else */
> + return "UCHE";
> +}
> +
> +static const char *a6xx_fault_block(struct msm_gpu *gpu, u32 id)
> +{
> + if (id == 0)
> + return "CP";
> + else if (id == 4)
> + return "CCU";
> + else if (id == 6)
> + return "CDP Prefetch";
> +
> + return a6xx_uche_fault_block(gpu, id);
> +}
> +
> +#define ARM_SMMU_FSR_TF BIT(1)
> +#define ARM_SMMU_FSR_PF BIT(3)
> +#define ARM_SMMU_FSR_EF BIT(4)
> +
> +static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
> {
> struct msm_gpu *gpu = arg;
> + struct adreno_smmu_fault_info *info = data;
> + const char *type = "UNKNOWN";
>
> - pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
> + /*
> + * Print a default message if we couldn't get the data from the
> + * adreno-smmu-priv
> + */
> + if (!info) {
> + pr_warn_ratelimited("*** gpu fault: iova=%.16lx flags=%d (%u,%u,%u,%u)\n",
> iova, flags,
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
>
> - return -EFAULT;
> + return 0;
> + }
> +
> + if (info->fsr & ARM_SMMU_FSR_TF)
> + type = "TRANSLATION";
> + else if (info->fsr & ARM_SMMU_FSR_PF)
> + type = "PERMISSION";
> + else if (info->fsr & ARM_SMMU_FSR_EF)
> + type = "EXTERNAL";
> +
> + pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
> + info->ttbr0, iova,
> + flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> + a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> + gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
> + gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
> + gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
> + gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
> +
> + return 0;
> }
>
> static void a6xx_cp_hw_err_irq(struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 50d881794758..6975b95c3c29 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
> unsigned long iova, int flags, void *arg)
> {
> struct msm_iommu *iommu = arg;
> + struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
> + struct adreno_smmu_fault_info info, *ptr = NULL;
> +
> + if (adreno_smmu->get_fault_info) {
> + adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);
> + ptr = &info;
> + }
> +
> if (iommu->base.handler)
> - return iommu->base.handler(iommu->base.arg, iova, flags);
> + return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
> +
> pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
> return 0;
> }
> diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> index 61ade89d9e48..a88f44c3268d 100644
> --- a/drivers/gpu/drm/msm/msm_mmu.h
> +++ b/drivers/gpu/drm/msm/msm_mmu.h
> @@ -26,7 +26,7 @@ enum msm_mmu_type {
> struct msm_mmu {
> const struct msm_mmu_funcs *funcs;
> struct device *dev;
> - int (*handler)(void *arg, unsigned long iova, int flags);
> + int (*handler)(void *arg, unsigned long iova, int flags, void *data);
> void *arg;
> enum msm_mmu_type type;
> };
> @@ -43,7 +43,7 @@ struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain);
> struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu);
>
> static inline void msm_mmu_set_fault_handler(struct msm_mmu *mmu, void *arg,
> - int (*handler)(void *arg, unsigned long iova, int flags))
> + int (*handler)(void *arg, unsigned long iova, int flags, void *data))
> {
> mmu->arg = arg;
> mmu->handler = handler;
> --
> 2.31.1
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

2021-06-14 17:56:44

by Bjorn Andersson

[permalink] [raw]

Subject: Re: [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Rob Clark <[email protected]>
>
> Add, via the adreno-smmu-priv interface, a way for the GPU to request
> the SMMU to stall translation on faults, and then later resume the
> translation, either retrying or terminating the current translation.
>
> This will be used on the GPU side to "freeze" the GPU while we snapshot
> useful state for devcoredump.
>
> Signed-off-by: Rob Clark <[email protected]>

Reviewed-by: Bjorn Andersson <[email protected]>

Regards,
Bjorn

> ---
> drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
> include/linux/adreno-smmu-priv.h | 7 +++++
> 2 files changed, 40 insertions(+)
>
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index b2e31ea84128..61fc645c1325 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -13,6 +13,7 @@ struct qcom_smmu {
> struct arm_smmu_device smmu;
> bool bypass_quirk;
> u8 bypass_cbndx;
> + u32 stall_enabled;
> };
>
> static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
> @@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
> static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
> u32 reg)
> {
> + struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
> +
> /*
> * On the GPU device we want to process subsequent transactions after a
> * fault to keep the GPU from hanging
> */
> reg |= ARM_SMMU_SCTLR_HUPCF;
>
> + if (qsmmu->stall_enabled & BIT(idx))
> + reg |= ARM_SMMU_SCTLR_CFCFG;
> +
> arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
> }
>
> @@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
> info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
> }
>
> +static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
> +{
> + struct arm_smmu_domain *smmu_domain = (void *)cookie;
> + struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> + struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
> +
> + if (enabled)
> + qsmmu->stall_enabled |= BIT(cfg->cbndx);
> + else
> + qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
> +}
> +
> +static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
> +{
> + struct arm_smmu_domain *smmu_domain = (void *)cookie;
> + struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> + struct arm_smmu_device *smmu = smmu_domain->smmu;
> + u32 reg = 0;
> +
> + if (terminate)
> + reg |= ARM_SMMU_RESUME_TERMINATE;
> +
> + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
> +}
> +
> #define QCOM_ADRENO_SMMU_GPU_SID 0
>
> static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
> @@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
> priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
> priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
> priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
> + priv->set_stall = qcom_adreno_smmu_set_stall;
> + priv->resume_translation = qcom_adreno_smmu_resume_translation;
>
> return 0;
> }
> diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
> index 53fe32fb9214..c637e0997f6d 100644
> --- a/include/linux/adreno-smmu-priv.h
> +++ b/include/linux/adreno-smmu-priv.h
> @@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
> * TTBR0 translation is enabled with the specified cfg
> * @get_fault_info: Called by the GPU fault handler to get information about
> * the fault
> + * @set_stall: Configure whether stall on fault (CFCFG) is enabled. Call
> + * before set_ttbr0_cfg(). If stalling on fault is enabled,
> + * the GPU driver must call resume_translation()
> + * @resume_translation: Resume translation after a fault
> + *
> *
> * The GPU driver (drm/msm) and adreno-smmu work together for controlling
> * the GPU's SMMU instance. This is by necessity, as the GPU is directly
> @@ -60,6 +65,8 @@ struct adreno_smmu_priv {
> const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
> int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
> void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
> + void (*set_stall)(const void *cookie, bool enabled);
> + void (*resume_translation)(const void *cookie, bool terminate);
> };
>
> #endif /* __ADRENO_SMMU_PRIV_H */
> --
> 2.31.1
>

2021-06-25 03:40:15

by Bjorn Andersson

[permalink] [raw]

Subject: Re: [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:
[..]
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 50d881794758..6975b95c3c29 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
> unsigned long iova, int flags, void *arg)
> {
> struct msm_iommu *iommu = arg;
> + struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
> + struct adreno_smmu_fault_info info, *ptr = NULL;
> +
> + if (adreno_smmu->get_fault_info) {

This seemed reasonable when I read it last time, but I didn't realize
that the msm_fault_handler() is installed for all msm_iommu instances.

So while we're trying to recover from the boot splash and setup the new
framebuffer we end up here with iommu->base.dev being the mdss device.
Naturally drvdata of mdss is not a struct adreno_smmu_priv.

> + adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);

So here we just jump straight out into hyperspace, never to return.

Not sure how to wire this up to avoid the problem, but right now I don't
think we can boot any device with a boot splash.

Regards,
Bjorn

> + ptr = &info;
> + }
> +
> if (iommu->base.handler)
> - return iommu->base.handler(iommu->base.arg, iova, flags);
> + return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
> +
> pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
> return 0;
> }

2021-06-25 15:40:06

[permalink] [raw]

Subject: Re: [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler

On Thu, Jun 24, 2021 at 8:39 PM Bjorn Andersson
<[email protected]> wrote:
>
> On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:
> [..]
> > diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> > index 50d881794758..6975b95c3c29 100644
> > --- a/drivers/gpu/drm/msm/msm_iommu.c
> > +++ b/drivers/gpu/drm/msm/msm_iommu.c
> > @@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
> > unsigned long iova, int flags, void *arg)
> > {
> > struct msm_iommu *iommu = arg;
> > + struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
> > + struct adreno_smmu_fault_info info, *ptr = NULL;
> > +
> > + if (adreno_smmu->get_fault_info) {
>
> This seemed reasonable when I read it last time, but I didn't realize
> that the msm_fault_handler() is installed for all msm_iommu instances.
>
> So while we're trying to recover from the boot splash and setup the new
> framebuffer we end up here with iommu->base.dev being the mdss device.
> Naturally drvdata of mdss is not a struct adreno_smmu_priv.
>
> > + adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);
>
> So here we just jump straight out into hyperspace, never to return.
>
> Not sure how to wire this up to avoid the problem, but right now I don't
> think we can boot any device with a boot splash.
>

I think we could do:

------------------------
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index eed2a762e9dd..30ee8866154e 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -29,6 +29,9 @@ static struct msm_iommu_pagetable
*to_pagetable(struct msm_mmu *mmu)
return container_of(mmu, struct msm_iommu_pagetable, base);
}

+static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
+ unsigned long iova, int flags, void *arg);
+
static int msm_iommu_pagetable_unmap(struct msm_mmu *mmu, u64 iova,
size_t size)
{
@@ -151,6 +154,8 @@ struct msm_mmu *msm_iommu_pagetable_create(struct
msm_mmu *parent)
struct io_pgtable_cfg ttbr0_cfg;
int ret;

+ iommu_set_fault_handler(iommu->domain, msm_fault_handler, iommu);
+
/* Get the pagetable configuration from the domain */
if (adreno_smmu->cookie)
ttbr1_cfg = adreno_smmu->get_ttbr1_cfg(adreno_smmu->cookie);
@@ -300,7 +305,6 @@ struct msm_mmu *msm_iommu_new(struct device *dev,
struct iommu_domain *domain)

iommu->domain = domain;
msm_mmu_init(&iommu->base, dev, &funcs, MSM_MMU_IOMMU);
- iommu_set_fault_handler(domain, msm_fault_handler, iommu);

atomic_set(&iommu->pagetables, 0);

------------------------

That would have the result of setting the same fault handler multiple
times, but that looks harmless. Mostly the fault handling stuff is to
make it easier to debug userspace issues, the fallback dmesg spam from
arm-smmu should be sufficient for any kernel side issues.

BR,
-R

2021-07-04 12:57:34

by Dmitry Baryshkov

[permalink] [raw]

Subject: Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling

Hi,

I've had splash screen disabled on my RB3. However once I've enabled it,
I've got the attached crash during the boot on the msm/msm-next. It
looks like it is related to this particular set of changes.

On 11/06/2021 00:44, Rob Clark wrote:
> From: Rob Clark <[email protected]>
>
> This picks up an earlier series[1] from Jordan, and adds additional
> support needed to generate GPU devcore dumps on iova faults. Original
> description:
>
> This is a stack to add an Adreno GPU specific handler for pagefaults. The first
> patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
> a adreno-smmu-priv function hook to capture a handful of important debugging
> registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
> third patch to print more detailed information on page fault such as the TTBR0
> for the pagetable that caused the fault and the source of the fault as
> determined by a combination of the FSYNR1 register and an internal GPU
> register.
>
> This code provides a solid base that we can expand on later for even more
> extensive GPU side page fault debugging capabilities.
>
> v5: [Rob] Use RBBM_STATUS3.SMMU_STALLED_ON_FAULT to detect case where
> GPU snapshotting needs to avoid crashdumper, and check the
> RBBM_STATUS3.SMMU_STALLED_ON_FAULT in GPU hang irq paths
> v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
> resume translation after it has had a chance to snapshot the GPUs
> state
> v3: Always clear FSR even if the target driver is going to handle resume
> v2: Fix comment wording and function pointer check per Rob Clark
>
> [1] https://lore.kernel.org/dri-devel/[email protected]/
>
> Jordan Crouse (3):
> iommu/arm-smmu: Add support for driver IOMMU fault handlers
> iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
> info
> drm/msm: Improve the a6xx page fault handler
>
> Rob Clark (2):
> iommu/arm-smmu-qcom: Add stall support
> drm/msm: devcoredump iommu fault support
>
> drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 23 +++-
> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 110 +++++++++++++++++++-
> drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 42 ++++++--
> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 +++
> drivers/gpu/drm/msm/msm_gem.h | 1 +
> drivers/gpu/drm/msm/msm_gem_submit.c | 1 +
> drivers/gpu/drm/msm/msm_gpu.c | 48 +++++++++
> drivers/gpu/drm/msm/msm_gpu.h | 17 +++
> drivers/gpu/drm/msm/msm_gpummu.c | 5 +
> drivers/gpu/drm/msm/msm_iommu.c | 22 +++-
> drivers/gpu/drm/msm/msm_mmu.h | 5 +-
> drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 50 +++++++++
> drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +-
> drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 +
> include/linux/adreno-smmu-priv.h | 38 ++++++-
> 15 files changed, 367 insertions(+), 21 deletions(-)
>

--
With best wishes
Dmitry

Attachments:

log-rb3-crash.txt (28.79 kB)

2021-07-04 18:18:31

[permalink] [raw]

Subject: Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling

I suspect you are getting a dpu fault, and need:

https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=hA@mail.gmail.com/

I suppose Bjorn was expecting me to send that patch

BR,
-R

On Sun, Jul 4, 2021 at 5:53 AM Dmitry Baryshkov
<[email protected]> wrote:
>
> Hi,
>
> I've had splash screen disabled on my RB3. However once I've enabled it,
> I've got the attached crash during the boot on the msm/msm-next. It
> looks like it is related to this particular set of changes.
>
> On 11/06/2021 00:44, Rob Clark wrote:
> > From: Rob Clark <[email protected]>
> >
> > This picks up an earlier series[1] from Jordan, and adds additional
> > support needed to generate GPU devcore dumps on iova faults. Original
> > description:
> >
> > This is a stack to add an Adreno GPU specific handler for pagefaults. The first
> > patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
> > a adreno-smmu-priv function hook to capture a handful of important debugging
> > registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
> > third patch to print more detailed information on page fault such as the TTBR0
> > for the pagetable that caused the fault and the source of the fault as
> > determined by a combination of the FSYNR1 register and an internal GPU
> > register.
> >
> > This code provides a solid base that we can expand on later for even more
> > extensive GPU side page fault debugging capabilities.
> >
> > v5: [Rob] Use RBBM_STATUS3.SMMU_STALLED_ON_FAULT to detect case where
> > GPU snapshotting needs to avoid crashdumper, and check the
> > RBBM_STATUS3.SMMU_STALLED_ON_FAULT in GPU hang irq paths
> > v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
> > resume translation after it has had a chance to snapshot the GPUs
> > state
> > v3: Always clear FSR even if the target driver is going to handle resume
> > v2: Fix comment wording and function pointer check per Rob Clark
> >
> > [1] https://lore.kernel.org/dri-devel/[email protected]/
> >
> > Jordan Crouse (3):
> > iommu/arm-smmu: Add support for driver IOMMU fault handlers
> > iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
> > info
> > drm/msm: Improve the a6xx page fault handler
> >
> > Rob Clark (2):
> > iommu/arm-smmu-qcom: Add stall support
> > drm/msm: devcoredump iommu fault support
> >
> > drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 23 +++-
> > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 110 +++++++++++++++++++-
> > drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 42 ++++++--
> > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 +++
> > drivers/gpu/drm/msm/msm_gem.h | 1 +
> > drivers/gpu/drm/msm/msm_gem_submit.c | 1 +
> > drivers/gpu/drm/msm/msm_gpu.c | 48 +++++++++
> > drivers/gpu/drm/msm/msm_gpu.h | 17 +++
> > drivers/gpu/drm/msm/msm_gpummu.c | 5 +
> > drivers/gpu/drm/msm/msm_iommu.c | 22 +++-
> > drivers/gpu/drm/msm/msm_mmu.h | 5 +-
> > drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 50 +++++++++
> > drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +-
> > drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 +
> > include/linux/adreno-smmu-priv.h | 38 ++++++-
> > 15 files changed, 367 insertions(+), 21 deletions(-)
> >
>
>
> --
> With best wishes
> Dmitry

2021-07-06 15:09:41

by Yassine Oudjana

[permalink] [raw]

Subject: Re: [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler

In-Reply-To: <[email protected]>

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Jordan Crouse <[email protected]>
>
> Use the new adreno-smmu-priv fault info function to get more SMMU
> debug registers and print the current TTBR0 to debug per-instance
> pagetables and figure out which GPU block generated the request.
>
> Signed-off-by: Jordan Crouse <[email protected]>
> Signed-off-by: Rob Clark <[email protected]>
> ---
> drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 4 +-
> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 76 +++++++++++++++++++++++++--
> drivers/gpu/drm/msm/msm_iommu.c | 11 +++-
> drivers/gpu/drm/msm/msm_mmu.h | 4 +-
> 4 files changed, 87 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index f46562c12022..eb030b00bff4 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -1075,7 +1075,7 @@ bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
> return true;
> }
>
> -static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
> +static int a5xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
> {
> struct msm_gpu *gpu = arg;
> pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
> @@ -1085,7 +1085,7 @@ static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
> gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(6)),
> gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(7)));
>
> - return -EFAULT;
> + return 0;
> }
>
> static void a5xx_cp_err_irq(struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index c7f0ddb12d8f..fc19db10bff1 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1032,18 +1032,88 @@ static void a6xx_recover(struct msm_gpu *gpu)
> msm_gpu_hw_init(gpu);
> }
>
> -static int a6xx_fault_handler(void *arg, unsigned long iova, int flags)
> +static const char *a6xx_uche_fault_block(struct msm_gpu *gpu, u32 mid)
> +{
> + static const char *uche_clients[7] = {
> + "VFD", "SP", "VSC", "VPC", "HLSQ", "PC", "LRZ",
> + };
> + u32 val;
> +
> + if (mid < 1 || mid > 3)
> + return "UNKNOWN";
> +
> + /*
> + * The source of the data depends on the mid ID read from FSYNR1.
> + * and the client ID read from the UCHE block
> + */
> + val = gpu_read(gpu, REG_A6XX_UCHE_CLIENT_PF);
> +
> + /* mid = 3 is most precise and refers to only one block per client */
> + if (mid == 3)
> + return uche_clients[val & 7];
> +
> + /* For mid=2 the source is TP or VFD except when the client id is 0 */
> + if (mid == 2)
> + return ((val & 7) == 0) ? "TP" : "TP|VFD";
> +
> + /* For mid=1 just return "UCHE" as a catchall for everything else */
> + return "UCHE";
> +}
> +
> +static const char *a6xx_fault_block(struct msm_gpu *gpu, u32 id)
> +{
> + if (id == 0)
> + return "CP";
> + else if (id == 4)
> + return "CCU";
> + else if (id == 6)
> + return "CDP Prefetch";
> +
> + return a6xx_uche_fault_block(gpu, id);
> +}
> +
> +#define ARM_SMMU_FSR_TF BIT(1)
> +#define ARM_SMMU_FSR_PF BIT(3)
> +#define ARM_SMMU_FSR_EF BIT(4)
> +
> +static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
> {
> struct msm_gpu *gpu = arg;
> + struct adreno_smmu_fault_info *info = data;
> + const char *type = "UNKNOWN";
>
> - pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
> + /*
> + * Print a default message if we couldn't get the data from the
> + * adreno-smmu-priv
> + */
> + if (!info) {
> + pr_warn_ratelimited("*** gpu fault: iova=%.16lx flags=%d (%u,%u,%u,%u)\n",
> iova, flags,
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
> gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
>
> - return -EFAULT;
> + return 0;
> + }
> +
> + if (info->fsr & ARM_SMMU_FSR_TF)
> + type = "TRANSLATION";
> + else if (info->fsr & ARM_SMMU_FSR_PF)
> + type = "PERMISSION";
> + else if (info->fsr & ARM_SMMU_FSR_EF)
> + type = "EXTERNAL";
> +
> + pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
> + info->ttbr0, iova,
> + flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> + a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> + gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
> + gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
> + gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
> + gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
> +
> + return 0;
> }
>
> static void a6xx_cp_hw_err_irq(struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 50d881794758..6975b95c3c29 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
> unsigned long iova, int flags, void *arg)
> {
> struct msm_iommu *iommu = arg;
> + struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
> + struct adreno_smmu_fault_info info, *ptr = NULL;
> +
> + if (adreno_smmu->get_fault_info) {
> + adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);

Calling this on msm8996 causes a panic:

[ 12.098593] [drm] Initialized msm 1.8.0 20130625 for 900000.mdss on minor 0
[ 12.115952] msm 900000.mdss: [drm:adreno_request_fw [msm]] loaded qcom/a530_pm4.fw from new location
[ 12.117173] msm 900000.mdss: [drm:adreno_request_fw [msm]] loaded qcom/a530_pfp.fw from new location
[ 12.118040] msm 900000.mdss: [drm:adreno_request_fw [msm]] loaded qcom/a530v3_gpmu.fw2 from new location
[ 12.159807] fb0: switching to msm from simple
[ 12.345745] q6asm-dai 9300000.remoteproc:smd-edge:apr:q6asm:dais: Adding to iommu group 3
[ 12.710132] Console: switching to colour dummy device 80x25
[ 12.967929] Unable to handle kernel execute from non-executable memory at virtual address ffff00008340a200
[ 12.967949] Mem abort info:
[ 12.967952] ESR = 0x8600000f
[ 12.967956] EC = 0x21: IABT (current EL), IL = 32 bits
[ 12.967962] SET = 0, FnV = 0
[ 12.967965] EA = 0, S1PTW = 0
[ 12.967969] FSC = 0x0f: level 3 permission fault
[ 12.967974] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000080ed2000
[ 12.967980] [ffff00008340a200] pgd=180000017e548003, p4d=180000017e548003, pud=180000017e1b5003, pmd=180000017e19a003, pte=006800010340af07
[ 12.968004] Internal error: Oops: 8600000f [#1] PREEMPT SMP
[ 12.968011] Modules linked in: q6asm_dai q6routing q6afe_dai q6adm q6asm q6dsp_common q6afe q6core venus_enc venus_dec videobuf2_dma_contig videobuf2_memops panel_lgphilips_sw43101 apr ath10k_pci ath10k_core ath mac80211 libarc4 cfg80211 venus_core v4l2_mem2mem videobuf2_v4l2 videobuf2_common videodev mc hci_uart nxp_nci_i2c nxp_nci btqca nci bluetooth nfc rfkill msm pmi8998_haptics qcom_fg drm_kms_helper syscopyarea slim_qcom_ngd_ctrl sysfillrect pdr_interface sysimgblt fb_sys_fops qcom_q6v5_mss qcom_q6v5_pas qcom_pil_info qcom_q6v5 qcom_sysmon qcom_common qmi_helpers mdt_loader snd_soc_apq8096 snd_soc_qcom_common socinfo rpmsg_char pwm_ir_tx rmtfs_mem snd_soc_wcd9335 regmap_slimbus snd_soc_core snd_compress slimbus snd_pcm snd_timer snd soundcore atmel_mxt_ts drm drm_panel_orientation_quirks
[ 12.968248] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 5.13.0+ #46
[ 12.968255] Hardware name: Xiaomi Mi Note 2 (DT)
[ 12.968261] pstate: 400000c5 (nZcv daIF -PAN -UAO -TCO BTYPE=--)
[ 12.968268] pc : 0xffff00008340a200
[ 12.968277] lr : msm_fault_handler+0x50/0xd0 [msm]
[ 12.968381] sp : ffff800010003e30
[ 12.968384] x29: ffff800010003e30 x28: ffff8000110c2ac0 x27: 0000000000000000
[ 12.968398] x26: ffff800010d8c6e0 x25: ffff800000000000 x24: 0000000083401000
[ 12.968412] x23: 0000000000000021 x22: ffff00008eeb2358 x21: 0000000000000000
[ 12.968426] x20: 0000000083401000 x19: ffff00008c5a9780 x18: 0000000000000001
[ 12.968439] x17: ffff8000ecd8b000 x16: ffff800010004000 x15: 0000000000004000
[ 12.968453] x14: 002093508d830af8 x13: 00000000000046aa x12: 0000000000000040
[ 12.968466] x11: ffff00008044a480 x10: ffff00008044a482 x9 : ffff00008001f068
[ 12.968479] x8 : ffff000080450028 x7 : 0000000000000000 x6 : ffff000080450128
[ 12.968492] x5 : ffff800008db1d60 x4 : ffff00008c5a9780 x3 : 0000000000000000
[ 12.968505] x2 : ffff00008340a200 x1 : ffff800010003e60 x0 : ffff800008dda9b0
[ 12.968519] Call trace:
[ 12.968524] 0xffff00008340a200
[ 12.968529] report_iommu_fault+0x20/0x3c
[ 12.968543] arm_smmu_context_fault+0x120/0x24c
[ 12.968550] __handle_irq_event_percpu+0x54/0x170
[ 12.968559] handle_irq_event+0x64/0x140
[ 12.968566] handle_fasteoi_irq+0xa4/0x1a0
[ 12.968574] handle_domain_irq+0x60/0x90
[ 12.968581] gic_handle_irq+0xb8/0x140
[ 12.968590] call_on_irq_stack+0x2c/0x5c
[ 12.968599] do_interrupt_handler+0x54/0x60
[ 12.968606] el1_interrupt+0x30/0x80
[ 12.968615] el1h_64_irq_handler+0x18/0x24
[ 12.968621] el1h_64_irq+0x78/0x7c
[ 12.968626] cpuidle_enter_state+0x12c/0x2f0
[ 12.968638] cpuidle_enter+0x38/0x50
[ 12.968645] do_idle+0x21c/0x2ac
[ 12.968655] cpu_startup_entry+0x28/0x80
[ 12.968661] rest_init+0xe4/0xf4
[ 12.968668] arch_call_rest_init+0x10/0x1c
[ 12.968676] start_kernel+0x64c/0x68c
[ 12.968682] __primary_switched+0xc0/0xc8
[ 12.968692] Code: 00000000 00000000 00000000 00000000 (08debf98)
[ 12.968705] ---[ end trace f35e302241530712 ]---
[ 12.975177] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 12.975188] SMP: stopping secondary CPUs
[ 13.175353] Kernel Offset: 0x80000 from 0xffff800010000000
[ 13.175359] PHYS_OFFSET: 0x80000000
[ 13.175363] CPU features: 0x0c000251,20000842
[ 13.175370] Memory Limit: none

It happens when I would usually get context faults caused by reserving continuous splash memory
for simplefb, like these:

[ 7.830005] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x834e0d00, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.836573] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x8374e300, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.843071] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x839bca00, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.849514] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x835eab00, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.855912] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x8384df00, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.862270] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x83471f00, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.868568] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x836ccf00, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.874846] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x83923900, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.881100] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x8353dd00, fsynr=0x21, cbfrsynra=0x0, cb=0
[ 7.887329] arm-smmu d00000.iommu: Unhandled context fault: fsr=0x402, iova=0x8378f800, fsynr=0x21, cbfrsynra=0x0, cb=0

But now I get a panic instead. Removing the memory reservation stops it from panicking,
although I do not think it should panic when getting any fault.

> + ptr = &info;
> + }
> +
> if (iommu->base.handler)
> - return iommu->base.handler(iommu->base.arg, iova, flags);
> + return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
> +
> pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
> return 0;
> }
> diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> index 61ade89d9e48..a88f44c3268d 100644
> --- a/drivers/gpu/drm/msm/msm_mmu.h
> +++ b/drivers/gpu/drm/msm/msm_mmu.h
> @@ -26,7 +26,7 @@ enum msm_mmu_type {
> struct msm_mmu {
> const struct msm_mmu_funcs *funcs;
> struct device *dev;
> - int (*handler)(void *arg, unsigned long iova, int flags);
> + int (*handler)(void *arg, unsigned long iova, int flags, void *data);
> void *arg;
> enum msm_mmu_type type;
> };
> @@ -43,7 +43,7 @@ struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain);
> struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu);
>
> static inline void msm_mmu_set_fault_handler(struct msm_mmu *mmu, void *arg,
> - int (*handler)(void *arg, unsigned long iova, int flags))
> + int (*handler)(void *arg, unsigned long iova, int flags, void *data))
> {
> mmu->arg = arg;
> mmu->handler = handler;
> --
> 2.31.1

2021-07-06 21:37:46

by Bjorn Andersson

[permalink] [raw]

Subject: Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling

On Sun 04 Jul 13:20 CDT 2021, Rob Clark wrote:

> I suspect you are getting a dpu fault, and need:
>
> https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=hA@mail.gmail.com/
>
> I suppose Bjorn was expecting me to send that patch
>

No, I left that discussion with the same understanding as you... But I
ended up side tracked by some other craziness.

Did you post this somewhere or would you still like me to test it and
spin a patch?

Regards,
Bjorn

> BR,
> -R
>
> On Sun, Jul 4, 2021 at 5:53 AM Dmitry Baryshkov
> <[email protected]> wrote:
> >
> > Hi,
> >
> > I've had splash screen disabled on my RB3. However once I've enabled it,
> > I've got the attached crash during the boot on the msm/msm-next. It
> > looks like it is related to this particular set of changes.
> >
> > On 11/06/2021 00:44, Rob Clark wrote:
> > > From: Rob Clark <[email protected]>
> > >
> > > This picks up an earlier series[1] from Jordan, and adds additional
> > > support needed to generate GPU devcore dumps on iova faults. Original
> > > description:
> > >
> > > This is a stack to add an Adreno GPU specific handler for pagefaults. The first
> > > patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
> > > a adreno-smmu-priv function hook to capture a handful of important debugging
> > > registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
> > > third patch to print more detailed information on page fault such as the TTBR0
> > > for the pagetable that caused the fault and the source of the fault as
> > > determined by a combination of the FSYNR1 register and an internal GPU
> > > register.
> > >
> > > This code provides a solid base that we can expand on later for even more
> > > extensive GPU side page fault debugging capabilities.
> > >
> > > v5: [Rob] Use RBBM_STATUS3.SMMU_STALLED_ON_FAULT to detect case where
> > > GPU snapshotting needs to avoid crashdumper, and check the
> > > RBBM_STATUS3.SMMU_STALLED_ON_FAULT in GPU hang irq paths
> > > v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
> > > resume translation after it has had a chance to snapshot the GPUs
> > > state
> > > v3: Always clear FSR even if the target driver is going to handle resume
> > > v2: Fix comment wording and function pointer check per Rob Clark
> > >
> > > [1] https://lore.kernel.org/dri-devel/[email protected]/
> > >
> > > Jordan Crouse (3):
> > > iommu/arm-smmu: Add support for driver IOMMU fault handlers
> > > iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
> > > info
> > > drm/msm: Improve the a6xx page fault handler
> > >
> > > Rob Clark (2):
> > > iommu/arm-smmu-qcom: Add stall support
> > > drm/msm: devcoredump iommu fault support
> > >
> > > drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 23 +++-
> > > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 110 +++++++++++++++++++-
> > > drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 42 ++++++--
> > > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 +++
> > > drivers/gpu/drm/msm/msm_gem.h | 1 +
> > > drivers/gpu/drm/msm/msm_gem_submit.c | 1 +
> > > drivers/gpu/drm/msm/msm_gpu.c | 48 +++++++++
> > > drivers/gpu/drm/msm/msm_gpu.h | 17 +++
> > > drivers/gpu/drm/msm/msm_gpummu.c | 5 +
> > > drivers/gpu/drm/msm/msm_iommu.c | 22 +++-
> > > drivers/gpu/drm/msm/msm_mmu.h | 5 +-
> > > drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 50 +++++++++
> > > drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +-
> > > drivers/iommu/arm/arm-smmu/arm-smmu.h | 2 +
> > > include/linux/adreno-smmu-priv.h | 38 ++++++-
> > > 15 files changed, 367 insertions(+), 21 deletions(-)
> > >
> >
> >
> > --
> > With best wishes
> > Dmitry

2021-07-07 05:14:26

[permalink] [raw]

Subject: Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling

On Sun, Jul 4, 2021 at 11:16 AM Rob Clark <[email protected]> wrote:
>
> I suspect you are getting a dpu fault, and need:
>
> https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=hA@mail.gmail.com/
>
> I suppose Bjorn was expecting me to send that patch

If it's helpful, I applied that and it got the db845c booting mainline
again for me (along with some reverts for a separate ext4 shrinker
crash).
Tested-by: John Stultz <[email protected]>

thanks
-john

2021-07-07 17:52:38

[permalink] [raw]

Subject: Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling

On Tue, Jul 6, 2021 at 10:12 PM John Stultz <[email protected]> wrote:
>
> On Sun, Jul 4, 2021 at 11:16 AM Rob Clark <[email protected]> wrote:
> >
> > I suspect you are getting a dpu fault, and need:
> >
> > https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=hA@mail.gmail.com/
> >
> > I suppose Bjorn was expecting me to send that patch
>
> If it's helpful, I applied that and it got the db845c booting mainline
> again for me (along with some reverts for a separate ext4 shrinker
> crash).
> Tested-by: John Stultz <[email protected]>
>

Thanks, I'll send a patch shortly

BR,
-R