2021-07-28 01:04:07

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 00/13] drm/msm: drm scheduler conversion and cleanups

From: Rob Clark <[email protected]>

Conversion to gpu_scheduler, and bonus removal of
drm_gem_object_put_locked()

v2: Fix priority mixup (msm UAPI has lower numeric priority value as
higher priority, inverse of drm/scheduler) and add some comments
in the UAPI header to clarify.

Now that we move active refcnt get into msm_gem_submit, add a
patch to mark all bos busy before pinning, to avoid evicting bos
used in same batch.

Fix bo locking for cmdstream dumping ($debugfs/n/{rd,hangrd})

v3: Add a patch to drop submit bo_list and instead use -EALREADY
to detect errors with same obj appearing multiple times in the
submit ioctl bos table. Otherwise, with struct_mutex locking
dropped, we'd need to move insertion into and removal from
bo_list under the obj lock.

v4: One last small tweak, drop unused wait_queue_head_t in
msm_fence_context

Rob Clark (13):
drm/msm: Docs and misc cleanup
drm/msm: Small submitqueue creation cleanup
drm/msm: drop drm_gem_object_put_locked()
drm: Drop drm_gem_object_put_locked()
drm/msm/submit: Simplify out-fence-fd handling
drm/msm: Consolidate submit bo state
drm/msm: Track "seqno" fences by idr
drm/msm: Return ERR_PTR() from submit_create()
drm/msm: Conversion to drm scheduler
drm/msm: Drop submit bo_list
drm/msm: Drop struct_mutex in submit path
drm/msm: Utilize gpu scheduler priorities
drm/msm/gem: Mark active before pinning

drivers/gpu/drm/drm_gem.c | 22 --
drivers/gpu/drm/msm/Kconfig | 1 +
drivers/gpu/drm/msm/adreno/a5xx_debugfs.c | 4 +-
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 6 +-
drivers/gpu/drm/msm/adreno/a5xx_power.c | 2 +-
drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 7 +-
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 12 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 2 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 4 +-
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 6 +-
drivers/gpu/drm/msm/msm_drv.c | 30 +-
drivers/gpu/drm/msm/msm_fence.c | 42 ---
drivers/gpu/drm/msm/msm_fence.h | 3 -
drivers/gpu/drm/msm/msm_gem.c | 94 +-----
drivers/gpu/drm/msm/msm_gem.h | 47 +--
drivers/gpu/drm/msm/msm_gem_submit.c | 344 ++++++++++++--------
drivers/gpu/drm/msm/msm_gpu.c | 46 +--
drivers/gpu/drm/msm/msm_gpu.h | 78 ++++-
drivers/gpu/drm/msm/msm_rd.c | 6 +-
drivers/gpu/drm/msm/msm_ringbuffer.c | 70 +++-
drivers/gpu/drm/msm/msm_ringbuffer.h | 12 +
drivers/gpu/drm/msm/msm_submitqueue.c | 53 ++-
include/drm/drm_gem.h | 2 -
include/uapi/drm/msm_drm.h | 14 +-
24 files changed, 516 insertions(+), 391 deletions(-)

--
2.31.1



2021-07-28 01:04:21

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 01/13] drm/msm: Docs and misc cleanup

From: Rob Clark <[email protected]>

Fix a couple incorrect or misspelt comments, and add submitqueue doc
comment.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/msm_gem.h | 3 +--
drivers/gpu/drm/msm/msm_gem_submit.c | 1 +
drivers/gpu/drm/msm/msm_gpu.h | 15 +++++++++++++++
drivers/gpu/drm/msm/msm_ringbuffer.c | 2 +-
drivers/gpu/drm/msm/msm_submitqueue.c | 9 +++++----
5 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index 405f8411e395..d69fcb37ce17 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -313,8 +313,7 @@ void msm_gem_vunmap(struct drm_gem_object *obj);

/* Created per submit-ioctl, to track bo's and cmdstream bufs, etc,
* associated with the cmdstream submission for synchronization (and
- * make it easier to unwind when things go wrong, etc). This only
- * lasts for the duration of the submit-ioctl.
+ * make it easier to unwind when things go wrong, etc).
*/
struct msm_gem_submit {
struct kref ref;
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 44f84bfd0c0e..6d46f9275a40 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -655,6 +655,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
bool has_ww_ticket = false;
unsigned i;
int ret, submitid;
+
if (!gpu)
return -ENXIO;

diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 710c3fedfbf3..96efcb31e502 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -250,6 +250,21 @@ struct msm_gpu_perfcntr {
const char *name;
};

+/**
+ * A submitqueue is associated with a gl context or vk queue (or equiv)
+ * in userspace.
+ *
+ * @id: userspace id for the submitqueue, unique within the drm_file
+ * @flags: userspace flags for the submitqueue, specified at creation
+ * (currently unusued)
+ * @prio: the submitqueue priority
+ * @faults: the number of GPU hangs associated with this submitqueue
+ * @ctx: the per-drm_file context associated with the submitqueue (ie.
+ * which set of pgtables do submits jobs associated with the
+ * submitqueue use)
+ * @node: node in the context's list of submitqueues
+ * @ref: reference count
+ */
struct msm_gpu_submitqueue {
int id;
u32 flags;
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 7e92d9532454..054461662af5 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -32,7 +32,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,

if (IS_ERR(ring->start)) {
ret = PTR_ERR(ring->start);
- ring->start = 0;
+ ring->start = NULL;
goto fail;
}

diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
index c3d206105d28..e5eef11ed014 100644
--- a/drivers/gpu/drm/msm/msm_submitqueue.c
+++ b/drivers/gpu/drm/msm/msm_submitqueue.c
@@ -98,17 +98,18 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
return 0;
}

+/*
+ * Create the default submit-queue (id==0), used for backwards compatibility
+ * for userspace that pre-dates the introduction of submitqueues.
+ */
int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
{
struct msm_drm_private *priv = drm->dev_private;
int default_prio;

- if (!ctx)
- return 0;
-
/*
* Select priority 2 as the "default priority" unless nr_rings is less
- * than 2 and then pick the lowest pirority
+ * than 2 and then pick the lowest priority
*/
default_prio = priv->gpu ?
clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1) : 0;
--
2.31.1


2021-07-28 01:04:25

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 02/13] drm/msm: Small submitqueue creation cleanup

From: Rob Clark <[email protected]>

If we don't have a gpu, there is no need to create a submitqueue, which
lets us simplify the error handling and submitqueue creation.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/msm_submitqueue.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
index e5eef11ed014..9e9fec61d629 100644
--- a/drivers/gpu/drm/msm/msm_submitqueue.c
+++ b/drivers/gpu/drm/msm/msm_submitqueue.c
@@ -66,6 +66,12 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
if (!ctx)
return -ENODEV;

+ if (!priv->gpu)
+ return -ENODEV;
+
+ if (prio >= priv->gpu->nr_rings)
+ return -EINVAL;
+
queue = kzalloc(sizeof(*queue), GFP_KERNEL);

if (!queue)
@@ -73,15 +79,7 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,

kref_init(&queue->ref);
queue->flags = flags;
-
- if (priv->gpu) {
- if (prio >= priv->gpu->nr_rings) {
- kfree(queue);
- return -EINVAL;
- }
-
- queue->prio = prio;
- }
+ queue->prio = prio;

write_lock(&ctx->queuelock);

@@ -107,12 +105,14 @@ int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
struct msm_drm_private *priv = drm->dev_private;
int default_prio;

+ if (!priv->gpu)
+ return -ENODEV;
+
/*
* Select priority 2 as the "default priority" unless nr_rings is less
* than 2 and then pick the lowest priority
*/
- default_prio = priv->gpu ?
- clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1) : 0;
+ default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1);

INIT_LIST_HEAD(&ctx->submitqueues);

--
2.31.1


2021-07-28 01:04:34

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 03/13] drm/msm: drop drm_gem_object_put_locked()

From: Rob Clark <[email protected]>

No idea why we were still using this. It certainly hasn't been needed
for some time. So drop the pointless twin codepaths.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/adreno/a5xx_debugfs.c | 4 +-
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 6 +--
drivers/gpu/drm/msm/adreno/a5xx_power.c | 2 +-
drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 7 ++-
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 12 ++---
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 2 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 4 +-
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +-
drivers/gpu/drm/msm/msm_gem.c | 56 ++++-----------------
drivers/gpu/drm/msm/msm_gem.h | 7 +--
drivers/gpu/drm/msm/msm_gem_submit.c | 2 +-
drivers/gpu/drm/msm/msm_gpu.c | 4 +-
drivers/gpu/drm/msm/msm_ringbuffer.c | 2 +-
13 files changed, 33 insertions(+), 77 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_debugfs.c b/drivers/gpu/drm/msm/adreno/a5xx_debugfs.c
index fc2c905b6c9e..c9d11d57aed6 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_debugfs.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_debugfs.c
@@ -117,13 +117,13 @@ reset_set(void *data, u64 val)

if (a5xx_gpu->pm4_bo) {
msm_gem_unpin_iova(a5xx_gpu->pm4_bo, gpu->aspace);
- drm_gem_object_put_locked(a5xx_gpu->pm4_bo);
+ drm_gem_object_put(a5xx_gpu->pm4_bo);
a5xx_gpu->pm4_bo = NULL;
}

if (a5xx_gpu->pfp_bo) {
msm_gem_unpin_iova(a5xx_gpu->pfp_bo, gpu->aspace);
- drm_gem_object_put_locked(a5xx_gpu->pfp_bo);
+ drm_gem_object_put(a5xx_gpu->pfp_bo);
a5xx_gpu->pfp_bo = NULL;
}

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index 7a271de9a212..0a93ed1d6b06 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1415,7 +1415,7 @@ struct a5xx_gpu_state {
static int a5xx_crashdumper_init(struct msm_gpu *gpu,
struct a5xx_crashdumper *dumper)
{
- dumper->ptr = msm_gem_kernel_new_locked(gpu->dev,
+ dumper->ptr = msm_gem_kernel_new(gpu->dev,
SZ_1M, MSM_BO_WC, gpu->aspace,
&dumper->bo, &dumper->iova);

@@ -1517,7 +1517,7 @@ static void a5xx_gpu_state_get_hlsq_regs(struct msm_gpu *gpu,

if (a5xx_crashdumper_run(gpu, &dumper)) {
kfree(a5xx_state->hlsqregs);
- msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
+ msm_gem_kernel_put(dumper.bo, gpu->aspace);
return;
}

@@ -1525,7 +1525,7 @@ static void a5xx_gpu_state_get_hlsq_regs(struct msm_gpu *gpu,
memcpy(a5xx_state->hlsqregs, dumper.ptr + (256 * SZ_1K),
count * sizeof(u32));

- msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
+ msm_gem_kernel_put(dumper.bo, gpu->aspace);
}

static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_power.c b/drivers/gpu/drm/msm/adreno/a5xx_power.c
index cdb165236a88..0e63a1429189 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_power.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_power.c
@@ -362,7 +362,7 @@ void a5xx_gpmu_ucode_init(struct msm_gpu *gpu)
*/
bosize = (cmds_size + (cmds_size / TYPE4_MAX_PAYLOAD) + 1) << 2;

- ptr = msm_gem_kernel_new_locked(drm, bosize,
+ ptr = msm_gem_kernel_new(drm, bosize,
MSM_BO_WC | MSM_BO_GPU_READONLY, gpu->aspace,
&a5xx_gpu->gpmu_bo, &a5xx_gpu->gpmu_iova);
if (IS_ERR(ptr))
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
index ee72510ff8ce..8abc9a2b114a 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
@@ -240,7 +240,7 @@ static int preempt_init_ring(struct a5xx_gpu *a5xx_gpu,
A5XX_PREEMPT_COUNTER_SIZE,
MSM_BO_WC, gpu->aspace, &counters_bo, &counters_iova);
if (IS_ERR(counters)) {
- msm_gem_kernel_put(bo, gpu->aspace, true);
+ msm_gem_kernel_put(bo, gpu->aspace);
return PTR_ERR(counters);
}

@@ -272,9 +272,8 @@ void a5xx_preempt_fini(struct msm_gpu *gpu)
int i;

for (i = 0; i < gpu->nr_rings; i++) {
- msm_gem_kernel_put(a5xx_gpu->preempt_bo[i], gpu->aspace, true);
- msm_gem_kernel_put(a5xx_gpu->preempt_counters_bo[i],
- gpu->aspace, true);
+ msm_gem_kernel_put(a5xx_gpu->preempt_bo[i], gpu->aspace);
+ msm_gem_kernel_put(a5xx_gpu->preempt_counters_bo[i], gpu->aspace);
}
}

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index b349692219b7..d7cec7f0dde0 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -1129,12 +1129,12 @@ int a6xx_gmu_stop(struct a6xx_gpu *a6xx_gpu)

static void a6xx_gmu_memory_free(struct a6xx_gmu *gmu)
{
- msm_gem_kernel_put(gmu->hfi.obj, gmu->aspace, false);
- msm_gem_kernel_put(gmu->debug.obj, gmu->aspace, false);
- msm_gem_kernel_put(gmu->icache.obj, gmu->aspace, false);
- msm_gem_kernel_put(gmu->dcache.obj, gmu->aspace, false);
- msm_gem_kernel_put(gmu->dummy.obj, gmu->aspace, false);
- msm_gem_kernel_put(gmu->log.obj, gmu->aspace, false);
+ msm_gem_kernel_put(gmu->hfi.obj, gmu->aspace);
+ msm_gem_kernel_put(gmu->debug.obj, gmu->aspace);
+ msm_gem_kernel_put(gmu->icache.obj, gmu->aspace);
+ msm_gem_kernel_put(gmu->dcache.obj, gmu->aspace);
+ msm_gem_kernel_put(gmu->dummy.obj, gmu->aspace);
+ msm_gem_kernel_put(gmu->log.obj, gmu->aspace);

gmu->aspace->mmu->funcs->detach(gmu->aspace->mmu);
msm_gem_address_space_put(gmu->aspace);
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 91f637b908f4..55ea136b8933 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1035,7 +1035,7 @@ static int a6xx_hw_init(struct msm_gpu *gpu)

if (adreno_gpu->base.hw_apriv || a6xx_gpu->has_whereami) {
if (!a6xx_gpu->shadow_bo) {
- a6xx_gpu->shadow = msm_gem_kernel_new_locked(gpu->dev,
+ a6xx_gpu->shadow = msm_gem_kernel_new(gpu->dev,
sizeof(u32) * gpu->nr_rings,
MSM_BO_WC | MSM_BO_MAP_PRIV,
gpu->aspace, &a6xx_gpu->shadow_bo,
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
index ad4ea0ed5d99..e8f65cd8eca6 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
@@ -112,7 +112,7 @@ static void *state_kmemdup(struct a6xx_gpu_state *a6xx_state, void *src,
static int a6xx_crashdumper_init(struct msm_gpu *gpu,
struct a6xx_crashdumper *dumper)
{
- dumper->ptr = msm_gem_kernel_new_locked(gpu->dev,
+ dumper->ptr = msm_gem_kernel_new(gpu->dev,
SZ_1M, MSM_BO_WC, gpu->aspace,
&dumper->bo, &dumper->iova);

@@ -961,7 +961,7 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
a6xx_get_clusters(gpu, a6xx_state, dumper);
a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);

- msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
+ msm_gem_kernel_put(dumper->bo, gpu->aspace);
}

if (snapshot_debugbus)
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index 9f5a30234b33..bad4809b68ef 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -390,7 +390,7 @@ struct drm_gem_object *adreno_fw_create_bo(struct msm_gpu *gpu,
struct drm_gem_object *bo;
void *ptr;

- ptr = msm_gem_kernel_new_locked(gpu->dev, fw->size - 4,
+ ptr = msm_gem_kernel_new(gpu->dev, fw->size - 4,
MSM_BO_WC | MSM_BO_GPU_READONLY, gpu->aspace, &bo, iova);

if (IS_ERR(ptr))
diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index 5b665ed8a605..4e99c448b83a 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -1064,7 +1064,7 @@ void msm_gem_describe_objects(struct list_head *list, struct seq_file *m)
}
#endif

-/* don't call directly! Use drm_gem_object_put_locked() and friends */
+/* don't call directly! Use drm_gem_object_put() */
void msm_gem_free_object(struct drm_gem_object *obj)
{
struct msm_gem_object *msm_obj = to_msm_bo(obj);
@@ -1195,8 +1195,7 @@ static int msm_gem_new_impl(struct drm_device *dev,
return 0;
}

-static struct drm_gem_object *_msm_gem_new(struct drm_device *dev,
- uint32_t size, uint32_t flags, bool struct_mutex_locked)
+struct drm_gem_object *msm_gem_new(struct drm_device *dev, uint32_t size, uint32_t flags)
{
struct msm_drm_private *priv = dev->dev_private;
struct msm_gem_object *msm_obj;
@@ -1283,26 +1282,10 @@ static struct drm_gem_object *_msm_gem_new(struct drm_device *dev,
return obj;

fail:
- if (struct_mutex_locked) {
- drm_gem_object_put_locked(obj);
- } else {
- drm_gem_object_put(obj);
- }
+ drm_gem_object_put(obj);
return ERR_PTR(ret);
}

-struct drm_gem_object *msm_gem_new_locked(struct drm_device *dev,
- uint32_t size, uint32_t flags)
-{
- return _msm_gem_new(dev, size, flags, true);
-}
-
-struct drm_gem_object *msm_gem_new(struct drm_device *dev,
- uint32_t size, uint32_t flags)
-{
- return _msm_gem_new(dev, size, flags, false);
-}
-
struct drm_gem_object *msm_gem_import(struct drm_device *dev,
struct dma_buf *dmabuf, struct sg_table *sgt)
{
@@ -1361,12 +1344,12 @@ struct drm_gem_object *msm_gem_import(struct drm_device *dev,
return ERR_PTR(ret);
}

-static void *_msm_gem_kernel_new(struct drm_device *dev, uint32_t size,
+void *msm_gem_kernel_new(struct drm_device *dev, uint32_t size,
uint32_t flags, struct msm_gem_address_space *aspace,
- struct drm_gem_object **bo, uint64_t *iova, bool locked)
+ struct drm_gem_object **bo, uint64_t *iova)
{
void *vaddr;
- struct drm_gem_object *obj = _msm_gem_new(dev, size, flags, locked);
+ struct drm_gem_object *obj = msm_gem_new(dev, size, flags);
int ret;

if (IS_ERR(obj))
@@ -1390,42 +1373,21 @@ static void *_msm_gem_kernel_new(struct drm_device *dev, uint32_t size,

return vaddr;
err:
- if (locked)
- drm_gem_object_put_locked(obj);
- else
- drm_gem_object_put(obj);
+ drm_gem_object_put(obj);

return ERR_PTR(ret);

}

-void *msm_gem_kernel_new(struct drm_device *dev, uint32_t size,
- uint32_t flags, struct msm_gem_address_space *aspace,
- struct drm_gem_object **bo, uint64_t *iova)
-{
- return _msm_gem_kernel_new(dev, size, flags, aspace, bo, iova, false);
-}
-
-void *msm_gem_kernel_new_locked(struct drm_device *dev, uint32_t size,
- uint32_t flags, struct msm_gem_address_space *aspace,
- struct drm_gem_object **bo, uint64_t *iova)
-{
- return _msm_gem_kernel_new(dev, size, flags, aspace, bo, iova, true);
-}
-
void msm_gem_kernel_put(struct drm_gem_object *bo,
- struct msm_gem_address_space *aspace, bool locked)
+ struct msm_gem_address_space *aspace)
{
if (IS_ERR_OR_NULL(bo))
return;

msm_gem_put_vaddr(bo);
msm_gem_unpin_iova(bo, aspace);
-
- if (locked)
- drm_gem_object_put_locked(bo);
- else
- drm_gem_object_put(bo);
+ drm_gem_object_put(bo);
}

void msm_gem_object_set_name(struct drm_gem_object *bo, const char *fmt, ...)
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index d69fcb37ce17..71ccf87a646b 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -154,16 +154,11 @@ int msm_gem_new_handle(struct drm_device *dev, struct drm_file *file,
uint32_t size, uint32_t flags, uint32_t *handle, char *name);
struct drm_gem_object *msm_gem_new(struct drm_device *dev,
uint32_t size, uint32_t flags);
-struct drm_gem_object *msm_gem_new_locked(struct drm_device *dev,
- uint32_t size, uint32_t flags);
void *msm_gem_kernel_new(struct drm_device *dev, uint32_t size,
uint32_t flags, struct msm_gem_address_space *aspace,
struct drm_gem_object **bo, uint64_t *iova);
-void *msm_gem_kernel_new_locked(struct drm_device *dev, uint32_t size,
- uint32_t flags, struct msm_gem_address_space *aspace,
- struct drm_gem_object **bo, uint64_t *iova);
void msm_gem_kernel_put(struct drm_gem_object *bo,
- struct msm_gem_address_space *aspace, bool locked);
+ struct msm_gem_address_space *aspace);
struct drm_gem_object *msm_gem_import(struct drm_device *dev,
struct dma_buf *dmabuf, struct sg_table *sgt);
__printf(2, 3)
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 6d46f9275a40..e789f68d5be1 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -452,7 +452,7 @@ static void submit_cleanup(struct msm_gem_submit *submit)
struct msm_gem_object *msm_obj = submit->bos[i].obj;
submit_unlock_unpin_bo(submit, i, false);
list_del_init(&msm_obj->submit_entry);
- drm_gem_object_put_locked(&msm_obj->base);
+ drm_gem_object_put(&msm_obj->base);
}
}

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index c4ed8694f721..a0589666b1a3 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -992,7 +992,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
gpu->rb[i] = NULL;
}

- msm_gem_kernel_put(gpu->memptrs_bo, gpu->aspace, false);
+ msm_gem_kernel_put(gpu->memptrs_bo, gpu->aspace);

platform_set_drvdata(pdev, NULL);
return ret;
@@ -1011,7 +1011,7 @@ void msm_gpu_cleanup(struct msm_gpu *gpu)
gpu->rb[i] = NULL;
}

- msm_gem_kernel_put(gpu->memptrs_bo, gpu->aspace, false);
+ msm_gem_kernel_put(gpu->memptrs_bo, gpu->aspace);

if (!IS_ERR_OR_NULL(gpu->aspace)) {
gpu->aspace->mmu->funcs->detach(gpu->aspace->mmu);
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 054461662af5..437cca57d005 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -67,7 +67,7 @@ void msm_ringbuffer_destroy(struct msm_ringbuffer *ring)

msm_fence_context_free(ring->fctx);

- msm_gem_kernel_put(ring->bo, ring->gpu->aspace, false);
+ msm_gem_kernel_put(ring->bo, ring->gpu->aspace);

kfree(ring);
}
--
2.31.1


2021-07-28 01:04:36

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 04/13] drm: Drop drm_gem_object_put_locked()

From: Rob Clark <[email protected]>

Now that no one is using it, remove it.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
---
drivers/gpu/drm/drm_gem.c | 22 ----------------------
include/drm/drm_gem.h | 2 --
2 files changed, 24 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index d62fb1a3c916..a34525332bef 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -973,28 +973,6 @@ drm_gem_object_free(struct kref *kref)
}
EXPORT_SYMBOL(drm_gem_object_free);

-/**
- * drm_gem_object_put_locked - release a GEM buffer object reference
- * @obj: GEM buffer object
- *
- * This releases a reference to @obj. Callers must hold the
- * &drm_device.struct_mutex lock when calling this function, even when the
- * driver doesn't use &drm_device.struct_mutex for anything.
- *
- * For drivers not encumbered with legacy locking use
- * drm_gem_object_put() instead.
- */
-void
-drm_gem_object_put_locked(struct drm_gem_object *obj)
-{
- if (obj) {
- WARN_ON(!mutex_is_locked(&obj->dev->struct_mutex));
-
- kref_put(&obj->refcount, drm_gem_object_free);
- }
-}
-EXPORT_SYMBOL(drm_gem_object_put_locked);
-
/**
* drm_gem_vm_open - vma->ops->open implementation for GEM
* @vma: VM area structure
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
index 240049566592..35e7f44c2a75 100644
--- a/include/drm/drm_gem.h
+++ b/include/drm/drm_gem.h
@@ -384,8 +384,6 @@ drm_gem_object_put(struct drm_gem_object *obj)
__drm_gem_object_put(obj);
}

-void drm_gem_object_put_locked(struct drm_gem_object *obj);
-
int drm_gem_handle_create(struct drm_file *file_priv,
struct drm_gem_object *obj,
u32 *handlep);
--
2.31.1


2021-07-28 01:04:49

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 05/13] drm/msm/submit: Simplify out-fence-fd handling

From: Rob Clark <[email protected]>

No need for this to be split in two parts.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/msm_gem_submit.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index e789f68d5be1..8abd743adfb0 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -645,7 +645,6 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
struct msm_file_private *ctx = file->driver_priv;
struct msm_gem_submit *submit;
struct msm_gpu *gpu = priv->gpu;
- struct sync_file *sync_file = NULL;
struct msm_gpu_submitqueue *queue;
struct msm_ringbuffer *ring;
struct msm_submit_post_dep *post_deps = NULL;
@@ -824,22 +823,19 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
}

if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
- sync_file = sync_file_create(submit->fence);
+ struct sync_file *sync_file = sync_file_create(submit->fence);
if (!sync_file) {
ret = -ENOMEM;
goto out;
}
+ fd_install(out_fence_fd, sync_file->file);
+ args->fence_fd = out_fence_fd;
}

msm_gpu_submit(gpu, submit);

args->fence = submit->fence->seqno;

- if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
- fd_install(out_fence_fd, sync_file->file);
- args->fence_fd = out_fence_fd;
- }
-
msm_reset_syncobjs(syncobjs_to_reset, args->nr_in_syncobjs);
msm_process_post_deps(post_deps, args->nr_out_syncobjs,
submit->fence);
--
2.31.1


2021-07-28 01:04:58

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 06/13] drm/msm: Consolidate submit bo state

From: Rob Clark <[email protected]>

Move all the locked/active/pinned state handling to msm_gem_submit.c.
In particular, for drm/scheduler, we'll need to do all this before
pushing the submit job to the scheduler. But while we're at it we can
get rid of the dupicate pin and refcnt.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/msm_gem.h | 2 +
drivers/gpu/drm/msm/msm_gem_submit.c | 92 ++++++++++++++++++++++------
drivers/gpu/drm/msm/msm_gpu.c | 29 +--------
3 files changed, 75 insertions(+), 48 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index 71ccf87a646b..da3af702a6c8 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -361,6 +361,8 @@ static inline void msm_gem_submit_put(struct msm_gem_submit *submit)
kref_put(&submit->ref, __msm_gem_submit_destroy);
}

+void msm_submit_retire(struct msm_gem_submit *submit);
+
/* helper to determine of a buffer in submit should be dumped, used for both
* devcoredump and debugfs cmdstream dumping:
*/
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 8abd743adfb0..4f02fa3c78f9 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -23,8 +23,8 @@

/* make sure these don't conflict w/ MSM_SUBMIT_BO_x */
#define BO_VALID 0x8000 /* is current addr in cmdstream correct/valid? */
-#define BO_LOCKED 0x4000
-#define BO_PINNED 0x2000
+#define BO_LOCKED 0x4000 /* obj lock is held */
+#define BO_PINNED 0x2000 /* obj is pinned and on active list */

static struct msm_gem_submit *submit_create(struct drm_device *dev,
struct msm_gpu *gpu,
@@ -220,21 +220,33 @@ static int submit_lookup_cmds(struct msm_gem_submit *submit,
return ret;
}

-static void submit_unlock_unpin_bo(struct msm_gem_submit *submit,
- int i, bool backoff)
+/* Unwind bo state, according to cleanup_flags. In the success case, only
+ * the lock is dropped at the end of the submit (and active/pin ref is dropped
+ * later when the submit is retired).
+ */
+static void submit_cleanup_bo(struct msm_gem_submit *submit, int i,
+ unsigned cleanup_flags)
{
- struct msm_gem_object *msm_obj = submit->bos[i].obj;
+ struct drm_gem_object *obj = &submit->bos[i].obj->base;
+ unsigned flags = submit->bos[i].flags & cleanup_flags;

- if (submit->bos[i].flags & BO_PINNED)
- msm_gem_unpin_iova_locked(&msm_obj->base, submit->aspace);
+ if (flags & BO_PINNED) {
+ msm_gem_unpin_iova_locked(obj, submit->aspace);
+ msm_gem_active_put(obj);
+ }

- if (submit->bos[i].flags & BO_LOCKED)
- dma_resv_unlock(msm_obj->base.resv);
+ if (flags & BO_LOCKED)
+ dma_resv_unlock(obj->resv);

- if (backoff && !(submit->bos[i].flags & BO_VALID))
- submit->bos[i].iova = 0;
+ submit->bos[i].flags &= ~cleanup_flags;
+}

- submit->bos[i].flags &= ~(BO_LOCKED | BO_PINNED);
+static void submit_unlock_unpin_bo(struct msm_gem_submit *submit, int i)
+{
+ submit_cleanup_bo(submit, i, BO_PINNED | BO_LOCKED);
+
+ if (!(submit->bos[i].flags & BO_VALID))
+ submit->bos[i].iova = 0;
}

/* This is where we make sure all the bo's are reserved and pin'd: */
@@ -266,10 +278,10 @@ static int submit_lock_objects(struct msm_gem_submit *submit)

fail:
for (; i >= 0; i--)
- submit_unlock_unpin_bo(submit, i, true);
+ submit_unlock_unpin_bo(submit, i);

if (slow_locked > 0)
- submit_unlock_unpin_bo(submit, slow_locked, true);
+ submit_unlock_unpin_bo(submit, slow_locked);

if (ret == -EDEADLK) {
struct msm_gem_object *msm_obj = submit->bos[contended].obj;
@@ -325,16 +337,18 @@ static int submit_pin_objects(struct msm_gem_submit *submit)
submit->valid = true;

for (i = 0; i < submit->nr_bos; i++) {
- struct msm_gem_object *msm_obj = submit->bos[i].obj;
+ struct drm_gem_object *obj = &submit->bos[i].obj->base;
uint64_t iova;

/* if locking succeeded, pin bo: */
- ret = msm_gem_get_and_pin_iova_locked(&msm_obj->base,
+ ret = msm_gem_get_and_pin_iova_locked(obj,
submit->aspace, &iova);

if (ret)
break;

+ msm_gem_active_get(obj, submit->gpu);
+
submit->bos[i].flags |= BO_PINNED;

if (iova == submit->bos[i].iova) {
@@ -350,6 +364,20 @@ static int submit_pin_objects(struct msm_gem_submit *submit)
return ret;
}

+static void submit_attach_object_fences(struct msm_gem_submit *submit)
+{
+ int i;
+
+ for (i = 0; i < submit->nr_bos; i++) {
+ struct drm_gem_object *obj = &submit->bos[i].obj->base;
+
+ if (submit->bos[i].flags & MSM_SUBMIT_BO_WRITE)
+ dma_resv_add_excl_fence(obj->resv, submit->fence);
+ else if (submit->bos[i].flags & MSM_SUBMIT_BO_READ)
+ dma_resv_add_shared_fence(obj->resv, submit->fence);
+ }
+}
+
static int submit_bo(struct msm_gem_submit *submit, uint32_t idx,
struct msm_gem_object **obj, uint64_t *iova, bool *valid)
{
@@ -444,18 +472,40 @@ static int submit_reloc(struct msm_gem_submit *submit, struct msm_gem_object *ob
return ret;
}

-static void submit_cleanup(struct msm_gem_submit *submit)
+/* Cleanup submit at end of ioctl. In the error case, this also drops
+ * references, unpins, and drops active refcnt. In the non-error case,
+ * this is done when the submit is retired.
+ */
+static void submit_cleanup(struct msm_gem_submit *submit, bool error)
{
+ unsigned cleanup_flags = BO_LOCKED;
unsigned i;

+ if (error)
+ cleanup_flags |= BO_PINNED;
+
for (i = 0; i < submit->nr_bos; i++) {
struct msm_gem_object *msm_obj = submit->bos[i].obj;
- submit_unlock_unpin_bo(submit, i, false);
+ submit_cleanup_bo(submit, i, cleanup_flags);
list_del_init(&msm_obj->submit_entry);
- drm_gem_object_put(&msm_obj->base);
+ if (error)
+ drm_gem_object_put(&msm_obj->base);
}
}

+void msm_submit_retire(struct msm_gem_submit *submit)
+{
+ int i;
+
+ for (i = 0; i < submit->nr_bos; i++) {
+ struct drm_gem_object *obj = &submit->bos[i].obj->base;
+
+ msm_gem_lock(obj);
+ submit_cleanup_bo(submit, i, BO_PINNED);
+ msm_gem_unlock(obj);
+ drm_gem_object_put(obj);
+ }
+}

struct msm_submit_post_dep {
struct drm_syncobj *syncobj;
@@ -832,6 +882,8 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
args->fence_fd = out_fence_fd;
}

+ submit_attach_object_fences(submit);
+
msm_gpu_submit(gpu, submit);

args->fence = submit->fence->seqno;
@@ -844,7 +896,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
out:
pm_runtime_put(&gpu->pdev->dev);
out_pre_pm:
- submit_cleanup(submit);
+ submit_cleanup(submit, !!ret);
if (has_ww_ticket)
ww_acquire_fini(&submit->ticket);
msm_gem_submit_put(submit);
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index a0589666b1a3..5bfc4d24a956 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -647,7 +647,6 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
volatile struct msm_gpu_submit_stats *stats;
u64 elapsed, clock = 0;
unsigned long flags;
- int i;

stats = &ring->memptrs->stats[index];
/* Convert 19.2Mhz alwayson ticks to nanoseconds for elapsed time */
@@ -663,15 +662,7 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
trace_msm_gpu_submit_retired(submit, elapsed, clock,
stats->alwayson_start, stats->alwayson_end);

- for (i = 0; i < submit->nr_bos; i++) {
- struct drm_gem_object *obj = &submit->bos[i].obj->base;
-
- msm_gem_lock(obj);
- msm_gem_active_put(obj);
- msm_gem_unpin_iova_locked(obj, submit->aspace);
- msm_gem_unlock(obj);
- drm_gem_object_put(obj);
- }
+ msm_submit_retire(submit);

pm_runtime_mark_last_busy(&gpu->pdev->dev);
pm_runtime_put_autosuspend(&gpu->pdev->dev);
@@ -748,7 +739,6 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
struct msm_drm_private *priv = dev->dev_private;
struct msm_ringbuffer *ring = submit->ring;
unsigned long flags;
- int i;

WARN_ON(!mutex_is_locked(&dev->struct_mutex));

@@ -762,23 +752,6 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)

update_sw_cntrs(gpu);

- for (i = 0; i < submit->nr_bos; i++) {
- struct msm_gem_object *msm_obj = submit->bos[i].obj;
- struct drm_gem_object *drm_obj = &msm_obj->base;
- uint64_t iova;
-
- /* submit takes a reference to the bo and iova until retired: */
- drm_gem_object_get(&msm_obj->base);
- msm_gem_get_and_pin_iova_locked(&msm_obj->base, submit->aspace, &iova);
-
- if (submit->bos[i].flags & MSM_SUBMIT_BO_WRITE)
- dma_resv_add_excl_fence(drm_obj->resv, submit->fence);
- else if (submit->bos[i].flags & MSM_SUBMIT_BO_READ)
- dma_resv_add_shared_fence(drm_obj->resv, submit->fence);
-
- msm_gem_active_get(drm_obj, gpu);
- }
-
/*
* ring->submits holds a ref to the submit, to deal with the case
* that a submit completes before msm_ioctl_gem_submit() returns.
--
2.31.1


2021-07-28 01:05:02

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 07/13] drm/msm: Track "seqno" fences by idr

From: Rob Clark <[email protected]>

Previously the (non-fd) fence returned from submit ioctl was a raw
seqno, which is scoped to the ring. But from UABI standpoint, the
ioctls related to seqno fences all specify a submitqueue. We can
take advantage of that to replace the seqno fences with a cyclic idr
handle.

This is in preperation for moving to drm scheduler, at which point
the submit ioctl will return after queuing the submit job to the
scheduler, but before the submit is written into the ring (and
therefore before a ring seqno has been assigned). Which means we
need to replace the dma_fence that userspace may need to wait on
with a scheduler fence.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/msm_drv.c | 30 +++++++++++++++++--
drivers/gpu/drm/msm/msm_fence.c | 42 ---------------------------
drivers/gpu/drm/msm/msm_fence.h | 3 --
drivers/gpu/drm/msm/msm_gem.h | 1 +
drivers/gpu/drm/msm/msm_gem_submit.c | 23 ++++++++++++++-
drivers/gpu/drm/msm/msm_gpu.h | 5 ++++
drivers/gpu/drm/msm/msm_submitqueue.c | 5 ++++
7 files changed, 61 insertions(+), 48 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 9b8fa2ad0d84..1594ae39d54f 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -911,6 +911,7 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
ktime_t timeout = to_ktime(args->timeout);
struct msm_gpu_submitqueue *queue;
struct msm_gpu *gpu = priv->gpu;
+ struct dma_fence *fence;
int ret;

if (args->pad) {
@@ -925,10 +926,35 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
if (!queue)
return -ENOENT;

- ret = msm_wait_fence(gpu->rb[queue->prio]->fctx, args->fence, &timeout,
- true);
+ /*
+ * Map submitqueue scoped "seqno" (which is actually an idr key)
+ * back to underlying dma-fence
+ *
+ * The fence is removed from the fence_idr when the submit is
+ * retired, so if the fence is not found it means there is nothing
+ * to wait for
+ */
+ ret = mutex_lock_interruptible(&queue->lock);
+ if (ret)
+ return ret;
+ fence = idr_find(&queue->fence_idr, args->fence);
+ if (fence)
+ fence = dma_fence_get_rcu(fence);
+ mutex_unlock(&queue->lock);
+
+ if (!fence)
+ return 0;

+ ret = dma_fence_wait_timeout(fence, true, timeout_to_jiffies(&timeout));
+ if (ret == 0) {
+ ret = -ETIMEDOUT;
+ } else if (ret != -ERESTARTSYS) {
+ ret = 0;
+ }
+
+ dma_fence_put(fence);
msm_submitqueue_put(queue);
+
return ret;
}

diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c
index b92a9091a1e2..f2cece542c3f 100644
--- a/drivers/gpu/drm/msm/msm_fence.c
+++ b/drivers/gpu/drm/msm/msm_fence.c
@@ -24,7 +24,6 @@ msm_fence_context_alloc(struct drm_device *dev, volatile uint32_t *fenceptr,
strncpy(fctx->name, name, sizeof(fctx->name));
fctx->context = dma_fence_context_alloc(1);
fctx->fenceptr = fenceptr;
- init_waitqueue_head(&fctx->event);
spin_lock_init(&fctx->spinlock);

return fctx;
@@ -45,53 +44,12 @@ static inline bool fence_completed(struct msm_fence_context *fctx, uint32_t fenc
(int32_t)(*fctx->fenceptr - fence) >= 0;
}

-/* legacy path for WAIT_FENCE ioctl: */
-int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
- ktime_t *timeout, bool interruptible)
-{
- int ret;
-
- if (fence > fctx->last_fence) {
- DRM_ERROR_RATELIMITED("%s: waiting on invalid fence: %u (of %u)\n",
- fctx->name, fence, fctx->last_fence);
- return -EINVAL;
- }
-
- if (!timeout) {
- /* no-wait: */
- ret = fence_completed(fctx, fence) ? 0 : -EBUSY;
- } else {
- unsigned long remaining_jiffies = timeout_to_jiffies(timeout);
-
- if (interruptible)
- ret = wait_event_interruptible_timeout(fctx->event,
- fence_completed(fctx, fence),
- remaining_jiffies);
- else
- ret = wait_event_timeout(fctx->event,
- fence_completed(fctx, fence),
- remaining_jiffies);
-
- if (ret == 0) {
- DBG("timeout waiting for fence: %u (completed: %u)",
- fence, fctx->completed_fence);
- ret = -ETIMEDOUT;
- } else if (ret != -ERESTARTSYS) {
- ret = 0;
- }
- }
-
- return ret;
-}
-
/* called from workqueue */
void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence)
{
spin_lock(&fctx->spinlock);
fctx->completed_fence = max(fence, fctx->completed_fence);
spin_unlock(&fctx->spinlock);
-
- wake_up_all(&fctx->event);
}

struct msm_fence {
diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h
index 6ab97062ff1a..4783db528bcc 100644
--- a/drivers/gpu/drm/msm/msm_fence.h
+++ b/drivers/gpu/drm/msm/msm_fence.h
@@ -49,7 +49,6 @@ struct msm_fence_context {
*/
volatile uint32_t *fenceptr;

- wait_queue_head_t event;
spinlock_t spinlock;
};

@@ -57,8 +56,6 @@ struct msm_fence_context * msm_fence_context_alloc(struct drm_device *dev,
volatile uint32_t *fenceptr, const char *name);
void msm_fence_context_free(struct msm_fence_context *fctx);

-int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
- ktime_t *timeout, bool interruptible);
void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);

struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx);
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index da3af702a6c8..e0579abda5b9 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -320,6 +320,7 @@ struct msm_gem_submit {
struct ww_acquire_ctx ticket;
uint32_t seqno; /* Sequence number of the submit on the ring */
struct dma_fence *fence;
+ int fence_id; /* key into queue->fence_idr */
struct msm_gpu_submitqueue *queue;
struct pid *pid; /* submitting process */
bool fault_dumped; /* Limit devcoredump dumping to one per submit */
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 4f02fa3c78f9..f6f595aae2c5 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -68,7 +68,14 @@ void __msm_gem_submit_destroy(struct kref *kref)
container_of(kref, struct msm_gem_submit, ref);
unsigned i;

+ if (submit->fence_id) {
+ mutex_lock(&submit->queue->lock);
+ idr_remove(&submit->queue->fence_idr, submit->fence_id);
+ mutex_unlock(&submit->queue->lock);
+ }
+
dma_fence_put(submit->fence);
+
put_pid(submit->pid);
msm_submitqueue_put(submit->queue);

@@ -872,6 +879,20 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
goto out;
}

+ /*
+ * Allocate an id which can be used by WAIT_FENCE ioctl to map back
+ * to the underlying fence.
+ */
+ mutex_lock(&queue->lock);
+ submit->fence_id = idr_alloc_cyclic(&queue->fence_idr,
+ submit->fence, 0, INT_MAX, GFP_KERNEL);
+ mutex_unlock(&queue->lock);
+ if (submit->fence_id < 0) {
+ ret = submit->fence_id = 0;
+ submit->fence_id = 0;
+ goto out;
+ }
+
if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
struct sync_file *sync_file = sync_file_create(submit->fence);
if (!sync_file) {
@@ -886,7 +907,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,

msm_gpu_submit(gpu, submit);

- args->fence = submit->fence->seqno;
+ args->fence = submit->fence_id;

msm_reset_syncobjs(syncobjs_to_reset, args->nr_in_syncobjs);
msm_process_post_deps(post_deps, args->nr_out_syncobjs,
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 96efcb31e502..579627252540 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -263,6 +263,9 @@ struct msm_gpu_perfcntr {
* which set of pgtables do submits jobs associated with the
* submitqueue use)
* @node: node in the context's list of submitqueues
+ * @fence_idr: maps fence-id to dma_fence for userspace visible fence
+ * seqno, protected by submitqueue lock
+ * @lock: submitqueue lock
* @ref: reference count
*/
struct msm_gpu_submitqueue {
@@ -272,6 +275,8 @@ struct msm_gpu_submitqueue {
int faults;
struct msm_file_private *ctx;
struct list_head node;
+ struct idr fence_idr;
+ struct mutex lock;
struct kref ref;
};

diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
index 9e9fec61d629..66f8d0fb38b0 100644
--- a/drivers/gpu/drm/msm/msm_submitqueue.c
+++ b/drivers/gpu/drm/msm/msm_submitqueue.c
@@ -12,6 +12,8 @@ void msm_submitqueue_destroy(struct kref *kref)
struct msm_gpu_submitqueue *queue = container_of(kref,
struct msm_gpu_submitqueue, ref);

+ idr_destroy(&queue->fence_idr);
+
msm_file_private_put(queue->ctx);

kfree(queue);
@@ -89,6 +91,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
if (id)
*id = queue->id;

+ idr_init(&queue->fence_idr);
+ mutex_init(&queue->lock);
+
list_add_tail(&queue->node, &ctx->submitqueues);

write_unlock(&ctx->queuelock);
--
2.31.1


2021-07-28 01:05:13

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 08/13] drm/msm: Return ERR_PTR() from submit_create()

From: Rob Clark <[email protected]>

In the next patch, we start having more than a single potential failure
reason.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/msm_gem_submit.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index f6f595aae2c5..f570155bc086 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -32,30 +32,27 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
uint32_t nr_cmds)
{
struct msm_gem_submit *submit;
- uint64_t sz = struct_size(submit, bos, nr_bos) +
- ((u64)nr_cmds * sizeof(submit->cmd[0]));
+ uint64_t sz;
+
+ sz = struct_size(submit, bos, nr_bos) +
+ ((u64)nr_cmds * sizeof(submit->cmd[0]));

if (sz > SIZE_MAX)
- return NULL;
+ return ERR_PTR(-ENOMEM);

- submit = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+ submit = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
if (!submit)
- return NULL;
+ return ERR_PTR(-ENOMEM);

kref_init(&submit->ref);
submit->dev = dev;
submit->aspace = queue->ctx->aspace;
submit->gpu = gpu;
- submit->fence = NULL;
submit->cmd = (void *)&submit->bos[nr_bos];
submit->queue = queue;
submit->ring = gpu->rb[queue->prio];
submit->fault_dumped = false;

- /* initially, until copy_from_user() and bo lookup succeeds: */
- submit->nr_bos = 0;
- submit->nr_cmds = 0;
-
INIT_LIST_HEAD(&submit->node);
INIT_LIST_HEAD(&submit->bo_list);

@@ -799,8 +796,8 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,

submit = submit_create(dev, gpu, queue, args->nr_bos,
args->nr_cmds);
- if (!submit) {
- ret = -ENOMEM;
+ if (IS_ERR(submit)) {
+ ret = PTR_ERR(submit);
goto out_unlock;
}

--
2.31.1


2021-07-28 01:05:17

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 10/13] drm/msm: Drop submit bo_list

From: Rob Clark <[email protected]>

This was only used to detect userspace including the same bo multiple
times in a submit. But ww_mutex can already tell us this.

When we drop struct_mutex around the submit ioctl, we'd otherwise need
to lock the bo before adding it to the bo_list. But since ww_mutex can
already tell us this, it is simpler just to remove the bo_list.

Signed-off-by: Rob Clark <[email protected]>
---
drivers/gpu/drm/msm/msm_gem.c | 1 -
drivers/gpu/drm/msm/msm_gem.h | 8 --------
drivers/gpu/drm/msm/msm_gem_submit.c | 28 +++++++++++++---------------
3 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index a527a6b1d6ba..af199ef53d2f 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -1151,7 +1151,6 @@ static int msm_gem_new_impl(struct drm_device *dev,
msm_obj->flags = flags;
msm_obj->madv = MSM_MADV_WILLNEED;

- INIT_LIST_HEAD(&msm_obj->submit_entry);
INIT_LIST_HEAD(&msm_obj->vmas);

*obj = &msm_obj->base;
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index a48114058ff9..f9e3ffb2309a 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -88,13 +88,6 @@ struct msm_gem_object {
*/
struct list_head mm_list;

- /* Transiently in the process of submit ioctl, objects associated
- * with the submit are on submit->bo_list.. this only lasts for
- * the duration of the ioctl, so one bo can never be on multiple
- * submit lists.
- */
- struct list_head submit_entry;
-
struct page **pages;
struct sg_table *sgt;
void *vaddr;
@@ -316,7 +309,6 @@ struct msm_gem_submit {
struct msm_gpu *gpu;
struct msm_gem_address_space *aspace;
struct list_head node; /* node in ring submit list */
- struct list_head bo_list;
struct ww_acquire_ctx ticket;
uint32_t seqno; /* Sequence number of the submit on the ring */

diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 2b158433a6e5..e11e4bb63695 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -63,7 +63,6 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
submit->fault_dumped = false;

INIT_LIST_HEAD(&submit->node);
- INIT_LIST_HEAD(&submit->bo_list);

return submit;
}
@@ -143,7 +142,6 @@ static int submit_lookup_objects(struct msm_gem_submit *submit,

for (i = 0; i < args->nr_bos; i++) {
struct drm_gem_object *obj;
- struct msm_gem_object *msm_obj;

/* normally use drm_gem_object_lookup(), but for bulk lookup
* all under single table_lock just hit object_idr directly:
@@ -155,20 +153,9 @@ static int submit_lookup_objects(struct msm_gem_submit *submit,
goto out_unlock;
}

- msm_obj = to_msm_bo(obj);
-
- if (!list_empty(&msm_obj->submit_entry)) {
- DRM_ERROR("handle %u at index %u already on submit list\n",
- submit->bos[i].handle, i);
- ret = -EINVAL;
- goto out_unlock;
- }
-
drm_gem_object_get(obj);

- submit->bos[i].obj = msm_obj;
-
- list_add_tail(&msm_obj->submit_entry, &submit->bo_list);
+ submit->bos[i].obj = to_msm_bo(obj);
}

out_unlock:
@@ -299,6 +286,12 @@ static int submit_lock_objects(struct msm_gem_submit *submit)
return 0;

fail:
+ if (ret == -EALREADY) {
+ DRM_ERROR("handle %u at index %u already on submit list\n",
+ submit->bos[i].handle, i);
+ ret = -EINVAL;
+ }
+
for (; i >= 0; i--)
submit_unlock_unpin_bo(submit, i);

@@ -315,6 +308,12 @@ static int submit_lock_objects(struct msm_gem_submit *submit)
slow_locked = contended;
goto retry;
}
+
+ /* Not expecting -EALREADY here, if the bo was already
+ * locked, we should have gotten -EALREADY already from
+ * the dma_resv_lock_interruptable() call.
+ */
+ WARN_ON_ONCE(ret == -EALREADY);
}

return ret;
@@ -508,7 +507,6 @@ static void submit_cleanup(struct msm_gem_submit *submit, bool error)
for (i = 0; i < submit->nr_bos; i++) {
struct msm_gem_object *msm_obj = submit->bos[i].obj;
submit_cleanup_bo(submit, i, cleanup_flags);
- list_del_init(&msm_obj->submit_entry);
if (error)
drm_gem_object_put(&msm_obj->base);
}
--
2.31.1


2021-07-28 01:05:55

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

From: Rob Clark <[email protected]>

The drm/scheduler provides additional prioritization on top of that
provided by however many number of ringbuffers (each with their own
priority level) is supported on a given generation. Expose the
additional levels of priority to userspace and map the userspace
priority back to ring (first level of priority) and schedular priority
(additional priority levels within the ring).

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
include/uapi/drm/msm_drm.h | 14 +++++-
5 files changed, 88 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index bad4809b68ef..748665232d29 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
return ret;
}
return -EINVAL;
- case MSM_PARAM_NR_RINGS:
- *value = gpu->nr_rings;
+ case MSM_PARAM_PRIORITIES:
+ *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
return 0;
case MSM_PARAM_PP_PGTABLE:
*value = 0;
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 450efe59abb5..c2ecec5b11c4 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
submit->gpu = gpu;
submit->cmd = (void *)&submit->bos[nr_bos];
submit->queue = queue;
- submit->ring = gpu->rb[queue->prio];
+ submit->ring = gpu->rb[queue->ring_nr];
submit->fault_dumped = false;

INIT_LIST_HEAD(&submit->node);
@@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
/* Get a unique identifier for the submission for logging purposes */
submitid = atomic_inc_return(&ident) - 1;

- ring = gpu->rb[queue->prio];
+ ring = gpu->rb[queue->ring_nr];
trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
args->nr_bos, args->nr_cmds);

diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index b912cacaecc0..0e4b45bff2e6 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
const char *name;
};

+/*
+ * The number of priority levels provided by drm gpu scheduler. The
+ * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
+ * cases, so we don't use it (no need for kernel generated jobs).
+ */
+#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
+
+/**
+ * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
+ *
+ * @gpu: the gpu instance
+ * @prio: the userspace priority level
+ * @ring_nr: [out] the ringbuffer the userspace priority maps to
+ * @sched_prio: [out] the gpu scheduler priority level which the userspace
+ * priority maps to
+ *
+ * With drm/scheduler providing it's own level of prioritization, our total
+ * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
+ * Each ring is associated with it's own scheduler instance. However, our
+ * UABI is that lower numerical values are higher priority. So mapping the
+ * single userspace priority level into ring_nr and sched_prio takes some
+ * care. The userspace provided priority (when a submitqueue is created)
+ * is mapped to ring nr and scheduler priority as such:
+ *
+ * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
+ * sched_prio = NR_SCHED_PRIORITIES -
+ * (userspace_prio % NR_SCHED_PRIORITIES) - 1
+ *
+ * This allows generations without preemption (nr_rings==1) to have some
+ * amount of prioritization, and provides more priority levels for gens
+ * that do have preemption.
+ */
+static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio,
+ unsigned *ring_nr, enum drm_sched_priority *sched_prio)
+{
+ unsigned rn, sp;
+
+ rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp);
+
+ /* invert sched priority to map to higher-numeric-is-higher-
+ * priority convention
+ */
+ sp = NR_SCHED_PRIORITIES - sp - 1;
+
+ if (rn >= gpu->nr_rings)
+ return -EINVAL;
+
+ *ring_nr = rn;
+ *sched_prio = sp;
+
+ return 0;
+}
+
/**
* A submitqueue is associated with a gl context or vk queue (or equiv)
* in userspace.
@@ -257,7 +310,8 @@ struct msm_gpu_perfcntr {
* @id: userspace id for the submitqueue, unique within the drm_file
* @flags: userspace flags for the submitqueue, specified at creation
* (currently unusued)
- * @prio: the submitqueue priority
+ * @ring_nr: the ringbuffer used by this submitqueue, which is determined
+ * by the submitqueue's priority
* @faults: the number of GPU hangs associated with this submitqueue
* @ctx: the per-drm_file context associated with the submitqueue (ie.
* which set of pgtables do submits jobs associated with the
@@ -272,7 +326,7 @@ struct msm_gpu_perfcntr {
struct msm_gpu_submitqueue {
int id;
u32 flags;
- u32 prio;
+ u32 ring_nr;
int faults;
struct msm_file_private *ctx;
struct list_head node;
diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
index 682ba2a7c0ec..32a55d81b58b 100644
--- a/drivers/gpu/drm/msm/msm_submitqueue.c
+++ b/drivers/gpu/drm/msm/msm_submitqueue.c
@@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
struct msm_gpu_submitqueue *queue;
struct msm_ringbuffer *ring;
struct drm_gpu_scheduler *sched;
+ enum drm_sched_priority sched_prio;
+ unsigned ring_nr;
int ret;

if (!ctx)
@@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
if (!priv->gpu)
return -ENODEV;

- if (prio >= priv->gpu->nr_rings)
- return -EINVAL;
+ ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio);
+ if (ret)
+ return ret;

queue = kzalloc(sizeof(*queue), GFP_KERNEL);

@@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,

kref_init(&queue->ref);
queue->flags = flags;
- queue->prio = prio;
+ queue->ring_nr = ring_nr;

- ring = priv->gpu->rb[prio];
+ ring = priv->gpu->rb[ring_nr];
sched = &ring->sched;

- /*
- * TODO we can allow more priorities than we have ringbuffers by
- * mapping:
- *
- * ring = prio / 3;
- * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3);
- *
- * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is
- * treated specially in places.
- */
ret = drm_sched_entity_init(&queue->entity,
- DRM_SCHED_PRIORITY_NORMAL,
- &sched, 1, NULL);
+ sched_prio, &sched, 1, NULL);
if (ret) {
kfree(queue);
return ret;
@@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
{
struct msm_drm_private *priv = drm->dev_private;
- int default_prio;
+ int default_prio, max_priority;

if (!priv->gpu)
return -ENODEV;

+ max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1;
+
/*
- * Select priority 2 as the "default priority" unless nr_rings is less
- * than 2 and then pick the lowest priority
+ * Pick a medium priority level as default. Lower numeric value is
+ * higher priority, so round-up to pick a priority that is not higher
+ * than the middle priority level.
*/
- default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1);
+ default_prio = DIV_ROUND_UP(max_priority, 2);

INIT_LIST_HEAD(&ctx->submitqueues);

diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
index f075851021c3..6b8fffc28a50 100644
--- a/include/uapi/drm/msm_drm.h
+++ b/include/uapi/drm/msm_drm.h
@@ -73,11 +73,19 @@ struct drm_msm_timespec {
#define MSM_PARAM_MAX_FREQ 0x04
#define MSM_PARAM_TIMESTAMP 0x05
#define MSM_PARAM_GMEM_BASE 0x06
-#define MSM_PARAM_NR_RINGS 0x07
+#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */
#define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */
#define MSM_PARAM_FAULTS 0x09
#define MSM_PARAM_SUSPENDS 0x0a

+/* For backwards compat. The original support for preemption was based on
+ * a single ring per priority level so # of priority levels equals the #
+ * of rings. With drm/scheduler providing additional levels of priority,
+ * the number of priorities is greater than the # of rings. The param is
+ * renamed to better reflect this.
+ */
+#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES
+
struct drm_msm_param {
__u32 pipe; /* in, MSM_PIPE_x */
__u32 param; /* in, MSM_PARAM_x */
@@ -304,6 +312,10 @@ struct drm_msm_gem_madvise {

#define MSM_SUBMITQUEUE_FLAGS (0)

+/*
+ * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1,
+ * a lower numeric value is higher priority.
+ */
struct drm_msm_submitqueue {
__u32 flags; /* in, MSM_SUBMITQUEUE_x */
__u32 prio; /* in, Priority level */
--
2.31.1


2021-07-28 01:06:49

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 11/13] drm/msm: Drop struct_mutex in submit path

From: Rob Clark <[email protected]>

It is sufficient to serialize on the submit queue now.

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/msm_gem_submit.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index e11e4bb63695..450efe59abb5 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -709,7 +709,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
struct msm_drm_private *priv = dev->dev_private;
struct drm_msm_gem_submit *args = data;
struct msm_file_private *ctx = file->driver_priv;
- struct msm_gem_submit *submit;
+ struct msm_gem_submit *submit = NULL;
struct msm_gpu *gpu = priv->gpu;
struct msm_gpu_submitqueue *queue;
struct msm_ringbuffer *ring;
@@ -753,7 +753,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
args->nr_bos, args->nr_cmds);

- ret = mutex_lock_interruptible(&dev->struct_mutex);
+ ret = mutex_lock_interruptible(&queue->lock);
if (ret)
goto out_post_unlock;

@@ -874,10 +874,8 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
* Allocate an id which can be used by WAIT_FENCE ioctl to map back
* to the underlying fence.
*/
- mutex_lock(&queue->lock);
submit->fence_id = idr_alloc_cyclic(&queue->fence_idr,
submit->user_fence, 0, INT_MAX, GFP_KERNEL);
- mutex_unlock(&queue->lock);
if (submit->fence_id < 0) {
ret = submit->fence_id = 0;
submit->fence_id = 0;
@@ -912,12 +910,12 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
submit_cleanup(submit, !!ret);
if (has_ww_ticket)
ww_acquire_fini(&submit->ticket);
- msm_gem_submit_put(submit);
out_unlock:
if (ret && (out_fence_fd >= 0))
put_unused_fd(out_fence_fd);
- mutex_unlock(&dev->struct_mutex);
-
+ mutex_unlock(&queue->lock);
+ if (submit)
+ msm_gem_submit_put(submit);
out_post_unlock:
if (!IS_ERR_OR_NULL(post_deps)) {
for (i = 0; i < args->nr_out_syncobjs; ++i) {
--
2.31.1


2021-07-28 01:07:07

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 13/13] drm/msm/gem: Mark active before pinning

From: Rob Clark <[email protected]>

Mark all the bos in the submit as active, before pinning, to prevent
evicting a buffer in the same submit to make room for a buffer earlier
in the table.

Signed-off-by: Rob Clark <[email protected]>
---
drivers/gpu/drm/msm/msm_gem.c | 2 --
drivers/gpu/drm/msm/msm_gem_submit.c | 28 ++++++++++++++++++++--------
2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index af199ef53d2f..15b1804fa64e 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -131,7 +131,6 @@ static struct page **get_pages(struct drm_gem_object *obj)
if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED))
sync_for_device(msm_obj);

- GEM_WARN_ON(msm_obj->active_count);
update_inactive(msm_obj);
}

@@ -815,7 +814,6 @@ void msm_gem_active_get(struct drm_gem_object *obj, struct msm_gpu *gpu)
GEM_WARN_ON(!msm_gem_is_locked(obj));
GEM_WARN_ON(msm_obj->madv != MSM_MADV_WILLNEED);
GEM_WARN_ON(msm_obj->dontneed);
- GEM_WARN_ON(!msm_obj->sgt);

if (msm_obj->active_count++ == 0) {
mutex_lock(&priv->mm_lock);
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index c2ecec5b11c4..fc25a85eb1ca 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -24,7 +24,8 @@
/* make sure these don't conflict w/ MSM_SUBMIT_BO_x */
#define BO_VALID 0x8000 /* is current addr in cmdstream correct/valid? */
#define BO_LOCKED 0x4000 /* obj lock is held */
-#define BO_PINNED 0x2000 /* obj is pinned and on active list */
+#define BO_ACTIVE 0x2000 /* active refcnt is held */
+#define BO_PINNED 0x1000 /* obj is pinned and on active list */

static struct msm_gem_submit *submit_create(struct drm_device *dev,
struct msm_gpu *gpu,
@@ -239,10 +240,11 @@ static void submit_cleanup_bo(struct msm_gem_submit *submit, int i,
struct drm_gem_object *obj = &submit->bos[i].obj->base;
unsigned flags = submit->bos[i].flags & cleanup_flags;

- if (flags & BO_PINNED) {
+ if (flags & BO_PINNED)
msm_gem_unpin_iova_locked(obj, submit->aspace);
+
+ if (flags & BO_ACTIVE)
msm_gem_active_put(obj);
- }

if (flags & BO_LOCKED)
dma_resv_unlock(obj->resv);
@@ -252,7 +254,7 @@ static void submit_cleanup_bo(struct msm_gem_submit *submit, int i,

static void submit_unlock_unpin_bo(struct msm_gem_submit *submit, int i)
{
- submit_cleanup_bo(submit, i, BO_PINNED | BO_LOCKED);
+ submit_cleanup_bo(submit, i, BO_PINNED | BO_ACTIVE | BO_LOCKED);

if (!(submit->bos[i].flags & BO_VALID))
submit->bos[i].iova = 0;
@@ -356,6 +358,18 @@ static int submit_pin_objects(struct msm_gem_submit *submit)

submit->valid = true;

+ /*
+ * Increment active_count first, so if under memory pressure, we
+ * don't inadvertently evict a bo needed by the submit in order
+ * to pin an earlier bo in the same submit.
+ */
+ for (i = 0; i < submit->nr_bos; i++) {
+ struct drm_gem_object *obj = &submit->bos[i].obj->base;
+
+ msm_gem_active_get(obj, submit->gpu);
+ submit->bos[i].flags |= BO_ACTIVE;
+ }
+
for (i = 0; i < submit->nr_bos; i++) {
struct drm_gem_object *obj = &submit->bos[i].obj->base;
uint64_t iova;
@@ -367,8 +381,6 @@ static int submit_pin_objects(struct msm_gem_submit *submit)
if (ret)
break;

- msm_gem_active_get(obj, submit->gpu);
-
submit->bos[i].flags |= BO_PINNED;

if (iova == submit->bos[i].iova) {
@@ -502,7 +514,7 @@ static void submit_cleanup(struct msm_gem_submit *submit, bool error)
unsigned i;

if (error)
- cleanup_flags |= BO_PINNED;
+ cleanup_flags |= BO_PINNED | BO_ACTIVE;

for (i = 0; i < submit->nr_bos; i++) {
struct msm_gem_object *msm_obj = submit->bos[i].obj;
@@ -520,7 +532,7 @@ void msm_submit_retire(struct msm_gem_submit *submit)
struct drm_gem_object *obj = &submit->bos[i].obj->base;

msm_gem_lock(obj);
- submit_cleanup_bo(submit, i, BO_PINNED);
+ submit_cleanup_bo(submit, i, BO_PINNED | BO_ACTIVE);
msm_gem_unlock(obj);
drm_gem_object_put(obj);
}
--
2.31.1


2021-07-28 01:07:23

by Rob Clark

[permalink] [raw]
Subject: [PATCH v4 09/13] drm/msm: Conversion to drm scheduler

From: Rob Clark <[email protected]>

For existing adrenos, there is one or more ringbuffer, depending on
whether preemption is supported. When preemption is supported, each
ringbuffer has it's own priority. A submitqueue (which maps to a
gl context or vk queue in userspace) is mapped to a specific ring-
buffer at creation time, based on the submitqueue's priority.

Each ringbuffer has it's own drm_gpu_scheduler. Each submitqueue
maps to a drm_sched_entity. And each submit maps to a drm_sched_job.

Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/4
Signed-off-by: Rob Clark <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/msm/Kconfig | 1 +
drivers/gpu/drm/msm/msm_gem.c | 35 ------
drivers/gpu/drm/msm/msm_gem.h | 26 ++++-
drivers/gpu/drm/msm/msm_gem_submit.c | 161 +++++++++++++-------------
drivers/gpu/drm/msm/msm_gpu.c | 13 +--
drivers/gpu/drm/msm/msm_gpu.h | 2 +
drivers/gpu/drm/msm/msm_rd.c | 6 +-
drivers/gpu/drm/msm/msm_ringbuffer.c | 66 +++++++++++
drivers/gpu/drm/msm/msm_ringbuffer.h | 12 ++
drivers/gpu/drm/msm/msm_submitqueue.c | 26 +++++
10 files changed, 217 insertions(+), 131 deletions(-)

diff --git a/drivers/gpu/drm/msm/Kconfig b/drivers/gpu/drm/msm/Kconfig
index 52536e7adb95..dc7f3e40850b 100644
--- a/drivers/gpu/drm/msm/Kconfig
+++ b/drivers/gpu/drm/msm/Kconfig
@@ -14,6 +14,7 @@ config DRM_MSM
select REGULATOR
select DRM_KMS_HELPER
select DRM_PANEL
+ select DRM_SCHED
select SHMEM
select TMPFS
select QCOM_SCM if ARCH_QCOM
diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index 4e99c448b83a..a527a6b1d6ba 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -806,41 +806,6 @@ void msm_gem_vunmap(struct drm_gem_object *obj)
msm_obj->vaddr = NULL;
}

-/* must be called before _move_to_active().. */
-int msm_gem_sync_object(struct drm_gem_object *obj,
- struct msm_fence_context *fctx, bool exclusive)
-{
- struct dma_resv_list *fobj;
- struct dma_fence *fence;
- int i, ret;
-
- fobj = dma_resv_shared_list(obj->resv);
- if (!fobj || (fobj->shared_count == 0)) {
- fence = dma_resv_excl_fence(obj->resv);
- /* don't need to wait on our own fences, since ring is fifo */
- if (fence && (fence->context != fctx->context)) {
- ret = dma_fence_wait(fence, true);
- if (ret)
- return ret;
- }
- }
-
- if (!exclusive || !fobj)
- return 0;
-
- for (i = 0; i < fobj->shared_count; i++) {
- fence = rcu_dereference_protected(fobj->shared[i],
- dma_resv_held(obj->resv));
- if (fence->context != fctx->context) {
- ret = dma_fence_wait(fence, true);
- if (ret)
- return ret;
- }
- }
-
- return 0;
-}
-
void msm_gem_active_get(struct drm_gem_object *obj, struct msm_gpu *gpu)
{
struct msm_gem_object *msm_obj = to_msm_bo(obj);
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index e0579abda5b9..a48114058ff9 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -9,6 +9,7 @@

#include <linux/kref.h>
#include <linux/dma-resv.h>
+#include "drm/gpu_scheduler.h"
#include "msm_drv.h"

/* Make all GEM related WARN_ON()s ratelimited.. when things go wrong they
@@ -143,8 +144,6 @@ void *msm_gem_get_vaddr_active(struct drm_gem_object *obj);
void msm_gem_put_vaddr_locked(struct drm_gem_object *obj);
void msm_gem_put_vaddr(struct drm_gem_object *obj);
int msm_gem_madvise(struct drm_gem_object *obj, unsigned madv);
-int msm_gem_sync_object(struct drm_gem_object *obj,
- struct msm_fence_context *fctx, bool exclusive);
void msm_gem_active_get(struct drm_gem_object *obj, struct msm_gpu *gpu);
void msm_gem_active_put(struct drm_gem_object *obj);
int msm_gem_cpu_prep(struct drm_gem_object *obj, uint32_t op, ktime_t *timeout);
@@ -311,6 +310,7 @@ void msm_gem_vunmap(struct drm_gem_object *obj);
* make it easier to unwind when things go wrong, etc).
*/
struct msm_gem_submit {
+ struct drm_sched_job base;
struct kref ref;
struct drm_device *dev;
struct msm_gpu *gpu;
@@ -319,7 +319,22 @@ struct msm_gem_submit {
struct list_head bo_list;
struct ww_acquire_ctx ticket;
uint32_t seqno; /* Sequence number of the submit on the ring */
- struct dma_fence *fence;
+
+ /* Array of struct dma_fence * to block on before submitting this job.
+ */
+ struct xarray deps;
+ unsigned long last_dep;
+
+ /* Hw fence, which is created when the scheduler executes the job, and
+ * is signaled when the hw finishes (via seqno write from cmdstream)
+ */
+ struct dma_fence *hw_fence;
+
+ /* Userspace visible fence, which is signaled by the scheduler after
+ * the hw_fence is signaled.
+ */
+ struct dma_fence *user_fence;
+
int fence_id; /* key into queue->fence_idr */
struct msm_gpu_submitqueue *queue;
struct pid *pid; /* submitting process */
@@ -350,6 +365,11 @@ struct msm_gem_submit {
} bos[];
};

+static inline struct msm_gem_submit *to_msm_submit(struct drm_sched_job *job)
+{
+ return container_of(job, struct msm_gem_submit, base);
+}
+
void __msm_gem_submit_destroy(struct kref *kref);

static inline void msm_gem_submit_get(struct msm_gem_submit *submit)
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index f570155bc086..2b158433a6e5 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -33,6 +33,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
{
struct msm_gem_submit *submit;
uint64_t sz;
+ int ret;

sz = struct_size(submit, bos, nr_bos) +
((u64)nr_cmds * sizeof(submit->cmd[0]));
@@ -44,6 +45,14 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
if (!submit)
return ERR_PTR(-ENOMEM);

+ ret = drm_sched_job_init(&submit->base, &queue->entity, queue);
+ if (ret) {
+ kfree(submit);
+ return ERR_PTR(ret);
+ }
+
+ xa_init_flags(&submit->deps, XA_FLAGS_ALLOC);
+
kref_init(&submit->ref);
submit->dev = dev;
submit->aspace = queue->ctx->aspace;
@@ -63,6 +72,8 @@ void __msm_gem_submit_destroy(struct kref *kref)
{
struct msm_gem_submit *submit =
container_of(kref, struct msm_gem_submit, ref);
+ unsigned long index;
+ struct dma_fence *fence;
unsigned i;

if (submit->fence_id) {
@@ -71,7 +82,14 @@ void __msm_gem_submit_destroy(struct kref *kref)
mutex_unlock(&submit->queue->lock);
}

- dma_fence_put(submit->fence);
+ xa_for_each (&submit->deps, index, fence) {
+ dma_fence_put(fence);
+ }
+
+ xa_destroy(&submit->deps);
+
+ dma_fence_put(submit->user_fence);
+ dma_fence_put(submit->hw_fence);

put_pid(submit->pid);
msm_submitqueue_put(submit->queue);
@@ -307,7 +325,7 @@ static int submit_fence_sync(struct msm_gem_submit *submit, bool no_implicit)
int i, ret = 0;

for (i = 0; i < submit->nr_bos; i++) {
- struct msm_gem_object *msm_obj = submit->bos[i].obj;
+ struct drm_gem_object *obj = &submit->bos[i].obj->base;
bool write = submit->bos[i].flags & MSM_SUBMIT_BO_WRITE;

if (!write) {
@@ -316,8 +334,7 @@ static int submit_fence_sync(struct msm_gem_submit *submit, bool no_implicit)
* strange place to call it. OTOH this is a
* convenient can-fail point to hook it in.
*/
- ret = dma_resv_reserve_shared(msm_obj->base.resv,
- 1);
+ ret = dma_resv_reserve_shared(obj->resv, 1);
if (ret)
return ret;
}
@@ -325,7 +342,7 @@ static int submit_fence_sync(struct msm_gem_submit *submit, bool no_implicit)
if (no_implicit)
continue;

- ret = msm_gem_sync_object(&msm_obj->base, submit->ring->fctx,
+ ret = drm_gem_fence_array_add_implicit(&submit->deps, obj,
write);
if (ret)
break;
@@ -376,9 +393,9 @@ static void submit_attach_object_fences(struct msm_gem_submit *submit)
struct drm_gem_object *obj = &submit->bos[i].obj->base;

if (submit->bos[i].flags & MSM_SUBMIT_BO_WRITE)
- dma_resv_add_excl_fence(obj->resv, submit->fence);
+ dma_resv_add_excl_fence(obj->resv, submit->user_fence);
else if (submit->bos[i].flags & MSM_SUBMIT_BO_READ)
- dma_resv_add_shared_fence(obj->resv, submit->fence);
+ dma_resv_add_shared_fence(obj->resv, submit->user_fence);
}
}

@@ -517,12 +534,12 @@ struct msm_submit_post_dep {
struct dma_fence_chain *chain;
};

-static struct drm_syncobj **msm_wait_deps(struct drm_device *dev,
- struct drm_file *file,
- uint64_t in_syncobjs_addr,
- uint32_t nr_in_syncobjs,
- size_t syncobj_stride,
- struct msm_ringbuffer *ring)
+static struct drm_syncobj **msm_parse_deps(struct msm_gem_submit *submit,
+ struct drm_file *file,
+ uint64_t in_syncobjs_addr,
+ uint32_t nr_in_syncobjs,
+ size_t syncobj_stride,
+ struct msm_ringbuffer *ring)
{
struct drm_syncobj **syncobjs = NULL;
struct drm_msm_gem_submit_syncobj syncobj_desc = {0};
@@ -546,7 +563,7 @@ static struct drm_syncobj **msm_wait_deps(struct drm_device *dev,
}

if (syncobj_desc.point &&
- !drm_core_check_feature(dev, DRIVER_SYNCOBJ_TIMELINE)) {
+ !drm_core_check_feature(submit->dev, DRIVER_SYNCOBJ_TIMELINE)) {
ret = -EOPNOTSUPP;
break;
}
@@ -561,10 +578,7 @@ static struct drm_syncobj **msm_wait_deps(struct drm_device *dev,
if (ret)
break;

- if (!dma_fence_match_context(fence, ring->fctx->context))
- ret = dma_fence_wait(fence, true);
-
- dma_fence_put(fence);
+ ret = drm_gem_fence_array_add(&submit->deps, fence);
if (ret)
break;

@@ -741,47 +755,6 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
args->nr_bos, args->nr_cmds);

- if (args->flags & MSM_SUBMIT_FENCE_FD_IN) {
- struct dma_fence *in_fence;
-
- in_fence = sync_file_get_fence(args->fence_fd);
-
- if (!in_fence)
- return -EINVAL;
-
- /*
- * Wait if the fence is from a foreign context, or if the fence
- * array contains any fence from a foreign context.
- */
- ret = 0;
- if (!dma_fence_match_context(in_fence, ring->fctx->context))
- ret = dma_fence_wait(in_fence, true);
-
- dma_fence_put(in_fence);
- if (ret)
- return ret;
- }
-
- if (args->flags & MSM_SUBMIT_SYNCOBJ_IN) {
- syncobjs_to_reset = msm_wait_deps(dev, file,
- args->in_syncobjs,
- args->nr_in_syncobjs,
- args->syncobj_stride, ring);
- if (IS_ERR(syncobjs_to_reset))
- return PTR_ERR(syncobjs_to_reset);
- }
-
- if (args->flags & MSM_SUBMIT_SYNCOBJ_OUT) {
- post_deps = msm_parse_post_deps(dev, file,
- args->out_syncobjs,
- args->nr_out_syncobjs,
- args->syncobj_stride);
- if (IS_ERR(post_deps)) {
- ret = PTR_ERR(post_deps);
- goto out_post_unlock;
- }
- }
-
ret = mutex_lock_interruptible(&dev->struct_mutex);
if (ret)
goto out_post_unlock;
@@ -807,22 +780,50 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
if (args->flags & MSM_SUBMIT_SUDO)
submit->in_rb = true;

+ if (args->flags & MSM_SUBMIT_FENCE_FD_IN) {
+ struct dma_fence *in_fence;
+
+ in_fence = sync_file_get_fence(args->fence_fd);
+
+ if (!in_fence) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ ret = drm_gem_fence_array_add(&submit->deps, in_fence);
+ if (ret)
+ goto out_unlock;
+ }
+
+ if (args->flags & MSM_SUBMIT_SYNCOBJ_IN) {
+ syncobjs_to_reset = msm_parse_deps(submit, file,
+ args->in_syncobjs,
+ args->nr_in_syncobjs,
+ args->syncobj_stride, ring);
+ if (IS_ERR(syncobjs_to_reset)) {
+ ret = PTR_ERR(syncobjs_to_reset);
+ goto out_unlock;
+ }
+ }
+
+ if (args->flags & MSM_SUBMIT_SYNCOBJ_OUT) {
+ post_deps = msm_parse_post_deps(dev, file,
+ args->out_syncobjs,
+ args->nr_out_syncobjs,
+ args->syncobj_stride);
+ if (IS_ERR(post_deps)) {
+ ret = PTR_ERR(post_deps);
+ goto out_unlock;
+ }
+ }
+
ret = submit_lookup_objects(submit, args, file);
if (ret)
- goto out_pre_pm;
+ goto out;

ret = submit_lookup_cmds(submit, args, file);
if (ret)
- goto out_pre_pm;
-
- /*
- * Thanks to dev_pm_opp opp_table_lock interactions with mm->mmap_sem
- * in the resume path, we need to to rpm get before we lock objs.
- * Which unfortunately might involve powering up the GPU sooner than
- * is necessary. But at least in the explicit fencing case, we will
- * have already done all the fence waiting.
- */
- pm_runtime_get_sync(&gpu->pdev->dev);
+ goto out;

/* copy_*_user while holding a ww ticket upsets lockdep */
ww_acquire_init(&submit->ticket, &reservation_ww_class);
@@ -869,12 +870,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,

submit->nr_cmds = i;

- submit->fence = msm_fence_alloc(ring->fctx);
- if (IS_ERR(submit->fence)) {
- ret = PTR_ERR(submit->fence);
- submit->fence = NULL;
- goto out;
- }
+ submit->user_fence = dma_fence_get(&submit->base.s_fence->finished);

/*
* Allocate an id which can be used by WAIT_FENCE ioctl to map back
@@ -882,7 +878,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
*/
mutex_lock(&queue->lock);
submit->fence_id = idr_alloc_cyclic(&queue->fence_idr,
- submit->fence, 0, INT_MAX, GFP_KERNEL);
+ submit->user_fence, 0, INT_MAX, GFP_KERNEL);
mutex_unlock(&queue->lock);
if (submit->fence_id < 0) {
ret = submit->fence_id = 0;
@@ -891,7 +887,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
}

if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
- struct sync_file *sync_file = sync_file_create(submit->fence);
+ struct sync_file *sync_file = sync_file_create(submit->user_fence);
if (!sync_file) {
ret = -ENOMEM;
goto out;
@@ -902,18 +898,19 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,

submit_attach_object_fences(submit);

- msm_gpu_submit(gpu, submit);
+ /* The scheduler owns a ref now: */
+ msm_gem_submit_get(submit);
+
+ drm_sched_entity_push_job(&submit->base, &queue->entity);

args->fence = submit->fence_id;

msm_reset_syncobjs(syncobjs_to_reset, args->nr_in_syncobjs);
msm_process_post_deps(post_deps, args->nr_out_syncobjs,
- submit->fence);
+ submit->user_fence);


out:
- pm_runtime_put(&gpu->pdev->dev);
-out_pre_pm:
submit_cleanup(submit, !!ret);
if (has_ww_ticket)
ww_acquire_fini(&submit->ticket);
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 5bfc4d24a956..8a3a592da3a4 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -176,8 +176,8 @@ static void update_fences(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
break;

msm_update_fence(submit->ring->fctx,
- submit->fence->seqno);
- dma_fence_signal(submit->fence);
+ submit->hw_fence->seqno);
+ dma_fence_signal(submit->hw_fence);
}
spin_unlock_irqrestore(&ring->submit_lock, flags);
}
@@ -380,10 +380,6 @@ static void recover_worker(struct kthread_work *work)
put_task_struct(task);
}

- /* msm_rd_dump_submit() needs bo locked to dump: */
- for (i = 0; i < submit->nr_bos; i++)
- msm_gem_lock(&submit->bos[i].obj->base);
-
if (comm && cmd) {
DRM_DEV_ERROR(dev->dev, "%s: offending task: %s (%s)\n",
gpu->name, comm, cmd);
@@ -393,9 +389,6 @@ static void recover_worker(struct kthread_work *work)
} else {
msm_rd_dump_submit(priv->hangrd, submit, NULL);
}
-
- for (i = 0; i < submit->nr_bos; i++)
- msm_gem_unlock(&submit->bos[i].obj->base);
}

/* Record the crash state */
@@ -704,7 +697,7 @@ static void retire_submits(struct msm_gpu *gpu)
* been signalled, then later submits are not signalled
* either, so we are also done.
*/
- if (submit && dma_fence_is_signaled(submit->fence)) {
+ if (submit && dma_fence_is_signaled(submit->hw_fence)) {
retire_submit(gpu, ring, submit);
} else {
break;
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 579627252540..b912cacaecc0 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -267,6 +267,7 @@ struct msm_gpu_perfcntr {
* seqno, protected by submitqueue lock
* @lock: submitqueue lock
* @ref: reference count
+ * @entity: the submit job-queue
*/
struct msm_gpu_submitqueue {
int id;
@@ -278,6 +279,7 @@ struct msm_gpu_submitqueue {
struct idr fence_idr;
struct mutex lock;
struct kref ref;
+ struct drm_sched_entity entity;
};

struct msm_gpu_state_bo {
diff --git a/drivers/gpu/drm/msm/msm_rd.c b/drivers/gpu/drm/msm/msm_rd.c
index 659e5cc4b40a..b55398a34fa4 100644
--- a/drivers/gpu/drm/msm/msm_rd.c
+++ b/drivers/gpu/drm/msm/msm_rd.c
@@ -325,15 +325,19 @@ static void snapshot_buf(struct msm_rd_state *rd,
if (!(submit->bos[idx].flags & MSM_SUBMIT_BO_READ))
return;

+ msm_gem_lock(&obj->base);
buf = msm_gem_get_vaddr_active(&obj->base);
if (IS_ERR(buf))
- return;
+ goto out_unlock;

buf += offset;

rd_write_section(rd, RD_BUFFER_CONTENTS, buf, size);

msm_gem_put_vaddr_locked(&obj->base);
+
+out_unlock:
+ msm_gem_unlock(&obj->base);
}

/* called under struct_mutex */
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 437cca57d005..5643f579ac46 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -7,10 +7,64 @@
#include "msm_ringbuffer.h"
#include "msm_gpu.h"

+/**
+ * The max # of jobs to write into the hw ringbuffer.
+ */
+static uint num_hw_submissions = 8;
+MODULE_PARM_DESC(num_hw_submissions, "The max number of HW submissions (default 8)");
+module_param(num_hw_submissions, uint, 0600);
+
+static struct dma_fence *msm_job_dependency(struct drm_sched_job *job,
+ struct drm_sched_entity *s_entity)
+{
+ struct msm_gem_submit *submit = to_msm_submit(job);
+
+ if (!xa_empty(&submit->deps))
+ return xa_erase(&submit->deps, submit->last_dep++);
+
+ return NULL;
+}
+
+static struct dma_fence *msm_job_run(struct drm_sched_job *job)
+{
+ struct msm_gem_submit *submit = to_msm_submit(job);
+ struct msm_gpu *gpu = submit->gpu;
+
+ submit->hw_fence = msm_fence_alloc(submit->ring->fctx);
+
+ pm_runtime_get_sync(&gpu->pdev->dev);
+
+ /* TODO move submit path over to using a per-ring lock.. */
+ mutex_lock(&gpu->dev->struct_mutex);
+
+ msm_gpu_submit(gpu, submit);
+
+ mutex_unlock(&gpu->dev->struct_mutex);
+
+ pm_runtime_put(&gpu->pdev->dev);
+
+ return dma_fence_get(submit->hw_fence);
+}
+
+static void msm_job_free(struct drm_sched_job *job)
+{
+ struct msm_gem_submit *submit = to_msm_submit(job);
+
+ drm_sched_job_cleanup(job);
+ msm_gem_submit_put(submit);
+}
+
+const struct drm_sched_backend_ops msm_sched_ops = {
+ .dependency = msm_job_dependency,
+ .run_job = msm_job_run,
+ .free_job = msm_job_free
+};
+
struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
void *memptrs, uint64_t memptrs_iova)
{
struct msm_ringbuffer *ring;
+ long sched_timeout;
char name[32];
int ret;

@@ -45,6 +99,16 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
ring->memptrs = memptrs;
ring->memptrs_iova = memptrs_iova;

+ /* currently managing hangcheck ourselves: */
+ sched_timeout = MAX_SCHEDULE_TIMEOUT;
+
+ ret = drm_sched_init(&ring->sched, &msm_sched_ops,
+ num_hw_submissions, 0, sched_timeout,
+ NULL, to_msm_bo(ring->bo)->name);
+ if (ret) {
+ goto fail;
+ }
+
INIT_LIST_HEAD(&ring->submits);
spin_lock_init(&ring->submit_lock);
spin_lock_init(&ring->preempt_lock);
@@ -65,6 +129,8 @@ void msm_ringbuffer_destroy(struct msm_ringbuffer *ring)
if (IS_ERR_OR_NULL(ring))
return;

+ drm_sched_fini(&ring->sched);
+
msm_fence_context_free(ring->fctx);

msm_gem_kernel_put(ring->bo, ring->gpu->aspace);
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.h b/drivers/gpu/drm/msm/msm_ringbuffer.h
index fe55d4a1aa16..d8c63df4e9ca 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.h
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.h
@@ -7,6 +7,7 @@
#ifndef __MSM_RINGBUFFER_H__
#define __MSM_RINGBUFFER_H__

+#include "drm/gpu_scheduler.h"
#include "msm_drv.h"

#define rbmemptr(ring, member) \
@@ -40,8 +41,19 @@ struct msm_ringbuffer {
struct drm_gem_object *bo;
uint32_t *start, *end, *cur, *next;

+ /*
+ * The job scheduler for this ring.
+ */
+ struct drm_gpu_scheduler sched;
+
/*
* List of in-flight submits on this ring. Protected by submit_lock.
+ *
+ * Currently just submits that are already written into the ring, not
+ * submits that are still in drm_gpu_scheduler's queues. At a later
+ * step we could probably move to letting drm_gpu_scheduler manage
+ * hangcheck detection and keep track of submit jobs that are in-
+ * flight.
*/
struct list_head submits;
spinlock_t submit_lock;
diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
index 66f8d0fb38b0..682ba2a7c0ec 100644
--- a/drivers/gpu/drm/msm/msm_submitqueue.c
+++ b/drivers/gpu/drm/msm/msm_submitqueue.c
@@ -14,6 +14,8 @@ void msm_submitqueue_destroy(struct kref *kref)

idr_destroy(&queue->fence_idr);

+ drm_sched_entity_destroy(&queue->entity);
+
msm_file_private_put(queue->ctx);

kfree(queue);
@@ -64,6 +66,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
{
struct msm_drm_private *priv = drm->dev_private;
struct msm_gpu_submitqueue *queue;
+ struct msm_ringbuffer *ring;
+ struct drm_gpu_scheduler *sched;
+ int ret;

if (!ctx)
return -ENODEV;
@@ -83,6 +88,27 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
queue->flags = flags;
queue->prio = prio;

+ ring = priv->gpu->rb[prio];
+ sched = &ring->sched;
+
+ /*
+ * TODO we can allow more priorities than we have ringbuffers by
+ * mapping:
+ *
+ * ring = prio / 3;
+ * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3);
+ *
+ * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is
+ * treated specially in places.
+ */
+ ret = drm_sched_entity_init(&queue->entity,
+ DRM_SCHED_PRIORITY_NORMAL,
+ &sched, 1, NULL);
+ if (ret) {
+ kfree(queue);
+ return ret;
+ }
+
write_lock(&ctx->queuelock);

queue->ctx = msm_file_private_get(ctx);
--
2.31.1


2021-11-10 15:33:39

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [PATCH v4 07/13] drm/msm: Track "seqno" fences by idr

On 7/28/2021 6:36 AM, Rob Clark wrote:
> From: Rob Clark <[email protected]>
>
> Previously the (non-fd) fence returned from submit ioctl was a raw
> seqno, which is scoped to the ring. But from UABI standpoint, the
> ioctls related to seqno fences all specify a submitqueue. We can
> take advantage of that to replace the seqno fences with a cyclic idr
> handle.
>
> This is in preperation for moving to drm scheduler, at which point
> the submit ioctl will return after queuing the submit job to the
> scheduler, but before the submit is written into the ring (and
> therefore before a ring seqno has been assigned). Which means we
> need to replace the dma_fence that userspace may need to wait on
> with a scheduler fence.
>
> Signed-off-by: Rob Clark <[email protected]>
> Acked-by: Christian König <[email protected]>
> ---
> drivers/gpu/drm/msm/msm_drv.c | 30 +++++++++++++++++--
> drivers/gpu/drm/msm/msm_fence.c | 42 ---------------------------
> drivers/gpu/drm/msm/msm_fence.h | 3 --
> drivers/gpu/drm/msm/msm_gem.h | 1 +
> drivers/gpu/drm/msm/msm_gem_submit.c | 23 ++++++++++++++-
> drivers/gpu/drm/msm/msm_gpu.h | 5 ++++
> drivers/gpu/drm/msm/msm_submitqueue.c | 5 ++++
> 7 files changed, 61 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> index 9b8fa2ad0d84..1594ae39d54f 100644
> --- a/drivers/gpu/drm/msm/msm_drv.c
> +++ b/drivers/gpu/drm/msm/msm_drv.c
> @@ -911,6 +911,7 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
> ktime_t timeout = to_ktime(args->timeout);
> struct msm_gpu_submitqueue *queue;
> struct msm_gpu *gpu = priv->gpu;
> + struct dma_fence *fence;
> int ret;
>
> if (args->pad) {
> @@ -925,10 +926,35 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
> if (!queue)
> return -ENOENT;
>
> - ret = msm_wait_fence(gpu->rb[queue->prio]->fctx, args->fence, &timeout,
> - true);
> + /*
> + * Map submitqueue scoped "seqno" (which is actually an idr key)
> + * back to underlying dma-fence
> + *
> + * The fence is removed from the fence_idr when the submit is
> + * retired, so if the fence is not found it means there is nothing
> + * to wait for
> + */
> + ret = mutex_lock_interruptible(&queue->lock);
> + if (ret)
> + return ret;
> + fence = idr_find(&queue->fence_idr, args->fence);
> + if (fence)
> + fence = dma_fence_get_rcu(fence);
> + mutex_unlock(&queue->lock);
> +
> + if (!fence)
> + return 0;
>
> + ret = dma_fence_wait_timeout(fence, true, timeout_to_jiffies(&timeout));
> + if (ret == 0) {
> + ret = -ETIMEDOUT;
> + } else if (ret != -ERESTARTSYS) {
> + ret = 0;
> + }
> +
> + dma_fence_put(fence);
> msm_submitqueue_put(queue);
> +
> return ret;
> }
>
> diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c
> index b92a9091a1e2..f2cece542c3f 100644
> --- a/drivers/gpu/drm/msm/msm_fence.c
> +++ b/drivers/gpu/drm/msm/msm_fence.c
> @@ -24,7 +24,6 @@ msm_fence_context_alloc(struct drm_device *dev, volatile uint32_t *fenceptr,
> strncpy(fctx->name, name, sizeof(fctx->name));
> fctx->context = dma_fence_context_alloc(1);
> fctx->fenceptr = fenceptr;
> - init_waitqueue_head(&fctx->event);
> spin_lock_init(&fctx->spinlock);
>
> return fctx;
> @@ -45,53 +44,12 @@ static inline bool fence_completed(struct msm_fence_context *fctx, uint32_t fenc
> (int32_t)(*fctx->fenceptr - fence) >= 0;
> }
>
> -/* legacy path for WAIT_FENCE ioctl: */
> -int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
> - ktime_t *timeout, bool interruptible)
> -{
> - int ret;
> -
> - if (fence > fctx->last_fence) {
> - DRM_ERROR_RATELIMITED("%s: waiting on invalid fence: %u (of %u)\n",
> - fctx->name, fence, fctx->last_fence);
> - return -EINVAL;

Rob, we changed this pre-existing behaviour in this patch. Now, when
userspace tries to wait on a future fence, we don't return an error.

I just want to check if this was accidental or not?

-Akhil.

> - }
> -
> - if (!timeout) {
> - /* no-wait: */
> - ret = fence_completed(fctx, fence) ? 0 : -EBUSY;
> - } else {
> - unsigned long remaining_jiffies = timeout_to_jiffies(timeout);
> -
> - if (interruptible)
> - ret = wait_event_interruptible_timeout(fctx->event,
> - fence_completed(fctx, fence),
> - remaining_jiffies);
> - else
> - ret = wait_event_timeout(fctx->event,
> - fence_completed(fctx, fence),
> - remaining_jiffies);
> -
> - if (ret == 0) {
> - DBG("timeout waiting for fence: %u (completed: %u)",
> - fence, fctx->completed_fence);
> - ret = -ETIMEDOUT;
> - } else if (ret != -ERESTARTSYS) {
> - ret = 0;
> - }
> - }
> -
> - return ret;
> -}
> -
> /* called from workqueue */
> void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence)
> {
> spin_lock(&fctx->spinlock);
> fctx->completed_fence = max(fence, fctx->completed_fence);
> spin_unlock(&fctx->spinlock);
> -
> - wake_up_all(&fctx->event);
> }
>
> struct msm_fence {
> diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h
> index 6ab97062ff1a..4783db528bcc 100644
> --- a/drivers/gpu/drm/msm/msm_fence.h
> +++ b/drivers/gpu/drm/msm/msm_fence.h
> @@ -49,7 +49,6 @@ struct msm_fence_context {
> */
> volatile uint32_t *fenceptr;
>
> - wait_queue_head_t event;
> spinlock_t spinlock;
> };
>
> @@ -57,8 +56,6 @@ struct msm_fence_context * msm_fence_context_alloc(struct drm_device *dev,
> volatile uint32_t *fenceptr, const char *name);
> void msm_fence_context_free(struct msm_fence_context *fctx);
>
> -int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
> - ktime_t *timeout, bool interruptible);
> void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);
>
> struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx);
> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> index da3af702a6c8..e0579abda5b9 100644
> --- a/drivers/gpu/drm/msm/msm_gem.h
> +++ b/drivers/gpu/drm/msm/msm_gem.h
> @@ -320,6 +320,7 @@ struct msm_gem_submit {
> struct ww_acquire_ctx ticket;
> uint32_t seqno; /* Sequence number of the submit on the ring */
> struct dma_fence *fence;
> + int fence_id; /* key into queue->fence_idr */
> struct msm_gpu_submitqueue *queue;
> struct pid *pid; /* submitting process */
> bool fault_dumped; /* Limit devcoredump dumping to one per submit */
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 4f02fa3c78f9..f6f595aae2c5 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -68,7 +68,14 @@ void __msm_gem_submit_destroy(struct kref *kref)
> container_of(kref, struct msm_gem_submit, ref);
> unsigned i;
>
> + if (submit->fence_id) {
> + mutex_lock(&submit->queue->lock);
> + idr_remove(&submit->queue->fence_idr, submit->fence_id);
> + mutex_unlock(&submit->queue->lock);
> + }
> +
> dma_fence_put(submit->fence);
> +
> put_pid(submit->pid);
> msm_submitqueue_put(submit->queue);
>
> @@ -872,6 +879,20 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> goto out;
> }
>
> + /*
> + * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> + * to the underlying fence.
> + */
> + mutex_lock(&queue->lock);
> + submit->fence_id = idr_alloc_cyclic(&queue->fence_idr,
> + submit->fence, 0, INT_MAX, GFP_KERNEL);
> + mutex_unlock(&queue->lock);
> + if (submit->fence_id < 0) {
> + ret = submit->fence_id = 0;
> + submit->fence_id = 0;
> + goto out;
> + }
> +
> if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> struct sync_file *sync_file = sync_file_create(submit->fence);
> if (!sync_file) {
> @@ -886,7 +907,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>
> msm_gpu_submit(gpu, submit);
>
> - args->fence = submit->fence->seqno;
> + args->fence = submit->fence_id;
>
> msm_reset_syncobjs(syncobjs_to_reset, args->nr_in_syncobjs);
> msm_process_post_deps(post_deps, args->nr_out_syncobjs,
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index 96efcb31e502..579627252540 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -263,6 +263,9 @@ struct msm_gpu_perfcntr {
> * which set of pgtables do submits jobs associated with the
> * submitqueue use)
> * @node: node in the context's list of submitqueues
> + * @fence_idr: maps fence-id to dma_fence for userspace visible fence
> + * seqno, protected by submitqueue lock
> + * @lock: submitqueue lock
> * @ref: reference count
> */
> struct msm_gpu_submitqueue {
> @@ -272,6 +275,8 @@ struct msm_gpu_submitqueue {
> int faults;
> struct msm_file_private *ctx;
> struct list_head node;
> + struct idr fence_idr;
> + struct mutex lock;
> struct kref ref;
> };
>
> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> index 9e9fec61d629..66f8d0fb38b0 100644
> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> @@ -12,6 +12,8 @@ void msm_submitqueue_destroy(struct kref *kref)
> struct msm_gpu_submitqueue *queue = container_of(kref,
> struct msm_gpu_submitqueue, ref);
>
> + idr_destroy(&queue->fence_idr);
> +
> msm_file_private_put(queue->ctx);
>
> kfree(queue);
> @@ -89,6 +91,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> if (id)
> *id = queue->id;
>
> + idr_init(&queue->fence_idr);
> + mutex_init(&queue->lock);
> +
> list_add_tail(&queue->node, &ctx->submitqueues);
>
> write_unlock(&ctx->queuelock);
>

2021-11-10 16:50:12

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 07/13] drm/msm: Track "seqno" fences by idr

On Wed, Nov 10, 2021 at 7:28 AM Akhil P Oommen <[email protected]> wrote:
>
> On 7/28/2021 6:36 AM, Rob Clark wrote:
> > From: Rob Clark <[email protected]>
> >
> > Previously the (non-fd) fence returned from submit ioctl was a raw
> > seqno, which is scoped to the ring. But from UABI standpoint, the
> > ioctls related to seqno fences all specify a submitqueue. We can
> > take advantage of that to replace the seqno fences with a cyclic idr
> > handle.
> >
> > This is in preperation for moving to drm scheduler, at which point
> > the submit ioctl will return after queuing the submit job to the
> > scheduler, but before the submit is written into the ring (and
> > therefore before a ring seqno has been assigned). Which means we
> > need to replace the dma_fence that userspace may need to wait on
> > with a scheduler fence.
> >
> > Signed-off-by: Rob Clark <[email protected]>
> > Acked-by: Christian König <[email protected]>
> > ---
> > drivers/gpu/drm/msm/msm_drv.c | 30 +++++++++++++++++--
> > drivers/gpu/drm/msm/msm_fence.c | 42 ---------------------------
> > drivers/gpu/drm/msm/msm_fence.h | 3 --
> > drivers/gpu/drm/msm/msm_gem.h | 1 +
> > drivers/gpu/drm/msm/msm_gem_submit.c | 23 ++++++++++++++-
> > drivers/gpu/drm/msm/msm_gpu.h | 5 ++++
> > drivers/gpu/drm/msm/msm_submitqueue.c | 5 ++++
> > 7 files changed, 61 insertions(+), 48 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> > index 9b8fa2ad0d84..1594ae39d54f 100644
> > --- a/drivers/gpu/drm/msm/msm_drv.c
> > +++ b/drivers/gpu/drm/msm/msm_drv.c
> > @@ -911,6 +911,7 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
> > ktime_t timeout = to_ktime(args->timeout);
> > struct msm_gpu_submitqueue *queue;
> > struct msm_gpu *gpu = priv->gpu;
> > + struct dma_fence *fence;
> > int ret;
> >
> > if (args->pad) {
> > @@ -925,10 +926,35 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
> > if (!queue)
> > return -ENOENT;
> >
> > - ret = msm_wait_fence(gpu->rb[queue->prio]->fctx, args->fence, &timeout,
> > - true);
> > + /*
> > + * Map submitqueue scoped "seqno" (which is actually an idr key)
> > + * back to underlying dma-fence
> > + *
> > + * The fence is removed from the fence_idr when the submit is
> > + * retired, so if the fence is not found it means there is nothing
> > + * to wait for
> > + */
> > + ret = mutex_lock_interruptible(&queue->lock);
> > + if (ret)
> > + return ret;
> > + fence = idr_find(&queue->fence_idr, args->fence);
> > + if (fence)
> > + fence = dma_fence_get_rcu(fence);
> > + mutex_unlock(&queue->lock);
> > +
> > + if (!fence)
> > + return 0;
> >
> > + ret = dma_fence_wait_timeout(fence, true, timeout_to_jiffies(&timeout));
> > + if (ret == 0) {
> > + ret = -ETIMEDOUT;
> > + } else if (ret != -ERESTARTSYS) {
> > + ret = 0;
> > + }
> > +
> > + dma_fence_put(fence);
> > msm_submitqueue_put(queue);
> > +
> > return ret;
> > }
> >
> > diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c
> > index b92a9091a1e2..f2cece542c3f 100644
> > --- a/drivers/gpu/drm/msm/msm_fence.c
> > +++ b/drivers/gpu/drm/msm/msm_fence.c
> > @@ -24,7 +24,6 @@ msm_fence_context_alloc(struct drm_device *dev, volatile uint32_t *fenceptr,
> > strncpy(fctx->name, name, sizeof(fctx->name));
> > fctx->context = dma_fence_context_alloc(1);
> > fctx->fenceptr = fenceptr;
> > - init_waitqueue_head(&fctx->event);
> > spin_lock_init(&fctx->spinlock);
> >
> > return fctx;
> > @@ -45,53 +44,12 @@ static inline bool fence_completed(struct msm_fence_context *fctx, uint32_t fenc
> > (int32_t)(*fctx->fenceptr - fence) >= 0;
> > }
> >
> > -/* legacy path for WAIT_FENCE ioctl: */
> > -int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
> > - ktime_t *timeout, bool interruptible)
> > -{
> > - int ret;
> > -
> > - if (fence > fctx->last_fence) {
> > - DRM_ERROR_RATELIMITED("%s: waiting on invalid fence: %u (of %u)\n",
> > - fctx->name, fence, fctx->last_fence);
> > - return -EINVAL;
>
> Rob, we changed this pre-existing behaviour in this patch. Now, when
> userspace tries to wait on a future fence, we don't return an error.
>
> I just want to check if this was accidental or not?

Hmm, perhaps we should do this to restore the previous behavior:

-------------
diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 73e827641024..3dd6da56eae6 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -1000,8 +1000,12 @@ static int msm_ioctl_wait_fence(struct
drm_device *dev, void *data,
fence = dma_fence_get_rcu(fence);
mutex_unlock(&queue->lock);

- if (!fence)
- return 0;
+ if (!fence) {
+ struct msm_fence_context *fctx = gpu->rb[queue->ring_nr]->fctx;
+ DRM_ERROR_RATELIMITED("%s: waiting on invalid fence:
%u (of %u)\n",
+ fctx->name, fence, fctx->last_fence);
+ return -EINVAL;
+ }

ret = dma_fence_wait_timeout(fence, true, timeout_to_jiffies(&timeout));
if (ret == 0) {
-------------

BR,
-R

> -Akhil.
>
> > - }
> > -
> > - if (!timeout) {
> > - /* no-wait: */
> > - ret = fence_completed(fctx, fence) ? 0 : -EBUSY;
> > - } else {
> > - unsigned long remaining_jiffies = timeout_to_jiffies(timeout);
> > -
> > - if (interruptible)
> > - ret = wait_event_interruptible_timeout(fctx->event,
> > - fence_completed(fctx, fence),
> > - remaining_jiffies);
> > - else
> > - ret = wait_event_timeout(fctx->event,
> > - fence_completed(fctx, fence),
> > - remaining_jiffies);
> > -
> > - if (ret == 0) {
> > - DBG("timeout waiting for fence: %u (completed: %u)",
> > - fence, fctx->completed_fence);
> > - ret = -ETIMEDOUT;
> > - } else if (ret != -ERESTARTSYS) {
> > - ret = 0;
> > - }
> > - }
> > -
> > - return ret;
> > -}
> > -
> > /* called from workqueue */
> > void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence)
> > {
> > spin_lock(&fctx->spinlock);
> > fctx->completed_fence = max(fence, fctx->completed_fence);
> > spin_unlock(&fctx->spinlock);
> > -
> > - wake_up_all(&fctx->event);
> > }
> >
> > struct msm_fence {
> > diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h
> > index 6ab97062ff1a..4783db528bcc 100644
> > --- a/drivers/gpu/drm/msm/msm_fence.h
> > +++ b/drivers/gpu/drm/msm/msm_fence.h
> > @@ -49,7 +49,6 @@ struct msm_fence_context {
> > */
> > volatile uint32_t *fenceptr;
> >
> > - wait_queue_head_t event;
> > spinlock_t spinlock;
> > };
> >
> > @@ -57,8 +56,6 @@ struct msm_fence_context * msm_fence_context_alloc(struct drm_device *dev,
> > volatile uint32_t *fenceptr, const char *name);
> > void msm_fence_context_free(struct msm_fence_context *fctx);
> >
> > -int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
> > - ktime_t *timeout, bool interruptible);
> > void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);
> >
> > struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx);
> > diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> > index da3af702a6c8..e0579abda5b9 100644
> > --- a/drivers/gpu/drm/msm/msm_gem.h
> > +++ b/drivers/gpu/drm/msm/msm_gem.h
> > @@ -320,6 +320,7 @@ struct msm_gem_submit {
> > struct ww_acquire_ctx ticket;
> > uint32_t seqno; /* Sequence number of the submit on the ring */
> > struct dma_fence *fence;
> > + int fence_id; /* key into queue->fence_idr */
> > struct msm_gpu_submitqueue *queue;
> > struct pid *pid; /* submitting process */
> > bool fault_dumped; /* Limit devcoredump dumping to one per submit */
> > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > index 4f02fa3c78f9..f6f595aae2c5 100644
> > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > @@ -68,7 +68,14 @@ void __msm_gem_submit_destroy(struct kref *kref)
> > container_of(kref, struct msm_gem_submit, ref);
> > unsigned i;
> >
> > + if (submit->fence_id) {
> > + mutex_lock(&submit->queue->lock);
> > + idr_remove(&submit->queue->fence_idr, submit->fence_id);
> > + mutex_unlock(&submit->queue->lock);
> > + }
> > +
> > dma_fence_put(submit->fence);
> > +
> > put_pid(submit->pid);
> > msm_submitqueue_put(submit->queue);
> >
> > @@ -872,6 +879,20 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > goto out;
> > }
> >
> > + /*
> > + * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> > + * to the underlying fence.
> > + */
> > + mutex_lock(&queue->lock);
> > + submit->fence_id = idr_alloc_cyclic(&queue->fence_idr,
> > + submit->fence, 0, INT_MAX, GFP_KERNEL);
> > + mutex_unlock(&queue->lock);
> > + if (submit->fence_id < 0) {
> > + ret = submit->fence_id = 0;
> > + submit->fence_id = 0;
> > + goto out;
> > + }
> > +
> > if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> > struct sync_file *sync_file = sync_file_create(submit->fence);
> > if (!sync_file) {
> > @@ -886,7 +907,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >
> > msm_gpu_submit(gpu, submit);
> >
> > - args->fence = submit->fence->seqno;
> > + args->fence = submit->fence_id;
> >
> > msm_reset_syncobjs(syncobjs_to_reset, args->nr_in_syncobjs);
> > msm_process_post_deps(post_deps, args->nr_out_syncobjs,
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index 96efcb31e502..579627252540 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -263,6 +263,9 @@ struct msm_gpu_perfcntr {
> > * which set of pgtables do submits jobs associated with the
> > * submitqueue use)
> > * @node: node in the context's list of submitqueues
> > + * @fence_idr: maps fence-id to dma_fence for userspace visible fence
> > + * seqno, protected by submitqueue lock
> > + * @lock: submitqueue lock
> > * @ref: reference count
> > */
> > struct msm_gpu_submitqueue {
> > @@ -272,6 +275,8 @@ struct msm_gpu_submitqueue {
> > int faults;
> > struct msm_file_private *ctx;
> > struct list_head node;
> > + struct idr fence_idr;
> > + struct mutex lock;
> > struct kref ref;
> > };
> >
> > diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> > index 9e9fec61d629..66f8d0fb38b0 100644
> > --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> > +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> > @@ -12,6 +12,8 @@ void msm_submitqueue_destroy(struct kref *kref)
> > struct msm_gpu_submitqueue *queue = container_of(kref,
> > struct msm_gpu_submitqueue, ref);
> >
> > + idr_destroy(&queue->fence_idr);
> > +
> > msm_file_private_put(queue->ctx);
> >
> > kfree(queue);
> > @@ -89,6 +91,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> > if (id)
> > *id = queue->id;
> >
> > + idr_init(&queue->fence_idr);
> > + mutex_init(&queue->lock);
> > +
> > list_add_tail(&queue->node, &ctx->submitqueues);
> >
> > write_unlock(&ctx->queuelock);
> >
>

2021-11-11 15:54:15

by Akhil P Oommen

[permalink] [raw]
Subject: Re: [PATCH v4 07/13] drm/msm: Track "seqno" fences by idr

On 11/10/2021 10:25 PM, Rob Clark wrote:
> On Wed, Nov 10, 2021 at 7:28 AM Akhil P Oommen <[email protected]> wrote:
>>
>> On 7/28/2021 6:36 AM, Rob Clark wrote:
>>> From: Rob Clark <[email protected]>
>>>
>>> Previously the (non-fd) fence returned from submit ioctl was a raw
>>> seqno, which is scoped to the ring. But from UABI standpoint, the
>>> ioctls related to seqno fences all specify a submitqueue. We can
>>> take advantage of that to replace the seqno fences with a cyclic idr
>>> handle.
>>>
>>> This is in preperation for moving to drm scheduler, at which point
>>> the submit ioctl will return after queuing the submit job to the
>>> scheduler, but before the submit is written into the ring (and
>>> therefore before a ring seqno has been assigned). Which means we
>>> need to replace the dma_fence that userspace may need to wait on
>>> with a scheduler fence.
>>>
>>> Signed-off-by: Rob Clark <[email protected]>
>>> Acked-by: Christian König <[email protected]>
>>> ---
>>> drivers/gpu/drm/msm/msm_drv.c | 30 +++++++++++++++++--
>>> drivers/gpu/drm/msm/msm_fence.c | 42 ---------------------------
>>> drivers/gpu/drm/msm/msm_fence.h | 3 --
>>> drivers/gpu/drm/msm/msm_gem.h | 1 +
>>> drivers/gpu/drm/msm/msm_gem_submit.c | 23 ++++++++++++++-
>>> drivers/gpu/drm/msm/msm_gpu.h | 5 ++++
>>> drivers/gpu/drm/msm/msm_submitqueue.c | 5 ++++
>>> 7 files changed, 61 insertions(+), 48 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
>>> index 9b8fa2ad0d84..1594ae39d54f 100644
>>> --- a/drivers/gpu/drm/msm/msm_drv.c
>>> +++ b/drivers/gpu/drm/msm/msm_drv.c
>>> @@ -911,6 +911,7 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
>>> ktime_t timeout = to_ktime(args->timeout);
>>> struct msm_gpu_submitqueue *queue;
>>> struct msm_gpu *gpu = priv->gpu;
>>> + struct dma_fence *fence;
>>> int ret;
>>>
>>> if (args->pad) {
>>> @@ -925,10 +926,35 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
>>> if (!queue)
>>> return -ENOENT;
>>>
>>> - ret = msm_wait_fence(gpu->rb[queue->prio]->fctx, args->fence, &timeout,
>>> - true);
>>> + /*
>>> + * Map submitqueue scoped "seqno" (which is actually an idr key)
>>> + * back to underlying dma-fence
>>> + *
>>> + * The fence is removed from the fence_idr when the submit is
>>> + * retired, so if the fence is not found it means there is nothing
>>> + * to wait for
>>> + */
>>> + ret = mutex_lock_interruptible(&queue->lock);
>>> + if (ret)
>>> + return ret;
>>> + fence = idr_find(&queue->fence_idr, args->fence);
>>> + if (fence)
>>> + fence = dma_fence_get_rcu(fence);
>>> + mutex_unlock(&queue->lock);
>>> +
>>> + if (!fence)
>>> + return 0;
>>>
>>> + ret = dma_fence_wait_timeout(fence, true, timeout_to_jiffies(&timeout));
>>> + if (ret == 0) {
>>> + ret = -ETIMEDOUT;
>>> + } else if (ret != -ERESTARTSYS) {
>>> + ret = 0;
>>> + }
>>> +
>>> + dma_fence_put(fence);
>>> msm_submitqueue_put(queue);
>>> +
>>> return ret;
>>> }
>>>
>>> diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c
>>> index b92a9091a1e2..f2cece542c3f 100644
>>> --- a/drivers/gpu/drm/msm/msm_fence.c
>>> +++ b/drivers/gpu/drm/msm/msm_fence.c
>>> @@ -24,7 +24,6 @@ msm_fence_context_alloc(struct drm_device *dev, volatile uint32_t *fenceptr,
>>> strncpy(fctx->name, name, sizeof(fctx->name));
>>> fctx->context = dma_fence_context_alloc(1);
>>> fctx->fenceptr = fenceptr;
>>> - init_waitqueue_head(&fctx->event);
>>> spin_lock_init(&fctx->spinlock);
>>>
>>> return fctx;
>>> @@ -45,53 +44,12 @@ static inline bool fence_completed(struct msm_fence_context *fctx, uint32_t fenc
>>> (int32_t)(*fctx->fenceptr - fence) >= 0;
>>> }
>>>
>>> -/* legacy path for WAIT_FENCE ioctl: */
>>> -int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
>>> - ktime_t *timeout, bool interruptible)
>>> -{
>>> - int ret;
>>> -
>>> - if (fence > fctx->last_fence) {
>>> - DRM_ERROR_RATELIMITED("%s: waiting on invalid fence: %u (of %u)\n",
>>> - fctx->name, fence, fctx->last_fence);
>>> - return -EINVAL;
>>
>> Rob, we changed this pre-existing behaviour in this patch. Now, when
>> userspace tries to wait on a future fence, we don't return an error.
>>
>> I just want to check if this was accidental or not?
>
> Hmm, perhaps we should do this to restore the previous behavior:
>
> -------------
> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> index 73e827641024..3dd6da56eae6 100644
> --- a/drivers/gpu/drm/msm/msm_drv.c
> +++ b/drivers/gpu/drm/msm/msm_drv.c
> @@ -1000,8 +1000,12 @@ static int msm_ioctl_wait_fence(struct
> drm_device *dev, void *data,
> fence = dma_fence_get_rcu(fence);
> mutex_unlock(&queue->lock);
>
> - if (!fence)
> - return 0;
> + if (!fence) {
> + struct msm_fence_context *fctx = gpu->rb[queue->ring_nr]->fctx;
> + DRM_ERROR_RATELIMITED("%s: waiting on invalid fence:
> %u (of %u)\n",
> + fctx->name, fence, fctx->last_fence);
> + return -EINVAL;
> + }

With this, when userspace tries to wait on a fence which is already
retired, it gets -EINVAL instead of success. Will this break userspace?

-Akhil.

>
> ret = dma_fence_wait_timeout(fence, true, timeout_to_jiffies(&timeout));
> if (ret == 0) {
> -------------
>
> BR,
> -R
>
>> -Akhil.
>>
>>> - }
>>> -
>>> - if (!timeout) {
>>> - /* no-wait: */
>>> - ret = fence_completed(fctx, fence) ? 0 : -EBUSY;
>>> - } else {
>>> - unsigned long remaining_jiffies = timeout_to_jiffies(timeout);
>>> -
>>> - if (interruptible)
>>> - ret = wait_event_interruptible_timeout(fctx->event,
>>> - fence_completed(fctx, fence),
>>> - remaining_jiffies);
>>> - else
>>> - ret = wait_event_timeout(fctx->event,
>>> - fence_completed(fctx, fence),
>>> - remaining_jiffies);
>>> -
>>> - if (ret == 0) {
>>> - DBG("timeout waiting for fence: %u (completed: %u)",
>>> - fence, fctx->completed_fence);
>>> - ret = -ETIMEDOUT;
>>> - } else if (ret != -ERESTARTSYS) {
>>> - ret = 0;
>>> - }
>>> - }
>>> -
>>> - return ret;
>>> -}
>>> -
>>> /* called from workqueue */
>>> void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence)
>>> {
>>> spin_lock(&fctx->spinlock);
>>> fctx->completed_fence = max(fence, fctx->completed_fence);
>>> spin_unlock(&fctx->spinlock);
>>> -
>>> - wake_up_all(&fctx->event);
>>> }
>>>
>>> struct msm_fence {
>>> diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h
>>> index 6ab97062ff1a..4783db528bcc 100644
>>> --- a/drivers/gpu/drm/msm/msm_fence.h
>>> +++ b/drivers/gpu/drm/msm/msm_fence.h
>>> @@ -49,7 +49,6 @@ struct msm_fence_context {
>>> */
>>> volatile uint32_t *fenceptr;
>>>
>>> - wait_queue_head_t event;
>>> spinlock_t spinlock;
>>> };
>>>
>>> @@ -57,8 +56,6 @@ struct msm_fence_context * msm_fence_context_alloc(struct drm_device *dev,
>>> volatile uint32_t *fenceptr, const char *name);
>>> void msm_fence_context_free(struct msm_fence_context *fctx);
>>>
>>> -int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
>>> - ktime_t *timeout, bool interruptible);
>>> void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);
>>>
>>> struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx);
>>> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
>>> index da3af702a6c8..e0579abda5b9 100644
>>> --- a/drivers/gpu/drm/msm/msm_gem.h
>>> +++ b/drivers/gpu/drm/msm/msm_gem.h
>>> @@ -320,6 +320,7 @@ struct msm_gem_submit {
>>> struct ww_acquire_ctx ticket;
>>> uint32_t seqno; /* Sequence number of the submit on the ring */
>>> struct dma_fence *fence;
>>> + int fence_id; /* key into queue->fence_idr */
>>> struct msm_gpu_submitqueue *queue;
>>> struct pid *pid; /* submitting process */
>>> bool fault_dumped; /* Limit devcoredump dumping to one per submit */
>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> index 4f02fa3c78f9..f6f595aae2c5 100644
>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> @@ -68,7 +68,14 @@ void __msm_gem_submit_destroy(struct kref *kref)
>>> container_of(kref, struct msm_gem_submit, ref);
>>> unsigned i;
>>>
>>> + if (submit->fence_id) {
>>> + mutex_lock(&submit->queue->lock);
>>> + idr_remove(&submit->queue->fence_idr, submit->fence_id);
>>> + mutex_unlock(&submit->queue->lock);
>>> + }
>>> +
>>> dma_fence_put(submit->fence);
>>> +
>>> put_pid(submit->pid);
>>> msm_submitqueue_put(submit->queue);
>>>
>>> @@ -872,6 +879,20 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>>> goto out;
>>> }
>>>
>>> + /*
>>> + * Allocate an id which can be used by WAIT_FENCE ioctl to map back
>>> + * to the underlying fence.
>>> + */
>>> + mutex_lock(&queue->lock);
>>> + submit->fence_id = idr_alloc_cyclic(&queue->fence_idr,
>>> + submit->fence, 0, INT_MAX, GFP_KERNEL);
>>> + mutex_unlock(&queue->lock);
>>> + if (submit->fence_id < 0) {
>>> + ret = submit->fence_id = 0;
>>> + submit->fence_id = 0;
>>> + goto out;
>>> + }
>>> +
>>> if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
>>> struct sync_file *sync_file = sync_file_create(submit->fence);
>>> if (!sync_file) {
>>> @@ -886,7 +907,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>>>
>>> msm_gpu_submit(gpu, submit);
>>>
>>> - args->fence = submit->fence->seqno;
>>> + args->fence = submit->fence_id;
>>>
>>> msm_reset_syncobjs(syncobjs_to_reset, args->nr_in_syncobjs);
>>> msm_process_post_deps(post_deps, args->nr_out_syncobjs,
>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>> index 96efcb31e502..579627252540 100644
>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>> @@ -263,6 +263,9 @@ struct msm_gpu_perfcntr {
>>> * which set of pgtables do submits jobs associated with the
>>> * submitqueue use)
>>> * @node: node in the context's list of submitqueues
>>> + * @fence_idr: maps fence-id to dma_fence for userspace visible fence
>>> + * seqno, protected by submitqueue lock
>>> + * @lock: submitqueue lock
>>> * @ref: reference count
>>> */
>>> struct msm_gpu_submitqueue {
>>> @@ -272,6 +275,8 @@ struct msm_gpu_submitqueue {
>>> int faults;
>>> struct msm_file_private *ctx;
>>> struct list_head node;
>>> + struct idr fence_idr;
>>> + struct mutex lock;
>>> struct kref ref;
>>> };
>>>
>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
>>> index 9e9fec61d629..66f8d0fb38b0 100644
>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>> @@ -12,6 +12,8 @@ void msm_submitqueue_destroy(struct kref *kref)
>>> struct msm_gpu_submitqueue *queue = container_of(kref,
>>> struct msm_gpu_submitqueue, ref);
>>>
>>> + idr_destroy(&queue->fence_idr);
>>> +
>>> msm_file_private_put(queue->ctx);
>>>
>>> kfree(queue);
>>> @@ -89,6 +91,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>> if (id)
>>> *id = queue->id;
>>>
>>> + idr_init(&queue->fence_idr);
>>> + mutex_init(&queue->lock);
>>> +
>>> list_add_tail(&queue->node, &ctx->submitqueues);
>>>
>>> write_unlock(&ctx->queuelock);
>>>
>>


2021-11-11 17:25:09

by Rob Clark

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v4 07/13] drm/msm: Track "seqno" fences by idr

On Thu, Nov 11, 2021 at 7:54 AM Akhil P Oommen <[email protected]> wrote:
>
> On 11/10/2021 10:25 PM, Rob Clark wrote:
> > On Wed, Nov 10, 2021 at 7:28 AM Akhil P Oommen <[email protected]> wrote:
> >>
> >> On 7/28/2021 6:36 AM, Rob Clark wrote:
> >>> From: Rob Clark <[email protected]>
> >>>
> >>> Previously the (non-fd) fence returned from submit ioctl was a raw
> >>> seqno, which is scoped to the ring. But from UABI standpoint, the
> >>> ioctls related to seqno fences all specify a submitqueue. We can
> >>> take advantage of that to replace the seqno fences with a cyclic idr
> >>> handle.
> >>>
> >>> This is in preperation for moving to drm scheduler, at which point
> >>> the submit ioctl will return after queuing the submit job to the
> >>> scheduler, but before the submit is written into the ring (and
> >>> therefore before a ring seqno has been assigned). Which means we
> >>> need to replace the dma_fence that userspace may need to wait on
> >>> with a scheduler fence.
> >>>
> >>> Signed-off-by: Rob Clark <[email protected]>
> >>> Acked-by: Christian König <[email protected]>
> >>> ---
> >>> drivers/gpu/drm/msm/msm_drv.c | 30 +++++++++++++++++--
> >>> drivers/gpu/drm/msm/msm_fence.c | 42 ---------------------------
> >>> drivers/gpu/drm/msm/msm_fence.h | 3 --
> >>> drivers/gpu/drm/msm/msm_gem.h | 1 +
> >>> drivers/gpu/drm/msm/msm_gem_submit.c | 23 ++++++++++++++-
> >>> drivers/gpu/drm/msm/msm_gpu.h | 5 ++++
> >>> drivers/gpu/drm/msm/msm_submitqueue.c | 5 ++++
> >>> 7 files changed, 61 insertions(+), 48 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> >>> index 9b8fa2ad0d84..1594ae39d54f 100644
> >>> --- a/drivers/gpu/drm/msm/msm_drv.c
> >>> +++ b/drivers/gpu/drm/msm/msm_drv.c
> >>> @@ -911,6 +911,7 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
> >>> ktime_t timeout = to_ktime(args->timeout);
> >>> struct msm_gpu_submitqueue *queue;
> >>> struct msm_gpu *gpu = priv->gpu;
> >>> + struct dma_fence *fence;
> >>> int ret;
> >>>
> >>> if (args->pad) {
> >>> @@ -925,10 +926,35 @@ static int msm_ioctl_wait_fence(struct drm_device *dev, void *data,
> >>> if (!queue)
> >>> return -ENOENT;
> >>>
> >>> - ret = msm_wait_fence(gpu->rb[queue->prio]->fctx, args->fence, &timeout,
> >>> - true);
> >>> + /*
> >>> + * Map submitqueue scoped "seqno" (which is actually an idr key)
> >>> + * back to underlying dma-fence
> >>> + *
> >>> + * The fence is removed from the fence_idr when the submit is
> >>> + * retired, so if the fence is not found it means there is nothing
> >>> + * to wait for
> >>> + */
> >>> + ret = mutex_lock_interruptible(&queue->lock);
> >>> + if (ret)
> >>> + return ret;
> >>> + fence = idr_find(&queue->fence_idr, args->fence);
> >>> + if (fence)
> >>> + fence = dma_fence_get_rcu(fence);
> >>> + mutex_unlock(&queue->lock);
> >>> +
> >>> + if (!fence)
> >>> + return 0;
> >>>
> >>> + ret = dma_fence_wait_timeout(fence, true, timeout_to_jiffies(&timeout));
> >>> + if (ret == 0) {
> >>> + ret = -ETIMEDOUT;
> >>> + } else if (ret != -ERESTARTSYS) {
> >>> + ret = 0;
> >>> + }
> >>> +
> >>> + dma_fence_put(fence);
> >>> msm_submitqueue_put(queue);
> >>> +
> >>> return ret;
> >>> }
> >>>
> >>> diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c
> >>> index b92a9091a1e2..f2cece542c3f 100644
> >>> --- a/drivers/gpu/drm/msm/msm_fence.c
> >>> +++ b/drivers/gpu/drm/msm/msm_fence.c
> >>> @@ -24,7 +24,6 @@ msm_fence_context_alloc(struct drm_device *dev, volatile uint32_t *fenceptr,
> >>> strncpy(fctx->name, name, sizeof(fctx->name));
> >>> fctx->context = dma_fence_context_alloc(1);
> >>> fctx->fenceptr = fenceptr;
> >>> - init_waitqueue_head(&fctx->event);
> >>> spin_lock_init(&fctx->spinlock);
> >>>
> >>> return fctx;
> >>> @@ -45,53 +44,12 @@ static inline bool fence_completed(struct msm_fence_context *fctx, uint32_t fenc
> >>> (int32_t)(*fctx->fenceptr - fence) >= 0;
> >>> }
> >>>
> >>> -/* legacy path for WAIT_FENCE ioctl: */
> >>> -int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
> >>> - ktime_t *timeout, bool interruptible)
> >>> -{
> >>> - int ret;
> >>> -
> >>> - if (fence > fctx->last_fence) {
> >>> - DRM_ERROR_RATELIMITED("%s: waiting on invalid fence: %u (of %u)\n",
> >>> - fctx->name, fence, fctx->last_fence);
> >>> - return -EINVAL;
> >>
> >> Rob, we changed this pre-existing behaviour in this patch. Now, when
> >> userspace tries to wait on a future fence, we don't return an error.
> >>
> >> I just want to check if this was accidental or not?
> >
> > Hmm, perhaps we should do this to restore the previous behavior:
> >
> > -------------
> > diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> > index 73e827641024..3dd6da56eae6 100644
> > --- a/drivers/gpu/drm/msm/msm_drv.c
> > +++ b/drivers/gpu/drm/msm/msm_drv.c
> > @@ -1000,8 +1000,12 @@ static int msm_ioctl_wait_fence(struct
> > drm_device *dev, void *data,
> > fence = dma_fence_get_rcu(fence);
> > mutex_unlock(&queue->lock);
> >
> > - if (!fence)
> > - return 0;
> > + if (!fence) {
> > + struct msm_fence_context *fctx = gpu->rb[queue->ring_nr]->fctx;
> > + DRM_ERROR_RATELIMITED("%s: waiting on invalid fence:
> > %u (of %u)\n",
> > + fctx->name, fence, fctx->last_fence);
> > + return -EINVAL;
> > + }
>
> With this, when userspace tries to wait on a fence which is already
> retired, it gets -EINVAL instead of success. Will this break userspace?

Oh, right, we definitely don't want that.. I guess that was the reason
for the original logic.

I have a different idea.. will send a patch in a bit.

BR,
-R

> -Akhil.
>
> >
> > ret = dma_fence_wait_timeout(fence, true, timeout_to_jiffies(&timeout));
> > if (ret == 0) {
> > -------------
> >
> > BR,
> > -R
> >
> >> -Akhil.
> >>
> >>> - }
> >>> -
> >>> - if (!timeout) {
> >>> - /* no-wait: */
> >>> - ret = fence_completed(fctx, fence) ? 0 : -EBUSY;
> >>> - } else {
> >>> - unsigned long remaining_jiffies = timeout_to_jiffies(timeout);
> >>> -
> >>> - if (interruptible)
> >>> - ret = wait_event_interruptible_timeout(fctx->event,
> >>> - fence_completed(fctx, fence),
> >>> - remaining_jiffies);
> >>> - else
> >>> - ret = wait_event_timeout(fctx->event,
> >>> - fence_completed(fctx, fence),
> >>> - remaining_jiffies);
> >>> -
> >>> - if (ret == 0) {
> >>> - DBG("timeout waiting for fence: %u (completed: %u)",
> >>> - fence, fctx->completed_fence);
> >>> - ret = -ETIMEDOUT;
> >>> - } else if (ret != -ERESTARTSYS) {
> >>> - ret = 0;
> >>> - }
> >>> - }
> >>> -
> >>> - return ret;
> >>> -}
> >>> -
> >>> /* called from workqueue */
> >>> void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence)
> >>> {
> >>> spin_lock(&fctx->spinlock);
> >>> fctx->completed_fence = max(fence, fctx->completed_fence);
> >>> spin_unlock(&fctx->spinlock);
> >>> -
> >>> - wake_up_all(&fctx->event);
> >>> }
> >>>
> >>> struct msm_fence {
> >>> diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h
> >>> index 6ab97062ff1a..4783db528bcc 100644
> >>> --- a/drivers/gpu/drm/msm/msm_fence.h
> >>> +++ b/drivers/gpu/drm/msm/msm_fence.h
> >>> @@ -49,7 +49,6 @@ struct msm_fence_context {
> >>> */
> >>> volatile uint32_t *fenceptr;
> >>>
> >>> - wait_queue_head_t event;
> >>> spinlock_t spinlock;
> >>> };
> >>>
> >>> @@ -57,8 +56,6 @@ struct msm_fence_context * msm_fence_context_alloc(struct drm_device *dev,
> >>> volatile uint32_t *fenceptr, const char *name);
> >>> void msm_fence_context_free(struct msm_fence_context *fctx);
> >>>
> >>> -int msm_wait_fence(struct msm_fence_context *fctx, uint32_t fence,
> >>> - ktime_t *timeout, bool interruptible);
> >>> void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);
> >>>
> >>> struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx);
> >>> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> >>> index da3af702a6c8..e0579abda5b9 100644
> >>> --- a/drivers/gpu/drm/msm/msm_gem.h
> >>> +++ b/drivers/gpu/drm/msm/msm_gem.h
> >>> @@ -320,6 +320,7 @@ struct msm_gem_submit {
> >>> struct ww_acquire_ctx ticket;
> >>> uint32_t seqno; /* Sequence number of the submit on the ring */
> >>> struct dma_fence *fence;
> >>> + int fence_id; /* key into queue->fence_idr */
> >>> struct msm_gpu_submitqueue *queue;
> >>> struct pid *pid; /* submitting process */
> >>> bool fault_dumped; /* Limit devcoredump dumping to one per submit */
> >>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>> index 4f02fa3c78f9..f6f595aae2c5 100644
> >>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> >>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>> @@ -68,7 +68,14 @@ void __msm_gem_submit_destroy(struct kref *kref)
> >>> container_of(kref, struct msm_gem_submit, ref);
> >>> unsigned i;
> >>>
> >>> + if (submit->fence_id) {
> >>> + mutex_lock(&submit->queue->lock);
> >>> + idr_remove(&submit->queue->fence_idr, submit->fence_id);
> >>> + mutex_unlock(&submit->queue->lock);
> >>> + }
> >>> +
> >>> dma_fence_put(submit->fence);
> >>> +
> >>> put_pid(submit->pid);
> >>> msm_submitqueue_put(submit->queue);
> >>>
> >>> @@ -872,6 +879,20 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >>> goto out;
> >>> }
> >>>
> >>> + /*
> >>> + * Allocate an id which can be used by WAIT_FENCE ioctl to map back
> >>> + * to the underlying fence.
> >>> + */
> >>> + mutex_lock(&queue->lock);
> >>> + submit->fence_id = idr_alloc_cyclic(&queue->fence_idr,
> >>> + submit->fence, 0, INT_MAX, GFP_KERNEL);
> >>> + mutex_unlock(&queue->lock);
> >>> + if (submit->fence_id < 0) {
> >>> + ret = submit->fence_id = 0;
> >>> + submit->fence_id = 0;
> >>> + goto out;
> >>> + }
> >>> +
> >>> if (args->flags & MSM_SUBMIT_FENCE_FD_OUT) {
> >>> struct sync_file *sync_file = sync_file_create(submit->fence);
> >>> if (!sync_file) {
> >>> @@ -886,7 +907,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >>>
> >>> msm_gpu_submit(gpu, submit);
> >>>
> >>> - args->fence = submit->fence->seqno;
> >>> + args->fence = submit->fence_id;
> >>>
> >>> msm_reset_syncobjs(syncobjs_to_reset, args->nr_in_syncobjs);
> >>> msm_process_post_deps(post_deps, args->nr_out_syncobjs,
> >>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> >>> index 96efcb31e502..579627252540 100644
> >>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >>> @@ -263,6 +263,9 @@ struct msm_gpu_perfcntr {
> >>> * which set of pgtables do submits jobs associated with the
> >>> * submitqueue use)
> >>> * @node: node in the context's list of submitqueues
> >>> + * @fence_idr: maps fence-id to dma_fence for userspace visible fence
> >>> + * seqno, protected by submitqueue lock
> >>> + * @lock: submitqueue lock
> >>> * @ref: reference count
> >>> */
> >>> struct msm_gpu_submitqueue {
> >>> @@ -272,6 +275,8 @@ struct msm_gpu_submitqueue {
> >>> int faults;
> >>> struct msm_file_private *ctx;
> >>> struct list_head node;
> >>> + struct idr fence_idr;
> >>> + struct mutex lock;
> >>> struct kref ref;
> >>> };
> >>>
> >>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> index 9e9fec61d629..66f8d0fb38b0 100644
> >>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> @@ -12,6 +12,8 @@ void msm_submitqueue_destroy(struct kref *kref)
> >>> struct msm_gpu_submitqueue *queue = container_of(kref,
> >>> struct msm_gpu_submitqueue, ref);
> >>>
> >>> + idr_destroy(&queue->fence_idr);
> >>> +
> >>> msm_file_private_put(queue->ctx);
> >>>
> >>> kfree(queue);
> >>> @@ -89,6 +91,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>> if (id)
> >>> *id = queue->id;
> >>>
> >>> + idr_init(&queue->fence_idr);
> >>> + mutex_init(&queue->lock);
> >>> +
> >>> list_add_tail(&queue->node, &ctx->submitqueues);
> >>>
> >>> write_unlock(&ctx->queuelock);
> >>>
> >>
>

2022-05-23 14:45:54

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities


Hi Rob,

On 28/07/2021 02:06, Rob Clark wrote:
> From: Rob Clark <[email protected]>
>
> The drm/scheduler provides additional prioritization on top of that
> provided by however many number of ringbuffers (each with their own
> priority level) is supported on a given generation. Expose the
> additional levels of priority to userspace and map the userspace
> priority back to ring (first level of priority) and schedular priority
> (additional priority levels within the ring).
>
> Signed-off-by: Rob Clark <[email protected]>
> Acked-by: Christian König <[email protected]>
> ---
> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
> include/uapi/drm/msm_drm.h | 14 +++++-
> 5 files changed, 88 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index bad4809b68ef..748665232d29 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
> return ret;
> }
> return -EINVAL;
> - case MSM_PARAM_NR_RINGS:
> - *value = gpu->nr_rings;
> + case MSM_PARAM_PRIORITIES:
> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
> return 0;
> case MSM_PARAM_PP_PGTABLE:
> *value = 0;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 450efe59abb5..c2ecec5b11c4 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> submit->gpu = gpu;
> submit->cmd = (void *)&submit->bos[nr_bos];
> submit->queue = queue;
> - submit->ring = gpu->rb[queue->prio];
> + submit->ring = gpu->rb[queue->ring_nr];
> submit->fault_dumped = false;
>
> INIT_LIST_HEAD(&submit->node);
> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> /* Get a unique identifier for the submission for logging purposes */
> submitid = atomic_inc_return(&ident) - 1;
>
> - ring = gpu->rb[queue->prio];
> + ring = gpu->rb[queue->ring_nr];
> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
> args->nr_bos, args->nr_cmds);
>
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index b912cacaecc0..0e4b45bff2e6 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
> const char *name;
> };
>
> +/*
> + * The number of priority levels provided by drm gpu scheduler. The
> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
> + * cases, so we don't use it (no need for kernel generated jobs).
> + */
> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
> +
> +/**
> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
> + *
> + * @gpu: the gpu instance
> + * @prio: the userspace priority level
> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
> + * priority maps to
> + *
> + * With drm/scheduler providing it's own level of prioritization, our total
> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
> + * Each ring is associated with it's own scheduler instance. However, our
> + * UABI is that lower numerical values are higher priority. So mapping the
> + * single userspace priority level into ring_nr and sched_prio takes some
> + * care. The userspace provided priority (when a submitqueue is created)
> + * is mapped to ring nr and scheduler priority as such:
> + *
> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
> + * sched_prio = NR_SCHED_PRIORITIES -
> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
> + *
> + * This allows generations without preemption (nr_rings==1) to have some
> + * amount of prioritization, and provides more priority levels for gens
> + * that do have preemption.

I am exploring how different drivers handle priority levels and this
caught my eye.

Is the implication of the last paragraphs that on hw with nr_rings > 1,
ring + 1 preempts ring? If so I am wondering does the "spreading" of
user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
levels within every "bucket" or how does that work?

Regards,

Tvrtko

> + */
> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio,
> + unsigned *ring_nr, enum drm_sched_priority *sched_prio)
> +{
> + unsigned rn, sp;
> +
> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp);
> +
> + /* invert sched priority to map to higher-numeric-is-higher-
> + * priority convention
> + */
> + sp = NR_SCHED_PRIORITIES - sp - 1;
> +
> + if (rn >= gpu->nr_rings)
> + return -EINVAL;
> +
> + *ring_nr = rn;
> + *sched_prio = sp;
> +
> + return 0;
> +}
> +
> /**
> * A submitqueue is associated with a gl context or vk queue (or equiv)
> * in userspace.
> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr {
> * @id: userspace id for the submitqueue, unique within the drm_file
> * @flags: userspace flags for the submitqueue, specified at creation
> * (currently unusued)
> - * @prio: the submitqueue priority
> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined
> + * by the submitqueue's priority
> * @faults: the number of GPU hangs associated with this submitqueue
> * @ctx: the per-drm_file context associated with the submitqueue (ie.
> * which set of pgtables do submits jobs associated with the
> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr {
> struct msm_gpu_submitqueue {
> int id;
> u32 flags;
> - u32 prio;
> + u32 ring_nr;
> int faults;
> struct msm_file_private *ctx;
> struct list_head node;
> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> index 682ba2a7c0ec..32a55d81b58b 100644
> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> struct msm_gpu_submitqueue *queue;
> struct msm_ringbuffer *ring;
> struct drm_gpu_scheduler *sched;
> + enum drm_sched_priority sched_prio;
> + unsigned ring_nr;
> int ret;
>
> if (!ctx)
> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> if (!priv->gpu)
> return -ENODEV;
>
> - if (prio >= priv->gpu->nr_rings)
> - return -EINVAL;
> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio);
> + if (ret)
> + return ret;
>
> queue = kzalloc(sizeof(*queue), GFP_KERNEL);
>
> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>
> kref_init(&queue->ref);
> queue->flags = flags;
> - queue->prio = prio;
> + queue->ring_nr = ring_nr;
>
> - ring = priv->gpu->rb[prio];
> + ring = priv->gpu->rb[ring_nr];
> sched = &ring->sched;
>
> - /*
> - * TODO we can allow more priorities than we have ringbuffers by
> - * mapping:
> - *
> - * ring = prio / 3;
> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3);
> - *
> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is
> - * treated specially in places.
> - */
> ret = drm_sched_entity_init(&queue->entity,
> - DRM_SCHED_PRIORITY_NORMAL,
> - &sched, 1, NULL);
> + sched_prio, &sched, 1, NULL);
> if (ret) {
> kfree(queue);
> return ret;
> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
> {
> struct msm_drm_private *priv = drm->dev_private;
> - int default_prio;
> + int default_prio, max_priority;
>
> if (!priv->gpu)
> return -ENODEV;
>
> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1;
> +
> /*
> - * Select priority 2 as the "default priority" unless nr_rings is less
> - * than 2 and then pick the lowest priority
> + * Pick a medium priority level as default. Lower numeric value is
> + * higher priority, so round-up to pick a priority that is not higher
> + * than the middle priority level.
> */
> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1);
> + default_prio = DIV_ROUND_UP(max_priority, 2);
>
> INIT_LIST_HEAD(&ctx->submitqueues);
>
> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> index f075851021c3..6b8fffc28a50 100644
> --- a/include/uapi/drm/msm_drm.h
> +++ b/include/uapi/drm/msm_drm.h
> @@ -73,11 +73,19 @@ struct drm_msm_timespec {
> #define MSM_PARAM_MAX_FREQ 0x04
> #define MSM_PARAM_TIMESTAMP 0x05
> #define MSM_PARAM_GMEM_BASE 0x06
> -#define MSM_PARAM_NR_RINGS 0x07
> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */
> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */
> #define MSM_PARAM_FAULTS 0x09
> #define MSM_PARAM_SUSPENDS 0x0a
>
> +/* For backwards compat. The original support for preemption was based on
> + * a single ring per priority level so # of priority levels equals the #
> + * of rings. With drm/scheduler providing additional levels of priority,
> + * the number of priorities is greater than the # of rings. The param is
> + * renamed to better reflect this.
> + */
> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES
> +
> struct drm_msm_param {
> __u32 pipe; /* in, MSM_PIPE_x */
> __u32 param; /* in, MSM_PARAM_x */
> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise {
>
> #define MSM_SUBMITQUEUE_FLAGS (0)
>
> +/*
> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1,
> + * a lower numeric value is higher priority.
> + */
> struct drm_msm_submitqueue {
> __u32 flags; /* in, MSM_SUBMITQUEUE_x */
> __u32 prio; /* in, Priority level */

2022-05-23 23:52:20

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> Hi Rob,
>
> On 28/07/2021 02:06, Rob Clark wrote:
> > From: Rob Clark <[email protected]>
> >
> > The drm/scheduler provides additional prioritization on top of that
> > provided by however many number of ringbuffers (each with their own
> > priority level) is supported on a given generation. Expose the
> > additional levels of priority to userspace and map the userspace
> > priority back to ring (first level of priority) and schedular priority
> > (additional priority levels within the ring).
> >
> > Signed-off-by: Rob Clark <[email protected]>
> > Acked-by: Christian König <[email protected]>
> > ---
> > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
> > drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
> > drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
> > drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
> > include/uapi/drm/msm_drm.h | 14 +++++-
> > 5 files changed, 88 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > index bad4809b68ef..748665232d29 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
> > return ret;
> > }
> > return -EINVAL;
> > - case MSM_PARAM_NR_RINGS:
> > - *value = gpu->nr_rings;
> > + case MSM_PARAM_PRIORITIES:
> > + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
> > return 0;
> > case MSM_PARAM_PP_PGTABLE:
> > *value = 0;
> > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > index 450efe59abb5..c2ecec5b11c4 100644
> > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> > submit->gpu = gpu;
> > submit->cmd = (void *)&submit->bos[nr_bos];
> > submit->queue = queue;
> > - submit->ring = gpu->rb[queue->prio];
> > + submit->ring = gpu->rb[queue->ring_nr];
> > submit->fault_dumped = false;
> >
> > INIT_LIST_HEAD(&submit->node);
> > @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > /* Get a unique identifier for the submission for logging purposes */
> > submitid = atomic_inc_return(&ident) - 1;
> >
> > - ring = gpu->rb[queue->prio];
> > + ring = gpu->rb[queue->ring_nr];
> > trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
> > args->nr_bos, args->nr_cmds);
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index b912cacaecc0..0e4b45bff2e6 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
> > const char *name;
> > };
> >
> > +/*
> > + * The number of priority levels provided by drm gpu scheduler. The
> > + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
> > + * cases, so we don't use it (no need for kernel generated jobs).
> > + */
> > +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
> > +
> > +/**
> > + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
> > + *
> > + * @gpu: the gpu instance
> > + * @prio: the userspace priority level
> > + * @ring_nr: [out] the ringbuffer the userspace priority maps to
> > + * @sched_prio: [out] the gpu scheduler priority level which the userspace
> > + * priority maps to
> > + *
> > + * With drm/scheduler providing it's own level of prioritization, our total
> > + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
> > + * Each ring is associated with it's own scheduler instance. However, our
> > + * UABI is that lower numerical values are higher priority. So mapping the
> > + * single userspace priority level into ring_nr and sched_prio takes some
> > + * care. The userspace provided priority (when a submitqueue is created)
> > + * is mapped to ring nr and scheduler priority as such:
> > + *
> > + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
> > + * sched_prio = NR_SCHED_PRIORITIES -
> > + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
> > + *
> > + * This allows generations without preemption (nr_rings==1) to have some
> > + * amount of prioritization, and provides more priority levels for gens
> > + * that do have preemption.
>
> I am exploring how different drivers handle priority levels and this
> caught my eye.
>
> Is the implication of the last paragraphs that on hw with nr_rings > 1,
> ring + 1 preempts ring?

Other way around, at least from the uabi standpoint. Ie. ring[0]
preempts ring[1]

> If so I am wondering does the "spreading" of
> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
> levels within every "bucket" or how does that work?

So, preemption is possible between any priority level before run_job()
gets called, which writes the job into the ringbuffer. After that
point, you only have "bucket" level preemption, because
NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
ringbuffer.

-----

btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
trying to add an igt test to stress shrinker/eviction, similar to the
existing tests/i915/gem_shrink.c. But we hit an unfortunate
combination of circumstances:
1. Pinning memory happens in the synchronous part of the submit ioctl,
before enqueuing the job for the kthread to handle.
2. The first run_job() callback incurs a slight delay (~1.5ms) while
resuming the GPU
3. Because of that delay, userspace has a chance to queue up enough
more jobs to require locking/pinning more than the available system
RAM..

I'm not sure if we want a way to prevent userspace from getting *too*
far ahead of the kthread. Or maybe at some point the shrinker should
sleep on non-idle buffers?

BR,
-R

>
> Regards,
>
> Tvrtko
>
> > + */
> > +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio,
> > + unsigned *ring_nr, enum drm_sched_priority *sched_prio)
> > +{
> > + unsigned rn, sp;
> > +
> > + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp);
> > +
> > + /* invert sched priority to map to higher-numeric-is-higher-
> > + * priority convention
> > + */
> > + sp = NR_SCHED_PRIORITIES - sp - 1;
> > +
> > + if (rn >= gpu->nr_rings)
> > + return -EINVAL;
> > +
> > + *ring_nr = rn;
> > + *sched_prio = sp;
> > +
> > + return 0;
> > +}
> > +
> > /**
> > * A submitqueue is associated with a gl context or vk queue (or equiv)
> > * in userspace.
> > @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr {
> > * @id: userspace id for the submitqueue, unique within the drm_file
> > * @flags: userspace flags for the submitqueue, specified at creation
> > * (currently unusued)
> > - * @prio: the submitqueue priority
> > + * @ring_nr: the ringbuffer used by this submitqueue, which is determined
> > + * by the submitqueue's priority
> > * @faults: the number of GPU hangs associated with this submitqueue
> > * @ctx: the per-drm_file context associated with the submitqueue (ie.
> > * which set of pgtables do submits jobs associated with the
> > @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr {
> > struct msm_gpu_submitqueue {
> > int id;
> > u32 flags;
> > - u32 prio;
> > + u32 ring_nr;
> > int faults;
> > struct msm_file_private *ctx;
> > struct list_head node;
> > diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> > index 682ba2a7c0ec..32a55d81b58b 100644
> > --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> > +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> > @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> > struct msm_gpu_submitqueue *queue;
> > struct msm_ringbuffer *ring;
> > struct drm_gpu_scheduler *sched;
> > + enum drm_sched_priority sched_prio;
> > + unsigned ring_nr;
> > int ret;
> >
> > if (!ctx)
> > @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> > if (!priv->gpu)
> > return -ENODEV;
> >
> > - if (prio >= priv->gpu->nr_rings)
> > - return -EINVAL;
> > + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio);
> > + if (ret)
> > + return ret;
> >
> > queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> >
> > @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >
> > kref_init(&queue->ref);
> > queue->flags = flags;
> > - queue->prio = prio;
> > + queue->ring_nr = ring_nr;
> >
> > - ring = priv->gpu->rb[prio];
> > + ring = priv->gpu->rb[ring_nr];
> > sched = &ring->sched;
> >
> > - /*
> > - * TODO we can allow more priorities than we have ringbuffers by
> > - * mapping:
> > - *
> > - * ring = prio / 3;
> > - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3);
> > - *
> > - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is
> > - * treated specially in places.
> > - */
> > ret = drm_sched_entity_init(&queue->entity,
> > - DRM_SCHED_PRIORITY_NORMAL,
> > - &sched, 1, NULL);
> > + sched_prio, &sched, 1, NULL);
> > if (ret) {
> > kfree(queue);
> > return ret;
> > @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> > int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
> > {
> > struct msm_drm_private *priv = drm->dev_private;
> > - int default_prio;
> > + int default_prio, max_priority;
> >
> > if (!priv->gpu)
> > return -ENODEV;
> >
> > + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1;
> > +
> > /*
> > - * Select priority 2 as the "default priority" unless nr_rings is less
> > - * than 2 and then pick the lowest priority
> > + * Pick a medium priority level as default. Lower numeric value is
> > + * higher priority, so round-up to pick a priority that is not higher
> > + * than the middle priority level.
> > */
> > - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1);
> > + default_prio = DIV_ROUND_UP(max_priority, 2);
> >
> > INIT_LIST_HEAD(&ctx->submitqueues);
> >
> > diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> > index f075851021c3..6b8fffc28a50 100644
> > --- a/include/uapi/drm/msm_drm.h
> > +++ b/include/uapi/drm/msm_drm.h
> > @@ -73,11 +73,19 @@ struct drm_msm_timespec {
> > #define MSM_PARAM_MAX_FREQ 0x04
> > #define MSM_PARAM_TIMESTAMP 0x05
> > #define MSM_PARAM_GMEM_BASE 0x06
> > -#define MSM_PARAM_NR_RINGS 0x07
> > +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */
> > #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */
> > #define MSM_PARAM_FAULTS 0x09
> > #define MSM_PARAM_SUSPENDS 0x0a
> >
> > +/* For backwards compat. The original support for preemption was based on
> > + * a single ring per priority level so # of priority levels equals the #
> > + * of rings. With drm/scheduler providing additional levels of priority,
> > + * the number of priorities is greater than the # of rings. The param is
> > + * renamed to better reflect this.
> > + */
> > +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES
> > +
> > struct drm_msm_param {
> > __u32 pipe; /* in, MSM_PIPE_x */
> > __u32 param; /* in, MSM_PARAM_x */
> > @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise {
> >
> > #define MSM_SUBMITQUEUE_FLAGS (0)
> >
> > +/*
> > + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1,
> > + * a lower numeric value is higher priority.
> > + */
> > struct drm_msm_submitqueue {
> > __u32 flags; /* in, MSM_SUBMITQUEUE_x */
> > __u32 prio; /* in, Priority level */

2022-05-24 16:57:36

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities


On 23/05/2022 23:53, Rob Clark wrote:
> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> Hi Rob,
>>
>> On 28/07/2021 02:06, Rob Clark wrote:
>>> From: Rob Clark <[email protected]>
>>>
>>> The drm/scheduler provides additional prioritization on top of that
>>> provided by however many number of ringbuffers (each with their own
>>> priority level) is supported on a given generation. Expose the
>>> additional levels of priority to userspace and map the userspace
>>> priority back to ring (first level of priority) and schedular priority
>>> (additional priority levels within the ring).
>>>
>>> Signed-off-by: Rob Clark <[email protected]>
>>> Acked-by: Christian König <[email protected]>
>>> ---
>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
>>> include/uapi/drm/msm_drm.h | 14 +++++-
>>> 5 files changed, 88 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>> index bad4809b68ef..748665232d29 100644
>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
>>> return ret;
>>> }
>>> return -EINVAL;
>>> - case MSM_PARAM_NR_RINGS:
>>> - *value = gpu->nr_rings;
>>> + case MSM_PARAM_PRIORITIES:
>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
>>> return 0;
>>> case MSM_PARAM_PP_PGTABLE:
>>> *value = 0;
>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> index 450efe59abb5..c2ecec5b11c4 100644
>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>> submit->gpu = gpu;
>>> submit->cmd = (void *)&submit->bos[nr_bos];
>>> submit->queue = queue;
>>> - submit->ring = gpu->rb[queue->prio];
>>> + submit->ring = gpu->rb[queue->ring_nr];
>>> submit->fault_dumped = false;
>>>
>>> INIT_LIST_HEAD(&submit->node);
>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>>> /* Get a unique identifier for the submission for logging purposes */
>>> submitid = atomic_inc_return(&ident) - 1;
>>>
>>> - ring = gpu->rb[queue->prio];
>>> + ring = gpu->rb[queue->ring_nr];
>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
>>> args->nr_bos, args->nr_cmds);
>>>
>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>> index b912cacaecc0..0e4b45bff2e6 100644
>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
>>> const char *name;
>>> };
>>>
>>> +/*
>>> + * The number of priority levels provided by drm gpu scheduler. The
>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
>>> + * cases, so we don't use it (no need for kernel generated jobs).
>>> + */
>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
>>> +
>>> +/**
>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
>>> + *
>>> + * @gpu: the gpu instance
>>> + * @prio: the userspace priority level
>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
>>> + * priority maps to
>>> + *
>>> + * With drm/scheduler providing it's own level of prioritization, our total
>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
>>> + * Each ring is associated with it's own scheduler instance. However, our
>>> + * UABI is that lower numerical values are higher priority. So mapping the
>>> + * single userspace priority level into ring_nr and sched_prio takes some
>>> + * care. The userspace provided priority (when a submitqueue is created)
>>> + * is mapped to ring nr and scheduler priority as such:
>>> + *
>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
>>> + * sched_prio = NR_SCHED_PRIORITIES -
>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
>>> + *
>>> + * This allows generations without preemption (nr_rings==1) to have some
>>> + * amount of prioritization, and provides more priority levels for gens
>>> + * that do have preemption.
>>
>> I am exploring how different drivers handle priority levels and this
>> caught my eye.
>>
>> Is the implication of the last paragraphs that on hw with nr_rings > 1,
>> ring + 1 preempts ring?
>
> Other way around, at least from the uabi standpoint. Ie. ring[0]
> preempts ring[1]

Ah yes, I figure it out from the comments but then confused myself when
writing the email.

>> If so I am wondering does the "spreading" of
>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
>> levels within every "bucket" or how does that work?
>
> So, preemption is possible between any priority level before run_job()
> gets called, which writes the job into the ringbuffer. After that

Hmm how? Before run_job() the jobs are not runnable, sitting in the
scheduler queues, right?

> point, you only have "bucket" level preemption, because
> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
> ringbuffer.

Right, and you have one GPU with four rings, which means you expose 12
priority levels to userspace, did I get that right?

If so how do you convey in the ABI that not all there priority levels
are equal? Like userspace can submit at prio 4 and expect prio 3 to
preempt, as would prio 2 preempt prio 3. While actual behaviour will not
match - 3 will not preempt 4.

Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
quick peek in Mesa but did not spot it - although I am not really at
home there yet so maybe I missed it.

> -----
>
> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
> trying to add an igt test to stress shrinker/eviction, similar to the
> existing tests/i915/gem_shrink.c. But we hit an unfortunate
> combination of circumstances:
> 1. Pinning memory happens in the synchronous part of the submit ioctl,
> before enqueuing the job for the kthread to handle.
> 2. The first run_job() callback incurs a slight delay (~1.5ms) while
> resuming the GPU
> 3. Because of that delay, userspace has a chance to queue up enough
> more jobs to require locking/pinning more than the available system
> RAM..

Is that one or multiple threads submitting jobs?

> I'm not sure if we want a way to prevent userspace from getting *too*
> far ahead of the kthread. Or maybe at some point the shrinker should
> sleep on non-idle buffers?

On the direct reclaim path when invoked from the submit ioctl? In i915
we only shrink idle objects on direct reclaim and leave active ones for
the swapper. It depends on how your locking looks like whether you could
do them, whether there would be coupling of locks and fs-reclaim context.

Regards,

Tvrtko

> BR,
> -R
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> + */
>>> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio,
>>> + unsigned *ring_nr, enum drm_sched_priority *sched_prio)
>>> +{
>>> + unsigned rn, sp;
>>> +
>>> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp);
>>> +
>>> + /* invert sched priority to map to higher-numeric-is-higher-
>>> + * priority convention
>>> + */
>>> + sp = NR_SCHED_PRIORITIES - sp - 1;
>>> +
>>> + if (rn >= gpu->nr_rings)
>>> + return -EINVAL;
>>> +
>>> + *ring_nr = rn;
>>> + *sched_prio = sp;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> /**
>>> * A submitqueue is associated with a gl context or vk queue (or equiv)
>>> * in userspace.
>>> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr {
>>> * @id: userspace id for the submitqueue, unique within the drm_file
>>> * @flags: userspace flags for the submitqueue, specified at creation
>>> * (currently unusued)
>>> - * @prio: the submitqueue priority
>>> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined
>>> + * by the submitqueue's priority
>>> * @faults: the number of GPU hangs associated with this submitqueue
>>> * @ctx: the per-drm_file context associated with the submitqueue (ie.
>>> * which set of pgtables do submits jobs associated with the
>>> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr {
>>> struct msm_gpu_submitqueue {
>>> int id;
>>> u32 flags;
>>> - u32 prio;
>>> + u32 ring_nr;
>>> int faults;
>>> struct msm_file_private *ctx;
>>> struct list_head node;
>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
>>> index 682ba2a7c0ec..32a55d81b58b 100644
>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>> struct msm_gpu_submitqueue *queue;
>>> struct msm_ringbuffer *ring;
>>> struct drm_gpu_scheduler *sched;
>>> + enum drm_sched_priority sched_prio;
>>> + unsigned ring_nr;
>>> int ret;
>>>
>>> if (!ctx)
>>> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>> if (!priv->gpu)
>>> return -ENODEV;
>>>
>>> - if (prio >= priv->gpu->nr_rings)
>>> - return -EINVAL;
>>> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio);
>>> + if (ret)
>>> + return ret;
>>>
>>> queue = kzalloc(sizeof(*queue), GFP_KERNEL);
>>>
>>> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>>
>>> kref_init(&queue->ref);
>>> queue->flags = flags;
>>> - queue->prio = prio;
>>> + queue->ring_nr = ring_nr;
>>>
>>> - ring = priv->gpu->rb[prio];
>>> + ring = priv->gpu->rb[ring_nr];
>>> sched = &ring->sched;
>>>
>>> - /*
>>> - * TODO we can allow more priorities than we have ringbuffers by
>>> - * mapping:
>>> - *
>>> - * ring = prio / 3;
>>> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3);
>>> - *
>>> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is
>>> - * treated specially in places.
>>> - */
>>> ret = drm_sched_entity_init(&queue->entity,
>>> - DRM_SCHED_PRIORITY_NORMAL,
>>> - &sched, 1, NULL);
>>> + sched_prio, &sched, 1, NULL);
>>> if (ret) {
>>> kfree(queue);
>>> return ret;
>>> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
>>> {
>>> struct msm_drm_private *priv = drm->dev_private;
>>> - int default_prio;
>>> + int default_prio, max_priority;
>>>
>>> if (!priv->gpu)
>>> return -ENODEV;
>>>
>>> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1;
>>> +
>>> /*
>>> - * Select priority 2 as the "default priority" unless nr_rings is less
>>> - * than 2 and then pick the lowest priority
>>> + * Pick a medium priority level as default. Lower numeric value is
>>> + * higher priority, so round-up to pick a priority that is not higher
>>> + * than the middle priority level.
>>> */
>>> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1);
>>> + default_prio = DIV_ROUND_UP(max_priority, 2);
>>>
>>> INIT_LIST_HEAD(&ctx->submitqueues);
>>>
>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>>> index f075851021c3..6b8fffc28a50 100644
>>> --- a/include/uapi/drm/msm_drm.h
>>> +++ b/include/uapi/drm/msm_drm.h
>>> @@ -73,11 +73,19 @@ struct drm_msm_timespec {
>>> #define MSM_PARAM_MAX_FREQ 0x04
>>> #define MSM_PARAM_TIMESTAMP 0x05
>>> #define MSM_PARAM_GMEM_BASE 0x06
>>> -#define MSM_PARAM_NR_RINGS 0x07
>>> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */
>>> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */
>>> #define MSM_PARAM_FAULTS 0x09
>>> #define MSM_PARAM_SUSPENDS 0x0a
>>>
>>> +/* For backwards compat. The original support for preemption was based on
>>> + * a single ring per priority level so # of priority levels equals the #
>>> + * of rings. With drm/scheduler providing additional levels of priority,
>>> + * the number of priorities is greater than the # of rings. The param is
>>> + * renamed to better reflect this.
>>> + */
>>> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES
>>> +
>>> struct drm_msm_param {
>>> __u32 pipe; /* in, MSM_PIPE_x */
>>> __u32 param; /* in, MSM_PARAM_x */
>>> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise {
>>>
>>> #define MSM_SUBMITQUEUE_FLAGS (0)
>>>
>>> +/*
>>> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1,
>>> + * a lower numeric value is higher priority.
>>> + */
>>> struct drm_msm_submitqueue {
>>> __u32 flags; /* in, MSM_SUBMITQUEUE_x */
>>> __u32 prio; /* in, Priority level */

2022-05-25 00:34:12

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
<[email protected]> wrote:
>
> On 23/05/2022 23:53, Rob Clark wrote:
> >
> > btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
> > trying to add an igt test to stress shrinker/eviction, similar to the
> > existing tests/i915/gem_shrink.c. But we hit an unfortunate
> > combination of circumstances:
> > 1. Pinning memory happens in the synchronous part of the submit ioctl,
> > before enqueuing the job for the kthread to handle.
> > 2. The first run_job() callback incurs a slight delay (~1.5ms) while
> > resuming the GPU
> > 3. Because of that delay, userspace has a chance to queue up enough
> > more jobs to require locking/pinning more than the available system
> > RAM..
>
> Is that one or multiple threads submitting jobs?

In this case multiple.. but I think it could also happen with a single
thread (provided it didn't stall on a fence, directly or indirectly,
from an earlier submit), because of how resume and actual job
submission happens from scheduler kthread.

> > I'm not sure if we want a way to prevent userspace from getting *too*
> > far ahead of the kthread. Or maybe at some point the shrinker should
> > sleep on non-idle buffers?
>
> On the direct reclaim path when invoked from the submit ioctl? In i915
> we only shrink idle objects on direct reclaim and leave active ones for
> the swapper. It depends on how your locking looks like whether you could
> do them, whether there would be coupling of locks and fs-reclaim context.

I think the locking is more or less ok, although lockdep is unhappy
about one thing[1] which is I think a false warning (ie. not
recognizing that we'd already successfully acquired the obj lock via
trylock). We can already reclaim idle bo's in this path. But the
problem with a bunch of submits queued up in the scheduler, is that
they are already considered pinned and active. So at some point we
need to sleep (hopefully interruptabley) until they are no longer
active, ie. to throttle userspace trying to shove in more submits
until some of the enqueued ones have a chance to run and complete.

BR,
-R

[1] https://gitlab.freedesktop.org/drm/msm/-/issues/14

> Regards,
>
> Tvrtko
>
> > BR,
> > -R
> >
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> + */
> >>> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio,
> >>> + unsigned *ring_nr, enum drm_sched_priority *sched_prio)
> >>> +{
> >>> + unsigned rn, sp;
> >>> +
> >>> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp);
> >>> +
> >>> + /* invert sched priority to map to higher-numeric-is-higher-
> >>> + * priority convention
> >>> + */
> >>> + sp = NR_SCHED_PRIORITIES - sp - 1;
> >>> +
> >>> + if (rn >= gpu->nr_rings)
> >>> + return -EINVAL;
> >>> +
> >>> + *ring_nr = rn;
> >>> + *sched_prio = sp;
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> /**
> >>> * A submitqueue is associated with a gl context or vk queue (or equiv)
> >>> * in userspace.
> >>> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr {
> >>> * @id: userspace id for the submitqueue, unique within the drm_file
> >>> * @flags: userspace flags for the submitqueue, specified at creation
> >>> * (currently unusued)
> >>> - * @prio: the submitqueue priority
> >>> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined
> >>> + * by the submitqueue's priority
> >>> * @faults: the number of GPU hangs associated with this submitqueue
> >>> * @ctx: the per-drm_file context associated with the submitqueue (ie.
> >>> * which set of pgtables do submits jobs associated with the
> >>> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr {
> >>> struct msm_gpu_submitqueue {
> >>> int id;
> >>> u32 flags;
> >>> - u32 prio;
> >>> + u32 ring_nr;
> >>> int faults;
> >>> struct msm_file_private *ctx;
> >>> struct list_head node;
> >>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> index 682ba2a7c0ec..32a55d81b58b 100644
> >>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>> struct msm_gpu_submitqueue *queue;
> >>> struct msm_ringbuffer *ring;
> >>> struct drm_gpu_scheduler *sched;
> >>> + enum drm_sched_priority sched_prio;
> >>> + unsigned ring_nr;
> >>> int ret;
> >>>
> >>> if (!ctx)
> >>> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>> if (!priv->gpu)
> >>> return -ENODEV;
> >>>
> >>> - if (prio >= priv->gpu->nr_rings)
> >>> - return -EINVAL;
> >>> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio);
> >>> + if (ret)
> >>> + return ret;
> >>>
> >>> queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> >>>
> >>> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>>
> >>> kref_init(&queue->ref);
> >>> queue->flags = flags;
> >>> - queue->prio = prio;
> >>> + queue->ring_nr = ring_nr;
> >>>
> >>> - ring = priv->gpu->rb[prio];
> >>> + ring = priv->gpu->rb[ring_nr];
> >>> sched = &ring->sched;
> >>>
> >>> - /*
> >>> - * TODO we can allow more priorities than we have ringbuffers by
> >>> - * mapping:
> >>> - *
> >>> - * ring = prio / 3;
> >>> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3);
> >>> - *
> >>> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is
> >>> - * treated specially in places.
> >>> - */
> >>> ret = drm_sched_entity_init(&queue->entity,
> >>> - DRM_SCHED_PRIORITY_NORMAL,
> >>> - &sched, 1, NULL);
> >>> + sched_prio, &sched, 1, NULL);
> >>> if (ret) {
> >>> kfree(queue);
> >>> return ret;
> >>> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
> >>> {
> >>> struct msm_drm_private *priv = drm->dev_private;
> >>> - int default_prio;
> >>> + int default_prio, max_priority;
> >>>
> >>> if (!priv->gpu)
> >>> return -ENODEV;
> >>>
> >>> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1;
> >>> +
> >>> /*
> >>> - * Select priority 2 as the "default priority" unless nr_rings is less
> >>> - * than 2 and then pick the lowest priority
> >>> + * Pick a medium priority level as default. Lower numeric value is
> >>> + * higher priority, so round-up to pick a priority that is not higher
> >>> + * than the middle priority level.
> >>> */
> >>> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1);
> >>> + default_prio = DIV_ROUND_UP(max_priority, 2);
> >>>
> >>> INIT_LIST_HEAD(&ctx->submitqueues);
> >>>
> >>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> >>> index f075851021c3..6b8fffc28a50 100644
> >>> --- a/include/uapi/drm/msm_drm.h
> >>> +++ b/include/uapi/drm/msm_drm.h
> >>> @@ -73,11 +73,19 @@ struct drm_msm_timespec {
> >>> #define MSM_PARAM_MAX_FREQ 0x04
> >>> #define MSM_PARAM_TIMESTAMP 0x05
> >>> #define MSM_PARAM_GMEM_BASE 0x06
> >>> -#define MSM_PARAM_NR_RINGS 0x07
> >>> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */
> >>> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */
> >>> #define MSM_PARAM_FAULTS 0x09
> >>> #define MSM_PARAM_SUSPENDS 0x0a
> >>>
> >>> +/* For backwards compat. The original support for preemption was based on
> >>> + * a single ring per priority level so # of priority levels equals the #
> >>> + * of rings. With drm/scheduler providing additional levels of priority,
> >>> + * the number of priorities is greater than the # of rings. The param is
> >>> + * renamed to better reflect this.
> >>> + */
> >>> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES
> >>> +
> >>> struct drm_msm_param {
> >>> __u32 pipe; /* in, MSM_PIPE_x */
> >>> __u32 param; /* in, MSM_PARAM_x */
> >>> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise {
> >>>
> >>> #define MSM_SUBMITQUEUE_FLAGS (0)
> >>>
> >>> +/*
> >>> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1,
> >>> + * a lower numeric value is higher priority.
> >>> + */
> >>> struct drm_msm_submitqueue {
> >>> __u32 flags; /* in, MSM_SUBMITQUEUE_x */
> >>> __u32 prio; /* in, Priority level */

2022-05-25 12:41:12

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities


On 24/05/2022 15:50, Rob Clark wrote:
> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> On 23/05/2022 23:53, Rob Clark wrote:
>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> Hi Rob,
>>>>
>>>> On 28/07/2021 02:06, Rob Clark wrote:
>>>>> From: Rob Clark <[email protected]>
>>>>>
>>>>> The drm/scheduler provides additional prioritization on top of that
>>>>> provided by however many number of ringbuffers (each with their own
>>>>> priority level) is supported on a given generation. Expose the
>>>>> additional levels of priority to userspace and map the userspace
>>>>> priority back to ring (first level of priority) and schedular priority
>>>>> (additional priority levels within the ring).
>>>>>
>>>>> Signed-off-by: Rob Clark <[email protected]>
>>>>> Acked-by: Christian König <[email protected]>
>>>>> ---
>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
>>>>> include/uapi/drm/msm_drm.h | 14 +++++-
>>>>> 5 files changed, 88 insertions(+), 27 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>> index bad4809b68ef..748665232d29 100644
>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
>>>>> return ret;
>>>>> }
>>>>> return -EINVAL;
>>>>> - case MSM_PARAM_NR_RINGS:
>>>>> - *value = gpu->nr_rings;
>>>>> + case MSM_PARAM_PRIORITIES:
>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
>>>>> return 0;
>>>>> case MSM_PARAM_PP_PGTABLE:
>>>>> *value = 0;
>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>> index 450efe59abb5..c2ecec5b11c4 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>>>> submit->gpu = gpu;
>>>>> submit->cmd = (void *)&submit->bos[nr_bos];
>>>>> submit->queue = queue;
>>>>> - submit->ring = gpu->rb[queue->prio];
>>>>> + submit->ring = gpu->rb[queue->ring_nr];
>>>>> submit->fault_dumped = false;
>>>>>
>>>>> INIT_LIST_HEAD(&submit->node);
>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>>>>> /* Get a unique identifier for the submission for logging purposes */
>>>>> submitid = atomic_inc_return(&ident) - 1;
>>>>>
>>>>> - ring = gpu->rb[queue->prio];
>>>>> + ring = gpu->rb[queue->ring_nr];
>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
>>>>> args->nr_bos, args->nr_cmds);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>>>> index b912cacaecc0..0e4b45bff2e6 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
>>>>> const char *name;
>>>>> };
>>>>>
>>>>> +/*
>>>>> + * The number of priority levels provided by drm gpu scheduler. The
>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
>>>>> + * cases, so we don't use it (no need for kernel generated jobs).
>>>>> + */
>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
>>>>> +
>>>>> +/**
>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
>>>>> + *
>>>>> + * @gpu: the gpu instance
>>>>> + * @prio: the userspace priority level
>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
>>>>> + * priority maps to
>>>>> + *
>>>>> + * With drm/scheduler providing it's own level of prioritization, our total
>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
>>>>> + * Each ring is associated with it's own scheduler instance. However, our
>>>>> + * UABI is that lower numerical values are higher priority. So mapping the
>>>>> + * single userspace priority level into ring_nr and sched_prio takes some
>>>>> + * care. The userspace provided priority (when a submitqueue is created)
>>>>> + * is mapped to ring nr and scheduler priority as such:
>>>>> + *
>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
>>>>> + * sched_prio = NR_SCHED_PRIORITIES -
>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
>>>>> + *
>>>>> + * This allows generations without preemption (nr_rings==1) to have some
>>>>> + * amount of prioritization, and provides more priority levels for gens
>>>>> + * that do have preemption.
>>>>
>>>> I am exploring how different drivers handle priority levels and this
>>>> caught my eye.
>>>>
>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1,
>>>> ring + 1 preempts ring?
>>>
>>> Other way around, at least from the uabi standpoint. Ie. ring[0]
>>> preempts ring[1]
>>
>> Ah yes, I figure it out from the comments but then confused myself when
>> writing the email.
>>
>>>> If so I am wondering does the "spreading" of
>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
>>>> levels within every "bucket" or how does that work?
>>>
>>> So, preemption is possible between any priority level before run_job()
>>> gets called, which writes the job into the ringbuffer. After that
>>
>> Hmm how? Before run_job() the jobs are not runnable, sitting in the
>> scheduler queues, right?
>
> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on
> prio[1] could be executed after submit B on prio[2] provided that
> run_job(submitA) hasn't happened yet. So I guess it isn't "really"
> preemption because the submit hasn't started running on the GPU yet.
> But rather just scheduling according to priority.
>
>>> point, you only have "bucket" level preemption, because
>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
>>> ringbuffer.
>>
>> Right, and you have one GPU with four rings, which means you expose 12
>> priority levels to userspace, did I get that right?
>
> Correct
>
>> If so how do you convey in the ABI that not all there priority levels
>> are equal? Like userspace can submit at prio 4 and expect prio 3 to
>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not
>> match - 3 will not preempt 4.
>
> It isn't really exposed to userspace, but perhaps it should be..
> Userspace just knows that, to the extent possible, the kernel will try
> to execute prio 3 before prio 4.
>
>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
>> quick peek in Mesa but did not spot it - although I am not really at
>> home there yet so maybe I missed it.
>
> Yes, there is an EGL extension:
>
> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
>
> It is pretty limited, it only exposes three priority levels.

Right, is that wired up on msm? And if it is, or could be, how do/would
you map the three priority levels for GPUs which expose 3 priority
levels versus the one which exposes 12?

Is it doable properly without leaking the fact drm/sched internal
implementation detail of three priority levels? Or if you went the other
way and only exposed up to max 3 levels, then you lose one priority
level your hardware suppose which is also not good.

It is all quite interesting because your hardware is completely
different from ours in this respect. In our case i915 decides when to
preempt, hardware has no concept of priority (*).

Regards,

Tvrtko

(*) Almost no concept of priority in hardware - we do have it on new
GPUs and only on a subset of engine classes where render and compute
share the EUs. But I think it's way different from Ardenos.

2022-05-25 14:01:21

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Tue, May 24, 2022 at 7:57 AM Rob Clark <[email protected]> wrote:
>
> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
> <[email protected]> wrote:
> >
> > On 23/05/2022 23:53, Rob Clark wrote:
> > >
> > > btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
> > > trying to add an igt test to stress shrinker/eviction, similar to the
> > > existing tests/i915/gem_shrink.c. But we hit an unfortunate
> > > combination of circumstances:
> > > 1. Pinning memory happens in the synchronous part of the submit ioctl,
> > > before enqueuing the job for the kthread to handle.
> > > 2. The first run_job() callback incurs a slight delay (~1.5ms) while
> > > resuming the GPU
> > > 3. Because of that delay, userspace has a chance to queue up enough
> > > more jobs to require locking/pinning more than the available system
> > > RAM..
> >
> > Is that one or multiple threads submitting jobs?
>
> In this case multiple.. but I think it could also happen with a single
> thread (provided it didn't stall on a fence, directly or indirectly,
> from an earlier submit), because of how resume and actual job
> submission happens from scheduler kthread.
>
> > > I'm not sure if we want a way to prevent userspace from getting *too*
> > > far ahead of the kthread. Or maybe at some point the shrinker should
> > > sleep on non-idle buffers?
> >
> > On the direct reclaim path when invoked from the submit ioctl? In i915
> > we only shrink idle objects on direct reclaim and leave active ones for
> > the swapper. It depends on how your locking looks like whether you could
> > do them, whether there would be coupling of locks and fs-reclaim context.
>
> I think the locking is more or less ok, although lockdep is unhappy
> about one thing[1] which is I think a false warning (ie. not
> recognizing that we'd already successfully acquired the obj lock via
> trylock). We can already reclaim idle bo's in this path. But the
> problem with a bunch of submits queued up in the scheduler, is that
> they are already considered pinned and active. So at some point we
> need to sleep (hopefully interruptabley) until they are no longer
> active, ie. to throttle userspace trying to shove in more submits
> until some of the enqueued ones have a chance to run and complete.
>
> BR,
> -R
>
> [1] https://gitlab.freedesktop.org/drm/msm/-/issues/14
>

btw, one thing I'm thinking about is __GFP_RETRY_MAYFAIL for gem
bo's.. I'd need to think about the various code paths that could
trigger us to need to allocate pages, but short-circuiting the
out_of_memory() path deep in drm_gem_get_pages() ->
shmem_read_mapping_page() -> ... -> __alloc_pages_may_oom() and
letting the driver decide itself if there is queued work worth waiting
on (and if not, calling out_of_memory() directly itself) seems like a
possible solution.. that also untangles the interrupted-syscall case
so we don't end up having to block in a non-interruptible way. Seems
like it might work?

BR,
-R

2022-05-25 18:17:09

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 23/05/2022 23:53, Rob Clark wrote:
> > On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >>
> >> Hi Rob,
> >>
> >> On 28/07/2021 02:06, Rob Clark wrote:
> >>> From: Rob Clark <[email protected]>
> >>>
> >>> The drm/scheduler provides additional prioritization on top of that
> >>> provided by however many number of ringbuffers (each with their own
> >>> priority level) is supported on a given generation. Expose the
> >>> additional levels of priority to userspace and map the userspace
> >>> priority back to ring (first level of priority) and schedular priority
> >>> (additional priority levels within the ring).
> >>>
> >>> Signed-off-by: Rob Clark <[email protected]>
> >>> Acked-by: Christian König <[email protected]>
> >>> ---
> >>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
> >>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
> >>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
> >>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
> >>> include/uapi/drm/msm_drm.h | 14 +++++-
> >>> 5 files changed, 88 insertions(+), 27 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>> index bad4809b68ef..748665232d29 100644
> >>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
> >>> return ret;
> >>> }
> >>> return -EINVAL;
> >>> - case MSM_PARAM_NR_RINGS:
> >>> - *value = gpu->nr_rings;
> >>> + case MSM_PARAM_PRIORITIES:
> >>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
> >>> return 0;
> >>> case MSM_PARAM_PP_PGTABLE:
> >>> *value = 0;
> >>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>> index 450efe59abb5..c2ecec5b11c4 100644
> >>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> >>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> >>> submit->gpu = gpu;
> >>> submit->cmd = (void *)&submit->bos[nr_bos];
> >>> submit->queue = queue;
> >>> - submit->ring = gpu->rb[queue->prio];
> >>> + submit->ring = gpu->rb[queue->ring_nr];
> >>> submit->fault_dumped = false;
> >>>
> >>> INIT_LIST_HEAD(&submit->node);
> >>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >>> /* Get a unique identifier for the submission for logging purposes */
> >>> submitid = atomic_inc_return(&ident) - 1;
> >>>
> >>> - ring = gpu->rb[queue->prio];
> >>> + ring = gpu->rb[queue->ring_nr];
> >>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
> >>> args->nr_bos, args->nr_cmds);
> >>>
> >>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> >>> index b912cacaecc0..0e4b45bff2e6 100644
> >>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
> >>> const char *name;
> >>> };
> >>>
> >>> +/*
> >>> + * The number of priority levels provided by drm gpu scheduler. The
> >>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
> >>> + * cases, so we don't use it (no need for kernel generated jobs).
> >>> + */
> >>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
> >>> +
> >>> +/**
> >>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
> >>> + *
> >>> + * @gpu: the gpu instance
> >>> + * @prio: the userspace priority level
> >>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
> >>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
> >>> + * priority maps to
> >>> + *
> >>> + * With drm/scheduler providing it's own level of prioritization, our total
> >>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
> >>> + * Each ring is associated with it's own scheduler instance. However, our
> >>> + * UABI is that lower numerical values are higher priority. So mapping the
> >>> + * single userspace priority level into ring_nr and sched_prio takes some
> >>> + * care. The userspace provided priority (when a submitqueue is created)
> >>> + * is mapped to ring nr and scheduler priority as such:
> >>> + *
> >>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
> >>> + * sched_prio = NR_SCHED_PRIORITIES -
> >>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
> >>> + *
> >>> + * This allows generations without preemption (nr_rings==1) to have some
> >>> + * amount of prioritization, and provides more priority levels for gens
> >>> + * that do have preemption.
> >>
> >> I am exploring how different drivers handle priority levels and this
> >> caught my eye.
> >>
> >> Is the implication of the last paragraphs that on hw with nr_rings > 1,
> >> ring + 1 preempts ring?
> >
> > Other way around, at least from the uabi standpoint. Ie. ring[0]
> > preempts ring[1]
>
> Ah yes, I figure it out from the comments but then confused myself when
> writing the email.
>
> >> If so I am wondering does the "spreading" of
> >> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
> >> levels within every "bucket" or how does that work?
> >
> > So, preemption is possible between any priority level before run_job()
> > gets called, which writes the job into the ringbuffer. After that
>
> Hmm how? Before run_job() the jobs are not runnable, sitting in the
> scheduler queues, right?

I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on
prio[1] could be executed after submit B on prio[2] provided that
run_job(submitA) hasn't happened yet. So I guess it isn't "really"
preemption because the submit hasn't started running on the GPU yet.
But rather just scheduling according to priority.

> > point, you only have "bucket" level preemption, because
> > NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
> > ringbuffer.
>
> Right, and you have one GPU with four rings, which means you expose 12
> priority levels to userspace, did I get that right?

Correct

> If so how do you convey in the ABI that not all there priority levels
> are equal? Like userspace can submit at prio 4 and expect prio 3 to
> preempt, as would prio 2 preempt prio 3. While actual behaviour will not
> match - 3 will not preempt 4.

It isn't really exposed to userspace, but perhaps it should be..
Userspace just knows that, to the extent possible, the kernel will try
to execute prio 3 before prio 4.

> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
> quick peek in Mesa but did not spot it - although I am not really at
> home there yet so maybe I missed it.

Yes, there is an EGL extension:

https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt

It is pretty limited, it only exposes three priority levels.

BR,
-R

> > -----
> >
> > btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
> > trying to add an igt test to stress shrinker/eviction, similar to the
> > existing tests/i915/gem_shrink.c. But we hit an unfortunate
> > combination of circumstances:
> > 1. Pinning memory happens in the synchronous part of the submit ioctl,
> > before enqueuing the job for the kthread to handle.
> > 2. The first run_job() callback incurs a slight delay (~1.5ms) while
> > resuming the GPU
> > 3. Because of that delay, userspace has a chance to queue up enough
> > more jobs to require locking/pinning more than the available system
> > RAM..
>
> Is that one or multiple threads submitting jobs?
>
> > I'm not sure if we want a way to prevent userspace from getting *too*
> > far ahead of the kthread. Or maybe at some point the shrinker should
> > sleep on non-idle buffers?
>
> On the direct reclaim path when invoked from the submit ioctl? In i915
> we only shrink idle objects on direct reclaim and leave active ones for
> the swapper. It depends on how your locking looks like whether you could
> do them, whether there would be coupling of locks and fs-reclaim context.
>
> Regards,
>
> Tvrtko
>
> > BR,
> > -R
> >
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> + */
> >>> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio,
> >>> + unsigned *ring_nr, enum drm_sched_priority *sched_prio)
> >>> +{
> >>> + unsigned rn, sp;
> >>> +
> >>> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp);
> >>> +
> >>> + /* invert sched priority to map to higher-numeric-is-higher-
> >>> + * priority convention
> >>> + */
> >>> + sp = NR_SCHED_PRIORITIES - sp - 1;
> >>> +
> >>> + if (rn >= gpu->nr_rings)
> >>> + return -EINVAL;
> >>> +
> >>> + *ring_nr = rn;
> >>> + *sched_prio = sp;
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> /**
> >>> * A submitqueue is associated with a gl context or vk queue (or equiv)
> >>> * in userspace.
> >>> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr {
> >>> * @id: userspace id for the submitqueue, unique within the drm_file
> >>> * @flags: userspace flags for the submitqueue, specified at creation
> >>> * (currently unusued)
> >>> - * @prio: the submitqueue priority
> >>> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined
> >>> + * by the submitqueue's priority
> >>> * @faults: the number of GPU hangs associated with this submitqueue
> >>> * @ctx: the per-drm_file context associated with the submitqueue (ie.
> >>> * which set of pgtables do submits jobs associated with the
> >>> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr {
> >>> struct msm_gpu_submitqueue {
> >>> int id;
> >>> u32 flags;
> >>> - u32 prio;
> >>> + u32 ring_nr;
> >>> int faults;
> >>> struct msm_file_private *ctx;
> >>> struct list_head node;
> >>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> index 682ba2a7c0ec..32a55d81b58b 100644
> >>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
> >>> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>> struct msm_gpu_submitqueue *queue;
> >>> struct msm_ringbuffer *ring;
> >>> struct drm_gpu_scheduler *sched;
> >>> + enum drm_sched_priority sched_prio;
> >>> + unsigned ring_nr;
> >>> int ret;
> >>>
> >>> if (!ctx)
> >>> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>> if (!priv->gpu)
> >>> return -ENODEV;
> >>>
> >>> - if (prio >= priv->gpu->nr_rings)
> >>> - return -EINVAL;
> >>> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio);
> >>> + if (ret)
> >>> + return ret;
> >>>
> >>> queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> >>>
> >>> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>>
> >>> kref_init(&queue->ref);
> >>> queue->flags = flags;
> >>> - queue->prio = prio;
> >>> + queue->ring_nr = ring_nr;
> >>>
> >>> - ring = priv->gpu->rb[prio];
> >>> + ring = priv->gpu->rb[ring_nr];
> >>> sched = &ring->sched;
> >>>
> >>> - /*
> >>> - * TODO we can allow more priorities than we have ringbuffers by
> >>> - * mapping:
> >>> - *
> >>> - * ring = prio / 3;
> >>> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3);
> >>> - *
> >>> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is
> >>> - * treated specially in places.
> >>> - */
> >>> ret = drm_sched_entity_init(&queue->entity,
> >>> - DRM_SCHED_PRIORITY_NORMAL,
> >>> - &sched, 1, NULL);
> >>> + sched_prio, &sched, 1, NULL);
> >>> if (ret) {
> >>> kfree(queue);
> >>> return ret;
> >>> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
> >>> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
> >>> {
> >>> struct msm_drm_private *priv = drm->dev_private;
> >>> - int default_prio;
> >>> + int default_prio, max_priority;
> >>>
> >>> if (!priv->gpu)
> >>> return -ENODEV;
> >>>
> >>> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1;
> >>> +
> >>> /*
> >>> - * Select priority 2 as the "default priority" unless nr_rings is less
> >>> - * than 2 and then pick the lowest priority
> >>> + * Pick a medium priority level as default. Lower numeric value is
> >>> + * higher priority, so round-up to pick a priority that is not higher
> >>> + * than the middle priority level.
> >>> */
> >>> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1);
> >>> + default_prio = DIV_ROUND_UP(max_priority, 2);
> >>>
> >>> INIT_LIST_HEAD(&ctx->submitqueues);
> >>>
> >>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
> >>> index f075851021c3..6b8fffc28a50 100644
> >>> --- a/include/uapi/drm/msm_drm.h
> >>> +++ b/include/uapi/drm/msm_drm.h
> >>> @@ -73,11 +73,19 @@ struct drm_msm_timespec {
> >>> #define MSM_PARAM_MAX_FREQ 0x04
> >>> #define MSM_PARAM_TIMESTAMP 0x05
> >>> #define MSM_PARAM_GMEM_BASE 0x06
> >>> -#define MSM_PARAM_NR_RINGS 0x07
> >>> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */
> >>> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */
> >>> #define MSM_PARAM_FAULTS 0x09
> >>> #define MSM_PARAM_SUSPENDS 0x0a
> >>>
> >>> +/* For backwards compat. The original support for preemption was based on
> >>> + * a single ring per priority level so # of priority levels equals the #
> >>> + * of rings. With drm/scheduler providing additional levels of priority,
> >>> + * the number of priorities is greater than the # of rings. The param is
> >>> + * renamed to better reflect this.
> >>> + */
> >>> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES
> >>> +
> >>> struct drm_msm_param {
> >>> __u32 pipe; /* in, MSM_PIPE_x */
> >>> __u32 param; /* in, MSM_PARAM_x */
> >>> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise {
> >>>
> >>> #define MSM_SUBMITQUEUE_FLAGS (0)
> >>>
> >>> +/*
> >>> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1,
> >>> + * a lower numeric value is higher priority.
> >>> + */
> >>> struct drm_msm_submitqueue {
> >>> __u32 flags; /* in, MSM_SUBMITQUEUE_x */
> >>> __u32 prio; /* in, Priority level */

2022-05-25 21:04:01

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities


On 24/05/2022 15:57, Rob Clark wrote:
> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>> On 23/05/2022 23:53, Rob Clark wrote:
>>>
>>> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
>>> trying to add an igt test to stress shrinker/eviction, similar to the
>>> existing tests/i915/gem_shrink.c. But we hit an unfortunate
>>> combination of circumstances:
>>> 1. Pinning memory happens in the synchronous part of the submit ioctl,
>>> before enqueuing the job for the kthread to handle.
>>> 2. The first run_job() callback incurs a slight delay (~1.5ms) while
>>> resuming the GPU
>>> 3. Because of that delay, userspace has a chance to queue up enough
>>> more jobs to require locking/pinning more than the available system
>>> RAM..
>>
>> Is that one or multiple threads submitting jobs?
>
> In this case multiple.. but I think it could also happen with a single
> thread (provided it didn't stall on a fence, directly or indirectly,
> from an earlier submit), because of how resume and actual job
> submission happens from scheduler kthread.
>
>>> I'm not sure if we want a way to prevent userspace from getting *too*
>>> far ahead of the kthread. Or maybe at some point the shrinker should
>>> sleep on non-idle buffers?
>>
>> On the direct reclaim path when invoked from the submit ioctl? In i915
>> we only shrink idle objects on direct reclaim and leave active ones for
>> the swapper. It depends on how your locking looks like whether you could
>> do them, whether there would be coupling of locks and fs-reclaim context.
>
> I think the locking is more or less ok, although lockdep is unhappy
> about one thing[1] which is I think a false warning (ie. not
> recognizing that we'd already successfully acquired the obj lock via
> trylock). We can already reclaim idle bo's in this path. But the
> problem with a bunch of submits queued up in the scheduler, is that
> they are already considered pinned and active. So at some point we
> need to sleep (hopefully interruptabley) until they are no longer
> active, ie. to throttle userspace trying to shove in more submits
> until some of the enqueued ones have a chance to run and complete.

Odd I did not think trylock could trigger that. Looking at your code it
indeed seems two trylocks. I am pretty sure we use the same trylock
trick to avoid it. I am confused..

Otherwise if you can afford to sleep you can of course throttle
organically via direct reclaim. Unless I am forgetting some key gotcha -
it's been a while I've been active in this area.

Regards,

Tvrtko

>
> BR,
> -R
>
> [1] https://gitlab.freedesktop.org/drm/msm/-/issues/14
>
>> Regards,
>>
>> Tvrtko
>>
>>> BR,
>>> -R
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> + */
>>>>> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio,
>>>>> + unsigned *ring_nr, enum drm_sched_priority *sched_prio)
>>>>> +{
>>>>> + unsigned rn, sp;
>>>>> +
>>>>> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp);
>>>>> +
>>>>> + /* invert sched priority to map to higher-numeric-is-higher-
>>>>> + * priority convention
>>>>> + */
>>>>> + sp = NR_SCHED_PRIORITIES - sp - 1;
>>>>> +
>>>>> + if (rn >= gpu->nr_rings)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + *ring_nr = rn;
>>>>> + *sched_prio = sp;
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> /**
>>>>> * A submitqueue is associated with a gl context or vk queue (or equiv)
>>>>> * in userspace.
>>>>> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr {
>>>>> * @id: userspace id for the submitqueue, unique within the drm_file
>>>>> * @flags: userspace flags for the submitqueue, specified at creation
>>>>> * (currently unusued)
>>>>> - * @prio: the submitqueue priority
>>>>> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined
>>>>> + * by the submitqueue's priority
>>>>> * @faults: the number of GPU hangs associated with this submitqueue
>>>>> * @ctx: the per-drm_file context associated with the submitqueue (ie.
>>>>> * which set of pgtables do submits jobs associated with the
>>>>> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr {
>>>>> struct msm_gpu_submitqueue {
>>>>> int id;
>>>>> u32 flags;
>>>>> - u32 prio;
>>>>> + u32 ring_nr;
>>>>> int faults;
>>>>> struct msm_file_private *ctx;
>>>>> struct list_head node;
>>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>> index 682ba2a7c0ec..32a55d81b58b 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c
>>>>> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>>>> struct msm_gpu_submitqueue *queue;
>>>>> struct msm_ringbuffer *ring;
>>>>> struct drm_gpu_scheduler *sched;
>>>>> + enum drm_sched_priority sched_prio;
>>>>> + unsigned ring_nr;
>>>>> int ret;
>>>>>
>>>>> if (!ctx)
>>>>> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>>>> if (!priv->gpu)
>>>>> return -ENODEV;
>>>>>
>>>>> - if (prio >= priv->gpu->nr_rings)
>>>>> - return -EINVAL;
>>>>> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio);
>>>>> + if (ret)
>>>>> + return ret;
>>>>>
>>>>> queue = kzalloc(sizeof(*queue), GFP_KERNEL);
>>>>>
>>>>> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>>>>
>>>>> kref_init(&queue->ref);
>>>>> queue->flags = flags;
>>>>> - queue->prio = prio;
>>>>> + queue->ring_nr = ring_nr;
>>>>>
>>>>> - ring = priv->gpu->rb[prio];
>>>>> + ring = priv->gpu->rb[ring_nr];
>>>>> sched = &ring->sched;
>>>>>
>>>>> - /*
>>>>> - * TODO we can allow more priorities than we have ringbuffers by
>>>>> - * mapping:
>>>>> - *
>>>>> - * ring = prio / 3;
>>>>> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3);
>>>>> - *
>>>>> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is
>>>>> - * treated specially in places.
>>>>> - */
>>>>> ret = drm_sched_entity_init(&queue->entity,
>>>>> - DRM_SCHED_PRIORITY_NORMAL,
>>>>> - &sched, 1, NULL);
>>>>> + sched_prio, &sched, 1, NULL);
>>>>> if (ret) {
>>>>> kfree(queue);
>>>>> return ret;
>>>>> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx,
>>>>> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx)
>>>>> {
>>>>> struct msm_drm_private *priv = drm->dev_private;
>>>>> - int default_prio;
>>>>> + int default_prio, max_priority;
>>>>>
>>>>> if (!priv->gpu)
>>>>> return -ENODEV;
>>>>>
>>>>> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1;
>>>>> +
>>>>> /*
>>>>> - * Select priority 2 as the "default priority" unless nr_rings is less
>>>>> - * than 2 and then pick the lowest priority
>>>>> + * Pick a medium priority level as default. Lower numeric value is
>>>>> + * higher priority, so round-up to pick a priority that is not higher
>>>>> + * than the middle priority level.
>>>>> */
>>>>> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1);
>>>>> + default_prio = DIV_ROUND_UP(max_priority, 2);
>>>>>
>>>>> INIT_LIST_HEAD(&ctx->submitqueues);
>>>>>
>>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h
>>>>> index f075851021c3..6b8fffc28a50 100644
>>>>> --- a/include/uapi/drm/msm_drm.h
>>>>> +++ b/include/uapi/drm/msm_drm.h
>>>>> @@ -73,11 +73,19 @@ struct drm_msm_timespec {
>>>>> #define MSM_PARAM_MAX_FREQ 0x04
>>>>> #define MSM_PARAM_TIMESTAMP 0x05
>>>>> #define MSM_PARAM_GMEM_BASE 0x06
>>>>> -#define MSM_PARAM_NR_RINGS 0x07
>>>>> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */
>>>>> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */
>>>>> #define MSM_PARAM_FAULTS 0x09
>>>>> #define MSM_PARAM_SUSPENDS 0x0a
>>>>>
>>>>> +/* For backwards compat. The original support for preemption was based on
>>>>> + * a single ring per priority level so # of priority levels equals the #
>>>>> + * of rings. With drm/scheduler providing additional levels of priority,
>>>>> + * the number of priorities is greater than the # of rings. The param is
>>>>> + * renamed to better reflect this.
>>>>> + */
>>>>> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES
>>>>> +
>>>>> struct drm_msm_param {
>>>>> __u32 pipe; /* in, MSM_PIPE_x */
>>>>> __u32 param; /* in, MSM_PARAM_x */
>>>>> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise {
>>>>>
>>>>> #define MSM_SUBMITQUEUE_FLAGS (0)
>>>>>
>>>>> +/*
>>>>> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1,
>>>>> + * a lower numeric value is higher priority.
>>>>> + */
>>>>> struct drm_msm_submitqueue {
>>>>> __u32 flags; /* in, MSM_SUBMITQUEUE_x */
>>>>> __u32 prio; /* in, Priority level */

2022-05-26 04:00:35

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 24/05/2022 15:50, Rob Clark wrote:
> > On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >>
> >> On 23/05/2022 23:53, Rob Clark wrote:
> >>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>> Hi Rob,
> >>>>
> >>>> On 28/07/2021 02:06, Rob Clark wrote:
> >>>>> From: Rob Clark <[email protected]>
> >>>>>
> >>>>> The drm/scheduler provides additional prioritization on top of that
> >>>>> provided by however many number of ringbuffers (each with their own
> >>>>> priority level) is supported on a given generation. Expose the
> >>>>> additional levels of priority to userspace and map the userspace
> >>>>> priority back to ring (first level of priority) and schedular priority
> >>>>> (additional priority levels within the ring).
> >>>>>
> >>>>> Signed-off-by: Rob Clark <[email protected]>
> >>>>> Acked-by: Christian König <[email protected]>
> >>>>> ---
> >>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
> >>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
> >>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
> >>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
> >>>>> include/uapi/drm/msm_drm.h | 14 +++++-
> >>>>> 5 files changed, 88 insertions(+), 27 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>> index bad4809b68ef..748665232d29 100644
> >>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
> >>>>> return ret;
> >>>>> }
> >>>>> return -EINVAL;
> >>>>> - case MSM_PARAM_NR_RINGS:
> >>>>> - *value = gpu->nr_rings;
> >>>>> + case MSM_PARAM_PRIORITIES:
> >>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
> >>>>> return 0;
> >>>>> case MSM_PARAM_PP_PGTABLE:
> >>>>> *value = 0;
> >>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>> index 450efe59abb5..c2ecec5b11c4 100644
> >>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> >>>>> submit->gpu = gpu;
> >>>>> submit->cmd = (void *)&submit->bos[nr_bos];
> >>>>> submit->queue = queue;
> >>>>> - submit->ring = gpu->rb[queue->prio];
> >>>>> + submit->ring = gpu->rb[queue->ring_nr];
> >>>>> submit->fault_dumped = false;
> >>>>>
> >>>>> INIT_LIST_HEAD(&submit->node);
> >>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >>>>> /* Get a unique identifier for the submission for logging purposes */
> >>>>> submitid = atomic_inc_return(&ident) - 1;
> >>>>>
> >>>>> - ring = gpu->rb[queue->prio];
> >>>>> + ring = gpu->rb[queue->ring_nr];
> >>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
> >>>>> args->nr_bos, args->nr_cmds);
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> >>>>> index b912cacaecc0..0e4b45bff2e6 100644
> >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
> >>>>> const char *name;
> >>>>> };
> >>>>>
> >>>>> +/*
> >>>>> + * The number of priority levels provided by drm gpu scheduler. The
> >>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
> >>>>> + * cases, so we don't use it (no need for kernel generated jobs).
> >>>>> + */
> >>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
> >>>>> +
> >>>>> +/**
> >>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
> >>>>> + *
> >>>>> + * @gpu: the gpu instance
> >>>>> + * @prio: the userspace priority level
> >>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
> >>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
> >>>>> + * priority maps to
> >>>>> + *
> >>>>> + * With drm/scheduler providing it's own level of prioritization, our total
> >>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
> >>>>> + * Each ring is associated with it's own scheduler instance. However, our
> >>>>> + * UABI is that lower numerical values are higher priority. So mapping the
> >>>>> + * single userspace priority level into ring_nr and sched_prio takes some
> >>>>> + * care. The userspace provided priority (when a submitqueue is created)
> >>>>> + * is mapped to ring nr and scheduler priority as such:
> >>>>> + *
> >>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
> >>>>> + * sched_prio = NR_SCHED_PRIORITIES -
> >>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
> >>>>> + *
> >>>>> + * This allows generations without preemption (nr_rings==1) to have some
> >>>>> + * amount of prioritization, and provides more priority levels for gens
> >>>>> + * that do have preemption.
> >>>>
> >>>> I am exploring how different drivers handle priority levels and this
> >>>> caught my eye.
> >>>>
> >>>> Is the implication of the last paragraphs that on hw with nr_rings > 1,
> >>>> ring + 1 preempts ring?
> >>>
> >>> Other way around, at least from the uabi standpoint. Ie. ring[0]
> >>> preempts ring[1]
> >>
> >> Ah yes, I figure it out from the comments but then confused myself when
> >> writing the email.
> >>
> >>>> If so I am wondering does the "spreading" of
> >>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
> >>>> levels within every "bucket" or how does that work?
> >>>
> >>> So, preemption is possible between any priority level before run_job()
> >>> gets called, which writes the job into the ringbuffer. After that
> >>
> >> Hmm how? Before run_job() the jobs are not runnable, sitting in the
> >> scheduler queues, right?
> >
> > I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on
> > prio[1] could be executed after submit B on prio[2] provided that
> > run_job(submitA) hasn't happened yet. So I guess it isn't "really"
> > preemption because the submit hasn't started running on the GPU yet.
> > But rather just scheduling according to priority.
> >
> >>> point, you only have "bucket" level preemption, because
> >>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
> >>> ringbuffer.
> >>
> >> Right, and you have one GPU with four rings, which means you expose 12
> >> priority levels to userspace, did I get that right?
> >
> > Correct
> >
> >> If so how do you convey in the ABI that not all there priority levels
> >> are equal? Like userspace can submit at prio 4 and expect prio 3 to
> >> preempt, as would prio 2 preempt prio 3. While actual behaviour will not
> >> match - 3 will not preempt 4.
> >
> > It isn't really exposed to userspace, but perhaps it should be..
> > Userspace just knows that, to the extent possible, the kernel will try
> > to execute prio 3 before prio 4.
> >
> >> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
> >> quick peek in Mesa but did not spot it - although I am not really at
> >> home there yet so maybe I missed it.
> >
> > Yes, there is an EGL extension:
> >
> > https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
> >
> > It is pretty limited, it only exposes three priority levels.
>
> Right, is that wired up on msm? And if it is, or could be, how do/would
> you map the three priority levels for GPUs which expose 3 priority
> levels versus the one which exposes 12?

We don't yet, but probably should, expose a cap to indicate to
userspace the # of hw rings vs # of levels of sched priority

> Is it doable properly without leaking the fact drm/sched internal
> implementation detail of three priority levels? Or if you went the other
> way and only exposed up to max 3 levels, then you lose one priority
> level your hardware suppose which is also not good.
>
> It is all quite interesting because your hardware is completely
> different from ours in this respect. In our case i915 decides when to
> preempt, hardware has no concept of priority (*).

It is really pretty much all in firmware.. a6xx is the first gen that
could do actual (non-cooperative) preemption (but that isn't
implemented yet in upstream driver)

BR,
-R

> Regards,
>
> Tvrtko
>
> (*) Almost no concept of priority in hardware - we do have it on new
> GPUs and only on a subset of engine classes where render and compute
> share the EUs. But I think it's way different from Ardenos.

2022-05-26 20:55:10

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities


On 26/05/2022 04:15, Rob Clark wrote:
> On Wed, May 25, 2022 at 9:11 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> On 24/05/2022 15:57, Rob Clark wrote:
>>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
>>> <[email protected]> wrote:
>>>>
>>>> On 23/05/2022 23:53, Rob Clark wrote:
>>>>>
>>>>> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
>>>>> trying to add an igt test to stress shrinker/eviction, similar to the
>>>>> existing tests/i915/gem_shrink.c. But we hit an unfortunate
>>>>> combination of circumstances:
>>>>> 1. Pinning memory happens in the synchronous part of the submit ioctl,
>>>>> before enqueuing the job for the kthread to handle.
>>>>> 2. The first run_job() callback incurs a slight delay (~1.5ms) while
>>>>> resuming the GPU
>>>>> 3. Because of that delay, userspace has a chance to queue up enough
>>>>> more jobs to require locking/pinning more than the available system
>>>>> RAM..
>>>>
>>>> Is that one or multiple threads submitting jobs?
>>>
>>> In this case multiple.. but I think it could also happen with a single
>>> thread (provided it didn't stall on a fence, directly or indirectly,
>>> from an earlier submit), because of how resume and actual job
>>> submission happens from scheduler kthread.
>>>
>>>>> I'm not sure if we want a way to prevent userspace from getting *too*
>>>>> far ahead of the kthread. Or maybe at some point the shrinker should
>>>>> sleep on non-idle buffers?
>>>>
>>>> On the direct reclaim path when invoked from the submit ioctl? In i915
>>>> we only shrink idle objects on direct reclaim and leave active ones for
>>>> the swapper. It depends on how your locking looks like whether you could
>>>> do them, whether there would be coupling of locks and fs-reclaim context.
>>>
>>> I think the locking is more or less ok, although lockdep is unhappy
>>> about one thing[1] which is I think a false warning (ie. not
>>> recognizing that we'd already successfully acquired the obj lock via
>>> trylock). We can already reclaim idle bo's in this path. But the
>>> problem with a bunch of submits queued up in the scheduler, is that
>>> they are already considered pinned and active. So at some point we
>>> need to sleep (hopefully interruptabley) until they are no longer
>>> active, ie. to throttle userspace trying to shove in more submits
>>> until some of the enqueued ones have a chance to run and complete.
>>
>> Odd I did not think trylock could trigger that. Looking at your code it
>> indeed seems two trylocks. I am pretty sure we use the same trylock
>> trick to avoid it. I am confused..
>
> The sequence is,
>
> 1. kref_get_unless_zero()
> 2. trylock, which succeeds
> 3. attempt to evict or purge (which may or may not have succeeded)
> 4. unlock
>
> ... meanwhile this has raced with submit (aka execbuf) finishing and
> retiring and dropping *other* remaining reference to bo...
>
> 5. drm_gem_object_put() which triggers drm_gem_object_free()
> 6. in our free path we acquire the obj lock again and then drop it.
> Which arguably is unnecessary and only serves to satisfy some
> GEM_WARN_ON(!msm_gem_is_locked(obj)) in code paths that are also used
> elsewhere
>
> lockdep doesn't realize the previously successful trylock+unlock
> sequence so it assumes that the code that triggered recursion into
> shrinker could be holding the objects lock.

Ah yes, missed that lock after trylock in msm_gem_shrinker/scan(). Well
i915 has the same sequence in our shrinker, but the difference is we use
delayed work to actually free, _and_ use trylock in the delayed worker.
It does feel a bit inelegant (objects with no reference count which
cannot be trylocked?!), but as this is the code recently refactored by
Maarten so I think best try and sync with him for the full story.

>> Otherwise if you can afford to sleep you can of course throttle
>> organically via direct reclaim. Unless I am forgetting some key gotcha -
>> it's been a while I've been active in this area.
>
> So, one thing that is awkward about sleeping in this path is that
> there is no way to propagate back -EINTR, so we end up doing an
> uninterruptible sleep in something that could be called indirectly
> from userspace syscall.. i915 seems to deal with this by limiting it
> to shrinker being called from kswapd. I think in the shrinker we want
> to know whether it is ok to sleep (ie. not syscall trigggered
> codepath, and whether we are under enough memory pressure to justify
> sleeping). For the syscall path, I'm playing with something that lets
> me pass __GFP_RETRY_MAYFAIL | __GFP_NOWARN to
> shmem_read_mapping_page_gfp(), and then stall after the shrinker has
> failed, somewhere where we can make it interruptable. Ofc, that
> doesn't help with all the other random memory allocations which can
> fail, so not sure if it will turn out to be a good approach or not.
> But I guess pinning the GEM bo's is the single biggest potential
> consumer of pages in the submit path, so maybe it will be better than
> nothing.

We play similar games, although by a quick look I am not sure we quite
manage to honour/propagate signals. This has certainly been a
historically fiddly area. If you first ask for no reclaim allocations
and invoke the shrinker manually first, then falling back to a bigger
hammer, you should be able to do it.

Regards,

Tvrtko

2022-05-26 23:55:18

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Wed, May 25, 2022 at 9:11 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 24/05/2022 15:57, Rob Clark wrote:
> > On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >> On 23/05/2022 23:53, Rob Clark wrote:
> >>>
> >>> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
> >>> trying to add an igt test to stress shrinker/eviction, similar to the
> >>> existing tests/i915/gem_shrink.c. But we hit an unfortunate
> >>> combination of circumstances:
> >>> 1. Pinning memory happens in the synchronous part of the submit ioctl,
> >>> before enqueuing the job for the kthread to handle.
> >>> 2. The first run_job() callback incurs a slight delay (~1.5ms) while
> >>> resuming the GPU
> >>> 3. Because of that delay, userspace has a chance to queue up enough
> >>> more jobs to require locking/pinning more than the available system
> >>> RAM..
> >>
> >> Is that one or multiple threads submitting jobs?
> >
> > In this case multiple.. but I think it could also happen with a single
> > thread (provided it didn't stall on a fence, directly or indirectly,
> > from an earlier submit), because of how resume and actual job
> > submission happens from scheduler kthread.
> >
> >>> I'm not sure if we want a way to prevent userspace from getting *too*
> >>> far ahead of the kthread. Or maybe at some point the shrinker should
> >>> sleep on non-idle buffers?
> >>
> >> On the direct reclaim path when invoked from the submit ioctl? In i915
> >> we only shrink idle objects on direct reclaim and leave active ones for
> >> the swapper. It depends on how your locking looks like whether you could
> >> do them, whether there would be coupling of locks and fs-reclaim context.
> >
> > I think the locking is more or less ok, although lockdep is unhappy
> > about one thing[1] which is I think a false warning (ie. not
> > recognizing that we'd already successfully acquired the obj lock via
> > trylock). We can already reclaim idle bo's in this path. But the
> > problem with a bunch of submits queued up in the scheduler, is that
> > they are already considered pinned and active. So at some point we
> > need to sleep (hopefully interruptabley) until they are no longer
> > active, ie. to throttle userspace trying to shove in more submits
> > until some of the enqueued ones have a chance to run and complete.
>
> Odd I did not think trylock could trigger that. Looking at your code it
> indeed seems two trylocks. I am pretty sure we use the same trylock
> trick to avoid it. I am confused..

The sequence is,

1. kref_get_unless_zero()
2. trylock, which succeeds
3. attempt to evict or purge (which may or may not have succeeded)
4. unlock

... meanwhile this has raced with submit (aka execbuf) finishing and
retiring and dropping *other* remaining reference to bo...

5. drm_gem_object_put() which triggers drm_gem_object_free()
6. in our free path we acquire the obj lock again and then drop it.
Which arguably is unnecessary and only serves to satisfy some
GEM_WARN_ON(!msm_gem_is_locked(obj)) in code paths that are also used
elsewhere

lockdep doesn't realize the previously successful trylock+unlock
sequence so it assumes that the code that triggered recursion into
shrinker could be holding the objects lock.

>
> Otherwise if you can afford to sleep you can of course throttle
> organically via direct reclaim. Unless I am forgetting some key gotcha -
> it's been a while I've been active in this area.

So, one thing that is awkward about sleeping in this path is that
there is no way to propagate back -EINTR, so we end up doing an
uninterruptible sleep in something that could be called indirectly
from userspace syscall.. i915 seems to deal with this by limiting it
to shrinker being called from kswapd. I think in the shrinker we want
to know whether it is ok to sleep (ie. not syscall trigggered
codepath, and whether we are under enough memory pressure to justify
sleeping). For the syscall path, I'm playing with something that lets
me pass __GFP_RETRY_MAYFAIL | __GFP_NOWARN to
shmem_read_mapping_page_gfp(), and then stall after the shrinker has
failed, somewhere where we can make it interruptable. Ofc, that
doesn't help with all the other random memory allocations which can
fail, so not sure if it will turn out to be a good approach or not.
But I guess pinning the GEM bo's is the single biggest potential
consumer of pages in the submit path, so maybe it will be better than
nothing.

BR,
-R

>
> Regards,
>
> Tvrtko
>
> >
> > BR,
> > -R
> >
> > [1] https://gitlab.freedesktop.org/drm/msm/-/issues/14
> >

2022-05-27 00:56:54

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities


On 25/05/2022 14:41, Rob Clark wrote:
> On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> On 24/05/2022 15:50, Rob Clark wrote:
>>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> On 23/05/2022 23:53, Rob Clark wrote:
>>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Rob,
>>>>>>
>>>>>> On 28/07/2021 02:06, Rob Clark wrote:
>>>>>>> From: Rob Clark <[email protected]>
>>>>>>>
>>>>>>> The drm/scheduler provides additional prioritization on top of that
>>>>>>> provided by however many number of ringbuffers (each with their own
>>>>>>> priority level) is supported on a given generation. Expose the
>>>>>>> additional levels of priority to userspace and map the userspace
>>>>>>> priority back to ring (first level of priority) and schedular priority
>>>>>>> (additional priority levels within the ring).
>>>>>>>
>>>>>>> Signed-off-by: Rob Clark <[email protected]>
>>>>>>> Acked-by: Christian König <[email protected]>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
>>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
>>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
>>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
>>>>>>> include/uapi/drm/msm_drm.h | 14 +++++-
>>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>> index bad4809b68ef..748665232d29 100644
>>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
>>>>>>> return ret;
>>>>>>> }
>>>>>>> return -EINVAL;
>>>>>>> - case MSM_PARAM_NR_RINGS:
>>>>>>> - *value = gpu->nr_rings;
>>>>>>> + case MSM_PARAM_PRIORITIES:
>>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
>>>>>>> return 0;
>>>>>>> case MSM_PARAM_PP_PGTABLE:
>>>>>>> *value = 0;
>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>> index 450efe59abb5..c2ecec5b11c4 100644
>>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>>>>>> submit->gpu = gpu;
>>>>>>> submit->cmd = (void *)&submit->bos[nr_bos];
>>>>>>> submit->queue = queue;
>>>>>>> - submit->ring = gpu->rb[queue->prio];
>>>>>>> + submit->ring = gpu->rb[queue->ring_nr];
>>>>>>> submit->fault_dumped = false;
>>>>>>>
>>>>>>> INIT_LIST_HEAD(&submit->node);
>>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>>>>>>> /* Get a unique identifier for the submission for logging purposes */
>>>>>>> submitid = atomic_inc_return(&ident) - 1;
>>>>>>>
>>>>>>> - ring = gpu->rb[queue->prio];
>>>>>>> + ring = gpu->rb[queue->ring_nr];
>>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
>>>>>>> args->nr_bos, args->nr_cmds);
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>> index b912cacaecc0..0e4b45bff2e6 100644
>>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
>>>>>>> const char *name;
>>>>>>> };
>>>>>>>
>>>>>>> +/*
>>>>>>> + * The number of priority levels provided by drm gpu scheduler. The
>>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
>>>>>>> + * cases, so we don't use it (no need for kernel generated jobs).
>>>>>>> + */
>>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
>>>>>>> + *
>>>>>>> + * @gpu: the gpu instance
>>>>>>> + * @prio: the userspace priority level
>>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
>>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
>>>>>>> + * priority maps to
>>>>>>> + *
>>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total
>>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
>>>>>>> + * Each ring is associated with it's own scheduler instance. However, our
>>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the
>>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some
>>>>>>> + * care. The userspace provided priority (when a submitqueue is created)
>>>>>>> + * is mapped to ring nr and scheduler priority as such:
>>>>>>> + *
>>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
>>>>>>> + * sched_prio = NR_SCHED_PRIORITIES -
>>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
>>>>>>> + *
>>>>>>> + * This allows generations without preemption (nr_rings==1) to have some
>>>>>>> + * amount of prioritization, and provides more priority levels for gens
>>>>>>> + * that do have preemption.
>>>>>>
>>>>>> I am exploring how different drivers handle priority levels and this
>>>>>> caught my eye.
>>>>>>
>>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1,
>>>>>> ring + 1 preempts ring?
>>>>>
>>>>> Other way around, at least from the uabi standpoint. Ie. ring[0]
>>>>> preempts ring[1]
>>>>
>>>> Ah yes, I figure it out from the comments but then confused myself when
>>>> writing the email.
>>>>
>>>>>> If so I am wondering does the "spreading" of
>>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
>>>>>> levels within every "bucket" or how does that work?
>>>>>
>>>>> So, preemption is possible between any priority level before run_job()
>>>>> gets called, which writes the job into the ringbuffer. After that
>>>>
>>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the
>>>> scheduler queues, right?
>>>
>>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on
>>> prio[1] could be executed after submit B on prio[2] provided that
>>> run_job(submitA) hasn't happened yet. So I guess it isn't "really"
>>> preemption because the submit hasn't started running on the GPU yet.
>>> But rather just scheduling according to priority.
>>>
>>>>> point, you only have "bucket" level preemption, because
>>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
>>>>> ringbuffer.
>>>>
>>>> Right, and you have one GPU with four rings, which means you expose 12
>>>> priority levels to userspace, did I get that right?
>>>
>>> Correct
>>>
>>>> If so how do you convey in the ABI that not all there priority levels
>>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to
>>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not
>>>> match - 3 will not preempt 4.
>>>
>>> It isn't really exposed to userspace, but perhaps it should be..
>>> Userspace just knows that, to the extent possible, the kernel will try
>>> to execute prio 3 before prio 4.
>>>
>>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
>>>> quick peek in Mesa but did not spot it - although I am not really at
>>>> home there yet so maybe I missed it.
>>>
>>> Yes, there is an EGL extension:
>>>
>>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
>>>
>>> It is pretty limited, it only exposes three priority levels.
>>
>> Right, is that wired up on msm? And if it is, or could be, how do/would
>> you map the three priority levels for GPUs which expose 3 priority
>> levels versus the one which exposes 12?
>
> We don't yet, but probably should, expose a cap to indicate to
> userspace the # of hw rings vs # of levels of sched priority

What bothers me is the question of whether this setup provides a
consistent benefit. Why would userspace use other than "real" (hardware)
priority levels on chips where they are available?

For instance if you exposed 4 instead of 12 on a respective platform,
would that be better or worse? Yes you could only map three directly
drm/sched and one would have to be "fake". Like:

hw prio 0 -> drm/sched 2
hw prio 1 -> drm/sched 1
hw prio 2 -> drm/sched 0
hw prio 3 -> drm/sched 0

Not saying that's nice either. Perhaps the answer is that drm/sched
needs more flexibility for instance if it wants to be widely used.

For instance in i915 uapi we have priority as int -1023 - +1023. And
matching implementation on some platforms, until the new ones which are
GuC firmware based, where we need to squash that to low/normal/high.

So thinking was drm/sched happens to align with GuC. But then we have
your hw where it doesn't seem to.

Regards,

Tvrtko

>> Is it doable properly without leaking the fact drm/sched internal
>> implementation detail of three priority levels? Or if you went the other
>> way and only exposed up to max 3 levels, then you lose one priority
>> level your hardware suppose which is also not good.
>>
>> It is all quite interesting because your hardware is completely
>> different from ours in this respect. In our case i915 decides when to
>> preempt, hardware has no concept of priority (*).
>
> It is really pretty much all in firmware.. a6xx is the first gen that
> could do actual (non-cooperative) preemption (but that isn't
> implemented yet in upstream driver)
>
> BR,
> -R
>
>> Regards,
>>
>> Tvrtko
>>
>> (*) Almost no concept of priority in hardware - we do have it on new
>> GPUs and only on a subset of engine classes where render and compute
>> share the EUs. But I think it's way different from Ardenos.

2022-05-27 12:07:37

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities


On 26/05/2022 04:37, Rob Clark wrote:
> On Wed, May 25, 2022 at 9:22 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> On 25/05/2022 14:41, Rob Clark wrote:
>>> On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> On 24/05/2022 15:50, Rob Clark wrote:
>>>>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> On 23/05/2022 23:53, Rob Clark wrote:
>>>>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Rob,
>>>>>>>>
>>>>>>>> On 28/07/2021 02:06, Rob Clark wrote:
>>>>>>>>> From: Rob Clark <[email protected]>
>>>>>>>>>
>>>>>>>>> The drm/scheduler provides additional prioritization on top of that
>>>>>>>>> provided by however many number of ringbuffers (each with their own
>>>>>>>>> priority level) is supported on a given generation. Expose the
>>>>>>>>> additional levels of priority to userspace and map the userspace
>>>>>>>>> priority back to ring (first level of priority) and schedular priority
>>>>>>>>> (additional priority levels within the ring).
>>>>>>>>>
>>>>>>>>> Signed-off-by: Rob Clark <[email protected]>
>>>>>>>>> Acked-by: Christian König <[email protected]>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
>>>>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
>>>>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
>>>>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
>>>>>>>>> include/uapi/drm/msm_drm.h | 14 +++++-
>>>>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>>>> index bad4809b68ef..748665232d29 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
>>>>>>>>> return ret;
>>>>>>>>> }
>>>>>>>>> return -EINVAL;
>>>>>>>>> - case MSM_PARAM_NR_RINGS:
>>>>>>>>> - *value = gpu->nr_rings;
>>>>>>>>> + case MSM_PARAM_PRIORITIES:
>>>>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
>>>>>>>>> return 0;
>>>>>>>>> case MSM_PARAM_PP_PGTABLE:
>>>>>>>>> *value = 0;
>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>>>> index 450efe59abb5..c2ecec5b11c4 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>>>>>>>> submit->gpu = gpu;
>>>>>>>>> submit->cmd = (void *)&submit->bos[nr_bos];
>>>>>>>>> submit->queue = queue;
>>>>>>>>> - submit->ring = gpu->rb[queue->prio];
>>>>>>>>> + submit->ring = gpu->rb[queue->ring_nr];
>>>>>>>>> submit->fault_dumped = false;
>>>>>>>>>
>>>>>>>>> INIT_LIST_HEAD(&submit->node);
>>>>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>>>>>>>>> /* Get a unique identifier for the submission for logging purposes */
>>>>>>>>> submitid = atomic_inc_return(&ident) - 1;
>>>>>>>>>
>>>>>>>>> - ring = gpu->rb[queue->prio];
>>>>>>>>> + ring = gpu->rb[queue->ring_nr];
>>>>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
>>>>>>>>> args->nr_bos, args->nr_cmds);
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>>>> index b912cacaecc0..0e4b45bff2e6 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
>>>>>>>>> const char *name;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> +/*
>>>>>>>>> + * The number of priority levels provided by drm gpu scheduler. The
>>>>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
>>>>>>>>> + * cases, so we don't use it (no need for kernel generated jobs).
>>>>>>>>> + */
>>>>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
>>>>>>>>> + *
>>>>>>>>> + * @gpu: the gpu instance
>>>>>>>>> + * @prio: the userspace priority level
>>>>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
>>>>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
>>>>>>>>> + * priority maps to
>>>>>>>>> + *
>>>>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total
>>>>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
>>>>>>>>> + * Each ring is associated with it's own scheduler instance. However, our
>>>>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the
>>>>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some
>>>>>>>>> + * care. The userspace provided priority (when a submitqueue is created)
>>>>>>>>> + * is mapped to ring nr and scheduler priority as such:
>>>>>>>>> + *
>>>>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
>>>>>>>>> + * sched_prio = NR_SCHED_PRIORITIES -
>>>>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
>>>>>>>>> + *
>>>>>>>>> + * This allows generations without preemption (nr_rings==1) to have some
>>>>>>>>> + * amount of prioritization, and provides more priority levels for gens
>>>>>>>>> + * that do have preemption.
>>>>>>>>
>>>>>>>> I am exploring how different drivers handle priority levels and this
>>>>>>>> caught my eye.
>>>>>>>>
>>>>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1,
>>>>>>>> ring + 1 preempts ring?
>>>>>>>
>>>>>>> Other way around, at least from the uabi standpoint. Ie. ring[0]
>>>>>>> preempts ring[1]
>>>>>>
>>>>>> Ah yes, I figure it out from the comments but then confused myself when
>>>>>> writing the email.
>>>>>>
>>>>>>>> If so I am wondering does the "spreading" of
>>>>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
>>>>>>>> levels within every "bucket" or how does that work?
>>>>>>>
>>>>>>> So, preemption is possible between any priority level before run_job()
>>>>>>> gets called, which writes the job into the ringbuffer. After that
>>>>>>
>>>>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the
>>>>>> scheduler queues, right?
>>>>>
>>>>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on
>>>>> prio[1] could be executed after submit B on prio[2] provided that
>>>>> run_job(submitA) hasn't happened yet. So I guess it isn't "really"
>>>>> preemption because the submit hasn't started running on the GPU yet.
>>>>> But rather just scheduling according to priority.
>>>>>
>>>>>>> point, you only have "bucket" level preemption, because
>>>>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
>>>>>>> ringbuffer.
>>>>>>
>>>>>> Right, and you have one GPU with four rings, which means you expose 12
>>>>>> priority levels to userspace, did I get that right?
>>>>>
>>>>> Correct
>>>>>
>>>>>> If so how do you convey in the ABI that not all there priority levels
>>>>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to
>>>>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not
>>>>>> match - 3 will not preempt 4.
>>>>>
>>>>> It isn't really exposed to userspace, but perhaps it should be..
>>>>> Userspace just knows that, to the extent possible, the kernel will try
>>>>> to execute prio 3 before prio 4.
>>>>>
>>>>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
>>>>>> quick peek in Mesa but did not spot it - although I am not really at
>>>>>> home there yet so maybe I missed it.
>>>>>
>>>>> Yes, there is an EGL extension:
>>>>>
>>>>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
>>>>>
>>>>> It is pretty limited, it only exposes three priority levels.
>>>>
>>>> Right, is that wired up on msm? And if it is, or could be, how do/would
>>>> you map the three priority levels for GPUs which expose 3 priority
>>>> levels versus the one which exposes 12?
>>>
>>> We don't yet, but probably should, expose a cap to indicate to
>>> userspace the # of hw rings vs # of levels of sched priority
>>
>> What bothers me is the question of whether this setup provides a
>> consistent benefit. Why would userspace use other than "real" (hardware)
>> priority levels on chips where they are available?
>
> yeah, perhaps we could decide that userspace doesn't really need more
> than 3 prio levels, and that on generations which have better
> preemption than what drm/sched provides, *only* expose those priority
> levels. I've avoided that so far because it seems wrong for the
> kernel to assume that a single EGL extension is all there is when it
> comes to userspace context priority.. the other option is to expose
> more information to userspace and let it decide.

Maybe in msm you could reserve 0 for kernel submissions (if you have
such use cases) and expose levels 1-3 via drm/sched? If you could wire
that up, and if four levels is most your hardware will have.

Although with that option it seems drm/sched could starve lower
priorities, I mean not give anything to the hw/fw scheduler on higher
rings as longs as there is work on lower. Which if those chips have some
smarter algorithm would defeat it.

So perhaps there is no way but improving drm/sched. Backend controlled
number of priorities and backend control for whether "in flight" job s
limit is global vs per priority level (per run queue).

Btw my motivation looking into all this is that we have CPU nice and
ionice supporting more levels and I'd like to tie that all together into
one consistent user friendly story (see
https://patchwork.freedesktop.org/series/102348/). In a world of
heterogenous compute pipelines I think that is the way forward. I even
demonstrated this from within ChromeOS, since the compositor uses nice
-5 is automatically gives it more GPU bandwith compared to for instance
Android VM.

I know of other hardware supporting more than three levels, but I need
to study more drm drivers to gain a complete picture. I only started
with msm since it looked simple. :)

> Honestly, the combination of the fact that a6xx is the first gen
> shipping in consumer products with upstream driver (using drm/sched),
> and not having had time yet to implement hw preemption for a6xx yet,
> means not a whole lot of thought has gone into the current arrangement
> ;-)

:)

What kind of scheduling algorithm does your hardware have between those
priority levels?

>> For instance if you exposed 4 instead of 12 on a respective platform,
>> would that be better or worse? Yes you could only map three directly
>> drm/sched and one would have to be "fake". Like:
>>
>> hw prio 0 -> drm/sched 2
>> hw prio 1 -> drm/sched 1
>> hw prio 2 -> drm/sched 0
>> hw prio 3 -> drm/sched 0
>>
>> Not saying that's nice either. Perhaps the answer is that drm/sched
>> needs more flexibility for instance if it wants to be widely used.
>
> I'm not sure what I'd add to drm/sched.. once it calls run_job()
> things are out of its hands, so really all it can do is re-order
> things prior to calling run_job() according to it's internal priority
> levels. And that is still better than no re-ordering so it adds some
> value, even if not complete.

Not sure about the value there - as mentioned before I see problems on
the uapi front with not all priorities being equal.

Besides, priority order scheduling is kind of meh to me. Especially if
it only applies in the scheduling frontend. If frontend and backend
algorithms do not even match then it's even more weird.

IMO sooner or later GPU scheduling will have to catchup with state of
the art from the CPU world and use priority as a hint for time sharing
decisions.

>> For instance in i915 uapi we have priority as int -1023 - +1023. And
>> matching implementation on some platforms, until the new ones which are
>> GuC firmware based, where we need to squash that to low/normal/high.
>
> hmm, that is a more awkward problem, since it sounds like you are
> mapping many more priority levels into a much smaller set of hw
> priority levels. Do you have separate drm_sched instances per hw
> priority level? If so you can do the same thing of using drm_sched
> priority levels to multiply # of hw priority levels, but ofc that is
> not perfect (and won't get you to 2k).

We don't use drm/sched yet, I was just mentioning what we have in uapi.
But yes, our current scheduling backend can handle more than three levels.

> But is there anything that actually *uses* that many levels of priority?

From userspace no, there are only a few internal priority levels for
things like heartbeats the driver is sending to check engine health and
page flip priority boosts.

Regards,

Tvrtko

2022-05-27 12:39:33

by Rob Clark

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Thu, May 26, 2022 at 4:38 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 26/05/2022 04:37, Rob Clark wrote:
> > On Wed, May 25, 2022 at 9:22 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >>
> >> On 25/05/2022 14:41, Rob Clark wrote:
> >>> On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>> On 24/05/2022 15:50, Rob Clark wrote:
> >>>>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 23/05/2022 23:53, Rob Clark wrote:
> >>>>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Rob,
> >>>>>>>>
> >>>>>>>> On 28/07/2021 02:06, Rob Clark wrote:
> >>>>>>>>> From: Rob Clark <[email protected]>
> >>>>>>>>>
> >>>>>>>>> The drm/scheduler provides additional prioritization on top of that
> >>>>>>>>> provided by however many number of ringbuffers (each with their own
> >>>>>>>>> priority level) is supported on a given generation. Expose the
> >>>>>>>>> additional levels of priority to userspace and map the userspace
> >>>>>>>>> priority back to ring (first level of priority) and schedular priority
> >>>>>>>>> (additional priority levels within the ring).
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Rob Clark <[email protected]>
> >>>>>>>>> Acked-by: Christian König <[email protected]>
> >>>>>>>>> ---
> >>>>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
> >>>>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
> >>>>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
> >>>>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
> >>>>>>>>> include/uapi/drm/msm_drm.h | 14 +++++-
> >>>>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>>>>>> index bad4809b68ef..748665232d29 100644
> >>>>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
> >>>>>>>>> return ret;
> >>>>>>>>> }
> >>>>>>>>> return -EINVAL;
> >>>>>>>>> - case MSM_PARAM_NR_RINGS:
> >>>>>>>>> - *value = gpu->nr_rings;
> >>>>>>>>> + case MSM_PARAM_PRIORITIES:
> >>>>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
> >>>>>>>>> return 0;
> >>>>>>>>> case MSM_PARAM_PP_PGTABLE:
> >>>>>>>>> *value = 0;
> >>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>>>>>> index 450efe59abb5..c2ecec5b11c4 100644
> >>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> >>>>>>>>> submit->gpu = gpu;
> >>>>>>>>> submit->cmd = (void *)&submit->bos[nr_bos];
> >>>>>>>>> submit->queue = queue;
> >>>>>>>>> - submit->ring = gpu->rb[queue->prio];
> >>>>>>>>> + submit->ring = gpu->rb[queue->ring_nr];
> >>>>>>>>> submit->fault_dumped = false;
> >>>>>>>>>
> >>>>>>>>> INIT_LIST_HEAD(&submit->node);
> >>>>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >>>>>>>>> /* Get a unique identifier for the submission for logging purposes */
> >>>>>>>>> submitid = atomic_inc_return(&ident) - 1;
> >>>>>>>>>
> >>>>>>>>> - ring = gpu->rb[queue->prio];
> >>>>>>>>> + ring = gpu->rb[queue->ring_nr];
> >>>>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
> >>>>>>>>> args->nr_bos, args->nr_cmds);
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> >>>>>>>>> index b912cacaecc0..0e4b45bff2e6 100644
> >>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >>>>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
> >>>>>>>>> const char *name;
> >>>>>>>>> };
> >>>>>>>>>
> >>>>>>>>> +/*
> >>>>>>>>> + * The number of priority levels provided by drm gpu scheduler. The
> >>>>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
> >>>>>>>>> + * cases, so we don't use it (no need for kernel generated jobs).
> >>>>>>>>> + */
> >>>>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
> >>>>>>>>> +
> >>>>>>>>> +/**
> >>>>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
> >>>>>>>>> + *
> >>>>>>>>> + * @gpu: the gpu instance
> >>>>>>>>> + * @prio: the userspace priority level
> >>>>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
> >>>>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
> >>>>>>>>> + * priority maps to
> >>>>>>>>> + *
> >>>>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total
> >>>>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
> >>>>>>>>> + * Each ring is associated with it's own scheduler instance. However, our
> >>>>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the
> >>>>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some
> >>>>>>>>> + * care. The userspace provided priority (when a submitqueue is created)
> >>>>>>>>> + * is mapped to ring nr and scheduler priority as such:
> >>>>>>>>> + *
> >>>>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
> >>>>>>>>> + * sched_prio = NR_SCHED_PRIORITIES -
> >>>>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
> >>>>>>>>> + *
> >>>>>>>>> + * This allows generations without preemption (nr_rings==1) to have some
> >>>>>>>>> + * amount of prioritization, and provides more priority levels for gens
> >>>>>>>>> + * that do have preemption.
> >>>>>>>>
> >>>>>>>> I am exploring how different drivers handle priority levels and this
> >>>>>>>> caught my eye.
> >>>>>>>>
> >>>>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1,
> >>>>>>>> ring + 1 preempts ring?
> >>>>>>>
> >>>>>>> Other way around, at least from the uabi standpoint. Ie. ring[0]
> >>>>>>> preempts ring[1]
> >>>>>>
> >>>>>> Ah yes, I figure it out from the comments but then confused myself when
> >>>>>> writing the email.
> >>>>>>
> >>>>>>>> If so I am wondering does the "spreading" of
> >>>>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
> >>>>>>>> levels within every "bucket" or how does that work?
> >>>>>>>
> >>>>>>> So, preemption is possible between any priority level before run_job()
> >>>>>>> gets called, which writes the job into the ringbuffer. After that
> >>>>>>
> >>>>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the
> >>>>>> scheduler queues, right?
> >>>>>
> >>>>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on
> >>>>> prio[1] could be executed after submit B on prio[2] provided that
> >>>>> run_job(submitA) hasn't happened yet. So I guess it isn't "really"
> >>>>> preemption because the submit hasn't started running on the GPU yet.
> >>>>> But rather just scheduling according to priority.
> >>>>>
> >>>>>>> point, you only have "bucket" level preemption, because
> >>>>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
> >>>>>>> ringbuffer.
> >>>>>>
> >>>>>> Right, and you have one GPU with four rings, which means you expose 12
> >>>>>> priority levels to userspace, did I get that right?
> >>>>>
> >>>>> Correct
> >>>>>
> >>>>>> If so how do you convey in the ABI that not all there priority levels
> >>>>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to
> >>>>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not
> >>>>>> match - 3 will not preempt 4.
> >>>>>
> >>>>> It isn't really exposed to userspace, but perhaps it should be..
> >>>>> Userspace just knows that, to the extent possible, the kernel will try
> >>>>> to execute prio 3 before prio 4.
> >>>>>
> >>>>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
> >>>>>> quick peek in Mesa but did not spot it - although I am not really at
> >>>>>> home there yet so maybe I missed it.
> >>>>>
> >>>>> Yes, there is an EGL extension:
> >>>>>
> >>>>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
> >>>>>
> >>>>> It is pretty limited, it only exposes three priority levels.
> >>>>
> >>>> Right, is that wired up on msm? And if it is, or could be, how do/would
> >>>> you map the three priority levels for GPUs which expose 3 priority
> >>>> levels versus the one which exposes 12?
> >>>
> >>> We don't yet, but probably should, expose a cap to indicate to
> >>> userspace the # of hw rings vs # of levels of sched priority
> >>
> >> What bothers me is the question of whether this setup provides a
> >> consistent benefit. Why would userspace use other than "real" (hardware)
> >> priority levels on chips where they are available?
> >
> > yeah, perhaps we could decide that userspace doesn't really need more
> > than 3 prio levels, and that on generations which have better
> > preemption than what drm/sched provides, *only* expose those priority
> > levels. I've avoided that so far because it seems wrong for the
> > kernel to assume that a single EGL extension is all there is when it
> > comes to userspace context priority.. the other option is to expose
> > more information to userspace and let it decide.
>
> Maybe in msm you could reserve 0 for kernel submissions (if you have
> such use cases) and expose levels 1-3 via drm/sched? If you could wire
> that up, and if four levels is most your hardware will have.

we fortunately don't need kernel submission for anything... that said,
the limited # of priorities for drm/sched seems a bit arbitrary
(although perhaps catering to the existing egl extension)

> Although with that option it seems drm/sched could starve lower
> priorities, I mean not give anything to the hw/fw scheduler on higher
> rings as longs as there is work on lower. Which if those chips have some
> smarter algorithm would defeat it.

So the thing is the (existing) gpu scheduling is strictly priority
based, and not "nice" based like CPU scheduling. Those two schemes
are completely different paradigms, the latter giving some boost to
processes that have been blocked on I/O (which, I'm not sure there is
an equiv thing for GPU) or otherwise haven't had a chance to run for a
while.

> So perhaps there is no way but improving drm/sched. Backend controlled
> number of priorities and backend control for whether "in flight" job s
> limit is global vs per priority level (per run queue).
>
> Btw my motivation looking into all this is that we have CPU nice and
> ionice supporting more levels and I'd like to tie that all together into
> one consistent user friendly story (see
> https://patchwork.freedesktop.org/series/102348/). In a world of
> heterogenous compute pipelines I think that is the way forward. I even
> demonstrated this from within ChromeOS, since the compositor uses nice
> -5 is automatically gives it more GPU bandwith compared to for instance
> Android VM.

But this can be achieved with a simple priority based scheme, ie.
compositor is higher priority than app.

The situation changes a bit, and becomes more cpu like perhaps, when
you add long running compute and cpu-offload stuff

> I know of other hardware supporting more than three levels, but I need
> to study more drm drivers to gain a complete picture. I only started
> with msm since it looked simple. :)

even in msm the # of priority levels is somewhat arbitrary.. but
roughly it is that we tell the hw there is something higher priority
to run, it waits a bit for a cooperative yield point (since force
preemption is rather expensive for 3d, ie. there is a lot of state to
save/restore, not just a few cpu registers), and then eventually if a
cooperative yield point isn't hit it triggers a forced preemption.
(Only on newer things, older gens only had cooperative yield points to
work with.)

> > Honestly, the combination of the fact that a6xx is the first gen
> > shipping in consumer products with upstream driver (using drm/sched),
> > and not having had time yet to implement hw preemption for a6xx yet,
> > means not a whole lot of thought has gone into the current arrangement
> > ;-)
>
> :)
>
> What kind of scheduling algorithm does your hardware have between those
> priority levels?

Like I said, it is strictly "thing A is higher priority than thing
B".. there is no CSF or io-nice type thing. I guess since it is still
the kernel that initiates the preemption, we could in theory implement
something more clever. But I'm not entirely sure something more
clever makes sense given the relatively high cost of forced preemption
compared to CPU. Ofc I could be wrong, I've not given a lot of
thought to it other than more limited scenarios (ie. compositor should
be higher priority than app)

BR,
-R

> >> For instance if you exposed 4 instead of 12 on a respective platform,
> >> would that be better or worse? Yes you could only map three directly
> >> drm/sched and one would have to be "fake". Like:
> >>
> >> hw prio 0 -> drm/sched 2
> >> hw prio 1 -> drm/sched 1
> >> hw prio 2 -> drm/sched 0
> >> hw prio 3 -> drm/sched 0
> >>
> >> Not saying that's nice either. Perhaps the answer is that drm/sched
> >> needs more flexibility for instance if it wants to be widely used.
> >
> > I'm not sure what I'd add to drm/sched.. once it calls run_job()
> > things are out of its hands, so really all it can do is re-order
> > things prior to calling run_job() according to it's internal priority
> > levels. And that is still better than no re-ordering so it adds some
> > value, even if not complete.
>
> Not sure about the value there - as mentioned before I see problems on
> the uapi front with not all priorities being equal.
>
> Besides, priority order scheduling is kind of meh to me. Especially if
> it only applies in the scheduling frontend. If frontend and backend
> algorithms do not even match then it's even more weird.
>
> IMO sooner or later GPU scheduling will have to catchup with state of
> the art from the CPU world and use priority as a hint for time sharing
> decisions.

Maybe.. that is a lot more sophisticated than the current situation of
"queue A should have higher priority than queue B"

OTOH actual preemption of GPU work is a lot more expensive than
preempting a CPU thread, so not even sure if we should try and look at
GPU and CPU scheduling the same way. (But so far I've only looked at
it as "compositor should have higher priority than app")

BR,
-R

> >> For instance in i915 uapi we have priority as int -1023 - +1023. And
> >> matching implementation on some platforms, until the new ones which are
> >> GuC firmware based, where we need to squash that to low/normal/high.
> >
> > hmm, that is a more awkward problem, since it sounds like you are
> > mapping many more priority levels into a much smaller set of hw
> > priority levels. Do you have separate drm_sched instances per hw
> > priority level? If so you can do the same thing of using drm_sched
> > priority levels to multiply # of hw priority levels, but ofc that is
> > not perfect (and won't get you to 2k).
>
> We don't use drm/sched yet, I was just mentioning what we have in uapi.
> But yes, our current scheduling backend can handle more than three levels.
>
> > But is there anything that actually *uses* that many levels of priority?
>
> From userspace no, there are only a few internal priority levels for
> things like heartbeats the driver is sending to check engine health and
> page flip priority boosts.
>
> Regards,
>
> Tvrtko

2022-05-27 19:41:01

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Wed, May 25, 2022 at 9:22 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 25/05/2022 14:41, Rob Clark wrote:
> > On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >>
> >> On 24/05/2022 15:50, Rob Clark wrote:
> >>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>> On 23/05/2022 23:53, Rob Clark wrote:
> >>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi Rob,
> >>>>>>
> >>>>>> On 28/07/2021 02:06, Rob Clark wrote:
> >>>>>>> From: Rob Clark <[email protected]>
> >>>>>>>
> >>>>>>> The drm/scheduler provides additional prioritization on top of that
> >>>>>>> provided by however many number of ringbuffers (each with their own
> >>>>>>> priority level) is supported on a given generation. Expose the
> >>>>>>> additional levels of priority to userspace and map the userspace
> >>>>>>> priority back to ring (first level of priority) and schedular priority
> >>>>>>> (additional priority levels within the ring).
> >>>>>>>
> >>>>>>> Signed-off-by: Rob Clark <[email protected]>
> >>>>>>> Acked-by: Christian König <[email protected]>
> >>>>>>> ---
> >>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
> >>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
> >>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
> >>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
> >>>>>>> include/uapi/drm/msm_drm.h | 14 +++++-
> >>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>>>> index bad4809b68ef..748665232d29 100644
> >>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> >>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
> >>>>>>> return ret;
> >>>>>>> }
> >>>>>>> return -EINVAL;
> >>>>>>> - case MSM_PARAM_NR_RINGS:
> >>>>>>> - *value = gpu->nr_rings;
> >>>>>>> + case MSM_PARAM_PRIORITIES:
> >>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
> >>>>>>> return 0;
> >>>>>>> case MSM_PARAM_PP_PGTABLE:
> >>>>>>> *value = 0;
> >>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>>>> index 450efe59abb5..c2ecec5b11c4 100644
> >>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> >>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> >>>>>>> submit->gpu = gpu;
> >>>>>>> submit->cmd = (void *)&submit->bos[nr_bos];
> >>>>>>> submit->queue = queue;
> >>>>>>> - submit->ring = gpu->rb[queue->prio];
> >>>>>>> + submit->ring = gpu->rb[queue->ring_nr];
> >>>>>>> submit->fault_dumped = false;
> >>>>>>>
> >>>>>>> INIT_LIST_HEAD(&submit->node);
> >>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> >>>>>>> /* Get a unique identifier for the submission for logging purposes */
> >>>>>>> submitid = atomic_inc_return(&ident) - 1;
> >>>>>>>
> >>>>>>> - ring = gpu->rb[queue->prio];
> >>>>>>> + ring = gpu->rb[queue->ring_nr];
> >>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
> >>>>>>> args->nr_bos, args->nr_cmds);
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> >>>>>>> index b912cacaecc0..0e4b45bff2e6 100644
> >>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> >>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
> >>>>>>> const char *name;
> >>>>>>> };
> >>>>>>>
> >>>>>>> +/*
> >>>>>>> + * The number of priority levels provided by drm gpu scheduler. The
> >>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
> >>>>>>> + * cases, so we don't use it (no need for kernel generated jobs).
> >>>>>>> + */
> >>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
> >>>>>>> +
> >>>>>>> +/**
> >>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
> >>>>>>> + *
> >>>>>>> + * @gpu: the gpu instance
> >>>>>>> + * @prio: the userspace priority level
> >>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
> >>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
> >>>>>>> + * priority maps to
> >>>>>>> + *
> >>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total
> >>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
> >>>>>>> + * Each ring is associated with it's own scheduler instance. However, our
> >>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the
> >>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some
> >>>>>>> + * care. The userspace provided priority (when a submitqueue is created)
> >>>>>>> + * is mapped to ring nr and scheduler priority as such:
> >>>>>>> + *
> >>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
> >>>>>>> + * sched_prio = NR_SCHED_PRIORITIES -
> >>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
> >>>>>>> + *
> >>>>>>> + * This allows generations without preemption (nr_rings==1) to have some
> >>>>>>> + * amount of prioritization, and provides more priority levels for gens
> >>>>>>> + * that do have preemption.
> >>>>>>
> >>>>>> I am exploring how different drivers handle priority levels and this
> >>>>>> caught my eye.
> >>>>>>
> >>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1,
> >>>>>> ring + 1 preempts ring?
> >>>>>
> >>>>> Other way around, at least from the uabi standpoint. Ie. ring[0]
> >>>>> preempts ring[1]
> >>>>
> >>>> Ah yes, I figure it out from the comments but then confused myself when
> >>>> writing the email.
> >>>>
> >>>>>> If so I am wondering does the "spreading" of
> >>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
> >>>>>> levels within every "bucket" or how does that work?
> >>>>>
> >>>>> So, preemption is possible between any priority level before run_job()
> >>>>> gets called, which writes the job into the ringbuffer. After that
> >>>>
> >>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the
> >>>> scheduler queues, right?
> >>>
> >>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on
> >>> prio[1] could be executed after submit B on prio[2] provided that
> >>> run_job(submitA) hasn't happened yet. So I guess it isn't "really"
> >>> preemption because the submit hasn't started running on the GPU yet.
> >>> But rather just scheduling according to priority.
> >>>
> >>>>> point, you only have "bucket" level preemption, because
> >>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
> >>>>> ringbuffer.
> >>>>
> >>>> Right, and you have one GPU with four rings, which means you expose 12
> >>>> priority levels to userspace, did I get that right?
> >>>
> >>> Correct
> >>>
> >>>> If so how do you convey in the ABI that not all there priority levels
> >>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to
> >>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not
> >>>> match - 3 will not preempt 4.
> >>>
> >>> It isn't really exposed to userspace, but perhaps it should be..
> >>> Userspace just knows that, to the extent possible, the kernel will try
> >>> to execute prio 3 before prio 4.
> >>>
> >>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
> >>>> quick peek in Mesa but did not spot it - although I am not really at
> >>>> home there yet so maybe I missed it.
> >>>
> >>> Yes, there is an EGL extension:
> >>>
> >>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
> >>>
> >>> It is pretty limited, it only exposes three priority levels.
> >>
> >> Right, is that wired up on msm? And if it is, or could be, how do/would
> >> you map the three priority levels for GPUs which expose 3 priority
> >> levels versus the one which exposes 12?
> >
> > We don't yet, but probably should, expose a cap to indicate to
> > userspace the # of hw rings vs # of levels of sched priority
>
> What bothers me is the question of whether this setup provides a
> consistent benefit. Why would userspace use other than "real" (hardware)
> priority levels on chips where they are available?

yeah, perhaps we could decide that userspace doesn't really need more
than 3 prio levels, and that on generations which have better
preemption than what drm/sched provides, *only* expose those priority
levels. I've avoided that so far because it seems wrong for the
kernel to assume that a single EGL extension is all there is when it
comes to userspace context priority.. the other option is to expose
more information to userspace and let it decide.

Honestly, the combination of the fact that a6xx is the first gen
shipping in consumer products with upstream driver (using drm/sched),
and not having had time yet to implement hw preemption for a6xx yet,
means not a whole lot of thought has gone into the current arrangement
;-)

> For instance if you exposed 4 instead of 12 on a respective platform,
> would that be better or worse? Yes you could only map three directly
> drm/sched and one would have to be "fake". Like:
>
> hw prio 0 -> drm/sched 2
> hw prio 1 -> drm/sched 1
> hw prio 2 -> drm/sched 0
> hw prio 3 -> drm/sched 0
>
> Not saying that's nice either. Perhaps the answer is that drm/sched
> needs more flexibility for instance if it wants to be widely used.

I'm not sure what I'd add to drm/sched.. once it calls run_job()
things are out of its hands, so really all it can do is re-order
things prior to calling run_job() according to it's internal priority
levels. And that is still better than no re-ordering so it adds some
value, even if not complete.

> For instance in i915 uapi we have priority as int -1023 - +1023. And
> matching implementation on some platforms, until the new ones which are
> GuC firmware based, where we need to squash that to low/normal/high.

hmm, that is a more awkward problem, since it sounds like you are
mapping many more priority levels into a much smaller set of hw
priority levels. Do you have separate drm_sched instances per hw
priority level? If so you can do the same thing of using drm_sched
priority levels to multiply # of hw priority levels, but ofc that is
not perfect (and won't get you to 2k).

But is there anything that actually *uses* that many levels of priority?

BR,
-R

> So thinking was drm/sched happens to align with GuC. But then we have
> your hw where it doesn't seem to.
>
> Regards,
>
> Tvrtko
>
> >> Is it doable properly without leaking the fact drm/sched internal
> >> implementation detail of three priority levels? Or if you went the other
> >> way and only exposed up to max 3 levels, then you lose one priority
> >> level your hardware suppose which is also not good.
> >>
> >> It is all quite interesting because your hardware is completely
> >> different from ours in this respect. In our case i915 decides when to
> >> preempt, hardware has no concept of priority (*).
> >
> > It is really pretty much all in firmware.. a6xx is the first gen that
> > could do actual (non-cooperative) preemption (but that isn't
> > implemented yet in upstream driver)
> >
> > BR,
> > -R
> >
> >> Regards,
> >>
> >> Tvrtko
> >>
> >> (*) Almost no concept of priority in hardware - we do have it on new
> >> GPUs and only on a subset of engine classes where render and compute
> >> share the EUs. But I think it's way different from Ardenos.

2022-05-28 18:57:44

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities

On Thu, May 26, 2022 at 6:29 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 26/05/2022 04:15, Rob Clark wrote:
> > On Wed, May 25, 2022 at 9:11 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >>
> >> On 24/05/2022 15:57, Rob Clark wrote:
> >>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
> >>> <[email protected]> wrote:
> >>>>
> >>>> On 23/05/2022 23:53, Rob Clark wrote:
> >>>>>
> >>>>> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm
> >>>>> trying to add an igt test to stress shrinker/eviction, similar to the
> >>>>> existing tests/i915/gem_shrink.c. But we hit an unfortunate
> >>>>> combination of circumstances:
> >>>>> 1. Pinning memory happens in the synchronous part of the submit ioctl,
> >>>>> before enqueuing the job for the kthread to handle.
> >>>>> 2. The first run_job() callback incurs a slight delay (~1.5ms) while
> >>>>> resuming the GPU
> >>>>> 3. Because of that delay, userspace has a chance to queue up enough
> >>>>> more jobs to require locking/pinning more than the available system
> >>>>> RAM..
> >>>>
> >>>> Is that one or multiple threads submitting jobs?
> >>>
> >>> In this case multiple.. but I think it could also happen with a single
> >>> thread (provided it didn't stall on a fence, directly or indirectly,
> >>> from an earlier submit), because of how resume and actual job
> >>> submission happens from scheduler kthread.
> >>>
> >>>>> I'm not sure if we want a way to prevent userspace from getting *too*
> >>>>> far ahead of the kthread. Or maybe at some point the shrinker should
> >>>>> sleep on non-idle buffers?
> >>>>
> >>>> On the direct reclaim path when invoked from the submit ioctl? In i915
> >>>> we only shrink idle objects on direct reclaim and leave active ones for
> >>>> the swapper. It depends on how your locking looks like whether you could
> >>>> do them, whether there would be coupling of locks and fs-reclaim context.
> >>>
> >>> I think the locking is more or less ok, although lockdep is unhappy
> >>> about one thing[1] which is I think a false warning (ie. not
> >>> recognizing that we'd already successfully acquired the obj lock via
> >>> trylock). We can already reclaim idle bo's in this path. But the
> >>> problem with a bunch of submits queued up in the scheduler, is that
> >>> they are already considered pinned and active. So at some point we
> >>> need to sleep (hopefully interruptabley) until they are no longer
> >>> active, ie. to throttle userspace trying to shove in more submits
> >>> until some of the enqueued ones have a chance to run and complete.
> >>
> >> Odd I did not think trylock could trigger that. Looking at your code it
> >> indeed seems two trylocks. I am pretty sure we use the same trylock
> >> trick to avoid it. I am confused..
> >
> > The sequence is,
> >
> > 1. kref_get_unless_zero()
> > 2. trylock, which succeeds
> > 3. attempt to evict or purge (which may or may not have succeeded)
> > 4. unlock
> >
> > ... meanwhile this has raced with submit (aka execbuf) finishing and
> > retiring and dropping *other* remaining reference to bo...
> >
> > 5. drm_gem_object_put() which triggers drm_gem_object_free()
> > 6. in our free path we acquire the obj lock again and then drop it.
> > Which arguably is unnecessary and only serves to satisfy some
> > GEM_WARN_ON(!msm_gem_is_locked(obj)) in code paths that are also used
> > elsewhere
> >
> > lockdep doesn't realize the previously successful trylock+unlock
> > sequence so it assumes that the code that triggered recursion into
> > shrinker could be holding the objects lock.
>
> Ah yes, missed that lock after trylock in msm_gem_shrinker/scan(). Well
> i915 has the same sequence in our shrinker, but the difference is we use
> delayed work to actually free, _and_ use trylock in the delayed worker.
> It does feel a bit inelegant (objects with no reference count which
> cannot be trylocked?!), but as this is the code recently refactored by
> Maarten so I think best try and sync with him for the full story.

ahh, we used to use delayed work for free, but realized that was
causing janks where we'd get a bunch of bo's queued up to free and at
some point that would cause us to miss deadlines

I suppose instead we could have used an unbound wq for free instead of
the same one we used (at the time, since transitioned to kthread
worker to avoid being preempted by RT SF threads) for retiring submits

> >> Otherwise if you can afford to sleep you can of course throttle
> >> organically via direct reclaim. Unless I am forgetting some key gotcha -
> >> it's been a while I've been active in this area.
> >
> > So, one thing that is awkward about sleeping in this path is that
> > there is no way to propagate back -EINTR, so we end up doing an
> > uninterruptible sleep in something that could be called indirectly
> > from userspace syscall.. i915 seems to deal with this by limiting it
> > to shrinker being called from kswapd. I think in the shrinker we want
> > to know whether it is ok to sleep (ie. not syscall trigggered
> > codepath, and whether we are under enough memory pressure to justify
> > sleeping). For the syscall path, I'm playing with something that lets
> > me pass __GFP_RETRY_MAYFAIL | __GFP_NOWARN to
> > shmem_read_mapping_page_gfp(), and then stall after the shrinker has
> > failed, somewhere where we can make it interruptable. Ofc, that
> > doesn't help with all the other random memory allocations which can
> > fail, so not sure if it will turn out to be a good approach or not.
> > But I guess pinning the GEM bo's is the single biggest potential
> > consumer of pages in the submit path, so maybe it will be better than
> > nothing.
>
> We play similar games, although by a quick look I am not sure we quite
> manage to honour/propagate signals. This has certainly been a
> historically fiddly area. If you first ask for no reclaim allocations
> and invoke the shrinker manually first, then falling back to a bigger
> hammer, you should be able to do it.

yeah, I think it should.. but I've been fighting a bit today with the
fact that the state of bo wrt. shrinkable state has grown a bit
complicated (ie. is it purgeable, evictable, evictable if we are
willing to wait a short amount of time, vs things that are pinned for
scanout and we shouldn't bother waiting on, etc.. plus I managed to
make it a bit worse recently with fenced un-pin of the vma for dealing
with the case that userspace notices that, for userspace allocated
iova, it can release the virtual address before the kernel has a
chance to retire the submit) ;-)

BR,
-R

> Regards,
>
> Tvrtko

2022-06-08 02:14:05

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [Freedreno] [PATCH v4 12/13] drm/msm: Utilize gpu scheduler priorities


On 27/05/2022 05:25, Rob Clark wrote:
> On Thu, May 26, 2022 at 4:38 AM Tvrtko Ursulin
> <[email protected]> wrote:
>> On 26/05/2022 04:37, Rob Clark wrote:
>>> On Wed, May 25, 2022 at 9:22 AM Tvrtko Ursulin
>>> <[email protected]> wrote:
>>>> On 25/05/2022 14:41, Rob Clark wrote:
>>>>> On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> On 24/05/2022 15:50, Rob Clark wrote:
>>>>>>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 23/05/2022 23:53, Rob Clark wrote:
>>>>>>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Rob,
>>>>>>>>>>
>>>>>>>>>> On 28/07/2021 02:06, Rob Clark wrote:
>>>>>>>>>>> From: Rob Clark <[email protected]>
>>>>>>>>>>>
>>>>>>>>>>> The drm/scheduler provides additional prioritization on top of that
>>>>>>>>>>> provided by however many number of ringbuffers (each with their own
>>>>>>>>>>> priority level) is supported on a given generation. Expose the
>>>>>>>>>>> additional levels of priority to userspace and map the userspace
>>>>>>>>>>> priority back to ring (first level of priority) and schedular priority
>>>>>>>>>>> (additional priority levels within the ring).
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Rob Clark <[email protected]>
>>>>>>>>>>> Acked-by: Christian König <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +-
>>>>>>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +-
>>>>>>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++-
>>>>>>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++--------
>>>>>>>>>>> include/uapi/drm/msm_drm.h | 14 +++++-
>>>>>>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>>>>>> index bad4809b68ef..748665232d29 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
>>>>>>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value)
>>>>>>>>>>> return ret;
>>>>>>>>>>> }
>>>>>>>>>>> return -EINVAL;
>>>>>>>>>>> - case MSM_PARAM_NR_RINGS:
>>>>>>>>>>> - *value = gpu->nr_rings;
>>>>>>>>>>> + case MSM_PARAM_PRIORITIES:
>>>>>>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES;
>>>>>>>>>>> return 0;
>>>>>>>>>>> case MSM_PARAM_PP_PGTABLE:
>>>>>>>>>>> *value = 0;
>>>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>>>>>> index 450efe59abb5..c2ecec5b11c4 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
>>>>>>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>>>>>>>>>>> submit->gpu = gpu;
>>>>>>>>>>> submit->cmd = (void *)&submit->bos[nr_bos];
>>>>>>>>>>> submit->queue = queue;
>>>>>>>>>>> - submit->ring = gpu->rb[queue->prio];
>>>>>>>>>>> + submit->ring = gpu->rb[queue->ring_nr];
>>>>>>>>>>> submit->fault_dumped = false;
>>>>>>>>>>>
>>>>>>>>>>> INIT_LIST_HEAD(&submit->node);
>>>>>>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>>>>>>>>>>> /* Get a unique identifier for the submission for logging purposes */
>>>>>>>>>>> submitid = atomic_inc_return(&ident) - 1;
>>>>>>>>>>>
>>>>>>>>>>> - ring = gpu->rb[queue->prio];
>>>>>>>>>>> + ring = gpu->rb[queue->ring_nr];
>>>>>>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid,
>>>>>>>>>>> args->nr_bos, args->nr_cmds);
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>>>>>> index b912cacaecc0..0e4b45bff2e6 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>>>>>>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr {
>>>>>>>>>>> const char *name;
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> +/*
>>>>>>>>>>> + * The number of priority levels provided by drm gpu scheduler. The
>>>>>>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some
>>>>>>>>>>> + * cases, so we don't use it (no need for kernel generated jobs).
>>>>>>>>>>> + */
>>>>>>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN)
>>>>>>>>>>> +
>>>>>>>>>>> +/**
>>>>>>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority
>>>>>>>>>>> + *
>>>>>>>>>>> + * @gpu: the gpu instance
>>>>>>>>>>> + * @prio: the userspace priority level
>>>>>>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to
>>>>>>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace
>>>>>>>>>>> + * priority maps to
>>>>>>>>>>> + *
>>>>>>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total
>>>>>>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES).
>>>>>>>>>>> + * Each ring is associated with it's own scheduler instance. However, our
>>>>>>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the
>>>>>>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some
>>>>>>>>>>> + * care. The userspace provided priority (when a submitqueue is created)
>>>>>>>>>>> + * is mapped to ring nr and scheduler priority as such:
>>>>>>>>>>> + *
>>>>>>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES
>>>>>>>>>>> + * sched_prio = NR_SCHED_PRIORITIES -
>>>>>>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1
>>>>>>>>>>> + *
>>>>>>>>>>> + * This allows generations without preemption (nr_rings==1) to have some
>>>>>>>>>>> + * amount of prioritization, and provides more priority levels for gens
>>>>>>>>>>> + * that do have preemption.
>>>>>>>>>>
>>>>>>>>>> I am exploring how different drivers handle priority levels and this
>>>>>>>>>> caught my eye.
>>>>>>>>>>
>>>>>>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1,
>>>>>>>>>> ring + 1 preempts ring?
>>>>>>>>>
>>>>>>>>> Other way around, at least from the uabi standpoint. Ie. ring[0]
>>>>>>>>> preempts ring[1]
>>>>>>>>
>>>>>>>> Ah yes, I figure it out from the comments but then confused myself when
>>>>>>>> writing the email.
>>>>>>>>
>>>>>>>>>> If so I am wondering does the "spreading" of
>>>>>>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable
>>>>>>>>>> levels within every "bucket" or how does that work?
>>>>>>>>>
>>>>>>>>> So, preemption is possible between any priority level before run_job()
>>>>>>>>> gets called, which writes the job into the ringbuffer. After that
>>>>>>>>
>>>>>>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the
>>>>>>>> scheduler queues, right?
>>>>>>>
>>>>>>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on
>>>>>>> prio[1] could be executed after submit B on prio[2] provided that
>>>>>>> run_job(submitA) hasn't happened yet. So I guess it isn't "really"
>>>>>>> preemption because the submit hasn't started running on the GPU yet.
>>>>>>> But rather just scheduling according to priority.
>>>>>>>
>>>>>>>>> point, you only have "bucket" level preemption, because
>>>>>>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO
>>>>>>>>> ringbuffer.
>>>>>>>>
>>>>>>>> Right, and you have one GPU with four rings, which means you expose 12
>>>>>>>> priority levels to userspace, did I get that right?
>>>>>>>
>>>>>>> Correct
>>>>>>>
>>>>>>>> If so how do you convey in the ABI that not all there priority levels
>>>>>>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to
>>>>>>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not
>>>>>>>> match - 3 will not preempt 4.
>>>>>>>
>>>>>>> It isn't really exposed to userspace, but perhaps it should be..
>>>>>>> Userspace just knows that, to the extent possible, the kernel will try
>>>>>>> to execute prio 3 before prio 4.
>>>>>>>
>>>>>>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a
>>>>>>>> quick peek in Mesa but did not spot it - although I am not really at
>>>>>>>> home there yet so maybe I missed it.
>>>>>>>
>>>>>>> Yes, there is an EGL extension:
>>>>>>>
>>>>>>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt
>>>>>>>
>>>>>>> It is pretty limited, it only exposes three priority levels.
>>>>>>
>>>>>> Right, is that wired up on msm? And if it is, or could be, how do/would
>>>>>> you map the three priority levels for GPUs which expose 3 priority
>>>>>> levels versus the one which exposes 12?
>>>>>
>>>>> We don't yet, but probably should, expose a cap to indicate to
>>>>> userspace the # of hw rings vs # of levels of sched priority
>>>>
>>>> What bothers me is the question of whether this setup provides a
>>>> consistent benefit. Why would userspace use other than "real" (hardware)
>>>> priority levels on chips where they are available?
>>>
>>> yeah, perhaps we could decide that userspace doesn't really need more
>>> than 3 prio levels, and that on generations which have better
>>> preemption than what drm/sched provides, *only* expose those priority
>>> levels. I've avoided that so far because it seems wrong for the
>>> kernel to assume that a single EGL extension is all there is when it
>>> comes to userspace context priority.. the other option is to expose
>>> more information to userspace and let it decide.
>>
>> Maybe in msm you could reserve 0 for kernel submissions (if you have
>> such use cases) and expose levels 1-3 via drm/sched? If you could wire
>> that up, and if four levels is most your hardware will have.
>
> we fortunately don't need kernel submission for anything... that said,
> the limited # of priorities for drm/sched seems a bit arbitrary
> (although perhaps catering to the existing egl extension)

I don't know the history there. But I am noticing Vulkan has at least four priorities.

First of all there is a "within process" priority which is expressed as [0.0f - 1.0f]. That does not seem to be implemented on the ANV side, which is perhaps understandable for now, since we don't have a scheduler smart enough to distinguish clients, its all just contexts regardless to which client they belong.

Then there is VK_EXT_global_priority which has four discrete levels (*). This one is implemented in ANV and maps to -512, 0, +512 and +1023 in i915 context priority uapi terms.

(*) Interesting fact is that despite four discrete levels, the Vulkan enum values are curiously spaced - as if they wanted to allow for more fine grained control in the future.

typedef enum VkQueueGlobalPriorityKHR {
VK_QUEUE_GLOBAL_PRIORITY_LOW_KHR = 128,
VK_QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR = 256,
VK_QUEUE_GLOBAL_PRIORITY_HIGH_KHR = 512,
VK_QUEUE_GLOBAL_PRIORITY_REALTIME_KHR = 1024,
VK_QUEUE_GLOBAL_PRIORITY_LOW_EXT = VK_QUEUE_GLOBAL_PRIORITY_LOW_KHR,
VK_QUEUE_GLOBAL_PRIORITY_MEDIUM_EXT = VK_QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR,
VK_QUEUE_GLOBAL_PRIORITY_HIGH_EXT = VK_QUEUE_GLOBAL_PRIORITY_HIGH_KHR,
VK_QUEUE_GLOBAL_PRIORITY_REALTIME_EXT = VK_QUEUE_GLOBAL_PRIORITY_REALTIME_KHR,
} VkQueueGlobalPriorityKHR;

AMD side seemed almost the same as i915, all until the point when I ended up in the kernel side amdgpu_to_sched_priority which does not seem called from anywhere. So not sure how those four actually map to amdgpu and drm/sched.

Christian are you reading this to help answer perhaps?

>> Although with that option it seems drm/sched could starve lower
>> priorities, I mean not give anything to the hw/fw scheduler on higher
>> rings as longs as there is work on lower. Which if those chips have some
>> smarter algorithm would defeat it.
>
> So the thing is the (existing) gpu scheduling is strictly priority
> based, and not "nice" based like CPU scheduling. Those two schemes
> are completely different paradigms, the latter giving some boost to
> processes that have been blocked on I/O (which, I'm not sure there is
> an equiv thing for GPU) or otherwise haven't had a chance to run for a
> while.

I lost you here - I don't think paradigms are different, nor that CPU nice is somehow tied with blocked on I/O.

There is a default inherit from CPU nice to I/O nice, but _only_ for default I/O priority - where it hasn't been explicitly set via respective system calls. Which is exactly what I am proposing for GPU.

>> So perhaps there is no way but improving drm/sched. Backend controlled
>> number of priorities and backend control for whether "in flight" job s
>> limit is global vs per priority level (per run queue).
>>
>> Btw my motivation looking into all this is that we have CPU nice and
>> ionice supporting more levels and I'd like to tie that all together into
>> one consistent user friendly story (see
>> https://patchwork.freedesktop.org/series/102348/). In a world of
>> heterogenous compute pipelines I think that is the way forward. I even
>> demonstrated this from within ChromeOS, since the compositor uses nice
>> -5 is automatically gives it more GPU bandwith compared to for instance
>> Android VM.
>
> But this can be achieved with a simple priority based scheme, ie.
> compositor is higher priority than app.

Of course. Where it gets more complicated is when you have multiple userspace libraries processing the same data set in turn.

> The situation changes a bit, and becomes more cpu like perhaps, when
> you add long running compute and cpu-offload stuff

Exactly. And not only long running compute but high queue depth in general.

Consider for instance user working in say Handbrake - fires away a transcode in the background while continues to fiddle with previews for a next video. Background transcode can queue up multiple frames worth of work into the GPU, where each frame can be several milliseconds worth of GPU time. Even in this scenario you get into the context scheduling terrirory rather than intearctive bursty / vsynced workloads.

Not only that but to process a single frame we can have a buffer first be decoded on the GPU, processed on the CPU and then encoded on the GPU again. In the future a VPU block will appear in this chain as well. So you need a way to control priority of all components in the chain in an consistent and usable manner.

You can make all the libraries and applications aware of course.. but realistically that will be hard and fragile. What I propose is for an operation like "nice 5 ffmpeg-transcode.sh ..." to just work in the background for the complete pipeline, just as the user intended.

>> I know of other hardware supporting more than three levels, but I need
>> to study more drm drivers to gain a complete picture. I only started
>> with msm since it looked simple. :)
>
> even in msm the # of priority levels is somewhat arbitrary.. but
> roughly it is that we tell the hw there is something higher priority
> to run, it waits a bit for a cooperative yield point (since force
> preemption is rather expensive for 3d, ie. there is a lot of state to
> save/restore, not just a few cpu registers), and then eventually if a
> cooperative yield point isn't hit it triggers a forced preemption.
> (Only on newer things, older gens only had cooperative yield points to
> work with.)
>
>>> Honestly, the combination of the fact that a6xx is the first gen
>>> shipping in consumer products with upstream driver (using drm/sched),
>>> and not having had time yet to implement hw preemption for a6xx yet,
>>> means not a whole lot of thought has gone into the current arrangement
>>> ;-)
>>
>> :)
>>
>> What kind of scheduling algorithm does your hardware have between those
>> priority levels?
>
> Like I said, it is strictly "thing A is higher priority than thing
> B".. there is no CSF or io-nice type thing. I guess since it is still
> the kernel that initiates the preemption, we could in theory implement

By this you mean you don't just feel the hw priority queues at will but make sure only a small amount of work is in each?

> something more clever. But I'm not entirely sure something more
> clever makes sense given the relatively high cost of forced preemption
> compared to CPU. Ofc I could be wrong, I've not given a lot of
> thought to it other than more limited scenarios (ie. compositor should
> be higher priority than app)

To deal with a varying cost of preemption is a matter of selecting the right timeslice. But yes, smart(er) scheduling is a somewhat orthogonal problem to priority control.

Regards,

Tvrtko

>
> BR,
> -R
>
>>>> For instance if you exposed 4 instead of 12 on a respective platform,
>>>> would that be better or worse? Yes you could only map three directly
>>>> drm/sched and one would have to be "fake". Like:
>>>>
>>>> hw prio 0 -> drm/sched 2
>>>> hw prio 1 -> drm/sched 1
>>>> hw prio 2 -> drm/sched 0
>>>> hw prio 3 -> drm/sched 0
>>>>
>>>> Not saying that's nice either. Perhaps the answer is that drm/sched
>>>> needs more flexibility for instance if it wants to be widely used.
>>>
>>> I'm not sure what I'd add to drm/sched.. once it calls run_job()
>>> things are out of its hands, so really all it can do is re-order
>>> things prior to calling run_job() according to it's internal priority
>>> levels. And that is still better than no re-ordering so it adds some
>>> value, even if not complete.
>>
>> Not sure about the value there - as mentioned before I see problems on
>> the uapi front with not all priorities being equal.
>>
>> Besides, priority order scheduling is kind of meh to me. Especially if
>> it only applies in the scheduling frontend. If frontend and backend
>> algorithms do not even match then it's even more weird.
>>
>> IMO sooner or later GPU scheduling will have to catchup with state of
>> the art from the CPU world and use priority as a hint for time sharing
>> decisions.
>
> Maybe.. that is a lot more sophisticated than the current situation of
> "queue A should have higher priority than queue B"
>
> OTOH actual preemption of GPU work is a lot more expensive than
> preempting a CPU thread, so not even sure if we should try and look at
> GPU and CPU scheduling the same way. (But so far I've only looked at
> it as "compositor should have higher priority than app")
>
> BR,
> -R
>
>>>> For instance in i915 uapi we have priority as int -1023 - +1023. And
>>>> matching implementation on some platforms, until the new ones which are
>>>> GuC firmware based, where we need to squash that to low/normal/high.
>>>
>>> hmm, that is a more awkward problem, since it sounds like you are
>>> mapping many more priority levels into a much smaller set of hw
>>> priority levels. Do you have separate drm_sched instances per hw
>>> priority level? If so you can do the same thing of using drm_sched
>>> priority levels to multiply # of hw priority levels, but ofc that is
>>> not perfect (and won't get you to 2k).
>>
>> We don't use drm/sched yet, I was just mentioning what we have in uapi.
>> But yes, our current scheduling backend can handle more than three levels.
>>
>>> But is there anything that actually *uses* that many levels of priority?
>>
>> From userspace no, there are only a few internal priority levels for
>> things like heartbeats the driver is sending to check engine health and
>> page flip priority boosts.
>>
>> Regards,
>>
>> Tvrtko