2024-04-23 21:33:59

by Adrián Larumbe

[permalink] [raw]
Subject: [PATCH v2 0/3] Support fdinfo runtime and memory stats on Panthor

This patch series enables userspace utilities like gputop and nvtop to
query a render context's fdinfo file and figure out rates of engine
and memory utilisation.

Changelog:
v2:
- Split original first patch in two, one for FW CS cycle and timestamp
calculations and job accounting memory management, and a second one
that enables fdinfo.
- Moved NUM_INSTRS_PER_SLOT to the file prelude
- Removed nelem variable from the group's struct definition.
- Precompute size of group's syncobj BO to avoid code duplication.
- Some minor nits.

Adrián Larumbe (3):
drm/panthor: introduce job cycle and timestamp accounting
drm/panthor: Add DRM fdinfo support
drm/panthor: Enable fdinfo for memory stats

drivers/gpu/drm/panthor/panthor_devfreq.c | 10 ++
drivers/gpu/drm/panthor/panthor_device.h | 11 ++
drivers/gpu/drm/panthor/panthor_drv.c | 31 ++++
drivers/gpu/drm/panthor/panthor_gem.c | 12 ++
drivers/gpu/drm/panthor/panthor_sched.c | 204 +++++++++++++++++++---
5 files changed, 244 insertions(+), 24 deletions(-)


base-commit: a6325ad47bc808aeb4c69ae36e0236c2c6d400b5
--
2.44.0



2024-04-23 21:34:07

by Adrián Larumbe

[permalink] [raw]
Subject: [PATCH v2 3/3] drm/panthor: Enable fdinfo for memory stats

When vm-binding an already-created BO, the entirety of its virtual size is
then backed by system memory, so its RSS is always the same as its virtual
size.

Also, we consider a PRIME imported BO to be resident if its matching
dma_buf has an open attachment, which means its backing storage had already
been allocated.

Signed-off-by: Adrián Larumbe <[email protected]>
---
drivers/gpu/drm/panthor/panthor_gem.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
index d6483266d0c2..386c0dfeeb5f 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.c
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -143,6 +143,17 @@ panthor_gem_prime_export(struct drm_gem_object *obj, int flags)
return drm_gem_prime_export(obj, flags);
}

+static enum drm_gem_object_status panthor_gem_status(struct drm_gem_object *obj)
+{
+ struct panthor_gem_object *bo = to_panthor_bo(obj);
+ enum drm_gem_object_status res = 0;
+
+ if (bo->base.base.import_attach || bo->base.pages)
+ res |= DRM_GEM_OBJECT_RESIDENT;
+
+ return res;
+}
+
static const struct drm_gem_object_funcs panthor_gem_funcs = {
.free = panthor_gem_free_object,
.print_info = drm_gem_shmem_object_print_info,
@@ -152,6 +163,7 @@ static const struct drm_gem_object_funcs panthor_gem_funcs = {
.vmap = drm_gem_shmem_object_vmap,
.vunmap = drm_gem_shmem_object_vunmap,
.mmap = panthor_gem_mmap,
+ .status = panthor_gem_status,
.export = panthor_gem_prime_export,
.vm_ops = &drm_gem_shmem_vm_ops,
};
--
2.44.0


2024-04-23 21:34:07

by Adrián Larumbe

[permalink] [raw]
Subject: [PATCH v2 2/3] drm/panthor: Add DRM fdinfo support

Drawing from the FW-calculated values in the previous commit, we can
increase the numbers for an open file by collecting them from finished jobs
when updating their group synchronisation objects.

Signed-off-by: Adrián Larumbe <[email protected]>
---
drivers/gpu/drm/panthor/panthor_devfreq.c | 10 +++++
drivers/gpu/drm/panthor/panthor_device.h | 11 ++++++
drivers/gpu/drm/panthor/panthor_drv.c | 31 +++++++++++++++
drivers/gpu/drm/panthor/panthor_sched.c | 46 +++++++++++++++++++++++
4 files changed, 98 insertions(+)

diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
index c6d3c327cc24..5eededaeade7 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.c
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
@@ -91,6 +91,7 @@ static int panthor_devfreq_get_dev_status(struct device *dev,
spin_lock_irqsave(&pdevfreq->lock, irqflags);

panthor_devfreq_update_utilization(pdevfreq);
+ ptdev->current_frequency = status->current_frequency;

status->total_time = ktime_to_ns(ktime_add(pdevfreq->busy_time,
pdevfreq->idle_time));
@@ -130,6 +131,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
struct panthor_devfreq *pdevfreq;
struct dev_pm_opp *opp;
unsigned long cur_freq;
+ unsigned long freq = ULONG_MAX;
int ret;

pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), GFP_KERNEL);
@@ -204,6 +206,14 @@ int panthor_devfreq_init(struct panthor_device *ptdev)

dev_pm_opp_put(opp);

+ /* Find the fastest defined rate */
+ opp = dev_pm_opp_find_freq_floor(dev, &freq);
+ if (IS_ERR(opp))
+ return PTR_ERR(opp);
+ ptdev->fast_rate = freq;
+
+ dev_pm_opp_put(opp);
+
/*
* Setup default thresholds for the simple_ondemand governor.
* The values are chosen based on experiments.
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 2fdd671b38fd..b5b5dfe3cafe 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -162,6 +162,14 @@ struct panthor_device {
*/
struct page *dummy_latest_flush;
} pm;
+
+ unsigned long current_frequency;
+ unsigned long fast_rate;
+};
+
+struct panthor_gpu_usage {
+ u64 time;
+ u64 cycles;
};

/**
@@ -176,6 +184,9 @@ struct panthor_file {

/** @groups: Scheduling group pool attached to this file. */
struct panthor_group_pool *groups;
+
+ /** @stats: cycle and timestamp measures for job execution. */
+ struct panthor_gpu_usage stats;
};

int panthor_device_init(struct panthor_device *ptdev);
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index b8a84f26b3ef..6d25385e02a1 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -3,12 +3,17 @@
/* Copyright 2019 Linaro, Ltd., Rob Herring <[email protected]> */
/* Copyright 2019 Collabora ltd. */

+#ifdef CONFIG_ARM_ARCH_TIMER
+#include <asm/arch_timer.h>
+#endif
+
#include <linux/list.h>
#include <linux/module.h>
#include <linux/of_platform.h>
#include <linux/pagemap.h>
#include <linux/platform_device.h>
#include <linux/pm_runtime.h>
+#include <linux/time64.h>

#include <drm/drm_debugfs.h>
#include <drm/drm_drv.h>
@@ -1351,6 +1356,30 @@ static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
return ret;
}

+static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev,
+ struct panthor_file *pfile,
+ struct drm_printer *p)
+{
+#ifdef CONFIG_ARM_ARCH_TIMER
+ drm_printf(p, "drm-engine-panthor:\t%llu ns\n",
+ DIV_ROUND_UP_ULL((pfile->stats.time * NSEC_PER_SEC),
+ arch_timer_get_cntfrq()));
+#endif
+ drm_printf(p, "drm-cycles-panthor:\t%llu\n", pfile->stats.cycles);
+ drm_printf(p, "drm-maxfreq-panthor:\t%lu Hz\n", ptdev->fast_rate);
+ drm_printf(p, "drm-curfreq-panthor:\t%lu Hz\n", ptdev->current_frequency);
+}
+
+static void panthor_show_fdinfo(struct drm_printer *p, struct drm_file *file)
+{
+ struct drm_device *dev = file->minor->dev;
+ struct panthor_device *ptdev = container_of(dev, struct panthor_device, base);
+
+ panthor_gpu_show_fdinfo(ptdev, file->driver_priv, p);
+
+ drm_show_memory_stats(p, file);
+}
+
static const struct file_operations panthor_drm_driver_fops = {
.open = drm_open,
.release = drm_release,
@@ -1360,6 +1389,7 @@ static const struct file_operations panthor_drm_driver_fops = {
.read = drm_read,
.llseek = noop_llseek,
.mmap = panthor_mmap,
+ .show_fdinfo = drm_show_fdinfo,
};

#ifdef CONFIG_DEBUG_FS
@@ -1378,6 +1408,7 @@ static const struct drm_driver panthor_drm_driver = {
DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
.open = panthor_open,
.postclose = panthor_postclose,
+ .show_fdinfo = panthor_show_fdinfo,
.ioctls = panthor_drm_driver_ioctls,
.num_ioctls = ARRAY_SIZE(panthor_drm_driver_ioctls),
.fops = &panthor_drm_driver_fops,
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 320dfa0388ba..9f1810f5cf4b 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -598,6 +598,18 @@ struct panthor_group {
size_t times_offset;
} syncobjs;

+ /** @fdinfo: Per-file total cycle and timestamp values reference. */
+ struct {
+ /** @data: Pointer to actual per-file sample data. */
+ struct panthor_gpu_usage *data;
+
+ /**
+ * @lock: Mutex to govern concurrent access from drm file's fdinfo callback
+ * and job post-completion processing function
+ */
+ struct mutex lock;
+ } fdinfo;
+
/** @state: Group state. */
enum panthor_group_state state;

@@ -859,6 +871,8 @@ static void group_release_work(struct work_struct *work)
struct panthor_device *ptdev = group->ptdev;
u32 i;

+ mutex_destroy(&group->fdinfo.lock);
+
for (i = 0; i < group->queue_count; i++)
group_free_queue(group, group->queues[i]);

@@ -2741,6 +2755,30 @@ void panthor_sched_post_reset(struct panthor_device *ptdev)
sched_queue_work(sched, sync_upd);
}

+static void update_fdinfo_stats(struct panthor_job *job)
+{
+ struct panthor_group *group = job->group;
+ struct panthor_queue *queue = group->queues[job->queue_idx];
+ struct panthor_device *ptdev = group->ptdev;
+ struct panthor_gpu_usage *fdinfo;
+ struct panthor_job_times *times;
+
+ drm_WARN_ON(&ptdev->base, job->ringbuf_idx >=
+ panthor_kernel_bo_size(queue->ringbuf) / (SLOTSIZE));
+
+ times = (struct panthor_job_times *)
+ ((unsigned long)group->syncobjs.bo->kmap + queue->time_offset +
+ (job->ringbuf_idx * sizeof(struct panthor_job_times)));
+
+ mutex_lock(&group->fdinfo.lock);
+ if ((group->fdinfo.data)) {
+ fdinfo = group->fdinfo.data;
+ fdinfo->cycles += times->cycles.after - times->cycles.before;
+ fdinfo->time += times->time.after - times->time.before;
+ }
+ mutex_unlock(&group->fdinfo.lock);
+}
+
static void group_sync_upd_work(struct work_struct *work)
{
struct panthor_group *group =
@@ -2776,6 +2814,7 @@ static void group_sync_upd_work(struct work_struct *work)
dma_fence_end_signalling(cookie);

list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
+ update_fdinfo_stats(job);
list_del_init(&job->node);
panthor_job_put(&job->base);
}
@@ -3240,6 +3279,9 @@ int panthor_group_create(struct panthor_file *pfile,
}
mutex_unlock(&sched->reset.lock);

+ group->fdinfo.data = &pfile->stats;
+ mutex_init(&group->fdinfo.lock);
+
return gid;

err_put_group:
@@ -3279,6 +3321,10 @@ int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle)
mutex_unlock(&sched->lock);
mutex_unlock(&sched->reset.lock);

+ mutex_lock(&group->fdinfo.lock);
+ group->fdinfo.data = NULL;
+ mutex_unlock(&group->fdinfo.lock);
+
group_put(group);
return 0;
}
--
2.44.0


2024-04-23 21:34:16

by Adrián Larumbe

[permalink] [raw]
Subject: [PATCH v2 1/3] drm/panthor: introduce job cycle and timestamp accounting

Enable calculations of job submission times in clock cycles and wall
time. This is done by expanding the boilerplate command stream when running
a job to include instructions that compute said times right before an after
a user CS.

Those numbers are stored in the queue's group's sync objects BO, right
after them. Because the queues in a group might have a different number of
slots, one must keep track of the overall slot tally when reckoning the
offset of a queue's time sample structs, one for each slot.

NUM_INSTRS_PER_SLOT had to be increased to 32 because of adding new FW
instructions for storing and subtracting the cycle counter and timestamp
register, and it must always remain a power of two.

This commit is done in preparation for enabling DRM fdinfo support in the
Panthor driver, which depends on the numbers calculated herein.

Signed-off-by: Adrián Larumbe <[email protected]>
---
drivers/gpu/drm/panthor/panthor_sched.c | 158 ++++++++++++++++++++----
1 file changed, 134 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index b3a51a6de523..320dfa0388ba 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -93,6 +93,9 @@
#define MIN_CSGS 3
#define MAX_CSG_PRIO 0xf

+#define NUM_INSTRS_PER_SLOT 32
+#define SLOTSIZE (NUM_INSTRS_PER_SLOT * sizeof(u64))
+
struct panthor_group;

/**
@@ -466,6 +469,9 @@ struct panthor_queue {
*/
struct list_head in_flight_jobs;
} fence_ctx;
+
+ /** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
+ unsigned long time_offset;
};

/**
@@ -580,7 +586,17 @@ struct panthor_group {
* One sync object per queue. The position of the sync object is
* determined by the queue index.
*/
- struct panthor_kernel_bo *syncobjs;
+
+ struct {
+ /** @bo: Kernel BO holding the sync objects. */
+ struct panthor_kernel_bo *bo;
+
+ /**
+ * @times_offset: Beginning of panthor_job_times struct samples after
+ * the group's array of sync objects.
+ */
+ size_t times_offset;
+ } syncobjs;

/** @state: Group state. */
enum panthor_group_state state;
@@ -639,6 +655,18 @@ struct panthor_group {
struct list_head wait_node;
};

+struct panthor_job_times {
+ struct {
+ u64 before;
+ u64 after;
+ } cycles;
+
+ struct {
+ u64 before;
+ u64 after;
+ } time;
+};
+
/**
* group_queue_work() - Queue a group work
* @group: Group to queue the work for.
@@ -718,6 +746,9 @@ struct panthor_job {
/** @queue_idx: Index of the queue inside @group. */
u32 queue_idx;

+ /** @ringbuf_idx: Index of the ringbuffer inside @queue. */
+ u32 ringbuf_idx;
+
/** @call_info: Information about the userspace command stream call. */
struct {
/** @start: GPU address of the userspace command stream. */
@@ -833,7 +864,7 @@ static void group_release_work(struct work_struct *work)

panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), group->suspend_buf);
panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), group->protm_suspend_buf);
- panthor_kernel_bo_destroy(group->vm, group->syncobjs);
+ panthor_kernel_bo_destroy(group->vm, group->syncobjs.bo);

panthor_vm_put(group->vm);
kfree(group);
@@ -1924,8 +1955,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
}
}

-#define NUM_INSTRS_PER_SLOT 16
-
static void
group_term_post_processing(struct panthor_group *group)
{
@@ -1962,7 +1991,7 @@ group_term_post_processing(struct panthor_group *group)
spin_unlock(&queue->fence_ctx.lock);

/* Manually update the syncobj seqno to unblock waiters. */
- syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
+ syncobj = group->syncobjs.bo->kmap + (i * sizeof(*syncobj));
syncobj->status = ~0;
syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
sched_queue_work(group->ptdev->scheduler, sync_upd);
@@ -2729,7 +2758,7 @@ static void group_sync_upd_work(struct work_struct *work)
if (!queue)
continue;

- syncobj = group->syncobjs->kmap + (queue_idx * sizeof(*syncobj));
+ syncobj = group->syncobjs.bo->kmap + (queue_idx * sizeof(*syncobj));

spin_lock(&queue->fence_ctx.lock);
list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
@@ -2764,15 +2793,23 @@ queue_run_job(struct drm_sched_job *sched_job)
struct panthor_scheduler *sched = ptdev->scheduler;
u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
+ u32 ringbuf_index = ringbuf_insert / (SLOTSIZE);
u64 addr_reg = ptdev->csif_info.cs_reg_count -
ptdev->csif_info.unpreserved_cs_reg_count;
u64 val_reg = addr_reg + 2;
- u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
- job->queue_idx * sizeof(struct panthor_syncobj_64b);
+ u64 cycle_reg = addr_reg;
+ u64 time_reg = val_reg;
+ u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) +
+ job->queue_idx * sizeof(struct panthor_syncobj_64b);
+ u64 times_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) + queue->time_offset +
+ (ringbuf_index * sizeof(struct panthor_job_times));
+
u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
struct dma_fence *done_fence;
int ret;

+ drm_WARN_ON(&ptdev->base, ringbuf_insert >= ringbuf_size);
+
u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
/* MOV32 rX+2, cs.latest_flush */
(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
@@ -2780,6 +2817,18 @@ queue_run_job(struct drm_sched_job *sched_job)
/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,

+ /* MOV48 rX:rX+1, cycles_offset */
+ (1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.before)),
+
+ /* MOV48 rX:rX+1, time_offset */
+ (1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.before)),
+
+ /* STORE_STATE cycles */
+ (40ull << 56) | (cycle_reg << 40) | (1ll << 32),
+
+ /* STORE_STATE timer */
+ (40ull << 56) | (time_reg << 40) | (0ll << 32),
+
/* MOV48 rX:rX+1, cs.start */
(1ull << 56) | (addr_reg << 48) | job->call_info.start,

@@ -2792,6 +2841,18 @@ queue_run_job(struct drm_sched_job *sched_job)
/* CALL rX:rX+1, rX+2 */
(32ull << 56) | (addr_reg << 40) | (val_reg << 32),

+ /* MOV48 rX:rX+1, cycles_offset */
+ (1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.after)),
+
+ /* MOV48 rX:rX+1, time_offset */
+ (1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.after)),
+
+ /* STORE_STATE cycles */
+ (40ull << 56) | (cycle_reg << 40) | (1ll << 32),
+
+ /* STORE_STATE timer */
+ (40ull << 56) | (time_reg << 40) | (0ll << 32),
+
/* MOV48 rX:rX+1, sync_addr */
(1ull << 56) | (addr_reg << 48) | sync_addr,

@@ -2846,6 +2907,7 @@ queue_run_job(struct drm_sched_job *sched_job)

job->ringbuf.start = queue->iface.input->insert;
job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
+ job->ringbuf_idx = ringbuf_index;

/* Make sure the ring buffer is updated before the INSERT
* register.
@@ -2936,7 +2998,8 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {

static struct panthor_queue *
group_create_queue(struct panthor_group *group,
- const struct drm_panthor_queue_create *args)
+ const struct drm_panthor_queue_create *args,
+ unsigned int slots_so_far)
{
struct drm_gpu_scheduler *drm_sched;
struct panthor_queue *queue;
@@ -2987,9 +3050,12 @@ group_create_queue(struct panthor_group *group,
goto err_free_queue;
}

+ queue->time_offset = group->syncobjs.times_offset +
+ (slots_so_far * sizeof(struct panthor_job_times));
+
ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
group->ptdev->scheduler->wq, 1,
- args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
+ args->ringbuf_size / SLOTSIZE,
0, msecs_to_jiffies(JOB_TIMEOUT_MS),
group->ptdev->reset.wq,
NULL, "panthor-queue", group->ptdev->base.dev);
@@ -3017,7 +3083,9 @@ int panthor_group_create(struct panthor_file *pfile,
struct panthor_scheduler *sched = ptdev->scheduler;
struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
struct panthor_group *group = NULL;
+ unsigned int total_slots;
u32 gid, i, suspend_size;
+ size_t syncobj_bo_size;
int ret;

if (group_args->pad)
@@ -3083,33 +3151,75 @@ int panthor_group_create(struct panthor_file *pfile,
goto err_put_group;
}

- group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
- group_args->queues.count *
- sizeof(struct panthor_syncobj_64b),
- DRM_PANTHOR_BO_NO_MMAP,
- DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
- DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
- PANTHOR_VM_KERNEL_AUTO_VA);
- if (IS_ERR(group->syncobjs)) {
- ret = PTR_ERR(group->syncobjs);
+ /*
+ * Need to add size for the panthor_job_times structs, as many as the sum
+ * of the number of job slots for every single queue ringbuffer.
+ */
+ for (i = 0, total_slots = 0; i < group_args->queues.count; i++)
+ total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
+
+ syncobj_bo_size = (group_args->queues.count * sizeof(struct panthor_syncobj_64b))
+ + (total_slots * sizeof(struct panthor_job_times));
+
+ /*
+ * Memory layout of group's syncobjs BO
+ * group->syncobjs.bo {
+ * struct panthor_syncobj_64b sync1;
+ * struct panthor_syncobj_64b sync2;
+ * ...
+ * As many as group_args->queues.count
+ * ...
+ * struct panthor_syncobj_64b syncn;
+ * struct panthor_job_times queue1_slot1
+ * struct panthor_job_times queue1_slot2
+ * ...
+ * As many as queue[i].ringbuf_size / SLOTSIZE
+ * ...
+ * struct panthor_job_times queue1_slotP
+ * ...
+ * As many as group_args->queues.count
+ * ...
+ * struct panthor_job_times queueN_slot1
+ * struct panthor_job_times queueN_slot2
+ * ...
+ * As many as queue[n].ringbuf_size / SLOTSIZE
+ * struct panthor_job_times queueN_slotQ
+ *
+ * Linearly, group->syncobjs.bo = {syncojb1,..,syncobjN,
+ * {queue1 = {js1,..,jsP},..,queueN = {js1,..,jsQ}}}
+ * }
+ *
+ */
+
+ group->syncobjs.bo = panthor_kernel_bo_create(ptdev, group->vm,
+ syncobj_bo_size,
+ DRM_PANTHOR_BO_NO_MMAP,
+ DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
+ DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
+ PANTHOR_VM_KERNEL_AUTO_VA);
+ if (IS_ERR(group->syncobjs.bo)) {
+ ret = PTR_ERR(group->syncobjs.bo);
goto err_put_group;
}

- ret = panthor_kernel_bo_vmap(group->syncobjs);
+ ret = panthor_kernel_bo_vmap(group->syncobjs.bo);
if (ret)
goto err_put_group;

- memset(group->syncobjs->kmap, 0,
- group_args->queues.count * sizeof(struct panthor_syncobj_64b));
+ memset(group->syncobjs.bo->kmap, 0, syncobj_bo_size);
+
+ group->syncobjs.times_offset =
+ group_args->queues.count * sizeof(struct panthor_syncobj_64b);

- for (i = 0; i < group_args->queues.count; i++) {
- group->queues[i] = group_create_queue(group, &queue_args[i]);
+ for (i = 0, total_slots = 0; i < group_args->queues.count; i++) {
+ group->queues[i] = group_create_queue(group, &queue_args[i], total_slots);
if (IS_ERR(group->queues[i])) {
ret = PTR_ERR(group->queues[i]);
group->queues[i] = NULL;
goto err_put_group;
}

+ total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
group->queue_count++;
}

--
2.44.0


2024-04-24 06:49:04

by Boris Brezillon

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] drm/panthor: introduce job cycle and timestamp accounting

On Tue, 23 Apr 2024 22:32:34 +0100
Adrián Larumbe <[email protected]> wrote:

> Enable calculations of job submission times in clock cycles and wall
> time. This is done by expanding the boilerplate command stream when running
> a job to include instructions that compute said times right before an after
> a user CS.
>
> Those numbers are stored in the queue's group's sync objects BO, right
> after them. Because the queues in a group might have a different number of
> slots, one must keep track of the overall slot tally when reckoning the
> offset of a queue's time sample structs, one for each slot.
>
> NUM_INSTRS_PER_SLOT had to be increased to 32 because of adding new FW
> instructions for storing and subtracting the cycle counter and timestamp
> register, and it must always remain a power of two.
>
> This commit is done in preparation for enabling DRM fdinfo support in the
> Panthor driver, which depends on the numbers calculated herein.
>
> Signed-off-by: Adrián Larumbe <[email protected]>
> ---
> drivers/gpu/drm/panthor/panthor_sched.c | 158 ++++++++++++++++++++----
> 1 file changed, 134 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index b3a51a6de523..320dfa0388ba 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -93,6 +93,9 @@
> #define MIN_CSGS 3
> #define MAX_CSG_PRIO 0xf
>
> +#define NUM_INSTRS_PER_SLOT 32
> +#define SLOTSIZE (NUM_INSTRS_PER_SLOT * sizeof(u64))

Given everyone agreed on the profiling sysfs knob for Panfrost, I'm
tempted to make the profiling optional here as well, so we can save
space on the CS ring buffers when profiling is disabled. This means
adjusting the 'credits' parameter we pass to drm_sched_job_init()
accordingly, with one credit counting for an instruction (or a block of
16 instructions to keep things naturally cache-line aligned). You'll
also need to change the 'credit_limit' passed to drm_sched_init().
>


2024-04-24 15:49:09

by Liviu Dudau

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] drm/panthor: introduce job cycle and timestamp accounting

Hi Adrián,

On Tue, Apr 23, 2024 at 10:32:34PM +0100, Adrián Larumbe wrote:
> Enable calculations of job submission times in clock cycles and wall
> time. This is done by expanding the boilerplate command stream when running
> a job to include instructions that compute said times right before an after
> a user CS.
>
> Those numbers are stored in the queue's group's sync objects BO, right
> after them. Because the queues in a group might have a different number of
> slots, one must keep track of the overall slot tally when reckoning the
> offset of a queue's time sample structs, one for each slot.
>
> NUM_INSTRS_PER_SLOT had to be increased to 32 because of adding new FW
> instructions for storing and subtracting the cycle counter and timestamp
> register, and it must always remain a power of two.
>
> This commit is done in preparation for enabling DRM fdinfo support in the
> Panthor driver, which depends on the numbers calculated herein.
>
> Signed-off-by: Adrián Larumbe <[email protected]>
> ---
> drivers/gpu/drm/panthor/panthor_sched.c | 158 ++++++++++++++++++++----
> 1 file changed, 134 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index b3a51a6de523..320dfa0388ba 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -93,6 +93,9 @@
> #define MIN_CSGS 3
> #define MAX_CSG_PRIO 0xf
>
> +#define NUM_INSTRS_PER_SLOT 32
> +#define SLOTSIZE (NUM_INSTRS_PER_SLOT * sizeof(u64))
> +
> struct panthor_group;
>
> /**
> @@ -466,6 +469,9 @@ struct panthor_queue {
> */
> struct list_head in_flight_jobs;
> } fence_ctx;
> +
> + /** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
> + unsigned long time_offset;

Minor nitpick: I would call it job_times_offset. Or stats_offset. One letter difference
versus syncobjs' times_offset gets harder to read sometimes.

> };
>
> /**
> @@ -580,7 +586,17 @@ struct panthor_group {
> * One sync object per queue. The position of the sync object is
> * determined by the queue index.
> */
> - struct panthor_kernel_bo *syncobjs;
> +
> + struct {
> + /** @bo: Kernel BO holding the sync objects. */
> + struct panthor_kernel_bo *bo;
> +
> + /**
> + * @times_offset: Beginning of panthor_job_times struct samples after
> + * the group's array of sync objects.
> + */
> + size_t times_offset;
> + } syncobjs;
>
> /** @state: Group state. */
> enum panthor_group_state state;
> @@ -639,6 +655,18 @@ struct panthor_group {
> struct list_head wait_node;
> };
>
> +struct panthor_job_times {
> + struct {
> + u64 before;
> + u64 after;
> + } cycles;
> +
> + struct {
> + u64 before;
> + u64 after;
> + } time;
> +};
> +
> /**
> * group_queue_work() - Queue a group work
> * @group: Group to queue the work for.
> @@ -718,6 +746,9 @@ struct panthor_job {
> /** @queue_idx: Index of the queue inside @group. */
> u32 queue_idx;
>
> + /** @ringbuf_idx: Index of the ringbuffer inside @queue. */
> + u32 ringbuf_idx;
> +
> /** @call_info: Information about the userspace command stream call. */
> struct {
> /** @start: GPU address of the userspace command stream. */
> @@ -833,7 +864,7 @@ static void group_release_work(struct work_struct *work)
>
> panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), group->suspend_buf);
> panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), group->protm_suspend_buf);
> - panthor_kernel_bo_destroy(group->vm, group->syncobjs);
> + panthor_kernel_bo_destroy(group->vm, group->syncobjs.bo);
>
> panthor_vm_put(group->vm);
> kfree(group);
> @@ -1924,8 +1955,6 @@ tick_ctx_init(struct panthor_scheduler *sched,
> }
> }
>
> -#define NUM_INSTRS_PER_SLOT 16
> -
> static void
> group_term_post_processing(struct panthor_group *group)
> {
> @@ -1962,7 +1991,7 @@ group_term_post_processing(struct panthor_group *group)
> spin_unlock(&queue->fence_ctx.lock);
>
> /* Manually update the syncobj seqno to unblock waiters. */
> - syncobj = group->syncobjs->kmap + (i * sizeof(*syncobj));
> + syncobj = group->syncobjs.bo->kmap + (i * sizeof(*syncobj));
> syncobj->status = ~0;
> syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
> sched_queue_work(group->ptdev->scheduler, sync_upd);
> @@ -2729,7 +2758,7 @@ static void group_sync_upd_work(struct work_struct *work)
> if (!queue)
> continue;
>
> - syncobj = group->syncobjs->kmap + (queue_idx * sizeof(*syncobj));
> + syncobj = group->syncobjs.bo->kmap + (queue_idx * sizeof(*syncobj));
>
> spin_lock(&queue->fence_ctx.lock);
> list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
> @@ -2764,15 +2793,23 @@ queue_run_job(struct drm_sched_job *sched_job)
> struct panthor_scheduler *sched = ptdev->scheduler;
> u32 ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
> u32 ringbuf_insert = queue->iface.input->insert & (ringbuf_size - 1);
> + u32 ringbuf_index = ringbuf_insert / (SLOTSIZE);
> u64 addr_reg = ptdev->csif_info.cs_reg_count -
> ptdev->csif_info.unpreserved_cs_reg_count;
> u64 val_reg = addr_reg + 2;
> - u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs) +
> - job->queue_idx * sizeof(struct panthor_syncobj_64b);
> + u64 cycle_reg = addr_reg;
> + u64 time_reg = val_reg;
> + u64 sync_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) +
> + job->queue_idx * sizeof(struct panthor_syncobj_64b);
> + u64 times_addr = panthor_kernel_bo_gpuva(group->syncobjs.bo) + queue->time_offset +
> + (ringbuf_index * sizeof(struct panthor_job_times));
> +
> u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> struct dma_fence *done_fence;
> int ret;
>
> + drm_WARN_ON(&ptdev->base, ringbuf_insert >= ringbuf_size);

I think we should return an error here, rather than warn. What would be the point to continue?

> +
> u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> /* MOV32 rX+2, cs.latest_flush */
> (2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> @@ -2780,6 +2817,18 @@ queue_run_job(struct drm_sched_job *sched_job)
> /* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> (36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
>
> + /* MOV48 rX:rX+1, cycles_offset */
> + (1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.before)),
> +
> + /* MOV48 rX:rX+1, time_offset */
> + (1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.before)),
> +
> + /* STORE_STATE cycles */
> + (40ull << 56) | (cycle_reg << 40) | (1ll << 32),
> +
> + /* STORE_STATE timer */
> + (40ull << 56) | (time_reg << 40) | (0ll << 32),
> +
> /* MOV48 rX:rX+1, cs.start */
> (1ull << 56) | (addr_reg << 48) | job->call_info.start,
>
> @@ -2792,6 +2841,18 @@ queue_run_job(struct drm_sched_job *sched_job)
> /* CALL rX:rX+1, rX+2 */
> (32ull << 56) | (addr_reg << 40) | (val_reg << 32),
>
> + /* MOV48 rX:rX+1, cycles_offset */
> + (1ull << 56) | (cycle_reg << 48) | (times_addr + offsetof(struct panthor_job_times, cycles.after)),
> +
> + /* MOV48 rX:rX+1, time_offset */
> + (1ull << 56) | (time_reg << 48) | (times_addr + offsetof(struct panthor_job_times, time.after)),
> +
> + /* STORE_STATE cycles */
> + (40ull << 56) | (cycle_reg << 40) | (1ll << 32),
> +
> + /* STORE_STATE timer */
> + (40ull << 56) | (time_reg << 40) | (0ll << 32),
> +
> /* MOV48 rX:rX+1, sync_addr */
> (1ull << 56) | (addr_reg << 48) | sync_addr,
>
> @@ -2846,6 +2907,7 @@ queue_run_job(struct drm_sched_job *sched_job)
>
> job->ringbuf.start = queue->iface.input->insert;
> job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> + job->ringbuf_idx = ringbuf_index;
>
> /* Make sure the ring buffer is updated before the INSERT
> * register.
> @@ -2936,7 +2998,8 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>
> static struct panthor_queue *
> group_create_queue(struct panthor_group *group,
> - const struct drm_panthor_queue_create *args)
> + const struct drm_panthor_queue_create *args,
> + unsigned int slots_so_far)
> {
> struct drm_gpu_scheduler *drm_sched;
> struct panthor_queue *queue;
> @@ -2987,9 +3050,12 @@ group_create_queue(struct panthor_group *group,
> goto err_free_queue;
> }
>
> + queue->time_offset = group->syncobjs.times_offset +
> + (slots_so_far * sizeof(struct panthor_job_times));
> +
> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
> group->ptdev->scheduler->wq, 1,
> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
> + args->ringbuf_size / SLOTSIZE,
> 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
> group->ptdev->reset.wq,
> NULL, "panthor-queue", group->ptdev->base.dev);
> @@ -3017,7 +3083,9 @@ int panthor_group_create(struct panthor_file *pfile,
> struct panthor_scheduler *sched = ptdev->scheduler;
> struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
> struct panthor_group *group = NULL;
> + unsigned int total_slots;
> u32 gid, i, suspend_size;
> + size_t syncobj_bo_size;
> int ret;
>
> if (group_args->pad)
> @@ -3083,33 +3151,75 @@ int panthor_group_create(struct panthor_file *pfile,
> goto err_put_group;
> }
>
> - group->syncobjs = panthor_kernel_bo_create(ptdev, group->vm,
> - group_args->queues.count *
> - sizeof(struct panthor_syncobj_64b),
> - DRM_PANTHOR_BO_NO_MMAP,
> - DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> - DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> - PANTHOR_VM_KERNEL_AUTO_VA);
> - if (IS_ERR(group->syncobjs)) {
> - ret = PTR_ERR(group->syncobjs);
> + /*
> + * Need to add size for the panthor_job_times structs, as many as the sum
> + * of the number of job slots for every single queue ringbuffer.
> + */
> + for (i = 0, total_slots = 0; i < group_args->queues.count; i++)
> + total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
> +
> + syncobj_bo_size = (group_args->queues.count * sizeof(struct panthor_syncobj_64b))
> + + (total_slots * sizeof(struct panthor_job_times));
> +
> + /*
> + * Memory layout of group's syncobjs BO
> + * group->syncobjs.bo {
> + * struct panthor_syncobj_64b sync1;
> + * struct panthor_syncobj_64b sync2;
> + * ...
> + * As many as group_args->queues.count
> + * ...
> + * struct panthor_syncobj_64b syncn;
> + * struct panthor_job_times queue1_slot1
> + * struct panthor_job_times queue1_slot2
> + * ...
> + * As many as queue[i].ringbuf_size / SLOTSIZE
> + * ...
> + * struct panthor_job_times queue1_slotP
> + * ...
> + * As many as group_args->queues.count
> + * ...
> + * struct panthor_job_times queueN_slot1
> + * struct panthor_job_times queueN_slot2
> + * ...
> + * As many as queue[n].ringbuf_size / SLOTSIZE
> + * struct panthor_job_times queueN_slotQ
> + *
> + * Linearly, group->syncobjs.bo = {syncojb1,..,syncobjN,
> + * {queue1 = {js1,..,jsP},..,queueN = {js1,..,jsQ}}}
> + * }
> + *
> + */
> +
> + group->syncobjs.bo = panthor_kernel_bo_create(ptdev, group->vm,
> + syncobj_bo_size,
> + DRM_PANTHOR_BO_NO_MMAP,
> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> + PANTHOR_VM_KERNEL_AUTO_VA);
> + if (IS_ERR(group->syncobjs.bo)) {
> + ret = PTR_ERR(group->syncobjs.bo);
> goto err_put_group;
> }
>
> - ret = panthor_kernel_bo_vmap(group->syncobjs);
> + ret = panthor_kernel_bo_vmap(group->syncobjs.bo);
> if (ret)
> goto err_put_group;
>
> - memset(group->syncobjs->kmap, 0,
> - group_args->queues.count * sizeof(struct panthor_syncobj_64b));
> + memset(group->syncobjs.bo->kmap, 0, syncobj_bo_size);
> +
> + group->syncobjs.times_offset =
> + group_args->queues.count * sizeof(struct panthor_syncobj_64b);
>
> - for (i = 0; i < group_args->queues.count; i++) {
> - group->queues[i] = group_create_queue(group, &queue_args[i]);
> + for (i = 0, total_slots = 0; i < group_args->queues.count; i++) {
> + group->queues[i] = group_create_queue(group, &queue_args[i], total_slots);
> if (IS_ERR(group->queues[i])) {
> ret = PTR_ERR(group->queues[i]);
> group->queues[i] = NULL;
> goto err_put_group;
> }
>
> + total_slots += (queue_args[i].ringbuf_size / (SLOTSIZE));
> group->queue_count++;
> }
>
> --
> 2.44.0
>

With those comments addressed,
Reviewed-by: Liviu Dudau <[email protected]>

Best regards,
Liviu


--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯

2024-04-24 18:03:27

by Liviu Dudau

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] drm/panthor: Enable fdinfo for memory stats

Hello,

On Tue, Apr 23, 2024 at 10:32:36PM +0100, Adrián Larumbe wrote:
> When vm-binding an already-created BO, the entirety of its virtual size is
> then backed by system memory, so its RSS is always the same as its virtual
> size.

How is that relevant to this patch? Or to put it differently: how are your
words describing your code change here?

>
> Also, we consider a PRIME imported BO to be resident if its matching
> dma_buf has an open attachment, which means its backing storage had already
> been allocated.

Reviewed-by: Liviu Dudau <[email protected]>

Best regards,
Liviu

>
> Signed-off-by: Adrián Larumbe <[email protected]>
> ---
> drivers/gpu/drm/panthor/panthor_gem.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> index d6483266d0c2..386c0dfeeb5f 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -143,6 +143,17 @@ panthor_gem_prime_export(struct drm_gem_object *obj, int flags)
> return drm_gem_prime_export(obj, flags);
> }
>
> +static enum drm_gem_object_status panthor_gem_status(struct drm_gem_object *obj)
> +{
> + struct panthor_gem_object *bo = to_panthor_bo(obj);
> + enum drm_gem_object_status res = 0;
> +
> + if (bo->base.base.import_attach || bo->base.pages)
> + res |= DRM_GEM_OBJECT_RESIDENT;
> +
> + return res;
> +}
> +
> static const struct drm_gem_object_funcs panthor_gem_funcs = {
> .free = panthor_gem_free_object,
> .print_info = drm_gem_shmem_object_print_info,
> @@ -152,6 +163,7 @@ static const struct drm_gem_object_funcs panthor_gem_funcs = {
> .vmap = drm_gem_shmem_object_vmap,
> .vunmap = drm_gem_shmem_object_vunmap,
> .mmap = panthor_gem_mmap,
> + .status = panthor_gem_status,
> .export = panthor_gem_prime_export,
> .vm_ops = &drm_gem_shmem_vm_ops,
> };
> --
> 2.44.0
>

--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯