2023-01-12 17:45:14

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC v3 00/12] DRM scheduling cgroup controller

From: Tvrtko Ursulin <[email protected]>

This series contains a proposal for a DRM scheduling cgroup controller which
implements a weight based hierarchical GPU usage budget based controller
similar in concept to some of the existing controllers.

Motivation mostly comes from my earlier proposal where I identified that GPU
scheduling lags significantly behind what is available for CPU and IO. Whereas
back then I was proposing to somehow tie this with process nice, feedback mostly
was that people wanted cgroups. So here it is - in the world of heterogenous
computing pipelines I think it is time to do something about this gap.

Code is not finished but should survive some light experimenting with. I am
sharing it early since the topic has been controversial in the past. I hope to
demonstrate there are gains to be had in real world usage(*), today, and that
the concepts the proposal relies are well enough established and stable.

*) Specifically under ChromeOS which uses cgroups to control CPU bandwith for
VMs based on the window focused status. It can be demonstrated how GPU
scheduling control can easily be integrated into that setup.

*) Another real world example later in the cover letter.

There should be no conflict with this proposal and any efforts to implement
memory usage based controller. Skeleton DRM cgroup controller is deliberatly
purely a skeleton patch where any further functionality can be added with no
real conflicts. [In fact, perhaps scheduling is even easier to deal with than
memory accounting.]

Structure of the series is as follows:

1-2) Improve client ownership tracking in DRM core.
3) Adds a skeleton DRM cgroup controller with no functionality.
4-9) Laying down some infrastructure to enable the controller.
10) The controller itself.
11-12) i915 support for the controller.

The proposals defines a delegation of duties between the tree parties: cgroup
controller, DRM core and individual drivers. Two way communication interfaces
are then defined to enable the delegation to work.

DRM scheduling soft limits
~~~~~~~~~~~~~~~~~~~~~~~~~~

Because of the heterogenous hardware and driver DRM capabilities, soft limits
are implemented as a loose co-operative (bi-directional) interface between the
controller and DRM core.

The controller configures the GPU time allowed per group and periodically scans
the belonging tasks to detect the over budget condition, at which point it
invokes a callback notifying the DRM core of the condition.

DRM core provides an API to query per process GPU utilization and 2nd API to
receive notification from the cgroup controller when the group enters or exits
the over budget condition.

Individual DRM drivers which implement the interface are expected to act on this
in the best-effort manner only. There are no guarantees that the soft limits
will be respected.

DRM scheduling soft limits interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

drm.weight
Standard cgroup weight based control [1, 10000] used to configure the
relative distributing of GPU time between the sibling groups.

drm.period_us (Most probably only a debugging aid during RFC phase.)
An integer representing the period with which the controller should look
at the GPU usage by the group and potentially send the over/under budget
signal.
Value of zero (defaul) disables the soft limit checking.

This builds upon the per client GPU utilisation work which landed recently for a
few drivers. My thinking is that in principle, an intersect of drivers which
support both that and some sort of scheduling control, like priorities, could
also in theory support this controller.

Another really interesting angle for this controller is that it mimics the same
control menthod used by the CPU scheduler. That is the proportional/weight based
GPU time budgeting. Which makes it easy to configure and does not need a new
mental model.

However, as the introduction mentions, GPUs are much more heterogenous and
therefore the controller uses very "soft" wording as to what it promises. The
general statement is that it can define budgets, notify clients when they are
over them, and let individual drivers implement best effort handling of those
conditions.

Delegation of duties in the implementation goes likes this:

* DRM cgroup controller implements the control files and the scanning loop.
* DRM core is required to track all DRM clients belonging to processes so it
can answer when asked how much GPU time is a process using.
* DRM core also provides a call back which the controller will call when a
certain process is over budget.
* Individual drivers need to implement two similar hooks, but which work for
a single DRM client. Over budget callback and GPU utilisation query.

What I have demonstrated in practice is that when wired to i915, in a really
primitive way where the over-budget condition simply lowers the scheduling
priority, the concept can be almost equally effective as the static priority
control. I say almost because the design where budget control depends on the
periodic usage scanning has a fundamental delay, so responsiveness will depend
on the scanning period, which may or may not be a problem for a particular use
case.

There are also interesting conversations to be had around mental models for what
is GPU usage as a single number when faced with GPUs which have different
execution engines. To an extent this is similar to the multi-core and cgroup
CPU controller problems, but definitely goes further than that.

I deliberately did not want to include any such complications in the controller
itself and left the individual drivers to handle it. For instance in the i915
over-budget callback it will not do anything unless client's GPU usage is on a
physical engine which is oversubscribed. This enables multiple clients to be
harmlessly over budget, as long as they are not competing for the same GPU
resource.

Example usage from within a Linux desktop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Standard Linux distributions like Ubuntu already uses cgroups heavily for
session management and that could easily be extended with the DRM controller.

After logging into the system graphically we can enable the DRM controller
throughout the cgroups hierarchy:

echo +drm > /sys/fs/cgroup/cgroup.subtree_control
echo +drm > /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo +drm > /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.subtree_control

Next we will open two SSH sessions, just so separate cgroups are handily created
by systemd for this experiment.

Roughly simultaneously we run the following two benchmarks in each session
respectively:

1)
./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000

2)
vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan

(The only reason for vsync off here is because I struggled to find an easily
runnable and demanding enough benchmark, or to run on a screen large enough to
make even a simpler ones demanding.)

With this test we get 252fps from GpuTest and 96fps from GfxBenchmark.

Premise here is that one of these GPU intensive benchmarks is intended to be ran
by the user with lower priority. Imagine kicking off some background compute
processing and continuing to use the UI for other tasks. Hence the user will now
re-run the test by first lowering the weight control of the first session (DRM
cgroup):

1)
echo 50 | sudo tee /sys/fs/cgroup/`cut -d':' -f3 /proc/self/cgroup`/drm.weight
./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000

2)
vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan

In this case we will see that GpuTest has recorded 208fps (~18% down) and
GfxBenchmark 114fps (18% up), demonstrating that even a very simple approach of
wiring up i915 to the DRM cgroup controller can enable external GPU scheduling
control.

* Note here that default weight is 100, so setting 50 for the background session
is asking the controller to give it half as much GPU bandwidth.

* Also note that in the RFC stage the DRM controller itself boots in a disabled
state and needs to be explicitly enabled by setting a scanning period such as:
echo 1000000 | sudo tee /sys/fs/cgroup/drm.period_us.

v2:
* Prefaced the series with some core DRM work as suggested by Christian.
* Dropped the priority based controller for now.
* Dropped the introspection cgroup controller file.
* Implemented unused budget sharing/propagation.
* Some small fixes/tweak as per review feedback and in general.

v3:
* Dropped one upstreamed patch.
* Logging cleanup (use DRM macros where available).

Tvrtko Ursulin (12):
drm: Track clients by tgid and not tid
drm: Update file owner during use
cgroup: Add the DRM cgroup controller
drm/cgroup: Track clients per owning process
drm/cgroup: Allow safe external access to file_priv
drm/cgroup: Add ability to query drm cgroup GPU time
drm/cgroup: Add over budget signalling callback
drm/cgroup: Only track clients which are providing drm_cgroup_ops
cgroup/drm: Client exit hook
cgroup/drm: Introduce weight based drm cgroup control
drm/i915: Wire up with drm controller GPU time query
drm/i915: Implement cgroup controller over budget throttling

Documentation/admin-guide/cgroup-v2.rst | 37 ++
drivers/gpu/drm/Kconfig | 1 +
drivers/gpu/drm/Makefile | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +-
drivers/gpu/drm/drm_auth.c | 3 +-
drivers/gpu/drm/drm_cgroup.c | 203 +++++++
drivers/gpu/drm/drm_debugfs.c | 12 +-
drivers/gpu/drm/drm_file.c | 60 +-
drivers/gpu/drm/drm_ioctl.c | 3 +
.../gpu/drm/i915/gem/i915_gem_execbuffer.c | 38 +-
drivers/gpu/drm/i915/i915_driver.c | 11 +
drivers/gpu/drm/i915/i915_drm_client.c | 209 ++++++-
drivers/gpu/drm/i915/i915_drm_client.h | 13 +
drivers/gpu/drm/nouveau/nouveau_drm.c | 5 +-
drivers/gpu/drm/vmwgfx/vmwgfx_gem.c | 6 +-
include/drm/drm_clients.h | 47 ++
include/drm/drm_drv.h | 36 ++
include/drm/drm_file.h | 17 +-
include/linux/cgroup_drm.h | 13 +
include/linux/cgroup_subsys.h | 4 +
init/Kconfig | 8 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/drm.c | 570 ++++++++++++++++++
23 files changed, 1273 insertions(+), 31 deletions(-)
create mode 100644 drivers/gpu/drm/drm_cgroup.c
create mode 100644 include/drm/drm_clients.h
create mode 100644 include/linux/cgroup_drm.h
create mode 100644 kernel/cgroup/drm.c

--
2.34.1


2023-01-12 17:48:46

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 05/12] drm/cgroup: Allow safe external access to file_priv

From: Tvrtko Ursulin <[email protected]>

Entry points from the cgroup subsystem into the drm cgroup controller will
need to walk the file_priv structures associated with registered clients
and since those are not RCU protected lets add a hack for now to make this
safe.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/drm_cgroup.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index d91512a560ff..46b012d2be42 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -18,6 +18,13 @@ __del_clients(struct drm_pid_clients *clients,
if (atomic_dec_and_test(&clients->num)) {
xa_erase(&drm_pid_clients, pid);
kfree_rcu(clients, rcu);
+
+ /*
+ * FIXME: file_priv is not RCU protected so we add this hack
+ * to avoid any races with code which walks clients->file_list
+ * and accesses file_priv.
+ */
+ synchronize_rcu();
}
}

--
2.34.1

2023-01-12 17:48:55

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 09/12] cgroup/drm: Client exit hook

From: Tvrtko Ursulin <[email protected]>

We need the ability for DRM core to inform the cgroup controller when a
client has closed a DRM file descriptor. This will allow us not needing
to keep state relating to GPU time usage by tasks sets in the cgroup
controller itself.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/drm_cgroup.c | 9 +++++++++
include/linux/cgroup_drm.h | 4 ++++
kernel/cgroup/drm.c | 13 +++++++++++++
3 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index 09249f795af3..ea1479d05355 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -3,6 +3,8 @@
* Copyright © 2022 Intel Corporation
*/

+#include <linux/cgroup_drm.h>
+
#include <drm/drm_drv.h>
#include <drm/drm_clients.h>
#include <drm/drm_print.h>
@@ -32,6 +34,7 @@ void drm_clients_close(struct drm_file *file_priv)
{
struct drm_device *dev = file_priv->minor->dev;
struct drm_pid_clients *clients;
+ struct task_struct *task;
struct pid *pid;

lockdep_assert_held(&dev->filelist_mutex);
@@ -44,6 +47,12 @@ void drm_clients_close(struct drm_file *file_priv)
if (drm_WARN_ON_ONCE(dev, !clients))
return;

+ task = get_pid_task(pid, PIDTYPE_PID);
+ if (task) {
+ drmcgroup_client_exited(task);
+ put_task_struct(task);
+ }
+
__del_clients(clients, file_priv, (unsigned long)pid);
}

diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
index bf8abc6b8ebf..2f755b896136 100644
--- a/include/linux/cgroup_drm.h
+++ b/include/linux/cgroup_drm.h
@@ -6,4 +6,8 @@
#ifndef _CGROUP_DRM_H
#define _CGROUP_DRM_H

+struct task_struct;
+
+void drmcgroup_client_exited(struct task_struct *task);
+
#endif /* _CGROUP_DRM_H */
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 48a8d646a094..3e7f165806de 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -22,6 +22,11 @@ css_to_drmcs(struct cgroup_subsys_state *css)
return container_of(css, struct drm_cgroup_state, css);
}

+static inline struct drm_cgroup_state *get_task_drmcs(struct task_struct *task)
+{
+ return css_to_drmcs(task_get_css(task, drm_cgrp_id));
+}
+
static struct drm_root_cgroup_state root_drmcs;

static void drmcs_free(struct cgroup_subsys_state *css)
@@ -46,6 +51,14 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
return &drmcs->css;
}

+void drmcgroup_client_exited(struct task_struct *task)
+{
+ struct drm_cgroup_state *drmcs = get_task_drmcs(task);
+
+ css_put(&drmcs->css);
+}
+EXPORT_SYMBOL_GPL(drmcgroup_client_exited);
+
struct cftype files[] = {
{ } /* Zero entry terminates. */
};
--
2.34.1

2023-01-12 17:48:56

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 12/12] drm/i915: Implement cgroup controller over budget throttling

From: Tvrtko Ursulin <[email protected]>

When notified by the drm core we are over our allotted time budget, i915
instance will check if any of the GPU engines it is reponsible for is
fully saturated. If it is, and the client in question is using that
engine, it will throttle it.

For now throttling is done simplistically by lowering the scheduling
priority while clients are throttled.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
.../gpu/drm/i915/gem/i915_gem_execbuffer.c | 38 ++++-
drivers/gpu/drm/i915/i915_driver.c | 1 +
drivers/gpu/drm/i915/i915_drm_client.c | 133 ++++++++++++++++++
drivers/gpu/drm/i915/i915_drm_client.h | 11 ++
4 files changed, 182 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 94d86ee24693..c3e57b51a106 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3065,6 +3065,42 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
break;
}

+#ifdef CONFIG_CGROUP_DRM
+static unsigned int
+__get_class(struct drm_i915_file_private *fpriv, const struct i915_request *rq)
+{
+ unsigned int class;
+
+ class = rq->context->engine->uabi_class;
+
+ if (WARN_ON_ONCE(class >= ARRAY_SIZE(fpriv->client->throttle)))
+ class = 0;
+
+ return class;
+}
+
+static void copy_priority(struct i915_sched_attr *attr,
+ const struct i915_execbuffer *eb,
+ const struct i915_request *rq)
+{
+ struct drm_i915_file_private *file_priv = eb->file->driver_priv;
+ int prio;
+
+ *attr = eb->gem_context->sched;
+
+ prio = file_priv->client->throttle[__get_class(file_priv, rq)];
+ if (prio)
+ attr->priority = prio;
+}
+#else
+static void copy_priority(struct i915_sched_attr *attr,
+ const struct i915_execbuffer *eb,
+ const struct i915_request *rq)
+{
+ *attr = eb->gem_context->sched;
+}
+#endif
+
static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq,
int err, bool last_parallel)
{
@@ -3081,7 +3117,7 @@ static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq,

/* Check that the context wasn't destroyed before submission */
if (likely(!intel_context_is_closed(eb->context))) {
- attr = eb->gem_context->sched;
+ copy_priority(&attr, eb, rq);
} else {
/* Serialise with context_close via the add_to_timeline */
i915_request_set_error_once(rq, -ENOENT);
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index 50935cdb3a93..11c0c6f45e57 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1900,6 +1900,7 @@ static const struct drm_ioctl_desc i915_ioctls[] = {
#ifdef CONFIG_CGROUP_DRM
static const struct drm_cgroup_ops i915_drm_cgroup_ops = {
.active_time_us = i915_drm_cgroup_get_active_time_us,
+ .signal_budget = i915_drm_cgroup_signal_budget,
};
#endif

diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index c9754cb0277f..72d92ee292ae 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -4,6 +4,7 @@
*/

#include <linux/kernel.h>
+#include <linux/ktime.h>
#include <linux/slab.h>
#include <linux/types.h>

@@ -159,6 +160,138 @@ u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file)

return busy;
}
+
+int i915_drm_cgroup_signal_budget(struct drm_file *file, u64 usage, u64 budget)
+{
+ struct drm_i915_file_private *fpriv = file->driver_priv;
+ u64 class_usage[I915_LAST_UABI_ENGINE_CLASS + 1];
+ u64 class_last[I915_LAST_UABI_ENGINE_CLASS + 1];
+ struct drm_i915_private *i915 = fpriv->dev_priv;
+ struct i915_drm_client *client = fpriv->client;
+ struct intel_engine_cs *engine;
+ bool over = usage > budget;
+ struct task_struct *task;
+ struct pid *pid;
+ unsigned int i;
+ ktime_t unused;
+ int ret = 0;
+ u64 t;
+
+ if (!supports_stats(i915))
+ return -EINVAL;
+
+ if (usage == 0 && budget == 0)
+ return 0;
+
+ rcu_read_lock();
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ if (over) {
+ client->over_budget++;
+ if (!client->over_budget)
+ client->over_budget = 2;
+
+ drm_dbg(&i915->drm, "%s[%u] over budget (%llu/%llu)\n",
+ task ? task->comm : "<unknown>", pid_vnr(pid),
+ usage, budget);
+ } else {
+ client->over_budget = 0;
+ memset(client->class_last, 0, sizeof(client->class_last));
+ memset(client->throttle, 0, sizeof(client->throttle));
+
+ drm_dbg(&i915->drm, "%s[%u] un-throttled; under budget\n",
+ task ? task->comm : "<unknown>", pid_vnr(pid));
+
+ rcu_read_unlock();
+ return 0;
+ }
+ rcu_read_unlock();
+
+ memset(class_usage, 0, sizeof(class_usage));
+ for_each_uabi_engine(engine, i915)
+ class_usage[engine->uabi_class] +=
+ ktime_to_ns(intel_engine_get_busy_time(engine, &unused));
+
+ memcpy(class_last, client->class_last, sizeof(class_last));
+ memcpy(client->class_last, class_usage, sizeof(class_last));
+
+ for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++)
+ class_usage[i] -= class_last[i];
+
+ t = client->last;
+ client->last = ktime_get_raw_ns();
+ t = client->last - t;
+
+ if (client->over_budget == 1)
+ return 0;
+
+ for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++) {
+ u64 client_class_usage[I915_LAST_UABI_ENGINE_CLASS + 1];
+ unsigned int capacity, rel_usage;
+
+ if (!i915->engine_uabi_class_count[i])
+ continue;
+
+ t = DIV_ROUND_UP_ULL(t, 1000);
+ class_usage[i] = DIV_ROUND_CLOSEST_ULL(class_usage[i], 1000);
+ rel_usage = DIV_ROUND_CLOSEST_ULL(class_usage[i] * 100ULL,
+ t *
+ i915->engine_uabi_class_count[i]);
+ if (rel_usage < 95) {
+ /* Physical class not oversubsribed. */
+ if (client->throttle[i]) {
+ client->throttle[i] = 0;
+
+ rcu_read_lock();
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ drm_dbg(&i915->drm,
+ "%s[%u] un-throttled; physical class %s utilisation %u%%\n",
+ task ? task->comm : "<unknown>",
+ pid_vnr(pid),
+ uabi_class_names[i],
+ rel_usage);
+ rcu_read_unlock();
+ }
+ continue;
+ }
+
+ client_class_usage[i] =
+ get_class_active_ns(client, i, &capacity);
+ if (client_class_usage[i]) {
+ int permille;
+
+ ret |= 1;
+
+ permille = DIV_ROUND_CLOSEST_ULL((usage - budget) *
+ 1000,
+ budget);
+ client->throttle[i] =
+ DIV_ROUND_CLOSEST(permille *
+ I915_CONTEXT_MIN_USER_PRIORITY,
+ 1000);
+ if (client->throttle[i] <
+ I915_CONTEXT_MIN_USER_PRIORITY)
+ client->throttle[i] =
+ I915_CONTEXT_MIN_USER_PRIORITY;
+
+ rcu_read_lock();
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ drm_dbg(&i915->drm,
+ "%s[%u] %d‰ over budget, throttled to priority %d; physical class %s utilisation %u%%\n",
+ task ? task->comm : "<unknown>",
+ pid_vnr(pid),
+ permille,
+ client->throttle[i],
+ uabi_class_names[i],
+ rel_usage);
+ rcu_read_unlock();
+ }
+ }
+
+ return ret;
+}
#endif

#ifdef CONFIG_PROC_FS
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index c8439eaa89be..092a7952a67b 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -15,6 +15,8 @@

#define I915_LAST_UABI_ENGINE_CLASS I915_ENGINE_CLASS_COMPUTE

+struct drm_file;
+
struct drm_i915_private;

struct i915_drm_clients {
@@ -38,6 +40,13 @@ struct i915_drm_client {
* @past_runtime: Accumulation of pphwsp runtimes from closed contexts.
*/
atomic64_t past_runtime[I915_LAST_UABI_ENGINE_CLASS + 1];
+
+#ifdef CONFIG_CGROUP_DRM
+ int throttle[I915_LAST_UABI_ENGINE_CLASS + 1];
+ unsigned int over_budget;
+ u64 last;
+ u64 class_last[I915_LAST_UABI_ENGINE_CLASS + 1];
+#endif
};

void i915_drm_clients_init(struct i915_drm_clients *clients,
@@ -66,5 +75,7 @@ void i915_drm_client_fdinfo(struct seq_file *m, struct file *f);
void i915_drm_clients_fini(struct i915_drm_clients *clients);

u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file);
+int i915_drm_cgroup_signal_budget(struct drm_file *file,
+ u64 usage, u64 budget);

#endif /* !__I915_DRM_CLIENT_H__ */
--
2.34.1

2023-01-12 17:49:07

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 03/12] cgroup: Add the DRM cgroup controller

From: Tvrtko Ursulin <[email protected]>

Skeleton controller without any functionality.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
include/linux/cgroup_drm.h | 9 ++++++
include/linux/cgroup_subsys.h | 4 +++
init/Kconfig | 7 +++++
kernel/cgroup/Makefile | 1 +
kernel/cgroup/drm.c | 59 +++++++++++++++++++++++++++++++++++
5 files changed, 80 insertions(+)
create mode 100644 include/linux/cgroup_drm.h
create mode 100644 kernel/cgroup/drm.c

diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
new file mode 100644
index 000000000000..bf8abc6b8ebf
--- /dev/null
+++ b/include/linux/cgroup_drm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#ifndef _CGROUP_DRM_H
+#define _CGROUP_DRM_H
+
+#endif /* _CGROUP_DRM_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 445235487230..49460494a010 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
SUBSYS(misc)
#endif

+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+SUBSYS(drm)
+#endif
+
/*
* The following subsystems are not supported on the default hierarchy.
*/
diff --git a/init/Kconfig b/init/Kconfig
index 7e5c3ddc341d..c5ace0d57007 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1089,6 +1089,13 @@ config CGROUP_RDMA
Attaching processes with active RDMA resources to the cgroup
hierarchy is allowed even if can cross the hierarchy's limit.

+config CGROUP_DRM
+ bool "DRM controller"
+ help
+ Provides the DRM subsystem controller.
+
+ ...
+
config CGROUP_FREEZER
bool "Freezer controller"
help
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 12f8457ad1f9..849bd2917477 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -6,4 +6,5 @@ obj-$(CONFIG_CGROUP_PIDS) += pids.o
obj-$(CONFIG_CGROUP_RDMA) += rdma.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CGROUP_MISC) += misc.o
+obj-$(CONFIG_CGROUP_DRM) += drm.o
obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
new file mode 100644
index 000000000000..48a8d646a094
--- /dev/null
+++ b/kernel/cgroup/drm.c
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include <linux/slab.h>
+#include <linux/cgroup.h>
+#include <linux/cgroup_drm.h>
+#include <linux/sched.h>
+
+struct drm_cgroup_state {
+ struct cgroup_subsys_state css;
+};
+
+struct drm_root_cgroup_state {
+ struct drm_cgroup_state drmcs;
+};
+
+static inline struct drm_cgroup_state *
+css_to_drmcs(struct cgroup_subsys_state *css)
+{
+ return container_of(css, struct drm_cgroup_state, css);
+}
+
+static struct drm_root_cgroup_state root_drmcs;
+
+static void drmcs_free(struct cgroup_subsys_state *css)
+{
+ if (css != &root_drmcs.drmcs.css)
+ kfree(css_to_drmcs(css));
+}
+
+static struct cgroup_subsys_state *
+drmcs_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct drm_cgroup_state *drmcs;
+
+ if (!parent_css) {
+ drmcs = &root_drmcs.drmcs;
+ } else {
+ drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
+ if (!drmcs)
+ return ERR_PTR(-ENOMEM);
+ }
+
+ return &drmcs->css;
+}
+
+struct cftype files[] = {
+ { } /* Zero entry terminates. */
+};
+
+struct cgroup_subsys drm_cgrp_subsys = {
+ .css_alloc = drmcs_alloc,
+ .css_free = drmcs_free,
+ .early_init = false,
+ .legacy_cftypes = files,
+ .dfl_cftypes = files,
+};
--
2.34.1

2023-01-12 17:49:29

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 01/12] drm: Track clients by tgid and not tid

From: Tvrtko Ursulin <[email protected]>

Thread group id (aka pid from userspace point of view) is a more
interesting thing to show as an owner of a DRM fd, so track and show that
instead of the thread id.

In the next patch we will make the owner updated post file descriptor
handover, which will also be tgid based to avoid ping-pong when multiple
threads access the fd.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Zack Rusin <[email protected]>
Cc: [email protected]
Cc: Alex Deucher <[email protected]>
Cc: "Christian König" <[email protected]>
Reviewed-by: Zack Rusin <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
drivers/gpu/drm/drm_debugfs.c | 4 ++--
drivers/gpu/drm/drm_file.c | 2 +-
drivers/gpu/drm/vmwgfx/vmwgfx_gem.c | 2 +-
4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 00edc7002ee8..ca0181332578 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -969,7 +969,7 @@ static int amdgpu_debugfs_gem_info_show(struct seq_file *m, void *unused)
* Therefore, we need to protect this ->comm access using RCU.
*/
rcu_read_lock();
- task = pid_task(file->pid, PIDTYPE_PID);
+ task = pid_task(file->pid, PIDTYPE_TGID);
seq_printf(m, "pid %8d command %s:\n", pid_nr(file->pid),
task ? task->comm : "<unknown>");
rcu_read_unlock();
diff --git a/drivers/gpu/drm/drm_debugfs.c b/drivers/gpu/drm/drm_debugfs.c
index 4f643a490dc3..4855230ba2c6 100644
--- a/drivers/gpu/drm/drm_debugfs.c
+++ b/drivers/gpu/drm/drm_debugfs.c
@@ -80,7 +80,7 @@ static int drm_clients_info(struct seq_file *m, void *data)
seq_printf(m,
"%20s %5s %3s master a %5s %10s\n",
"command",
- "pid",
+ "tgid",
"dev",
"uid",
"magic");
@@ -94,7 +94,7 @@ static int drm_clients_info(struct seq_file *m, void *data)
bool is_current_master = drm_is_current_master(priv);

rcu_read_lock(); /* locks pid_task()->comm */
- task = pid_task(priv->pid, PIDTYPE_PID);
+ task = pid_task(priv->pid, PIDTYPE_TGID);
uid = task ? __task_cred(task)->euid : GLOBAL_ROOT_UID;
seq_printf(m, "%20s %5d %3d %c %c %5d %10u\n",
task ? task->comm : "<unknown>",
diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index a51ff8cee049..c1018c470047 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -156,7 +156,7 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)
if (!file)
return ERR_PTR(-ENOMEM);

- file->pid = get_pid(task_pid(current));
+ file->pid = get_pid(task_tgid(current));
file->minor = minor;

/* for compatibility root is always authenticated */
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
index ce609e7d758f..f2985337aa53 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
@@ -260,7 +260,7 @@ static int vmw_debugfs_gem_info_show(struct seq_file *m, void *unused)
* Therefore, we need to protect this ->comm access using RCU.
*/
rcu_read_lock();
- task = pid_task(file->pid, PIDTYPE_PID);
+ task = pid_task(file->pid, PIDTYPE_TGID);
seq_printf(m, "pid %8d command %s:\n", pid_nr(file->pid),
task ? task->comm : "<unknown>");
rcu_read_unlock();
--
2.34.1

2023-01-12 17:51:24

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

From: Tvrtko Ursulin <[email protected]>

Similar to CPU scheduling, implement a concept of weight in the drm cgroup
controller.

Uses the same range and default as the CPU controller - CGROUP_WEIGHT_MIN,
CGROUP_WEIGHT_DFL and CGROUP_WEIGHT_MAX.

Later each cgroup is assigned a time budget proportionaly based on the
relative weights of it's siblings. This time budget is in turn split by
the group's children and so on.

This will be used to implement a soft, or best effort signal from drm
cgroup to drm core notifying about groups which are over their allotted
budget.

No guarantees that the limit can be enforced are provided or implied.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 37 ++
drivers/gpu/drm/Kconfig | 1 +
init/Kconfig | 1 +
kernel/cgroup/drm.c | 506 +++++++++++++++++++++++-
4 files changed, 541 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c897f14..9894dd59e4c5 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2407,6 +2407,43 @@ HugeTLB Interface Files
hugetlb pages of <hugepagesize> in this cgroup. Only active in
use hugetlb pages are included. The per-node values are in bytes.

+DRM
+---
+
+The DRM controller allows configuring scheduling soft limits.
+
+DRM scheduling soft limits
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because of the heterogenous hardware and driver DRM capabilities, soft limits
+are implemented as a loose co-operative (bi-directional) interface between the
+controller and DRM core.
+
+The controller configures the GPU time allowed per group and periodically scans
+the belonging tasks to detect the over budget condition, at which point it
+invokes a callback notifying the DRM core of the condition.
+
+DRM core provides an API to query per process GPU utilization and 2nd API to
+receive notification from the cgroup controller when the group enters or exits
+the over budget condition.
+
+Individual DRM drivers which implement the interface are expected to act on this
+in the best-effort manner only. There are no guarantees that the soft limits
+will be respected.
+
+DRM scheduling soft limits interface files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ drm.weight
+ Standard cgroup weight based control [1, 10000] used to configure the
+ relative distributing of GPU time between the sibling groups.
+
+ drm.period_us (debugging aid during RFC only)
+ An integer representing the period with which the controller should look
+ at the GPU usage by the group and potentially send the over/under budget
+ signal.
+ Value of zero (defaul) disables the soft limit checking.
+
Misc
----

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index b56b9f2fe8e6..0fbfd3026b71 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -7,6 +7,7 @@
#
menuconfig DRM
tristate "Direct Rendering Manager (XFree86 4.1.0 and higher DRI support)"
+ default y if CGROUP_DRM=y
depends on (AGP || AGP=n) && !EMULATED_CMPXCHG && HAS_DMA
select DRM_PANEL_ORIENTATION_QUIRKS
select HDMI
diff --git a/init/Kconfig b/init/Kconfig
index c5ace0d57007..304418674097 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1091,6 +1091,7 @@ config CGROUP_RDMA

config CGROUP_DRM
bool "DRM controller"
+ select DRM
help
Provides the DRM subsystem controller.

diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 3e7f165806de..527b7bf8c576 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -8,14 +8,43 @@
#include <linux/cgroup_drm.h>
#include <linux/sched.h>

+#include <drm/drm_clients.h>
+
struct drm_cgroup_state {
struct cgroup_subsys_state css;
+
+ unsigned int weight;
+
+ /*
+ * Below fields are owned and updated by the scan worker. Either the
+ * worker accesses them, or worker needs to be suspended and synced
+ * before they can be touched from the outside.
+ */
+ unsigned int sum_children_weights;
+
+ bool over;
+ bool over_budget;
+
+ u64 per_s_budget_us;
+ u64 prev_active_us;
+ u64 active_us;
};

struct drm_root_cgroup_state {
struct drm_cgroup_state drmcs;
+
+ unsigned int period_us;
+
+ ktime_t prev_timestamp;
+
+ bool scanning_suspended;
+ unsigned int suspended_period_us;
+
+ struct delayed_work scan_work;
};

+static DEFINE_MUTEX(drmcg_mutex);
+
static inline struct drm_cgroup_state *
css_to_drmcs(struct cgroup_subsys_state *css)
{
@@ -29,10 +58,355 @@ static inline struct drm_cgroup_state *get_task_drmcs(struct task_struct *task)

static struct drm_root_cgroup_state root_drmcs;

+static u64 drmcs_get_active_time_us(struct drm_cgroup_state *drmcs)
+{
+ struct cgroup *cgrp = drmcs->css.cgroup;
+ struct task_struct *task;
+ struct css_task_iter it;
+ u64 total = 0;
+
+ css_task_iter_start(&cgrp->self,
+ CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
+ &it);
+ while ((task = css_task_iter_next(&it))) {
+ u64 time;
+
+ /* Ignore kernel threads here. */
+ if (task->flags & PF_KTHREAD)
+ continue;
+
+ time = drm_pid_get_active_time_us(task_pid(task));
+ total += time;
+ }
+ css_task_iter_end(&it);
+
+ return total;
+}
+
+static u64
+drmcs_read_weight(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+ return drmcs->weight;
+}
+
+static int
+drmcs_write_weight(struct cgroup_subsys_state *css, struct cftype *cftype,
+ u64 weight)
+{
+ struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+ int ret;
+
+ if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+ return -ERANGE;
+
+ ret = mutex_lock_interruptible(&drmcg_mutex);
+ if (ret)
+ return ret;
+ drmcs->weight = weight;
+ mutex_unlock(&drmcg_mutex);
+
+ return 0;
+}
+
+static void
+signal_drm_budget(struct drm_cgroup_state *drmcs, u64 usage, u64 budget)
+{
+ struct cgroup *cgrp = drmcs->css.cgroup;
+ struct task_struct *task;
+ struct css_task_iter it;
+
+ css_task_iter_start(&cgrp->self,
+ CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED,
+ &it);
+ while ((task = css_task_iter_next(&it))) {
+ /* Ignore kernel threads here. */
+ if (task->flags & PF_KTHREAD)
+ continue;
+
+ drm_pid_signal_budget(task_pid(task), usage, budget);
+ }
+ css_task_iter_end(&it);
+}
+
+static bool
+__start_scanning(struct drm_cgroup_state *root, unsigned int period_us)
+{
+ struct cgroup_subsys_state *node;
+ bool ok = false;
+
+ rcu_read_lock();
+
+ css_for_each_descendant_post(node, &root->css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+
+ if (!css_tryget_online(node))
+ goto out;
+
+ drmcs->active_us = 0;
+ drmcs->sum_children_weights = 0;
+
+ if (period_us && node == &root->css)
+ drmcs->per_s_budget_us =
+ DIV_ROUND_UP_ULL((u64)period_us * USEC_PER_SEC,
+ USEC_PER_SEC);
+ else
+ drmcs->per_s_budget_us = 0;
+
+ css_put(node);
+ }
+
+ css_for_each_descendant_post(node, &root->css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+ struct drm_cgroup_state *parent;
+ u64 active;
+
+ if (!css_tryget_online(node))
+ goto out;
+ if (!node->parent) {
+ css_put(node);
+ continue;
+ }
+ if (!css_tryget_online(node->parent)) {
+ css_put(node);
+ goto out;
+ }
+ parent = css_to_drmcs(node->parent);
+
+ active = drmcs_get_active_time_us(drmcs);
+ if (period_us && active > drmcs->prev_active_us)
+ drmcs->active_us += active - drmcs->prev_active_us;
+ drmcs->prev_active_us = active;
+
+ parent->active_us += drmcs->active_us;
+ parent->sum_children_weights += drmcs->weight;
+
+ css_put(node);
+ css_put(&parent->css);
+ }
+
+ ok = true;
+
+out:
+ rcu_read_unlock();
+
+ return ok;
+}
+
+static void scan_worker(struct work_struct *work)
+{
+ struct drm_cgroup_state *root = &root_drmcs.drmcs;
+ struct cgroup_subsys_state *node;
+ unsigned int period_us;
+ ktime_t now;
+
+ rcu_read_lock();
+
+ if (WARN_ON_ONCE(!css_tryget_online(&root->css))) {
+ rcu_read_unlock();
+ return;
+ }
+
+ now = ktime_get();
+ period_us = ktime_to_us(ktime_sub(now, root_drmcs.prev_timestamp));
+ root_drmcs.prev_timestamp = now;
+
+ /*
+ * 1st pass - reset working values and update hierarchical weights and
+ * GPU utilisation.
+ */
+ if (!__start_scanning(root, period_us))
+ goto out_retry; /*
+ * Always come back later if scanner races with
+ * core cgroup management. (Repeated pattern.)
+ */
+
+ css_for_each_descendant_pre(node, &root->css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+ struct cgroup_subsys_state *css;
+ unsigned int over_weights = 0;
+ u64 unused_us = 0;
+
+ if (!css_tryget_online(node))
+ goto out_retry;
+
+ /*
+ * 2nd pass - calculate initial budgets, mark over budget
+ * siblings and add up unused budget for the group.
+ */
+ css_for_each_child(css, &drmcs->css) {
+ struct drm_cgroup_state *sibling = css_to_drmcs(css);
+
+ if (!css_tryget_online(css)) {
+ css_put(node);
+ goto out_retry;
+ }
+
+ sibling->per_s_budget_us =
+ DIV_ROUND_UP_ULL(drmcs->per_s_budget_us *
+ sibling->weight,
+ drmcs->sum_children_weights);
+
+ sibling->over = sibling->active_us >
+ sibling->per_s_budget_us;
+ if (sibling->over)
+ over_weights += sibling->weight;
+ else
+ unused_us += sibling->per_s_budget_us -
+ sibling->active_us;
+
+ css_put(css);
+ }
+
+ /*
+ * 3rd pass - spread unused budget according to relative weights
+ * of over budget siblings.
+ */
+ css_for_each_child(css, &drmcs->css) {
+ struct drm_cgroup_state *sibling = css_to_drmcs(css);
+
+ if (!css_tryget_online(css)) {
+ css_put(node);
+ goto out_retry;
+ }
+
+ if (sibling->over) {
+ u64 budget_us =
+ DIV_ROUND_UP_ULL(unused_us *
+ sibling->weight,
+ over_weights);
+ sibling->per_s_budget_us += budget_us;
+ sibling->over = sibling->active_us >
+ sibling->per_s_budget_us;
+ }
+
+ css_put(css);
+ }
+
+ css_put(node);
+ }
+
+ /*
+ * 4th pass - send out over/under budget notifications.
+ */
+ css_for_each_descendant_post(node, &root->css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+
+ if (!css_tryget_online(node))
+ goto out_retry;
+
+ if (drmcs->over || drmcs->over_budget)
+ signal_drm_budget(drmcs,
+ drmcs->active_us,
+ drmcs->per_s_budget_us);
+ drmcs->over_budget = drmcs->over;
+
+ css_put(node);
+ }
+
+out_retry:
+ rcu_read_unlock();
+
+ period_us = READ_ONCE(root_drmcs.period_us);
+ if (period_us)
+ schedule_delayed_work(&root_drmcs.scan_work,
+ usecs_to_jiffies(period_us));
+
+ css_put(&root->css);
+}
+
+static void start_scanning(u64 period_us)
+{
+ lockdep_assert_held(&drmcg_mutex);
+
+ root_drmcs.period_us = (unsigned int)period_us;
+ WARN_ON_ONCE(!__start_scanning(&root_drmcs.drmcs, 0));
+ root_drmcs.prev_timestamp = ktime_get();
+ mod_delayed_work(system_wq, &root_drmcs.scan_work,
+ usecs_to_jiffies(period_us));
+}
+
+static void stop_scanning(struct drm_cgroup_state *drmcs)
+{
+ if (drmcs == &root_drmcs.drmcs) {
+ root_drmcs.period_us = 0;
+ cancel_delayed_work_sync(&root_drmcs.scan_work);
+ }
+
+ if (drmcs->over_budget) {
+ /*
+ * Signal under budget when scanning goes off so drivers
+ * correctly update their state.
+ */
+ signal_drm_budget(drmcs, 0, USEC_PER_SEC);
+ drmcs->over_budget = false;
+ }
+}
+
+static void signal_stop_scanning(void)
+{
+ struct cgroup_subsys_state *node;
+
+ lockdep_assert_held(&drmcg_mutex);
+
+ stop_scanning(&root_drmcs.drmcs); /* Handle outside RCU lock. */
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(node, &root_drmcs.drmcs.css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+
+ if (drmcs == &root_drmcs.drmcs)
+ continue;
+
+ if (css_tryget_online(node)) {
+ stop_scanning(drmcs);
+ css_put(node);
+ }
+ }
+ rcu_read_unlock();
+}
+
+static void start_suspend_scanning(void)
+{
+ lockdep_assert_held(&drmcg_mutex);
+
+ if (root_drmcs.scanning_suspended)
+ return;
+
+ root_drmcs.scanning_suspended = true;
+ root_drmcs.suspended_period_us = root_drmcs.period_us;
+ root_drmcs.period_us = 0;
+}
+
+static void finish_suspend_scanning(void)
+{
+ if (root_drmcs.suspended_period_us)
+ cancel_delayed_work_sync(&root_drmcs.scan_work);
+}
+
+static void resume_scanning(void)
+{
+ lockdep_assert_held(&drmcg_mutex);
+
+ if (!root_drmcs.scanning_suspended)
+ return;
+
+ root_drmcs.scanning_suspended = false;
+ if (root_drmcs.suspended_period_us) {
+ start_scanning(root_drmcs.suspended_period_us);
+ root_drmcs.suspended_period_us = 0;
+ }
+}
+
static void drmcs_free(struct cgroup_subsys_state *css)
{
- if (css != &root_drmcs.drmcs.css)
- kfree(css_to_drmcs(css));
+ struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+ stop_scanning(drmcs);
+
+ if (drmcs != &root_drmcs.drmcs)
+ kfree(drmcs);
}

static struct cgroup_subsys_state *
@@ -42,30 +416,154 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)

if (!parent_css) {
drmcs = &root_drmcs.drmcs;
+ INIT_DELAYED_WORK(&root_drmcs.scan_work, scan_worker);
} else {
drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
if (!drmcs)
return ERR_PTR(-ENOMEM);
}

+ drmcs->weight = CGROUP_WEIGHT_DFL;
+
return &drmcs->css;
}

+static int drmcs_can_attach(struct cgroup_taskset *tset)
+{
+ int ret;
+
+ /*
+ * As processes are getting moved between groups we need to ensure
+ * both that the old group does not see a sudden downward jump in the
+ * GPU utilisation, and that the new group does not see a sudden jump
+ * up with all the GPU time clients belonging to the migrated process
+ * have accumulated.
+ *
+ * To achieve that we suspend the scanner until the migration is
+ * completed where the resume at the end ensures both groups start
+ * observing GPU utilisation from a reset state.
+ */
+
+ ret = mutex_lock_interruptible(&drmcg_mutex);
+ if (ret)
+ return ret;
+ start_suspend_scanning();
+ mutex_unlock(&drmcg_mutex);
+
+ finish_suspend_scanning();
+
+ return 0;
+}
+
+static void tset_resume_scanning(struct cgroup_taskset *tset)
+{
+ mutex_lock(&drmcg_mutex);
+ resume_scanning();
+ mutex_unlock(&drmcg_mutex);
+}
+
+static void drmcs_attach(struct cgroup_taskset *tset)
+{
+ tset_resume_scanning(tset);
+}
+
+static void drmcs_cancel_attach(struct cgroup_taskset *tset)
+{
+ tset_resume_scanning(tset);
+}
+
+static u64
+drmcs_read_period_us(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ return root_drmcs.period_us;
+}
+
+static int
+drmcs_write_period_us(struct cgroup_subsys_state *css, struct cftype *cftype,
+ u64 period_us)
+{
+ int ret;
+
+ if (css->cgroup->level)
+ return -EINVAL;
+ if ((period_us && period_us < 500000) || period_us > USEC_PER_SEC * 60)
+ return -EINVAL;
+
+ ret = mutex_lock_interruptible(&drmcg_mutex);
+ if (ret)
+ return ret;
+
+ if (!root_drmcs.scanning_suspended) {
+ if (period_us)
+ start_scanning(period_us);
+ else
+ signal_stop_scanning();
+ } else {
+ /*
+ * If scanning is temporarily suspended just update the period
+ * which will apply once resumed, or simply skip resuming in
+ * case of disabling.
+ */
+ root_drmcs.suspended_period_us = period_us;
+ if (!period_us)
+ root_drmcs.scanning_suspended = false;
+ }
+
+ mutex_unlock(&drmcg_mutex);
+
+ return 0;
+}
+
void drmcgroup_client_exited(struct task_struct *task)
{
- struct drm_cgroup_state *drmcs = get_task_drmcs(task);
+ /*
+ * QQQ/TODO - Skip if task is not a member of a cgroup which has a
+ * DRM controller enabled?
+ */
+
+ /*
+ * Since we are not tracking accumulated GPU time for each cgroup,
+ * avoid jumps in group observed GPU usage by re-setting the scanner
+ * at a point when GPU usage can suddenly jump down.
+ *
+ * Downside is clients can influence the effectiveness of the over-
+ * budget scanning by continuously closing DRM file descriptors but for
+ * now we do not worry about it.
+ */
+
+ mutex_lock(&drmcg_mutex);
+ start_suspend_scanning();
+ mutex_unlock(&drmcg_mutex);
+
+ finish_suspend_scanning();

- css_put(&drmcs->css);
+ mutex_lock(&drmcg_mutex);
+ resume_scanning();
+ mutex_unlock(&drmcg_mutex);
}
EXPORT_SYMBOL_GPL(drmcgroup_client_exited);

struct cftype files[] = {
+ {
+ .name = "weight",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = drmcs_read_weight,
+ .write_u64 = drmcs_write_weight,
+ },
+ {
+ .name = "period_us",
+ .read_u64 = drmcs_read_period_us,
+ .write_u64 = drmcs_write_period_us,
+ },
{ } /* Zero entry terminates. */
};

struct cgroup_subsys drm_cgrp_subsys = {
.css_alloc = drmcs_alloc,
.css_free = drmcs_free,
+ .can_attach = drmcs_can_attach,
+ .attach = drmcs_attach,
+ .cancel_attach = drmcs_cancel_attach,
.early_init = false,
.legacy_cftypes = files,
.dfl_cftypes = files,
--
2.34.1

2023-01-12 17:52:11

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 06/12] drm/cgroup: Add ability to query drm cgroup GPU time

From: Tvrtko Ursulin <[email protected]>

Add a driver callback and core helper which allow querying the time spent
on GPUs for processes belonging to a group.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/drm_cgroup.c | 24 ++++++++++++++++++++++++
include/drm/drm_clients.h | 2 ++
include/drm/drm_drv.h | 28 ++++++++++++++++++++++++++++
3 files changed, 54 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index 46b012d2be42..bc1e34f1a552 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -138,3 +138,27 @@ drm_clients_migrate(struct drm_file *file_priv,
pid_nr(old), pid_nr(new));
rcu_read_unlock();
}
+
+u64 drm_pid_get_active_time_us(struct pid *pid)
+{
+ struct drm_pid_clients *clients;
+ u64 total = 0;
+
+ rcu_read_lock();
+ clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+ if (clients) {
+ struct drm_file *fpriv;
+
+ list_for_each_entry_rcu(fpriv, &clients->file_list, clink) {
+ const struct drm_cgroup_ops *cg_ops =
+ fpriv->minor->dev->driver->cg_ops;
+
+ if (cg_ops && cg_ops->active_time_us)
+ total += cg_ops->active_time_us(fpriv);
+ }
+ }
+ rcu_read_unlock();
+
+ return total;
+}
+EXPORT_SYMBOL_GPL(drm_pid_get_active_time_us);
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
index 2732fffab3f0..7e0c0cf14f25 100644
--- a/include/drm/drm_clients.h
+++ b/include/drm/drm_clients.h
@@ -41,4 +41,6 @@ drm_clients_migrate(struct drm_file *file_priv,
}
#endif

+u64 drm_pid_get_active_time_us(struct pid *pid);
+
#endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index d7c521e8860f..f5f0e088e1fe 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -158,6 +158,24 @@ enum drm_driver_feature {
DRIVER_KMS_LEGACY_CONTEXT = BIT(31),
};

+/**
+ * struct drm_cgroup_ops
+ *
+ * This structure contains a number of callbacks that drivers can provide if
+ * they are able to support one or more of the functionalities implemented by
+ * the DRM cgroup controller.
+ */
+struct drm_cgroup_ops {
+ /**
+ * @active_time_us:
+ *
+ * Optional callback for reporting the GPU time consumed by this client.
+ *
+ * Used by the DRM core when queried by the DRM cgroup controller.
+ */
+ u64 (*active_time_us) (struct drm_file *);
+};
+
/**
* struct drm_driver - DRM driver structure
*
@@ -469,6 +487,16 @@ struct drm_driver {
*/
const struct file_operations *fops;

+#ifdef CONFIG_CGROUP_DRM
+ /**
+ * @cg_ops:
+ *
+ * Optional pointer to driver callbacks facilitating integration with
+ * the DRM cgroup controller.
+ */
+ const struct drm_cgroup_ops *cg_ops;
+#endif
+
#ifdef CONFIG_DRM_LEGACY
/* Everything below here is for legacy driver, never use! */
/* private: */
--
2.34.1

2023-01-12 17:54:03

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 11/12] drm/i915: Wire up with drm controller GPU time query

From: Tvrtko Ursulin <[email protected]>

Implement the drm_cgroup_ops->active_time_us callback.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/i915/i915_driver.c | 10 ++++
drivers/gpu/drm/i915/i915_drm_client.c | 76 ++++++++++++++++++++++----
drivers/gpu/drm/i915/i915_drm_client.h | 2 +
3 files changed, 78 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index c1e427ba57ae..50935cdb3a93 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1897,6 +1897,12 @@ static const struct drm_ioctl_desc i915_ioctls[] = {
DRM_IOCTL_DEF_DRV(I915_GEM_VM_DESTROY, i915_gem_vm_destroy_ioctl, DRM_RENDER_ALLOW),
};

+#ifdef CONFIG_CGROUP_DRM
+static const struct drm_cgroup_ops i915_drm_cgroup_ops = {
+ .active_time_us = i915_drm_cgroup_get_active_time_us,
+};
+#endif
+
/*
* Interface history:
*
@@ -1925,6 +1931,10 @@ static const struct drm_driver i915_drm_driver = {
.lastclose = i915_driver_lastclose,
.postclose = i915_driver_postclose,

+#ifdef CONFIG_CGROUP_DRM
+ .cg_ops = &i915_drm_cgroup_ops,
+#endif
+
.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
.gem_prime_import = i915_gem_prime_import,
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index b09d1d386574..c9754cb0277f 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -75,7 +75,7 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients)
xa_destroy(&clients->xarray);
}

-#ifdef CONFIG_PROC_FS
+#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
static const char * const uabi_class_names[] = {
[I915_ENGINE_CLASS_RENDER] = "render",
[I915_ENGINE_CLASS_COPY] = "copy",
@@ -100,22 +100,78 @@ static u64 busy_add(struct i915_gem_context *ctx, unsigned int class)
return total;
}

-static void
-show_client_class(struct seq_file *m,
- struct i915_drm_client *client,
- unsigned int class)
+static u64 get_class_active_ns(struct i915_drm_client *client,
+ unsigned int class,
+ unsigned int *capacity)
{
- const struct list_head *list = &client->ctx_list;
- u64 total = atomic64_read(&client->past_runtime[class]);
- const unsigned int capacity =
- client->clients->i915->engine_uabi_class_count[class];
struct i915_gem_context *ctx;
+ u64 total;
+
+ *capacity =
+ client->clients->i915->engine_uabi_class_count[class];
+ if (!*capacity)
+ return 0;
+
+ total = atomic64_read(&client->past_runtime[class]);

rcu_read_lock();
- list_for_each_entry_rcu(ctx, list, client_link)
+ list_for_each_entry_rcu(ctx, &client->ctx_list, client_link)
total += busy_add(ctx, class);
rcu_read_unlock();

+ return total;
+}
+#endif
+
+#ifdef CONFIG_CGROUP_DRM
+static bool supports_stats(struct drm_i915_private *i915)
+{
+ if (GRAPHICS_VER(i915) < 8)
+ return false;
+
+ /* temporary... */
+ if (intel_uc_uses_guc_submission(&to_gt(i915)->uc))
+ return false;
+
+ return true;
+}
+
+u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file)
+{
+ struct drm_i915_file_private *fpriv = file->driver_priv;
+ struct i915_drm_client *client = fpriv->client;
+ unsigned int i;
+ u64 busy = 0;
+
+ if (!supports_stats(client->clients->i915))
+ return 0;
+
+ for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++) {
+ unsigned int capacity;
+ u64 b;
+
+ b = get_class_active_ns(client, i, &capacity);
+ if (capacity) {
+ b = DIV_ROUND_UP_ULL(b, capacity * 1000);
+ busy += b;
+ }
+ }
+
+ return busy;
+}
+#endif
+
+#ifdef CONFIG_PROC_FS
+static void
+show_client_class(struct seq_file *m,
+ struct i915_drm_client *client,
+ unsigned int class)
+{
+ unsigned int capacity;
+ u64 total;
+
+ total = get_class_active_ns(client, class, &capacity);
+
if (capacity)
seq_printf(m, "drm-engine-%s:\t%llu ns\n",
uabi_class_names[class], total);
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 69496af996d9..c8439eaa89be 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -65,4 +65,6 @@ void i915_drm_client_fdinfo(struct seq_file *m, struct file *f);

void i915_drm_clients_fini(struct i915_drm_clients *clients);

+u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file);
+
#endif /* !__I915_DRM_CLIENT_H__ */
--
2.34.1

2023-01-12 18:23:18

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 08/12] drm/cgroup: Only track clients which are providing drm_cgroup_ops

From: Tvrtko Ursulin <[email protected]>

To reduce the number of tracking going on, especially with drivers which
will not support any sort of control from the drm cgroup controller side,
lets express the funcionality as opt-in and use the presence of
drm_cgroup_ops as activation criteria.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/drm_cgroup.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index ef951421bba6..09249f795af3 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -36,6 +36,9 @@ void drm_clients_close(struct drm_file *file_priv)

lockdep_assert_held(&dev->filelist_mutex);

+ if (!dev->driver->cg_ops)
+ return;
+
pid = rcu_access_pointer(file_priv->pid);
clients = xa_load(&drm_pid_clients, (unsigned long)pid);
if (drm_WARN_ON_ONCE(dev, !clients))
@@ -67,6 +70,9 @@ int drm_clients_open(struct drm_file *file_priv)

lockdep_assert_held(&dev->filelist_mutex);

+ if (!dev->driver->cg_ops)
+ return 0;
+
pid = (unsigned long)rcu_access_pointer(file_priv->pid);
clients = xa_load(&drm_pid_clients, pid);
if (!clients) {
@@ -102,6 +108,9 @@ drm_clients_migrate(struct drm_file *file_priv,

lockdep_assert_held(&dev->filelist_mutex);

+ if (!dev->driver->cg_ops)
+ return;
+
existing_clients = xa_load(&drm_pid_clients, (unsigned long)new);
clients = xa_load(&drm_pid_clients, (unsigned long)old);

--
2.34.1

2023-01-12 18:23:43

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 02/12] drm: Update file owner during use

From: Tvrtko Ursulin <[email protected]>

With the typical model where the display server opends the file descriptor
and then hands it over to the client we were showing stale data in
debugfs.

Fix it by updating the drm_file->pid on ioctl access from a different
process.

The field is also made RCU protected to allow for lockless readers. Update
side is protected with dev->filelist_mutex.

Before:

$ cat /sys/kernel/debug/dri/0/clients
command pid dev master a uid magic
Xorg 2344 0 y y 0 0
Xorg 2344 0 n y 0 2
Xorg 2344 0 n y 0 3
Xorg 2344 0 n y 0 4

After:

$ cat /sys/kernel/debug/dri/0/clients
command tgid dev master a uid magic
Xorg 830 0 y y 0 0
xfce4-session 880 0 n y 0 1
xfwm4 943 0 n y 0 2
neverball 1095 0 n y 0 3

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Daniel Vetter <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 ++--
drivers/gpu/drm/drm_auth.c | 3 +-
drivers/gpu/drm/drm_debugfs.c | 10 ++++---
drivers/gpu/drm/drm_file.c | 40 +++++++++++++++++++++++--
drivers/gpu/drm/drm_ioctl.c | 3 ++
drivers/gpu/drm/nouveau/nouveau_drm.c | 5 +++-
drivers/gpu/drm/vmwgfx/vmwgfx_gem.c | 6 ++--
include/drm/drm_file.h | 13 ++++++--
8 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index ca0181332578..3f22520552bb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -960,6 +960,7 @@ static int amdgpu_debugfs_gem_info_show(struct seq_file *m, void *unused)
list_for_each_entry(file, &dev->filelist, lhead) {
struct task_struct *task;
struct drm_gem_object *gobj;
+ struct pid *pid;
int id;

/*
@@ -969,8 +970,9 @@ static int amdgpu_debugfs_gem_info_show(struct seq_file *m, void *unused)
* Therefore, we need to protect this ->comm access using RCU.
*/
rcu_read_lock();
- task = pid_task(file->pid, PIDTYPE_TGID);
- seq_printf(m, "pid %8d command %s:\n", pid_nr(file->pid),
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ seq_printf(m, "pid %8d command %s:\n", pid_nr(pid),
task ? task->comm : "<unknown>");
rcu_read_unlock();

diff --git a/drivers/gpu/drm/drm_auth.c b/drivers/gpu/drm/drm_auth.c
index cf92a9ae8034..2ed2585ded37 100644
--- a/drivers/gpu/drm/drm_auth.c
+++ b/drivers/gpu/drm/drm_auth.c
@@ -235,7 +235,8 @@ static int drm_new_set_master(struct drm_device *dev, struct drm_file *fpriv)
static int
drm_master_check_perm(struct drm_device *dev, struct drm_file *file_priv)
{
- if (file_priv->pid == task_pid(current) && file_priv->was_master)
+ if (file_priv->was_master &&
+ rcu_access_pointer(file_priv->pid) == task_pid(current))
return 0;

if (!capable(CAP_SYS_ADMIN))
diff --git a/drivers/gpu/drm/drm_debugfs.c b/drivers/gpu/drm/drm_debugfs.c
index 4855230ba2c6..b46f5ceb24c6 100644
--- a/drivers/gpu/drm/drm_debugfs.c
+++ b/drivers/gpu/drm/drm_debugfs.c
@@ -90,15 +90,17 @@ static int drm_clients_info(struct seq_file *m, void *data)
*/
mutex_lock(&dev->filelist_mutex);
list_for_each_entry_reverse(priv, &dev->filelist, lhead) {
- struct task_struct *task;
bool is_current_master = drm_is_current_master(priv);
+ struct task_struct *task;
+ struct pid *pid;

- rcu_read_lock(); /* locks pid_task()->comm */
- task = pid_task(priv->pid, PIDTYPE_TGID);
+ rcu_read_lock(); /* Locks priv->pid and pid_task()->comm! */
+ pid = rcu_dereference(priv->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
uid = task ? __task_cred(task)->euid : GLOBAL_ROOT_UID;
seq_printf(m, "%20s %5d %3d %c %c %5d %10u\n",
task ? task->comm : "<unknown>",
- pid_vnr(priv->pid),
+ pid_vnr(pid),
priv->minor->index,
is_current_master ? 'y' : 'n',
priv->authenticated ? 'y' : 'n',
diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index c1018c470047..f2f8175ece15 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -156,7 +156,7 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)
if (!file)
return ERR_PTR(-ENOMEM);

- file->pid = get_pid(task_tgid(current));
+ rcu_assign_pointer(file->pid, get_pid(task_tgid(current)));
file->minor = minor;

/* for compatibility root is always authenticated */
@@ -196,7 +196,7 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)
drm_syncobj_release(file);
if (drm_core_check_feature(dev, DRIVER_GEM))
drm_gem_release(dev, file);
- put_pid(file->pid);
+ put_pid(rcu_access_pointer(file->pid));
kfree(file);

return ERR_PTR(ret);
@@ -287,7 +287,7 @@ void drm_file_free(struct drm_file *file)

WARN_ON(!list_empty(&file->event_list));

- put_pid(file->pid);
+ put_pid(rcu_access_pointer(file->pid));
kfree(file);
}

@@ -501,6 +501,40 @@ int drm_release(struct inode *inode, struct file *filp)
}
EXPORT_SYMBOL(drm_release);

+void drm_file_update_pid(struct drm_file *filp)
+{
+ struct drm_device *dev;
+ struct pid *pid, *old;
+
+ /*
+ * Master nodes need to keep the original ownership in order for
+ * drm_master_check_perm to keep working correctly. (See comment in
+ * drm_auth.c.)
+ */
+ if (filp->was_master)
+ return;
+
+ pid = task_tgid(current);
+
+ /*
+ * Quick unlocked check since the model is a single handover followed by
+ * exclusive repeated use.
+ */
+ if (pid == rcu_access_pointer(filp->pid))
+ return;
+
+ dev = filp->minor->dev;
+ mutex_lock(&dev->filelist_mutex);
+ old = rcu_replace_pointer(filp->pid, pid, 1);
+ mutex_unlock(&dev->filelist_mutex);
+
+ if (pid != old) {
+ get_pid(pid);
+ synchronize_rcu();
+ put_pid(old);
+ }
+}
+
/**
* drm_release_noglobal - release method for DRM file
* @inode: device inode
diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
index 7c9d66ee917d..305b18d9d7b6 100644
--- a/drivers/gpu/drm/drm_ioctl.c
+++ b/drivers/gpu/drm/drm_ioctl.c
@@ -775,6 +775,9 @@ long drm_ioctl_kernel(struct file *file, drm_ioctl_t *func, void *kdata,
struct drm_device *dev = file_priv->minor->dev;
int retcode;

+ /* Update drm_file owner if fd was passed along. */
+ drm_file_update_pid(file_priv);
+
if (drm_dev_is_unplugged(dev))
return -ENODEV;

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 80f154b6adab..a763d3ee61fb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -1097,7 +1097,10 @@ nouveau_drm_open(struct drm_device *dev, struct drm_file *fpriv)
}

get_task_comm(tmpname, current);
- snprintf(name, sizeof(name), "%s[%d]", tmpname, pid_nr(fpriv->pid));
+ rcu_read_lock();
+ snprintf(name, sizeof(name), "%s[%d]",
+ tmpname, pid_nr(rcu_dereference(fpriv->pid)));
+ rcu_read_unlock();

if (!(cli = kzalloc(sizeof(*cli), GFP_KERNEL))) {
ret = -ENOMEM;
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
index f2985337aa53..3853d9bb9ab8 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
@@ -251,6 +251,7 @@ static int vmw_debugfs_gem_info_show(struct seq_file *m, void *unused)
list_for_each_entry(file, &dev->filelist, lhead) {
struct task_struct *task;
struct drm_gem_object *gobj;
+ struct pid *pid;
int id;

/*
@@ -260,8 +261,9 @@ static int vmw_debugfs_gem_info_show(struct seq_file *m, void *unused)
* Therefore, we need to protect this ->comm access using RCU.
*/
rcu_read_lock();
- task = pid_task(file->pid, PIDTYPE_TGID);
- seq_printf(m, "pid %8d command %s:\n", pid_nr(file->pid),
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ seq_printf(m, "pid %8d command %s:\n", pid_nr(pid),
task ? task->comm : "<unknown>");
rcu_read_unlock();

diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index 0d1f853092ab..27d545131d4a 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -255,8 +255,15 @@ struct drm_file {
/** @master_lookup_lock: Serializes @master. */
spinlock_t master_lookup_lock;

- /** @pid: Process that opened this file. */
- struct pid *pid;
+ /**
+ * @pid: Process that is using this file.
+ *
+ * Must only be dereferenced under a rcu_read_lock or equivalent.
+ *
+ * Updates are guarded with dev->filelist_mutex and reference must be
+ * dropped after a RCU grace period to accommodate lockless readers.
+ */
+ struct pid __rcu *pid;

/** @magic: Authentication magic, see @authenticated. */
drm_magic_t magic;
@@ -415,6 +422,8 @@ static inline bool drm_is_accel_client(const struct drm_file *file_priv)
return file_priv->minor->type == DRM_MINOR_ACCEL;
}

+void drm_file_update_pid(struct drm_file *);
+
int drm_open(struct inode *inode, struct file *filp);
int drm_open_helper(struct file *filp, struct drm_minor *minor);
ssize_t drm_read(struct file *filp, char __user *buffer,
--
2.34.1

2023-01-12 18:30:56

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 07/12] drm/cgroup: Add over budget signalling callback

From: Tvrtko Ursulin <[email protected]>

Add a new callback via which the drm cgroup controller is notifying the
drm core that a certain process is above its allotted GPU time.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/drm_cgroup.c | 21 +++++++++++++++++++++
include/drm/drm_clients.h | 1 +
include/drm/drm_drv.h | 8 ++++++++
3 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index bc1e34f1a552..ef951421bba6 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -162,3 +162,24 @@ u64 drm_pid_get_active_time_us(struct pid *pid)
return total;
}
EXPORT_SYMBOL_GPL(drm_pid_get_active_time_us);
+
+void drm_pid_signal_budget(struct pid *pid, u64 usage, u64 budget)
+{
+ struct drm_pid_clients *clients;
+
+ rcu_read_lock();
+ clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+ if (clients) {
+ struct drm_file *fpriv;
+
+ list_for_each_entry_rcu(fpriv, &clients->file_list, clink) {
+ const struct drm_cgroup_ops *cg_ops =
+ fpriv->minor->dev->driver->cg_ops;
+
+ if (cg_ops && cg_ops->signal_budget)
+ cg_ops->signal_budget(fpriv, usage, budget);
+ }
+ }
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(drm_pid_signal_budget);
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
index 7e0c0cf14f25..f3571caa35f8 100644
--- a/include/drm/drm_clients.h
+++ b/include/drm/drm_clients.h
@@ -42,5 +42,6 @@ drm_clients_migrate(struct drm_file *file_priv,
#endif

u64 drm_pid_get_active_time_us(struct pid *pid);
+void drm_pid_signal_budget(struct pid *pid, u64 usage, u64 budget);

#endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index f5f0e088e1fe..0945e562821a 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -174,6 +174,14 @@ struct drm_cgroup_ops {
* Used by the DRM core when queried by the DRM cgroup controller.
*/
u64 (*active_time_us) (struct drm_file *);
+
+ /**
+ * @signal_budget:
+ *
+ * Optional callback used by the DRM core to forward over/under GPU time
+ * messages sent by the DRM cgroup controller.
+ */
+ int (*signal_budget) (struct drm_file *, u64 used, u64 budget);
};

/**
--
2.34.1

2023-01-12 18:57:58

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 04/12] drm/cgroup: Track clients per owning process

From: Tvrtko Ursulin <[email protected]>

To enable propagation of settings from the cgroup drm controller to drm we
need to start tracking which processes own which drm clients.

Implement that by tracking the struct pid pointer of the owning process in
a new XArray, pointing to a structure containing a list of associated
struct drm_file pointers.

Clients are added and removed under the filelist mutex and RCU list
operations are used below it to allow for lockless lookup.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/Makefile | 1 +
drivers/gpu/drm/drm_cgroup.c | 133 +++++++++++++++++++++++++++++++++++
drivers/gpu/drm/drm_file.c | 20 ++++--
include/drm/drm_clients.h | 44 ++++++++++++
include/drm/drm_file.h | 4 ++
5 files changed, 198 insertions(+), 4 deletions(-)
create mode 100644 drivers/gpu/drm/drm_cgroup.c
create mode 100644 include/drm/drm_clients.h

diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 496fa5a6147a..c036b1b379c4 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -59,6 +59,7 @@ drm-$(CONFIG_DRM_LEGACY) += \
drm_scatter.o \
drm_vm.o
drm-$(CONFIG_DRM_LIB_RANDOM) += lib/drm_random.o
+drm-$(CONFIG_CGROUP_DRM) += drm_cgroup.o
drm-$(CONFIG_COMPAT) += drm_ioc32.o
drm-$(CONFIG_DRM_PANEL) += drm_panel.o
drm-$(CONFIG_OF) += drm_of.o
diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
new file mode 100644
index 000000000000..d91512a560ff
--- /dev/null
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include <drm/drm_drv.h>
+#include <drm/drm_clients.h>
+#include <drm/drm_print.h>
+
+static DEFINE_XARRAY(drm_pid_clients);
+
+static void
+__del_clients(struct drm_pid_clients *clients,
+ struct drm_file *file_priv,
+ unsigned long pid)
+{
+ list_del_rcu(&file_priv->clink);
+ if (atomic_dec_and_test(&clients->num)) {
+ xa_erase(&drm_pid_clients, pid);
+ kfree_rcu(clients, rcu);
+ }
+}
+
+void drm_clients_close(struct drm_file *file_priv)
+{
+ struct drm_device *dev = file_priv->minor->dev;
+ struct drm_pid_clients *clients;
+ struct pid *pid;
+
+ lockdep_assert_held(&dev->filelist_mutex);
+
+ pid = rcu_access_pointer(file_priv->pid);
+ clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+ if (drm_WARN_ON_ONCE(dev, !clients))
+ return;
+
+ __del_clients(clients, file_priv, (unsigned long)pid);
+}
+
+static struct drm_pid_clients *__alloc_clients(void)
+{
+ struct drm_pid_clients *clients;
+
+ clients = kmalloc(sizeof(*clients), GFP_KERNEL);
+ if (clients) {
+ atomic_set(&clients->num, 0);
+ INIT_LIST_HEAD(&clients->file_list);
+ init_rcu_head(&clients->rcu);
+ }
+
+ return clients;
+}
+
+int drm_clients_open(struct drm_file *file_priv)
+{
+ struct drm_device *dev = file_priv->minor->dev;
+ struct drm_pid_clients *clients;
+ bool new_client = false;
+ unsigned long pid;
+
+ lockdep_assert_held(&dev->filelist_mutex);
+
+ pid = (unsigned long)rcu_access_pointer(file_priv->pid);
+ clients = xa_load(&drm_pid_clients, pid);
+ if (!clients) {
+ clients = __alloc_clients();
+ if (!clients)
+ return -ENOMEM;
+ new_client = true;
+ }
+ atomic_inc(&clients->num);
+ list_add_tail_rcu(&file_priv->clink, &clients->file_list);
+ if (new_client) {
+ void *xret;
+
+ xret = xa_store(&drm_pid_clients, pid, clients, GFP_KERNEL);
+ if (xa_err(xret)) {
+ list_del_init(&file_priv->clink);
+ kfree(clients);
+ return PTR_ERR(clients);
+ }
+ }
+
+ return 0;
+}
+
+void
+drm_clients_migrate(struct drm_file *file_priv,
+ struct pid *old,
+ struct pid *new)
+{
+ struct drm_device *dev = file_priv->minor->dev;
+ struct drm_pid_clients *existing_clients;
+ struct drm_pid_clients *clients;
+
+ lockdep_assert_held(&dev->filelist_mutex);
+
+ existing_clients = xa_load(&drm_pid_clients, (unsigned long)new);
+ clients = xa_load(&drm_pid_clients, (unsigned long)old);
+
+ if (drm_WARN_ON_ONCE(dev, !clients))
+ return;
+ else if (drm_WARN_ON_ONCE(dev, clients == existing_clients))
+ return;
+
+ __del_clients(clients, file_priv, (unsigned long)old);
+
+ if (!existing_clients) {
+ void *xret;
+
+ clients = __alloc_clients();
+ if (!clients)
+ goto err;
+
+ xret = xa_store(&drm_pid_clients, (unsigned long)new, clients,
+ GFP_KERNEL);
+ if (xa_err(xret))
+ goto err;
+ } else {
+ clients = existing_clients;
+ }
+
+ atomic_inc(&clients->num);
+ list_add_tail_rcu(&file_priv->clink, &clients->file_list);
+
+ return;
+
+err:
+ rcu_read_lock();
+ drm_warn(dev, "Failed to migrate client from pid %u to %u!\n",
+ pid_nr(old), pid_nr(new));
+ rcu_read_unlock();
+}
diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index f2f8175ece15..5cf446f721f8 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -40,6 +40,7 @@
#include <linux/slab.h>

#include <drm/drm_client.h>
+#include <drm/drm_clients.h>
#include <drm/drm_drv.h>
#include <drm/drm_file.h>
#include <drm/drm_print.h>
@@ -298,6 +299,7 @@ static void drm_close_helper(struct file *filp)

mutex_lock(&dev->filelist_mutex);
list_del(&file_priv->lhead);
+ drm_clients_close(file_priv);
mutex_unlock(&dev->filelist_mutex);

drm_file_free(file_priv);
@@ -349,10 +351,8 @@ int drm_open_helper(struct file *filp, struct drm_minor *minor)

if (drm_is_primary_client(priv)) {
ret = drm_master_open(priv);
- if (ret) {
- drm_file_free(priv);
- return ret;
- }
+ if (ret)
+ goto err_free;
}

filp->private_data = priv;
@@ -360,6 +360,9 @@ int drm_open_helper(struct file *filp, struct drm_minor *minor)
priv->filp = filp;

mutex_lock(&dev->filelist_mutex);
+ ret = drm_clients_open(priv);
+ if (ret)
+ goto err_unlock;
list_add(&priv->lhead, &dev->filelist);
mutex_unlock(&dev->filelist_mutex);

@@ -387,6 +390,13 @@ int drm_open_helper(struct file *filp, struct drm_minor *minor)
#endif

return 0;
+
+err_unlock:
+ mutex_unlock(&dev->filelist_mutex);
+err_free:
+ drm_file_free(priv);
+
+ return ret;
}

/**
@@ -526,6 +536,8 @@ void drm_file_update_pid(struct drm_file *filp)
dev = filp->minor->dev;
mutex_lock(&dev->filelist_mutex);
old = rcu_replace_pointer(filp->pid, pid, 1);
+ if (pid != old)
+ drm_clients_migrate(filp, old, pid);
mutex_unlock(&dev->filelist_mutex);

if (pid != old) {
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
new file mode 100644
index 000000000000..2732fffab3f0
--- /dev/null
+++ b/include/drm/drm_clients.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#ifndef _DRM_CLIENTS_H_
+#define _DRM_CLIENTS_H_
+
+#include <linux/pid.h>
+
+#include <drm/drm_file.h>
+
+struct drm_pid_clients {
+ atomic_t num;
+ struct list_head file_list;
+ struct rcu_head rcu;
+};
+
+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+void drm_clients_close(struct drm_file *file_priv);
+int drm_clients_open(struct drm_file *file_priv);
+
+void
+drm_clients_migrate(struct drm_file *file_priv,
+ struct pid *old, struct pid *new);
+#else
+static inline void drm_clients_close(struct drm_file *file_priv)
+{
+}
+
+static inline int drm_clients_open(struct drm_file *file_priv)
+{
+ return 0;
+}
+
+static void
+drm_clients_migrate(struct drm_file *file_priv,
+ struct pid *old, struct pid *new)
+{
+
+}
+#endif
+
+#endif
diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index 27d545131d4a..644c5b17d6a7 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -279,6 +279,10 @@ struct drm_file {
/** @minor: &struct drm_minor for this file. */
struct drm_minor *minor;

+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+ struct list_head clink;
+#endif
+
/**
* @object_idr:
*
--
2.34.1

2023-01-17 16:45:39

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC 04/12] drm/cgroup: Track clients per owning process


On 17/01/2023 16:03, Stanislaw Gruszka wrote:
> Hi
>
> On Thu, Jan 12, 2023 at 04:56:01PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <[email protected]>
>>
>> To enable propagation of settings from the cgroup drm controller to drm we
>> need to start tracking which processes own which drm clients.
>>
>> Implement that by tracking the struct pid pointer of the owning process in
>> a new XArray, pointing to a structure containing a list of associated
>> struct drm_file pointers.
>>
>> Clients are added and removed under the filelist mutex and RCU list
>> operations are used below it to allow for lockless lookup.
>>
>> Signed-off-by: Tvrtko Ursulin <[email protected]>
>
> <snip>
>
>> +int drm_clients_open(struct drm_file *file_priv)
>> +{
>> + struct drm_device *dev = file_priv->minor->dev;
>> + struct drm_pid_clients *clients;
>> + bool new_client = false;
>> + unsigned long pid;
>> +
>> + lockdep_assert_held(&dev->filelist_mutex);
>> +
>> + pid = (unsigned long)rcu_access_pointer(file_priv->pid);
>> + clients = xa_load(&drm_pid_clients, pid);
>> + if (!clients) {
>> + clients = __alloc_clients();
>> + if (!clients)
>> + return -ENOMEM;
>> + new_client = true;
>> + }
>> + atomic_inc(&clients->num);
>> + list_add_tail_rcu(&file_priv->clink, &clients->file_list);
>> + if (new_client) {
>> + void *xret;
>> +
>> + xret = xa_store(&drm_pid_clients, pid, clients, GFP_KERNEL);
>> + if (xa_err(xret)) {
>> + list_del_init(&file_priv->clink);
>> + kfree(clients);
>> + return PTR_ERR(clients);
>
> This looks incorrect, rather xa_err(xret) should be returned.

Thanks, looks like a copy & paste error. Noted to correct before next
public post.

Regards,

Tvrtko

2023-01-17 16:52:48

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [RFC 04/12] drm/cgroup: Track clients per owning process

Hi

On Thu, Jan 12, 2023 at 04:56:01PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <[email protected]>
>
> To enable propagation of settings from the cgroup drm controller to drm we
> need to start tracking which processes own which drm clients.
>
> Implement that by tracking the struct pid pointer of the owning process in
> a new XArray, pointing to a structure containing a list of associated
> struct drm_file pointers.
>
> Clients are added and removed under the filelist mutex and RCU list
> operations are used below it to allow for lockless lookup.
>
> Signed-off-by: Tvrtko Ursulin <[email protected]>

<snip>

> +int drm_clients_open(struct drm_file *file_priv)
> +{
> + struct drm_device *dev = file_priv->minor->dev;
> + struct drm_pid_clients *clients;
> + bool new_client = false;
> + unsigned long pid;
> +
> + lockdep_assert_held(&dev->filelist_mutex);
> +
> + pid = (unsigned long)rcu_access_pointer(file_priv->pid);
> + clients = xa_load(&drm_pid_clients, pid);
> + if (!clients) {
> + clients = __alloc_clients();
> + if (!clients)
> + return -ENOMEM;
> + new_client = true;
> + }
> + atomic_inc(&clients->num);
> + list_add_tail_rcu(&file_priv->clink, &clients->file_list);
> + if (new_client) {
> + void *xret;
> +
> + xret = xa_store(&drm_pid_clients, pid, clients, GFP_KERNEL);
> + if (xa_err(xret)) {
> + list_del_init(&file_priv->clink);
> + kfree(clients);
> + return PTR_ERR(clients);

This looks incorrect, rather xa_err(xret) should be returned.

Regards
Stanislaw

2023-01-23 15:43:09

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller

Hello Tvrtko.

Interesting work.

On Thu, Jan 12, 2023 at 04:55:57PM +0000, Tvrtko Ursulin <[email protected]> wrote:
> Because of the heterogenous hardware and driver DRM capabilities, soft limits
> are implemented as a loose co-operative (bi-directional) interface between the
> controller and DRM core.

IIUC, this periodic scanning, calculating and applying could be partly
implemented with userspace utilities. (As you write, these limits are
best effort only, so it sounds to me such a total implementation is
unnecessary.)

I think a better approach would be to avoid the async querying and
instead require implementing explicit foo_charge_time(client, dur) API
(similar to how other controllers achieve this).
Your argument is the heterogenity of devices -- does it mean there are
devices/drivers that can't implement such a synchronous charging?

> DRM core provides an API to query per process GPU utilization and 2nd API to
> receive notification from the cgroup controller when the group enters or exits
> the over budget condition.

The return value of foo_charge_time() would substitute such a
notification synchronously. (By extension all clients in an affected
cgroup could be notified to achieve some broader actions.)

> Individual DRM drivers which implement the interface are expected to act on this
> in the best-effort manner only. There are no guarantees that the soft limits
> will be respected.

Back to original concern -- must all code reside in the kernel when it's
essentially advisory resource control?

> * DRM core is required to track all DRM clients belonging to processes so it
> can answer when asked how much GPU time is a process using.
> [...]
> * Individual drivers need to implement two similar hooks, but which work for
> a single DRM client. Over budget callback and GPU utilisation query.

This information is eventually aggregated for each process in a cgroup.
(And the action is carried on a single client, not a process.)
The per-process tracking seems like an additional indirection.
Could be the clients associated directly with DRM cgroup? [1]


Regards,
Michal

[1] I understand the sending a fd of a client is a regular operation, so
I'm not sure how cross-cg migrations would have to be handled in any
case.


Attachments:
(No filename) (2.24 kB)
signature.asc (228.00 B)
Digital signature
Download all attachments

2023-01-25 18:11:47

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller


Hi,

On 23/01/2023 15:42, Michal Koutný wrote:
> Hello Tvrtko.
>
> Interesting work.

Thanks!

> On Thu, Jan 12, 2023 at 04:55:57PM +0000, Tvrtko Ursulin <[email protected]> wrote:
>> Because of the heterogenous hardware and driver DRM capabilities, soft limits
>> are implemented as a loose co-operative (bi-directional) interface between the
>> controller and DRM core.
>
> IIUC, this periodic scanning, calculating and applying could be partly
> implemented with userspace utilities. (As you write, these limits are
> best effort only, so it sounds to me such a total implementation is
> unnecessary.)

I don't immediately see how you envisage the half-userspace
implementation would look like in terms of what functionality/new APIs
would be provided by the kernel?

> I think a better approach would be to avoid the async querying and
> instead require implementing explicit foo_charge_time(client, dur) API
> (similar to how other controllers achieve this).
> Your argument is the heterogenity of devices -- does it mean there are
> devices/drivers that can't implement such a synchronous charging?

Problem there is to find a suitable point to charge at. If for a moment
we limit the discussion to i915, out of the box we could having charging
happening at several thousand times per second to effectively never.
This is to illustrate the GPU context execution dynamics which range
from many small packets of work to multi-minute, or longer. For the
latter to be accounted for we'd still need some periodic scanning, which
would then perhaps go per driver. For the former we'd have thousands of
needless updates per second.

Hence my thinking was to pay both the cost of accounting and collecting
the usage data once per actionable event, where the latter is controlled
by some reasonable scanning period/frequency.

In addition to that, a few DRM drivers already support GPU usage
querying via fdinfo, so that being externally triggered, it is next to
trivial to wire all those DRM drivers into such common DRM cgroup
controller framework. All that every driver needs to implement on top is
the "over budget" callback.

>> DRM core provides an API to query per process GPU utilization and 2nd API to
>> receive notification from the cgroup controller when the group enters or exits
>> the over budget condition.
>
> The return value of foo_charge_time() would substitute such a
> notification synchronously. (By extension all clients in an affected
> cgroup could be notified to achieve some broader actions.)

Right, it is doable in theory, but as mention above some rate limit
would have to be added. And the notification would still need to have
unused budget propagation through the tree, so it wouldn't work to
localize the action to the single cgroup (the one getting the charge).

>> Individual DRM drivers which implement the interface are expected to act on this
>> in the best-effort manner only. There are no guarantees that the soft limits
>> will be respected.
>
> Back to original concern -- must all code reside in the kernel when it's
> essentially advisory resource control?
>
>> * DRM core is required to track all DRM clients belonging to processes so it
>> can answer when asked how much GPU time is a process using.
>> [...]
>> * Individual drivers need to implement two similar hooks, but which work for
>> a single DRM client. Over budget callback and GPU utilisation query.
>
> This information is eventually aggregated for each process in a cgroup.
> (And the action is carried on a single client, not a process.)
> The per-process tracking seems like an additional indirection.
> Could be the clients associated directly with DRM cgroup? [1]

I think you could be right here - with some deeper integration with the
cgroup subsystem this could probably be done. It would require moving
the list of drm clients into the cgroup css state itself. Let me try and
sketch that out in the following weeks because it would be a nice
simplification if it indeed worked out.

Regards,

Tvrtko

>
>
> Regards,
> Michal
>
> [1] I understand the sending a fd of a client is a regular operation, so
> I'm not sure how cross-cg migrations would have to be handled in any
> case.

2023-01-26 13:01:43

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller

On Wed, Jan 25, 2023 at 06:11:35PM +0000, Tvrtko Ursulin <[email protected]> wrote:
> I don't immediately see how you envisage the half-userspace implementation
> would look like in terms of what functionality/new APIs would be provided by
> the kernel?

Output:
drm.stat (with consumed time(s))

Input:
drm.throttle (alternatives)
- a) writing 0,1 (in rough analogy to your proposed
notifications)
- b) writing duration (in loose analogy to memory.reclaim)
- for how long GPU work should be backed off

An userspace agent sitting between these two and it'd do the measurement
and calculation depending on given policies (weighting, throttling) and
apply respective controls.

(In resemblance of e.g. https://denji.github.io/cpulimit/)

> Problem there is to find a suitable point to charge at. If for a moment we
> limit the discussion to i915, out of the box we could having charging
> happening at several thousand times per second to effectively never. This is
> to illustrate the GPU context execution dynamics which range from many small
> packets of work to multi-minute, or longer. For the latter to be accounted
> for we'd still need some periodic scanning, which would then perhaps go per
> driver. For the former we'd have thousands of needless updates per second.
>
> Hence my thinking was to pay both the cost of accounting and collecting the
> usage data once per actionable event, where the latter is controlled by some
> reasonable scanning period/frequency.
>
> In addition to that, a few DRM drivers already support GPU usage querying
> via fdinfo, so that being externally triggered, it is next to trivial to
> wire all those DRM drivers into such common DRM cgroup controller framework.
> All that every driver needs to implement on top is the "over budget"
> callback.

I'd also like show comparison with CPU accounting and controller.
There is tick-based (~sampling) measurement of various components of CPU
time (task_group_account_field()). But the actual schedulling (weights)
or throttling is based on precise accounting (update_curr()).

So, if the goal is to have precise and guaranteed limits, it shouldn't
(cannot) be based on sampling. OTOH, if it must be sampling based due to
variability of the device landscape, it could be advisory mechanism with
the userspace component.

My 0.02€,
Michal


Attachments:
(No filename) (2.30 kB)
signature.asc (228.00 B)
Digital signature
Download all attachments

2023-01-26 17:04:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller

Hello,

On Thu, Jan 26, 2023 at 02:00:50PM +0100, Michal Koutn? wrote:
> On Wed, Jan 25, 2023 at 06:11:35PM +0000, Tvrtko Ursulin <[email protected]> wrote:
> > I don't immediately see how you envisage the half-userspace implementation
> > would look like in terms of what functionality/new APIs would be provided by
> > the kernel?
>
> Output:
> drm.stat (with consumed time(s))
>
> Input:
> drm.throttle (alternatives)
> - a) writing 0,1 (in rough analogy to your proposed
> notifications)
> - b) writing duration (in loose analogy to memory.reclaim)
> - for how long GPU work should be backed off
>
> An userspace agent sitting between these two and it'd do the measurement
> and calculation depending on given policies (weighting, throttling) and
> apply respective controls.
>
> (In resemblance of e.g. https://denji.github.io/cpulimit/)

Yeah, things like this can be done from userspace but if we're gonna build
the infrastructure to allow that in gpu drivers and so on, I don't see why
we wouldn't add a generic in-kernel control layer if we can implement a
proper weight based control. We can of course also expose .max style
interface to allow userspace to do whatever they wanna do with it.

> > Problem there is to find a suitable point to charge at. If for a moment we
> > limit the discussion to i915, out of the box we could having charging
> > happening at several thousand times per second to effectively never. This is
> > to illustrate the GPU context execution dynamics which range from many small
> > packets of work to multi-minute, or longer. For the latter to be accounted
> > for we'd still need some periodic scanning, which would then perhaps go per
> > driver. For the former we'd have thousands of needless updates per second.
> >
> > Hence my thinking was to pay both the cost of accounting and collecting the
> > usage data once per actionable event, where the latter is controlled by some
> > reasonable scanning period/frequency.
> >
> > In addition to that, a few DRM drivers already support GPU usage querying
> > via fdinfo, so that being externally triggered, it is next to trivial to
> > wire all those DRM drivers into such common DRM cgroup controller framework.
> > All that every driver needs to implement on top is the "over budget"
> > callback.
>
> I'd also like show comparison with CPU accounting and controller.
> There is tick-based (~sampling) measurement of various components of CPU
> time (task_group_account_field()). But the actual schedulling (weights)
> or throttling is based on precise accounting (update_curr()).
>
> So, if the goal is to have precise and guaranteed limits, it shouldn't
> (cannot) be based on sampling. OTOH, if it must be sampling based due to
> variability of the device landscape, it could be advisory mechanism with
> the userspace component.

As for the specific control mechanism, yeah, charge based interface would be
more conventional and my suspicion is that transposing the current
implementation that way likely isn't too difficult. It just pushes "am I
over the limit?" decisions to the specific drivers with the core layer
telling them how much under/over budget they are. I'm curious what other gpu
driver folks think about the current RFC tho. Is at least AMD on board with
the approach?

Thanks.

--
tejun

2023-01-26 17:57:36

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller


Hi,

(Two replies in one, hope you will manage to navigate it.)

On 26/01/2023 17:04, Tejun Heo wrote:
> Hello,
>
> On Thu, Jan 26, 2023 at 02:00:50PM +0100, Michal Koutný wrote:
>> On Wed, Jan 25, 2023 at 06:11:35PM +0000, Tvrtko Ursulin <[email protected]> wrote:
>>> I don't immediately see how you envisage the half-userspace implementation
>>> would look like in terms of what functionality/new APIs would be provided by
>>> the kernel?
>>
>> Output:
>> drm.stat (with consumed time(s))
>>
>> Input:
>> drm.throttle (alternatives)
>> - a) writing 0,1 (in rough analogy to your proposed
>> notifications)
>> - b) writing duration (in loose analogy to memory.reclaim)
>> - for how long GPU work should be backed off
>>
>> An userspace agent sitting between these two and it'd do the measurement
>> and calculation depending on given policies (weighting, throttling) and
>> apply respective controls.

Right, I wouldn't recommend drm.throttle as ABI since my idea is to
enable drivers to do as good job as they individually can. Eg. some may
be able to be much smarter than simple throttling, or some may start of
simpler and later gain a better implementation. Some may even have
differing capability or granularity depending on the GPU model they are
driving, like in the case of i915.

So even if the RFC shows just a simple i915 implementation, the
controller itself shouldn't prevent a smarter approach (via exposed
ABI). And neither this simple i915 implementation works equally well for
all supported GPU generations! This will be a theme common for many DRM
drivers.

Secondly, doing this in userspace would require the ability to get some
sort of an atomic snapshot of the whole tree hierarchy to account for
changes in layout of the tree and task migrations. Or some retry logic
with some added ABI fields to enable it.

Even then I think the only thing we would be able to move to userspace
is the tree-walking logic and that sounds like not that much kernel code
saved to trade for increased inefficiency.

>> (In resemblance of e.g. https://denji.github.io/cpulimit/)
>
> Yeah, things like this can be done from userspace but if we're gonna build
> the infrastructure to allow that in gpu drivers and so on, I don't see why
> we wouldn't add a generic in-kernel control layer if we can implement a
> proper weight based control. We can of course also expose .max style
> interface to allow userspace to do whatever they wanna do with it.

Yes agreed, and to re-stress out, the ABI as proposed does not preclude
changing from scanning to charging or whatever. The idea was for it to
be compatible in concept with the CPU controller and also avoid baking
in the controlling method to individual drivers.

>>> Problem there is to find a suitable point to charge at. If for a moment we
>>> limit the discussion to i915, out of the box we could having charging
>>> happening at several thousand times per second to effectively never. This is
>>> to illustrate the GPU context execution dynamics which range from many small
>>> packets of work to multi-minute, or longer. For the latter to be accounted
>>> for we'd still need some periodic scanning, which would then perhaps go per
>>> driver. For the former we'd have thousands of needless updates per second.
>>>
>>> Hence my thinking was to pay both the cost of accounting and collecting the
>>> usage data once per actionable event, where the latter is controlled by some
>>> reasonable scanning period/frequency.
>>>
>>> In addition to that, a few DRM drivers already support GPU usage querying
>>> via fdinfo, so that being externally triggered, it is next to trivial to
>>> wire all those DRM drivers into such common DRM cgroup controller framework.
>>> All that every driver needs to implement on top is the "over budget"
>>> callback.
>>
>> I'd also like show comparison with CPU accounting and controller.
>> There is tick-based (~sampling) measurement of various components of CPU
>> time (task_group_account_field()). But the actual schedulling (weights)
>> or throttling is based on precise accounting (update_curr()).
>>
>> So, if the goal is to have precise and guaranteed limits, it shouldn't
>> (cannot) be based on sampling. OTOH, if it must be sampling based due to
>> variability of the device landscape, it could be advisory mechanism with
>> the userspace component.

I don't think precise and guaranteed limits are feasible given the
heterogeneous nature of DRM driver capabilities, but I also don't think
sticking an userspace component in the middle is the way to go.

> As for the specific control mechanism, yeah, charge based interface would be
> more conventional and my suspicion is that transposing the current
> implementation that way likely isn't too difficult. It just pushes "am I
> over the limit?" decisions to the specific drivers with the core layer
> telling them how much under/over budget they are. I'm curious what other

As I have tried to explain in my previous reply, I don't think real time
charging is feasible. Because frequency of charging events can both be
too high and too low. Too high that it doesn't bring value apart from
increased processing times, where it is not useful to send out
notification at the same rate, and too low in the sense that some sort
of periodic query would then be needed in every driver implementation to
enable all classes of GPU clients to be properly handled.

I just don't see any positives to the charging approach in the DRM
landscape, but for sure see some negatives. (If we ignore the positive
of it being a more typical approach, but then I think that is not enough
to outweigh the negatives.)

gpu
> driver folks think about the current RFC tho. Is at least AMD on board with
> the approach?

Yes I am keenly awaiting comments from the DRM colleagues as well.

Regards,

Tvrtko

P.S. Note that Michal's idea to simplify client tracking is on my TODO
list. If that works out some patches, the series itself actually, would
become even simpler.

2023-01-26 18:16:47

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller


On 26/01/2023 17:57, Tvrtko Ursulin wrote:
> On 26/01/2023 17:04, Tejun Heo wrote:

>> driver folks think about the current RFC tho. Is at least AMD on board
>> with
>> the approach?
>
> Yes I am keenly awaiting comments from the DRM colleagues as well.

Forgot to mention one thing on this point which may interest AMD.

Some time ago I tested the super primitive "throttling via lowering the
scheduling priority" on a GuC based i915 GPU, so only three supported
priority levels, and FWIW it can be somewhat effective.

It certainly was effective for my main use case which is "run this GPU
workload in the background while I use the GPU for something else".

The actual test was along the lines of running a GPU hog in parallel to
an interactive client which can measure dropped frames.

With equal drm.weights the interactive client was seeing ~10 (i915
pre-GuC) or ~27 (i915 GuC) dropped frames per second (60 fps target).
With the GPU hog drm.weight lowered to 1:10 that dropped to ~3 dropped
frames per second (all 3 before the over budget condition was noticed by
the controller).

Main take here is that improved user experience is possible even with
this primitive throttling method and even on GPUs which support only
three scheduling priority levels.

Although main thing still is that individual drivers are completely free
to improve their method of handling to the over budget signal. Nothing
in the controller itself should be precluding that.

Regards,

Tvrtko

2023-01-27 10:04:47

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller

On Thu, Jan 26, 2023 at 05:57:24PM +0000, Tvrtko Ursulin <[email protected]> wrote:
> So even if the RFC shows just a simple i915 implementation, the controller
> itself shouldn't prevent a smarter approach (via exposed ABI).

scan/query + over budget notification is IMO limited in guarantees.

> [...]
> Yes agreed, and to re-stress out, the ABI as proposed does not preclude
> changing from scanning to charging or whatever. The idea was for it to be
> compatible in concept with the CPU controller and also avoid baking in the
> controlling method to individual drivers.
> [...]

But I submit to your point of rather not exposing this via cgroup API
for possible future refinements.

> Secondly, doing this in userspace would require the ability to get some sort
> of an atomic snapshot of the whole tree hierarchy to account for changes in
> layout of the tree and task migrations. Or some retry logic with some added
> ABI fields to enable it.

Note, that the proposed implementation is succeptible to miscount due to
concurrent tree modifications and task migrations too (scanning may not
converge under frequent cgroup layout modifications, and migrating tasks
may be summed 0 or >1 times). While in-kernel implementation may assure
the snapshot view, it'd come at cost. (Read: since the mechanism isn't
precise anyway, I don't suggest a fully synchronized scanning.)

Regards,
Michal


Attachments:
(No filename) (1.37 kB)
signature.asc (228.00 B)
Digital signature
Download all attachments

2023-01-27 11:43:48

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller


On 27/01/2023 10:04, Michal Koutný wrote:
> On Thu, Jan 26, 2023 at 05:57:24PM +0000, Tvrtko Ursulin <[email protected]> wrote:
>> So even if the RFC shows just a simple i915 implementation, the controller
>> itself shouldn't prevent a smarter approach (via exposed ABI).
>
> scan/query + over budget notification is IMO limited in guarantees.

It is yes, I tried to stress out that it is not a hard guarantee in any
shape and form and that the "quality" of adhering to the allocated
budget will depend on individual hw and sw capabilities.

But it is what I believe is the best approach given a) how different in
scheduling capability GPU drivers are and b) the very fact there isn't a
central scheduling entity as opposed to the CPU side of things.

It is just no possible to do a hard guarantee system. GPUs do not
preempt as nicely and easily as the CPUs do. And the frequency of
context switches varies widely from too fast to "never", so again,
charging would have several problems to overcome which would make the
whole setup IMHO pointless.

And not only that some GPUs do not preempt nicely, but some even can't
do any of this, period. Even if we stay within the lineage of hardware
supported by only i915, we have three distinct categories: 1) can't do
any of this, 2a) can do fine grained priority based scheduling with
reasonable preemption capability, 2b) ditto but without reasonable
preemption capability, and 3) like 2a) and 2b) but with the scheduler in
the firmware and currently supporting coarse priority based scheduling.

Shall I also mention that a single cgroup can contain multiple GPU
clients, all using different GPUs with a different mix of the above
listed challenges?

The main point is, should someone prove me wrong and come up a smarter
way at some point in the future, then "drm.weight" as an ABI remains
compatible and the improvement can happen completely under the hood. In
the mean time users get external control, and _some_ ability to improve
the user experience with the scenarios such as I described yesterday.

>> [...]
>> Yes agreed, and to re-stress out, the ABI as proposed does not preclude
>> changing from scanning to charging or whatever. The idea was for it to be
>> compatible in concept with the CPU controller and also avoid baking in the
>> controlling method to individual drivers.
>> [...]
>
> But I submit to your point of rather not exposing this via cgroup API
> for possible future refinements.

Ack.

>> Secondly, doing this in userspace would require the ability to get some sort
>> of an atomic snapshot of the whole tree hierarchy to account for changes in
>> layout of the tree and task migrations. Or some retry logic with some added
>> ABI fields to enable it.
>
> Note, that the proposed implementation is succeptible to miscount due to
> concurrent tree modifications and task migrations too (scanning may not
> converge under frequent cgroup layout modifications, and migrating tasks
> may be summed 0 or >1 times). While in-kernel implementation may assure
> the snapshot view, it'd come at cost. (Read: since the mechanism isn't
> precise anyway, I don't suggest a fully synchronized scanning.)

The part that scanning may not converge in my _current implementation_
is true. For instance if clients would be constantly coming and going,
for that I took a shortcut of not bothering to accumulate usage on
process/client exit, and I just wait for a stable two periods to look at
the current state. I reckon this is possibly okay for the real world.

Cgroup tree hierarchy modifications being the reason for not converging
can also happen, but I thought I can hand wave that as not a realistic
scenario. Perhaps I am not imaginative enough?

Under or over-accounting for migrating tasks I don't think can happen
since I am explicitly handling that.

Regards,

Tvrtko

2023-01-27 13:00:40

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller

On Fri, Jan 27, 2023 at 11:40:58AM +0000, Tvrtko Ursulin <[email protected]> wrote:
> The main point is, should someone prove me wrong and come up a smarter way
> at some point in the future, then "drm.weight" as an ABI remains compatible
> and the improvement can happen completely under the hood. In the mean time
> users get external control, and _some_ ability to improve the user
> experience with the scenarios such as I described yesterday.

I'm on board now.

(I've done a mental switch of likening this rather to existing IO
control (higher variance of quanta size, worse preemption, limited
effect) than CPU control.)


> Cgroup tree hierarchy modifications being the reason for not converging can
> also happen, but I thought I can hand wave that as not a realistic scenario.
> Perhaps I am not imaginative enough?

My suggestion: simply skip offlined drmcgs instead of restarting whole
iteration. (A respawning cgroup-wrapped job or intentionally adverse
respawner could in my imagination cause that.)

> Under or over-accounting for migrating tasks I don't think can happen since
> I am explicitly handling that.

I'll reply to the patch for better context...

Regards,
Michal


Attachments:
(No filename) (1.17 kB)
signature.asc (228.00 B)
Digital signature
Download all attachments

2023-01-27 13:01:47

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

On Thu, Jan 12, 2023 at 04:56:07PM +0000, Tvrtko Ursulin <[email protected]> wrote:
> +static int drmcs_can_attach(struct cgroup_taskset *tset)
> +{
> + int ret;
> +
> + /*
> + * As processes are getting moved between groups we need to ensure
> + * both that the old group does not see a sudden downward jump in the
> + * GPU utilisation, and that the new group does not see a sudden jump
> + * up with all the GPU time clients belonging to the migrated process
> + * have accumulated.
> + *
> + * To achieve that we suspend the scanner until the migration is
> + * completed where the resume at the end ensures both groups start
> + * observing GPU utilisation from a reset state.
> + */
> +
> + ret = mutex_lock_interruptible(&drmcg_mutex);
> + if (ret)
> + return ret;
> + start_suspend_scanning();
> + mutex_unlock(&drmcg_mutex);
> +
> + finish_suspend_scanning();

Here's scanning suspension, communicated via

root_drmcs.scanning_suspended = true;
root_drmcs.suspended_period_us = root_drmcs.period_us;
root_drmcs.period_us = 0;

but I don't see those used in scan_worker() and the scanning traversal
can apparently run concurrently with a task migration.

> [...]
> +static bool
> +__start_scanning(struct drm_cgroup_state *root, unsigned int period_us)
> [...]
> + css_for_each_descendant_post(node, &root->css) {
> [...]
> + active = drmcs_get_active_time_us(drmcs);
> + if (period_us && active > drmcs->prev_active_us)
> + drmcs->active_us += active - drmcs->prev_active_us;
> + drmcs->prev_active_us = active;

drmcs_get_active_time_us() could count a task's contribution here,
the task would migrate to a different drmcs,
and it'd be counted 2nd time.



Attachments:
(No filename) (1.66 kB)
signature.asc (228.00 B)
Digital signature
Download all attachments

2023-01-27 13:32:07

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control


On 27/01/2023 13:01, Michal Koutný wrote:
> On Thu, Jan 12, 2023 at 04:56:07PM +0000, Tvrtko Ursulin <[email protected]> wrote:
>> +static int drmcs_can_attach(struct cgroup_taskset *tset)
>> +{
>> + int ret;
>> +
>> + /*
>> + * As processes are getting moved between groups we need to ensure
>> + * both that the old group does not see a sudden downward jump in the
>> + * GPU utilisation, and that the new group does not see a sudden jump
>> + * up with all the GPU time clients belonging to the migrated process
>> + * have accumulated.
>> + *
>> + * To achieve that we suspend the scanner until the migration is
>> + * completed where the resume at the end ensures both groups start
>> + * observing GPU utilisation from a reset state.
>> + */
>> +
>> + ret = mutex_lock_interruptible(&drmcg_mutex);
>> + if (ret)
>> + return ret;
>> + start_suspend_scanning();
>> + mutex_unlock(&drmcg_mutex);
>> +
>> + finish_suspend_scanning();
>
> Here's scanning suspension, communicated via
>
> root_drmcs.scanning_suspended = true;
> root_drmcs.suspended_period_us = root_drmcs.period_us;
> root_drmcs.period_us = 0;
>
> but I don't see those used in scan_worker() and the scanning traversal
> can apparently run concurrently with a task migration.

I think you missed the finish_suspend_scanning() part:

if (root_drmcs.suspended_period_us)
cancel_delayed_work_sync(&root_drmcs.scan_work);

So if scanning was in progress migration will wait until it finishes.
And re-start only when migration is done (drmcs_attach), or it failed
(drmcs_cancel_attach).

Not claiming I did not miss something because I was totally new with
cgroup internals when I started working on this. So it is definitely
useful to have more eyes looking.

>> [...]
>> +static bool
>> +__start_scanning(struct drm_cgroup_state *root, unsigned int period_us)
>> [...]
>> + css_for_each_descendant_post(node, &root->css) {
>> [...]
>> + active = drmcs_get_active_time_us(drmcs);
>> + if (period_us && active > drmcs->prev_active_us)
>> + drmcs->active_us += active - drmcs->prev_active_us;
>> + drmcs->prev_active_us = active;
>
> drmcs_get_active_time_us() could count a task's contribution here,
> the task would migrate to a different drmcs,
> and it'd be counted 2nd time.

Lets see.. __start_scanning() can be called from the worker, so max one
instance at a time, no issue.

Then from resume scanning, so it is guaranteed worker is not running and
can't restart since mutex guards the re-start.

Finally from drmcs_write_period_us() - yes there __start_scanning() can
race with it being invoked by the worker - oops! However.. this is just
a debugging aid as the cover letter explains. This file is not intended
to be present in the final version, rather as per earlier discussion
with Tejun the idea is to only have boot time option to control the
functionality (enable/disable or period).

I will nevertheless try to fix this race up for the next posting to
avoid further confusion!

Regards,

Tvrtko

2023-01-27 14:12:06

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

On Fri, Jan 27, 2023 at 01:31:54PM +0000, Tvrtko Ursulin <[email protected]> wrote:
> I think you missed the finish_suspend_scanning() part:
>
> if (root_drmcs.suspended_period_us)
> cancel_delayed_work_sync(&root_drmcs.scan_work);
>
> So if scanning was in progress migration will wait until it finishes.

Indeed, I've missed that. Thank you!

> Not claiming I did not miss something because I was totally new with cgroup
> internals when I started working on this. So it is definitely useful to have
> more eyes looking.

The custom with (especially v2, especially horizontal) migrations
is that they're treated leniently to avoid performance costs.

I'm afraid waiting for scan in can_attach() can propagate globally (via
cgroup_update_dfl_csses() and cgroup_attach_lock()) sometimes.

OTOH, unless I misunderstood, you need to cover explicit (not task but
resource, when sending client FD around) migration anyway?
(I.e. my suggestion would be to mutualy exclude scanning and explicit
migration but not scanning and task migration in order to avoid possible
global propagation.)

Thanks,
Michal


Attachments:
(No filename) (1.09 kB)
signature.asc (228.00 B)
Digital signature
Download all attachments

2023-01-27 15:21:31

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control


On 27/01/2023 14:11, Michal Koutný wrote:
> On Fri, Jan 27, 2023 at 01:31:54PM +0000, Tvrtko Ursulin <[email protected]> wrote:
>> I think you missed the finish_suspend_scanning() part:
>>
>> if (root_drmcs.suspended_period_us)
>> cancel_delayed_work_sync(&root_drmcs.scan_work);
>>
>> So if scanning was in progress migration will wait until it finishes.
>
> Indeed, I've missed that. Thank you!
>
>> Not claiming I did not miss something because I was totally new with cgroup
>> internals when I started working on this. So it is definitely useful to have
>> more eyes looking.
>
> The custom with (especially v2, especially horizontal) migrations
> is that they're treated leniently to avoid performance costs.
>
> I'm afraid waiting for scan in can_attach() can propagate globally (via
> cgroup_update_dfl_csses() and cgroup_attach_lock()) sometimes.

That something along those lines might be a concern was indeed worrying
me when coming up with the scheme. Good inside knowledge hint, thank
you. I will have a deeper look.

> OTOH, unless I misunderstood, you need to cover explicit (not task but
> resource, when sending client FD around) migration anyway?

Correct. So far that was handled outside the cgroup controller in the
drm layer and any lock dependency propagation was hidden behind RCU.
But that will likely change once I try your suggestion of eliminating
the struct pid based client tracking and so become relevant.

> (I.e. my suggestion would be to mutualy exclude scanning and explicit
> migration but not scanning and task migration in order to avoid possible
> global propagation.)

Thanks, I will look into this all hopefully shortly. Perhaps what you
suggest will come naturally with the removal of struct pid based tracking.

Regards,

Tvrtko

2023-01-28 01:11:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

On Thu, Jan 12, 2023 at 04:56:07PM +0000, Tvrtko Ursulin wrote:
...
> + /*
> + * 1st pass - reset working values and update hierarchical weights and
> + * GPU utilisation.
> + */
> + if (!__start_scanning(root, period_us))
> + goto out_retry; /*
> + * Always come back later if scanner races with
> + * core cgroup management. (Repeated pattern.)
> + */
> +
> + css_for_each_descendant_pre(node, &root->css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> + struct cgroup_subsys_state *css;
> + unsigned int over_weights = 0;
> + u64 unused_us = 0;
> +
> + if (!css_tryget_online(node))
> + goto out_retry;
> +
> + /*
> + * 2nd pass - calculate initial budgets, mark over budget
> + * siblings and add up unused budget for the group.
> + */
> + css_for_each_child(css, &drmcs->css) {
> + struct drm_cgroup_state *sibling = css_to_drmcs(css);
> +
> + if (!css_tryget_online(css)) {
> + css_put(node);
> + goto out_retry;
> + }
> +
> + sibling->per_s_budget_us =
> + DIV_ROUND_UP_ULL(drmcs->per_s_budget_us *
> + sibling->weight,
> + drmcs->sum_children_weights);
> +
> + sibling->over = sibling->active_us >
> + sibling->per_s_budget_us;
> + if (sibling->over)
> + over_weights += sibling->weight;
> + else
> + unused_us += sibling->per_s_budget_us -
> + sibling->active_us;
> +
> + css_put(css);
> + }
> +
> + /*
> + * 3rd pass - spread unused budget according to relative weights
> + * of over budget siblings.
> + */
> + css_for_each_child(css, &drmcs->css) {
> + struct drm_cgroup_state *sibling = css_to_drmcs(css);
> +
> + if (!css_tryget_online(css)) {
> + css_put(node);
> + goto out_retry;
> + }
> +
> + if (sibling->over) {
> + u64 budget_us =
> + DIV_ROUND_UP_ULL(unused_us *
> + sibling->weight,
> + over_weights);
> + sibling->per_s_budget_us += budget_us;
> + sibling->over = sibling->active_us >
> + sibling->per_s_budget_us;
> + }
> +
> + css_put(css);
> + }
> +
> + css_put(node);
> + }
> +
> + /*
> + * 4th pass - send out over/under budget notifications.
> + */
> + css_for_each_descendant_post(node, &root->css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> +
> + if (!css_tryget_online(node))
> + goto out_retry;
> +
> + if (drmcs->over || drmcs->over_budget)
> + signal_drm_budget(drmcs,
> + drmcs->active_us,
> + drmcs->per_s_budget_us);
> + drmcs->over_budget = drmcs->over;
> +
> + css_put(node);
> + }

It keeps bothering me that the distribution logic has no memory. Maybe this
is good enough for coarse control with long cycle durations but it likely
will get in trouble if pushed to finer grained control. State keeping
doesn't require a lot of complexity. The only state that needs tracking is
each cgroup's vtime and then the core should be able to tell specific
drivers how much each cgroup is over or under fairly accurately at any given
time.

That said, this isn't a blocker. What's implemented can work well enough
with coarse enough time grain and that might be enough for the time being
and we can get back to it later. I think Michal already mentioned it but it
might be a good idea to track active and inactive cgroups and build the
weight tree with only active ones. There are machines with a lot of mostly
idle cgroups (> tens of thousands) and tree wide scanning even at low
frequency can become a pretty bad bottleneck.

Thanks.

--
tejun

2023-02-02 14:26:16

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control


On 28/01/2023 01:11, Tejun Heo wrote:
> On Thu, Jan 12, 2023 at 04:56:07PM +0000, Tvrtko Ursulin wrote:
> ...
>> + /*
>> + * 1st pass - reset working values and update hierarchical weights and
>> + * GPU utilisation.
>> + */
>> + if (!__start_scanning(root, period_us))
>> + goto out_retry; /*
>> + * Always come back later if scanner races with
>> + * core cgroup management. (Repeated pattern.)
>> + */
>> +
>> + css_for_each_descendant_pre(node, &root->css) {
>> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
>> + struct cgroup_subsys_state *css;
>> + unsigned int over_weights = 0;
>> + u64 unused_us = 0;
>> +
>> + if (!css_tryget_online(node))
>> + goto out_retry;
>> +
>> + /*
>> + * 2nd pass - calculate initial budgets, mark over budget
>> + * siblings and add up unused budget for the group.
>> + */
>> + css_for_each_child(css, &drmcs->css) {
>> + struct drm_cgroup_state *sibling = css_to_drmcs(css);
>> +
>> + if (!css_tryget_online(css)) {
>> + css_put(node);
>> + goto out_retry;
>> + }
>> +
>> + sibling->per_s_budget_us =
>> + DIV_ROUND_UP_ULL(drmcs->per_s_budget_us *
>> + sibling->weight,
>> + drmcs->sum_children_weights);
>> +
>> + sibling->over = sibling->active_us >
>> + sibling->per_s_budget_us;
>> + if (sibling->over)
>> + over_weights += sibling->weight;
>> + else
>> + unused_us += sibling->per_s_budget_us -
>> + sibling->active_us;
>> +
>> + css_put(css);
>> + }
>> +
>> + /*
>> + * 3rd pass - spread unused budget according to relative weights
>> + * of over budget siblings.
>> + */
>> + css_for_each_child(css, &drmcs->css) {
>> + struct drm_cgroup_state *sibling = css_to_drmcs(css);
>> +
>> + if (!css_tryget_online(css)) {
>> + css_put(node);
>> + goto out_retry;
>> + }
>> +
>> + if (sibling->over) {
>> + u64 budget_us =
>> + DIV_ROUND_UP_ULL(unused_us *
>> + sibling->weight,
>> + over_weights);
>> + sibling->per_s_budget_us += budget_us;
>> + sibling->over = sibling->active_us >
>> + sibling->per_s_budget_us;
>> + }
>> +
>> + css_put(css);
>> + }
>> +
>> + css_put(node);
>> + }
>> +
>> + /*
>> + * 4th pass - send out over/under budget notifications.
>> + */
>> + css_for_each_descendant_post(node, &root->css) {
>> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
>> +
>> + if (!css_tryget_online(node))
>> + goto out_retry;
>> +
>> + if (drmcs->over || drmcs->over_budget)
>> + signal_drm_budget(drmcs,
>> + drmcs->active_us,
>> + drmcs->per_s_budget_us);
>> + drmcs->over_budget = drmcs->over;
>> +
>> + css_put(node);
>> + }
>
> It keeps bothering me that the distribution logic has no memory. Maybe this
> is good enough for coarse control with long cycle durations but it likely
> will get in trouble if pushed to finer grained control. State keeping
> doesn't require a lot of complexity. The only state that needs tracking is
> each cgroup's vtime and then the core should be able to tell specific
> drivers how much each cgroup is over or under fairly accurately at any given
> time.
>
> That said, this isn't a blocker. What's implemented can work well enough
> with coarse enough time grain and that might be enough for the time being
> and we can get back to it later. I think Michal already mentioned it but it
> might be a good idea to track active and inactive cgroups and build the
> weight tree with only active ones. There are machines with a lot of mostly
> idle cgroups (> tens of thousands) and tree wide scanning even at low
> frequency can become a pretty bad bottleneck.

Right, that's the kind of experience (tens of thousands) I was missing,
thank you. Another one item on my TODO list then but I have a question
first.

When you say active/inactive - to what you are referring in the cgroup
world? Offline/online? For those my understanding was offline was a
temporary state while css is getting destroyed.

Also, I am really postponing implementing those changes until I hear at
least something from the DRM community.

Regards,

Tvrtko

2023-02-02 20:00:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

Hello,

On Thu, Feb 02, 2023 at 02:26:06PM +0000, Tvrtko Ursulin wrote:
> When you say active/inactive - to what you are referring in the cgroup
> world? Offline/online? For those my understanding was offline was a
> temporary state while css is getting destroyed.

Oh, it's just based on activity. So, for example, iocost puts a cgroup on
its active list which is canned periodically when an IO is issued from an
inactive cgroup. If an active cgroup doesn't have any activity between two
scans, it becomes inactive and dropped from the list. drm can prolly use the
same approach?

> Also, I am really postponing implementing those changes until I hear at
> least something from the DRM community.

Yeah, that sounds like a good idea.

Thanks.

--
tejun