LinuxLists.cc - [RFC 00/17] DRM scheduling cgroup controller

2022-10-19 18:34:29

Subject: [RFC 00/17] DRM scheduling cgroup controller

From: Tvrtko Ursulin <[email protected]>

This series contains two independent proposals for a DRM scheduling cgroup
controller: static priority based controller and GPU usage budget based
controller.

Motivation mostly comes from my earlier proposal where I identified that GPU
scheduling lags significantly behind what is available for CPU and IO. Whereas
back then I was proposing to somehow tie this with process nice, feedback mostly
was that people wanted cgroups. So here it is - in the world of heterogenous
computing pipelines I think it is time to do something about this gap.

Code is not finished but should survive some light experimenting with. I am
sharing it early since the topic has been controversial in the past. I hope to
demonstrate there are gains to be had in real world usage(*), today, and that
the concepts the proposal relies are well enough established and stable.

*) Specifically under ChromeOS which uses cgroups to control CPU bandwith for
VMs based on the window focused status. It can be demonstrated how GPU
scheduling control can easily be integrated into that setup.

There should be no conflict with this proposal and any efforts to implement
memory usage based controller. Skeleton DRM cgroup controller is deliberatly
purely a skeleton patch where any further functionality can be added with no
real conflicts. [In fact, perhaps scheduling is even easier to deal with than
memory accounting.]

To re-iterate, two proposal are completely functionaly independent and can be
evaluated and progressed independently.

Structure of the series is as follows:

1) Adds a skeleton DRM cgroup controller with no functionality.
2-6) First proposal - static priority based controller.
7) i915 adds support for the static priority based controller.
8-14) Second proposal - GPU usage based controller.
15-17) i915 adds support for the GPU usage based controller.

Both proposals define a delegation of duties between the tree parties: cgroup
controller, DRM core and individual drivers. Two way communication interfaces
are then defined to enable the delegation to work. Principle of discoverability
is also employed so that the level of supported functionality can be observed
in situation when there are multiple GPUs in use, each with a different set of
scheduling capabilities.

DRM static priority control
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Static priority control exposes a hierarchical control interface for the
scheduling priority support present in many DRM device drivers.

Hierarchical meaning that the child group priorities are relative to their
parent. As an example:

A=-1000
/\
/ \
/ \
B=0 C=100

This results in the effective priority of a group B of -1000 and C of -900. In
other words the fact C is configured for elevated priority is relative to its
parent being a low priority and hence is only elevated in the context of its
siblings.

The scope of individual DRM scheduling priority may be per device or per device
driver, or a combination of both, depending on the implementation. The
controller does not ensure any priority ordering across multiple DRM drivers nor
does it impose any further policy and leaves desired configuration to the system
administrator.

Individual DRM drivers are required to transparently convert the cgroup priority
into values suitable for their capabilities.

No guarantees on effectiveness or granularity are provided by the controller,
apart the available range being chosen to be an integer and hence allowing a
generic concept of normal (zero), lower (negative values) and higher (positive
values) priority.

DRM static priority interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

drm.priority_levels
One of:
1) And integer representing the minimum number of discrete priority
levels for the whole group.
Optionally followed by an asterisk ('*') indicating some DRM clients
in the group support more than the minimum number.
2) '0'- indicating one or more DRM clients in the group has no support
for static priority control.
3) 'n/a' - when there are no DRM clients in the configured group.

drm.priority
A read-write integer between -10000 and 10000 (inclusive) representing
an abstract static priority level.

drm.effective_priority
Read only integer showing the current effective priority level for the
group. Effective meaning taking into account the chain of inherited

This approach should work with all DRM drivers which support priority control,
for instance i915, amdgpu or msm. In the future possibly all that are built on
top of the drm/scheduler as well.

I considered a few options for priority control including range based (min/max
limits, or a flavour of) but ended up settling for simplest one. Mainly because
of the limited priority level support with some drivers, and so to enable adding
wide spread support as easily as possible.

All that is required to enable the setup to be useful is for the drivers to
support the concept of a low, normal and high scheduling priority, which can
then be mapped to from the abstract negatitve, default zero and positive cgroup
priorities.

If range based controls are later wanted they can be added in a backward
compatible manner. Until then priority overlap is possible meaning groups need
to be correctly configured by the "administrator" (entity configuring cgroup
usage on a given system).

As mentioned before, this controller can be easily integrated with the current
ChromeOS architecture for managing CPU bandwith of focused versus unfocused
VM windows. As the OS re-configures the respective CPU shares in a respective
cgroup on a window focus status change, DRM cgroup controller could be attached
to the same group and DRM scheduling priority changed at the same time.

I have ran an experiment where I have shown that doing this can enable the
foreground browser window hit 60fps with no dropped frames, when faced with a
Android game runnning in the background, where if DRM priorities were not used
foreground window was only able to maintain around 45fps.

DRM scheduling soft limits
~~~~~~~~~~~~~~~~~~~~~~~~~~

Because of the heterogenous hardware and driver DRM capabilities, soft limits
are implemented as a loose co-operative (bi-directional) interface between the
controller and DRM core.

The controller configures the GPU time allowed per group and periodically scans
the belonging tasks to detect the over budget condition, at which point it
invokes a callback notifying the DRM core of the condition.

DRM core provides an API to query per process GPU utilization and 2nd API to
receive notification from the cgroup controller when the group enters or exits
the over budget condition.

Individual DRM drivers which implement the interface are expected to act on this
in the best-effort manner only. There are no guarantees that the soft limits
will be respected.

DRM scheduling soft limits interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

drm.weight
Standard cgroup weight based control [1, 10000] used to configure the
relative distributing of GPU time between the sibling groups.

drm.period_us
An integer representing the period with which the controller should look
at the GPU usage by the group and potentially send the over/under budget
signal.
Value of zero (defaul) disables the soft limit checking.

drm.budget_supported
One of:
1) 'yes' - when all DRM clients in the group support the functionality.
2) 'no' - when at least one of the DRM clients does not support the
functionality.
3) 'n/a' - when there are no DRM clients in the group.

The second proposal is a little bit more advanced in concept and also a little
bit less finished. Interesting thing is that it builds upon the per client GPU
utilisation work which landed recently for a few drivers. So my thinking is that
in principle, an intersect of drivers which support both that and some sort of
priority scheduling control, could also in theory support this.

Another really interesting angle for this controller is that it mimics the same
control menthod used by the CPU scheduler. That is the proportional/weight based
GPU time budgeting. Which makes it easy to configure and does not need a new
mental model.

However, as the introduction mentions, GPUs are much more heterogenous and
therefore the controller uses very "soft" wording as to what it promises. The
general statement is that it can define budgets, notify clients when they are
over them, and let individual drivers implement best effort handling of those
conditions.

Delegation of duties in the implementation goes likes this:

* DRM cgroup controller implements the control files and the scanning loop.
* DRM core is required to track all DRM clients belonging to processes so it
can answer when asked how much GPU time is a process using.
* DRM core also provides a call back which the controller will call when a
certain process is over budget.
* Individual drivers need to implement two similar hooks, but which work for
a single DRM client. Over budget callback and GPU utilisation query.

What I have demonstrated in practice is that when wired to i915, in a really
primitive way where the over-budget condition simply lowers the scheduling
priority, the concept can be almost equally effective as the static priority
control. I say almost because the design where budget control depends on the
periodic usage scanning has a fundamental delay, so responsiveness will depend
on the scanning period, which may or may not be a problem for a particular use
case.

The unfinished part is the GPU budgeting split which currently does not
propagate unused bandwith to children, neither can share it with siblings. But
this is not due fundamental reasons, just to avoid spending too much time on it
too early.

There are also interesting conversations to be had around mental models for what
is GPU usage as a single number when faced with GPUs which have different
execution engines. To an extent this is similar to the multi-core and cgroup
CPU controller problems, but definitely goes further than that.

I deliberately did not want to include any such complications in the controller
itself and left the individual drivers to handle it. For instance in the i915
over-budget callback it will not do anything unless client's GPU usage is on a
physical engine which is oversubscribed. This enables multiple clients to be
harmlessly over budget, as long as they are not competing for the same GPU
resource.

This much for now, hope some good discussion will follow.

P.S.
A disclaimer of a kind - I was not familiar with how to implement a cgroup
controller at all when I started this prototype therefore it is quite possible
there are many bugs and misunderstandings on how it should be done.

Tvrtko Ursulin (17):
cgroup: Add the DRM cgroup controller
drm: Track clients per owning process
cgroup/drm: Support cgroup priority control
drm/cgroup: Allow safe external access to file_priv
drm: Connect priority updates to drm core
drm: Only track clients which are providing drm_cgroup_ops
drm/i915: i915 priority
drm: Allow for migration of clients
cgroup/drm: Introduce weight based drm cgroup control
drm: Add ability to query drm cgroup GPU time
drm: Add over budget signalling callback
cgroup/drm: Client exit hook
cgroup/drm: Ability to periodically scan cgroups for over budget GPU
usage
cgroup/drm: Show group budget signaling capability in sysfs
drm/i915: Migrate client to new owner on context create
drm/i915: Wire up with drm controller GPU time query
drm/i915: Implement cgroup controller over budget throttling

Documentation/admin-guide/cgroup-v2.rst | 100 +++
drivers/gpu/drm/Kconfig | 1 +
drivers/gpu/drm/Makefile | 1 +
drivers/gpu/drm/drm_cgroup.c | 294 +++++++
drivers/gpu/drm/drm_file.c | 22 +-
drivers/gpu/drm/i915/gem/i915_gem_context.c | 3 +
.../gpu/drm/i915/gem/i915_gem_execbuffer.c | 45 +-
drivers/gpu/drm/i915/i915_driver.c | 12 +
drivers/gpu/drm/i915/i915_drm_client.c | 183 ++++-
drivers/gpu/drm/i915/i915_drm_client.h | 15 +
include/drm/drm_clients.h | 50 ++
include/drm/drm_drv.h | 73 ++
include/drm/drm_file.h | 15 +
include/linux/cgroup_drm.h | 18 +
include/linux/cgroup_subsys.h | 4 +
init/Kconfig | 8 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/drm.c | 764 ++++++++++++++++++
18 files changed, 1594 insertions(+), 15 deletions(-)
create mode 100644 drivers/gpu/drm/drm_cgroup.c
create mode 100644 include/drm/drm_clients.h
create mode 100644 include/linux/cgroup_drm.h
create mode 100644 kernel/cgroup/drm.c

--
2.34.1

2022-10-19 18:35:45

by Tvrtko Ursulin

[permalink] [raw]

Subject: [RFC 16/17] drm/i915: Wire up with drm controller GPU time query

From: Tvrtko Ursulin <[email protected]>

Implement the drm_cgroup_ops->active_time_us callback.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/i915/i915_driver.c | 1 +
drivers/gpu/drm/i915/i915_drm_client.c | 106 +++++++++++++++++++------
drivers/gpu/drm/i915/i915_drm_client.h | 2 +
3 files changed, 83 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index 7912782b87cc..b949fd715202 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1896,6 +1896,7 @@ static const struct drm_ioctl_desc i915_ioctls[] = {
#ifdef CONFIG_CGROUP_DRM
static const struct drm_cgroup_ops i915_drm_cgroup_ops = {
.priority_levels = i915_drm_priority_levels,
+ .active_time_us = i915_drm_cgroup_get_active_time_us,
};
#endif

diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 61a3cdaa7b16..8527fe80d449 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -75,23 +75,7 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients)
xa_destroy(&clients->xarray);
}

-#ifdef CONFIG_CGROUP_DRM
-unsigned int i915_drm_priority_levels(struct drm_file *file)
-{
- struct drm_i915_file_private *fpriv = file->driver_priv;
- struct i915_drm_client *client = fpriv->client;
- struct drm_i915_private *i915 = client->clients->i915;
-
- if (GRAPHICS_VER(i915) < 8)
- return 0;
- else if (intel_uc_uses_guc_submission(&to_gt(i915)->uc))
- return 3;
- else
- return 2047;
-}
-#endif
-
-#ifdef CONFIG_PROC_FS
+#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
static const char * const uabi_class_names[] = {
[I915_ENGINE_CLASS_RENDER] = "render",
[I915_ENGINE_CLASS_COPY] = "copy",
@@ -116,22 +100,92 @@ static u64 busy_add(struct i915_gem_context *ctx, unsigned int class)
return total;
}

-static void
-show_client_class(struct seq_file *m,
- struct i915_drm_client *client,
- unsigned int class)
+static u64 get_class_active_ns(struct i915_drm_client *client,
+ unsigned int class,
+ unsigned int *capacity)
{
- const struct list_head *list = &client->ctx_list;
- u64 total = atomic64_read(&client->past_runtime[class]);
- const unsigned int capacity =
- client->clients->i915->engine_uabi_class_count[class];
struct i915_gem_context *ctx;
+ u64 total;
+
+ *capacity =
+ client->clients->i915->engine_uabi_class_count[class];
+ if (!*capacity)
+ return 0;
+
+ total = atomic64_read(&client->past_runtime[class]);

rcu_read_lock();
- list_for_each_entry_rcu(ctx, list, client_link)
+ list_for_each_entry_rcu(ctx, &client->ctx_list, client_link)
total += busy_add(ctx, class);
rcu_read_unlock();

+ return total;
+}
+#endif
+
+#ifdef CONFIG_CGROUP_DRM
+unsigned int i915_drm_priority_levels(struct drm_file *file)
+{
+ struct drm_i915_file_private *fpriv = file->driver_priv;
+ struct i915_drm_client *client = fpriv->client;
+ struct drm_i915_private *i915 = client->clients->i915;
+
+ if (GRAPHICS_VER(i915) < 8)
+ return 0;
+ else if (intel_uc_uses_guc_submission(&to_gt(i915)->uc))
+ return 3;
+ else
+ return 2047;
+}
+
+static bool supports_stats(struct drm_i915_private *i915)
+{
+ if (GRAPHICS_VER(i915) < 8)
+ return false;
+
+ /* temporary... */
+ if (intel_uc_uses_guc_submission(&to_gt(i915)->uc))
+ return false;
+
+ return true;
+}
+
+u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file)
+{
+ struct drm_i915_file_private *fpriv = file->driver_priv;
+ struct i915_drm_client *client = fpriv->client;
+ unsigned int i;
+ u64 busy = 0;
+
+ if (!supports_stats(client->clients->i915))
+ return 0;
+
+ for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++) {
+ unsigned int capacity;
+ u64 b;
+
+ b = get_class_active_ns(client, i, &capacity);
+ if (capacity) {
+ b = DIV_ROUND_UP_ULL(b, capacity * 1000);
+ busy += b;
+ }
+ }
+
+ return busy;
+}
+#endif
+
+#ifdef CONFIG_PROC_FS
+static void
+show_client_class(struct seq_file *m,
+ struct i915_drm_client *client,
+ unsigned int class)
+{
+ unsigned int capacity;
+ u64 total;
+
+ total = get_class_active_ns(client, class, &capacity);
+
if (capacity)
seq_printf(m, "drm-engine-%s:\t%llu ns\n",
uabi_class_names[class], total);
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index bd5925241007..99b8ae01c183 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -69,4 +69,6 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients);

unsigned int i915_drm_priority_levels(struct drm_file *file);

+u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file);
+
#endif /* !__I915_DRM_CLIENT_H__ */
--
2.34.1

2022-10-19 18:37:41

by Tvrtko Ursulin

[permalink] [raw]

Subject: [RFC 11/17] drm: Add over budget signalling callback

From: Tvrtko Ursulin <[email protected]>

Add a new callback via which the drm cgroup controller is notifying the
drm core that a certain process is above its allotted GPU time.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/drm_cgroup.c | 21 +++++++++++++++++++++
include/drm/drm_clients.h | 1 +
include/drm/drm_drv.h | 8 ++++++++
3 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/drm_cgroup.c b/drivers/gpu/drm/drm_cgroup.c
index e0cadb5e5659..e36bc4333924 100644
--- a/drivers/gpu/drm/drm_cgroup.c
+++ b/drivers/gpu/drm/drm_cgroup.c
@@ -230,3 +230,24 @@ u64 drm_pid_get_active_time_us(struct pid *pid)
return total;
}
EXPORT_SYMBOL_GPL(drm_pid_get_active_time_us);
+
+void drm_pid_signal_budget(struct pid *pid, u64 usage, u64 budget)
+{
+ struct drm_pid_clients *clients;
+
+ rcu_read_lock();
+ clients = xa_load(&drm_pid_clients, (unsigned long)pid);
+ if (clients) {
+ struct drm_file *fpriv;
+
+ list_for_each_entry_rcu(fpriv, &clients->file_list, clink) {
+ const struct drm_cgroup_ops *cg_ops =
+ fpriv->minor->dev->driver->cg_ops;
+
+ if (cg_ops && cg_ops->signal_budget)
+ cg_ops->signal_budget(fpriv, usage, budget);
+ }
+ }
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(drm_pid_signal_budget);
diff --git a/include/drm/drm_clients.h b/include/drm/drm_clients.h
index f25e09ed5feb..7ad09fd0a404 100644
--- a/include/drm/drm_clients.h
+++ b/include/drm/drm_clients.h
@@ -38,5 +38,6 @@ static inline void drm_clients_migrate(struct drm_file *file_priv)
unsigned int drm_pid_priority_levels(struct pid *pid, bool *non_uniform);
void drm_pid_update_priority(struct pid *pid, int priority);
u64 drm_pid_get_active_time_us(struct pid *pid);
+void drm_pid_signal_budget(struct pid *pid, u64 usage, u64 budget);

#endif
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 0f1802df01fe..07dec956ebfb 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -192,6 +192,14 @@ struct drm_cgroup_ops {
* Used by the DRM core when queried by the DRM cgroup controller.
*/
u64 (*active_time_us) (struct drm_file *);
+
+ /**
+ * @signal_budget:
+ *
+ * Optional callback used by the DRM core to forward over/under GPU time
+ * messages sent by the DRM cgroup controller.
+ */
+ void (*signal_budget) (struct drm_file *, u64 used, u64 budget);
};

/**
--
2.34.1

2022-10-19 18:55:37

by Tejun Heo

[permalink] [raw]

Subject: Re: [RFC 00/17] DRM scheduling cgroup controller

Hello,

On Wed, Oct 19, 2022 at 06:32:37PM +0100, Tvrtko Ursulin wrote:
...
> DRM static priority interface files
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> drm.priority_levels
> One of:
> 1) And integer representing the minimum number of discrete priority
> levels for the whole group.
> Optionally followed by an asterisk ('*') indicating some DRM clients
> in the group support more than the minimum number.
> 2) '0'- indicating one or more DRM clients in the group has no support
> for static priority control.
> 3) 'n/a' - when there are no DRM clients in the configured group.
>
> drm.priority
> A read-write integer between -10000 and 10000 (inclusive) representing
> an abstract static priority level.
>
> drm.effective_priority
> Read only integer showing the current effective priority level for the
> group. Effective meaning taking into account the chain of inherited

From interface POV, this is a lot worse than the second proposal and I'd
really like to avoid this. Even if we go with mapping user priority
configuration to per-driver priorities, I'd much prefer if the interface
presented to user is weight based and let each driver try to match the
resulting hierarchical weight (ie. the absolute proportion a given cgroup
should have at the point in time) as best as they can rather than exposing
opaque priority numbers to userspace whose meaning isn't defined at all.

> DRM scheduling soft limits interface files
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> drm.weight
> Standard cgroup weight based control [1, 10000] used to configure the
> relative distributing of GPU time between the sibling groups.

Please take a look at io.weight. This can follow the same convention to
express both global and per-device weights.

> drm.period_us
> An integer representing the period with which the controller should look
> at the GPU usage by the group and potentially send the over/under budget
> signal.
> Value of zero (defaul) disables the soft limit checking.

Can we not do period_us or at least make it a per-driver tuning parameter
exposed as module param? Weight, users can easily understand and configure.
period_us is a lot more an implementation detail. If we want to express the
trade-off between latency and bandwidth at the interface, we prolly should
encode the latency requirement in a more canonical way but let's leave that
for the future.

> drm.budget_supported
> One of:
> 1) 'yes' - when all DRM clients in the group support the functionality.
> 2) 'no' - when at least one of the DRM clients does not support the
> functionality.
> 3) 'n/a' - when there are no DRM clients in the group.

Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
with and I'm not sure 'no' meaning at least one device not supporting is
intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
drop this.

Another basic interface question. Is everyone happy with the drm prefix or
should it be something like gpu? Also, in the future, if there's a consensus
around how to control gpu memory, what prefix would that take?

> The second proposal is a little bit more advanced in concept and also a little
> bit less finished. Interesting thing is that it builds upon the per client GPU
> utilisation work which landed recently for a few drivers. So my thinking is that
> in principle, an intersect of drivers which support both that and some sort of
> priority scheduling control, could also in theory support this.
>
> Another really interesting angle for this controller is that it mimics the same
> control menthod used by the CPU scheduler. That is the proportional/weight based
> GPU time budgeting. Which makes it easy to configure and does not need a new
> mental model.
>
> However, as the introduction mentions, GPUs are much more heterogenous and
> therefore the controller uses very "soft" wording as to what it promises. The
> general statement is that it can define budgets, notify clients when they are
> over them, and let individual drivers implement best effort handling of those
> conditions.
>
> Delegation of duties in the implementation goes likes this:
>
> * DRM cgroup controller implements the control files and the scanning loop.
> * DRM core is required to track all DRM clients belonging to processes so it
> can answer when asked how much GPU time is a process using.
> * DRM core also provides a call back which the controller will call when a
> certain process is over budget.
> * Individual drivers need to implement two similar hooks, but which work for
> a single DRM client. Over budget callback and GPU utilisation query.
>
> What I have demonstrated in practice is that when wired to i915, in a really
> primitive way where the over-budget condition simply lowers the scheduling
> priority, the concept can be almost equally effective as the static priority
> control. I say almost because the design where budget control depends on the
> periodic usage scanning has a fundamental delay, so responsiveness will depend
> on the scanning period, which may or may not be a problem for a particular use
> case.
>
> The unfinished part is the GPU budgeting split which currently does not
> propagate unused bandwith to children, neither can share it with siblings. But
> this is not due fundamental reasons, just to avoid spending too much time on it
> too early.

Rather than doing it hierarchically on the spot, it's usually a lot cheaper
and easier to calculate the flattened hierarchical weight per leaf cgroup
and divide the bandwidth according to the eventual portions. For an example,
please take a look at block/blk-iocost.c.

I don't know much about the drm driver side, so can't comment much on it but
I do really like the idea of having the core implementation determining who
should get how much and then letting each driver enforce the target. That
seems a lot more robust and generic than trying to somehow coax and expose
per-driver priority implementations directly.

Thanks.

--
tejun

2022-10-27 14:47:09

by Tvrtko Ursulin

[permalink] [raw]

Subject: Re: [RFC 00/17] DRM scheduling cgroup controller

Hi Tejun,

On 19/10/2022 19:45, Tejun Heo wrote:
> Hello,
>
> On Wed, Oct 19, 2022 at 06:32:37PM +0100, Tvrtko Ursulin wrote:
> ...
>> DRM static priority interface files
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> drm.priority_levels
>> One of:
>> 1) And integer representing the minimum number of discrete priority
>> levels for the whole group.
>> Optionally followed by an asterisk ('*') indicating some DRM clients
>> in the group support more than the minimum number.
>> 2) '0'- indicating one or more DRM clients in the group has no support
>> for static priority control.
>> 3) 'n/a' - when there are no DRM clients in the configured group.
>>
>> drm.priority
>> A read-write integer between -10000 and 10000 (inclusive) representing
>> an abstract static priority level.
>>
>> drm.effective_priority
>> Read only integer showing the current effective priority level for the
>> group. Effective meaning taking into account the chain of inherited
>
>>From interface POV, this is a lot worse than the second proposal and I'd
> really like to avoid this. Even if we go with mapping user priority
> configuration to per-driver priorities, I'd much prefer if the interface
> presented to user is weight based and let each driver try to match the
> resulting hierarchical weight (ie. the absolute proportion a given cgroup
> should have at the point in time) as best as they can rather than exposing
> opaque priority numbers to userspace whose meaning isn't defined at all.

I actually somewhat agree here and this proposal was mainly motivated by
my desire to come up with *something* which would allow similar
_external_ control as it exists in the CPU and IO world (nice/ionice).

Because currently priority of GPU workloads cannot be controlled from
the outside at all. And especially if we consider pipelines composed of
stages where part of the processing is done on the CPU and part on the
GPU (or AI/VPU accelerator), then I think it would be really useful to
be able to do so.

To this effect I proposed connecting CPU nice with GPU priority, same as
it is connected for IO priority (so only if it hasn't been explicitly
changed away from the defaults), but at that point I was getting
feedback nudging me into direction of cgroups.

Looking at what's available in cgroups world now, I have spotted the
blkio.prio.class control. For my current use case (lower GPU scheduling
of background/unfocused windows) that would also work. Even if starting
with just two possible values - 'no-change' and 'idle' (to follow the IO
controller naming).

How would you view that as a proposal? It would be a bit tougher "sell"
to the DRM community, perhaps, given that many drivers do have
scheduling priority, but the concept of scheduling class is not really
there. Nevertheless a concept of normal-vs-background could be plausible
in my mind. It could be easily implemented by using the priority
controls available in many drivers.

>> DRM scheduling soft limits interface files
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> drm.weight
>> Standard cgroup weight based control [1, 10000] used to configure the
>> relative distributing of GPU time between the sibling groups.
>
> Please take a look at io.weight. This can follow the same convention to
> express both global and per-device weights.
>
>> drm.period_us
>> An integer representing the period with which the controller should look
>> at the GPU usage by the group and potentially send the over/under budget
>> signal.
>> Value of zero (defaul) disables the soft limit checking.
>
> Can we not do period_us or at least make it a per-driver tuning parameter
> exposed as module param? Weight, users can easily understand and configure.
> period_us is a lot more an implementation detail. If we want to express the
> trade-off between latency and bandwidth at the interface, we prolly should
> encode the latency requirement in a more canonical way but let's leave that
> for the future.

Yes agreed - for the moment (while RFC) it is handy for me to have it
controllable per group. But I agree a kernel global (modparam) should be
sufficient in the final solution.

>
>> drm.budget_supported
>> One of:
>> 1) 'yes' - when all DRM clients in the group support the functionality.
>> 2) 'no' - when at least one of the DRM clients does not support the
>> functionality.
>> 3) 'n/a' - when there are no DRM clients in the group.
>
> Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
> with and I'm not sure 'no' meaning at least one device not supporting is
> intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
> drop this.

The idea actually is for this to be per-cgroup and potentially change
dynamically. It implements the concept of "observability", how I have,
perhaps clumsily, named it.

This is because we can have a mix of DRM file descriptors in a cgroup,
not all of which support the proposed functionality. So I wanted to have
something by which the administrator can observe the status of the group.

For instance seeing some clients do not support the feature could be
signal that things have been misconfigured, or that appeal needs to be
made for driver X to start supporting the feature. Seeing a "no" there
in other words is a signal that budgeting may not really work as
expected and needs to be investigated.

> Another basic interface question. Is everyone happy with the drm prefix or
> should it be something like gpu? Also, in the future, if there's a consensus
> around how to control gpu memory, what prefix would that take?

Given the current effort to bring in AI/VPU/ML accelerator devices under
the DRM umbrella I think drm prefix is okay. At least some of those
devices (the ones I looked at) will accept the same concepts of
scheduling and memory allocation as well.

And for memory specifically, I don't see why the drm prefix would not work.

But yeah, I echo the calls for wider feedback. Given how drm cgroup
controller has been on a wish list for ages I have to say I expected a
bit more interest. :)

>> The second proposal is a little bit more advanced in concept and also a little
>> bit less finished. Interesting thing is that it builds upon the per client GPU
>> utilisation work which landed recently for a few drivers. So my thinking is that
>> in principle, an intersect of drivers which support both that and some sort of
>> priority scheduling control, could also in theory support this.
>>
>> Another really interesting angle for this controller is that it mimics the same
>> control menthod used by the CPU scheduler. That is the proportional/weight based
>> GPU time budgeting. Which makes it easy to configure and does not need a new
>> mental model.
>>
>> However, as the introduction mentions, GPUs are much more heterogenous and
>> therefore the controller uses very "soft" wording as to what it promises. The
>> general statement is that it can define budgets, notify clients when they are
>> over them, and let individual drivers implement best effort handling of those
>> conditions.
>>
>> Delegation of duties in the implementation goes likes this:
>>
>> * DRM cgroup controller implements the control files and the scanning loop.
>> * DRM core is required to track all DRM clients belonging to processes so it
>> can answer when asked how much GPU time is a process using.
>> * DRM core also provides a call back which the controller will call when a
>> certain process is over budget.
>> * Individual drivers need to implement two similar hooks, but which work for
>> a single DRM client. Over budget callback and GPU utilisation query.
>>
>> What I have demonstrated in practice is that when wired to i915, in a really
>> primitive way where the over-budget condition simply lowers the scheduling
>> priority, the concept can be almost equally effective as the static priority
>> control. I say almost because the design where budget control depends on the
>> periodic usage scanning has a fundamental delay, so responsiveness will depend
>> on the scanning period, which may or may not be a problem for a particular use
>> case.
>>
>> The unfinished part is the GPU budgeting split which currently does not
>> propagate unused bandwith to children, neither can share it with siblings. But
>> this is not due fundamental reasons, just to avoid spending too much time on it
>> too early.
>
> Rather than doing it hierarchically on the spot, it's usually a lot cheaper
> and easier to calculate the flattened hierarchical weight per leaf cgroup
> and divide the bandwidth according to the eventual portions. For an example,
> please take a look at block/blk-iocost.c.

This seems exactly what I had in mind (but haven't implemented it yet).
So in this RFC I have budget splitting per group where each tree level
adds up to "100%" and the thing which I have not implemented is
"borrowing" or yielding (how blk-iocost.c calls it, or donating) unused
budgets to siblings.

I am very happy to hear my idea is the right one and someone already
implemented it. Thanks for this pointer!

> I don't know much about the drm driver side, so can't comment much on it but
> I do really like the idea of having the core implementation determining who
> should get how much and then letting each driver enforce the target. That
> seems a lot more robust and generic than trying to somehow coax and expose
> per-driver priority implementations directly.

Thanks, and thanks for having a detailed read and providing great
feedback so far!

Regards,

Tvrtko

2022-10-31 20:34:04

by Tejun Heo

[permalink] [raw]

Subject: Re: [RFC 00/17] DRM scheduling cgroup controller

Hello,

On Thu, Oct 27, 2022 at 03:32:00PM +0100, Tvrtko Ursulin wrote:
> Looking at what's available in cgroups world now, I have spotted the
> blkio.prio.class control. For my current use case (lower GPU scheduling of
> background/unfocused windows) that would also work. Even if starting with
> just two possible values - 'no-change' and 'idle' (to follow the IO
> controller naming).

I wouldn't follow that example. That's only meaningful within the context of
bfq and it probabaly shouldn't have been merged in the first place.

> How would you view that as a proposal? It would be a bit tougher "sell" to
> the DRM community, perhaps, given that many drivers do have scheduling
> priority, but the concept of scheduling class is not really there.
> Nevertheless a concept of normal-vs-background could be plausible in my
> mind. It could be easily implemented by using the priority controls
> available in many drivers.

I don't feel great about that.

* The semantics aren't clearly defined. While not immediately obvious in the
interface, the task nice levels have definite mappings to weight values
and thus clear meanings. I don't think it's a good idea to leave the
interface semantics open to interpretation.

* Maybe GPUs are better but my experience with optional hardware features in
the storage world has been that vendors diverge wildly and unexpectedly to
the point many features are mostly useless. There are fewer GPU vendors
and more software effort behind each, so maybe the situation is better but
I think it'd be helpul to keep some skepticism.

* Even when per-vendor or per-driver features are consistent enough to be
useful, they often aren't thought through enough to be truly useful. e.g.
nvme has priority features but they aren't really that useful because they
can't do much without congestion control on the issuer side and once you
have congestion control on the issuer side which is usually a lot more
complex (e.g. dealing with work-conserving hierarchical weight
distributions, priority inversions and so on), you can achieve most of
what you need in terms of resource control from the issuer side anyway.

So, I'd much prefer to have a fuller solution on the kernel side which
integrates per-vendor/driver features where they make sense.

> > > drm.budget_supported
> > > One of:
> > > 1) 'yes' - when all DRM clients in the group support the functionality.
> > > 2) 'no' - when at least one of the DRM clients does not support the
> > > functionality.
> > > 3) 'n/a' - when there are no DRM clients in the group.
> >
> > Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
> > with and I'm not sure 'no' meaning at least one device not supporting is
> > intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
> > drop this.
>
> The idea actually is for this to be per-cgroup and potentially change
> dynamically. It implements the concept of "observability", how I have,
> perhaps clumsily, named it.
>
> This is because we can have a mix of DRM file descriptors in a cgroup, not
> all of which support the proposed functionality. So I wanted to have
> something by which the administrator can observe the status of the group.
>
> For instance seeing some clients do not support the feature could be signal
> that things have been misconfigured, or that appeal needs to be made for
> driver X to start supporting the feature. Seeing a "no" there in other words
> is a signal that budgeting may not really work as expected and needs to be
> investigated.

I still don't see how this is per-cgroup given that it's indicating whether
the driver supports some feature. Also, the eventual goal would be
supporting the same control mechanisms across most (if not all) GPUs, right?

> > Rather than doing it hierarchically on the spot, it's usually a lot cheaper
> > and easier to calculate the flattened hierarchical weight per leaf cgroup
> > and divide the bandwidth according to the eventual portions. For an example,
> > please take a look at block/blk-iocost.c.
>
> This seems exactly what I had in mind (but haven't implemented it yet). So
> in this RFC I have budget splitting per group where each tree level adds up
> to "100%" and the thing which I have not implemented is "borrowing" or
> yielding (how blk-iocost.c calls it, or donating) unused budgets to
> siblings.
>
> I am very happy to hear my idea is the right one and someone already
> implemented it. Thanks for this pointer!

The budget donation thing in iocost is necessary only because it wants to
make the hot path local to the cgroup because io control has to support very
high decision rate. For time-slicing GPU, it's likely that following the
current hierarchical weight on the spot is enough.

Thanks.

--
tejun

2022-11-09 17:28:29

by Tvrtko Ursulin

[permalink] [raw]

Subject: Re: [RFC 00/17] DRM scheduling cgroup controller

On 31/10/2022 20:20, Tejun Heo wrote:
> Hello,
>
> On Thu, Oct 27, 2022 at 03:32:00PM +0100, Tvrtko Ursulin wrote:
>> Looking at what's available in cgroups world now, I have spotted the
>> blkio.prio.class control. For my current use case (lower GPU scheduling of
>> background/unfocused windows) that would also work. Even if starting with
>> just two possible values - 'no-change' and 'idle' (to follow the IO
>> controller naming).
>
> I wouldn't follow that example. That's only meaningful within the context of
> bfq and it probabaly shouldn't have been merged in the first place.
>
>> How would you view that as a proposal? It would be a bit tougher "sell" to
>> the DRM community, perhaps, given that many drivers do have scheduling
>> priority, but the concept of scheduling class is not really there.
>> Nevertheless a concept of normal-vs-background could be plausible in my
>> mind. It could be easily implemented by using the priority controls
>> available in many drivers.
>
> I don't feel great about that.
>
> * The semantics aren't clearly defined. While not immediately obvious in the
> interface, the task nice levels have definite mappings to weight values
> and thus clear meanings. I don't think it's a good idea to leave the
> interface semantics open to interpretation.

Agreed it is not clearly defined, but it was the same with nice levels.
As in it changed a lot over the years how exactly they behave (with a
few scheduler rewrites) and they were constantly at least somewhat
useful as a mean of external control.

Nevertheless you will notice I have dropped everything priority based
from the v2 of the RFC to simplify the conversation going forward.

> * Maybe GPUs are better but my experience with optional hardware features in
> the storage world has been that vendors diverge wildly and unexpectedly to
> the point many features are mostly useless. There are fewer GPU vendors
> and more software effort behind each, so maybe the situation is better but
> I think it'd be helpul to keep some skepticism.
>
> * Even when per-vendor or per-driver features are consistent enough to be
> useful, they often aren't thought through enough to be truly useful. e.g.
> nvme has priority features but they aren't really that useful because they
> can't do much without congestion control on the issuer side and once you
> have congestion control on the issuer side which is usually a lot more
> complex (e.g. dealing with work-conserving hierarchical weight
> distributions, priority inversions and so on), you can achieve most of
> what you need in terms of resource control from the issuer side anyway.

GPUs will not be fully uniform either, especially in the terms of how
well the controls work, which is why I am spelling out how this is only
attempting to do "soft limits", everywhere in the documentation, cover
letter and patch commit message.

But at least concept of GPU time feels to me like a very universal one
so should be something which we can base the control on.

> So, I'd much prefer to have a fuller solution on the kernel side which
> integrates per-vendor/driver features where they make sense.
>
>>>> drm.budget_supported
>>>> One of:
>>>> 1) 'yes' - when all DRM clients in the group support the functionality.
>>>> 2) 'no' - when at least one of the DRM clients does not support the
>>>> functionality.
>>>> 3) 'n/a' - when there are no DRM clients in the group.
>>>
>>> Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
>>> with and I'm not sure 'no' meaning at least one device not supporting is
>>> intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
>>> drop this.
>>
>> The idea actually is for this to be per-cgroup and potentially change
>> dynamically. It implements the concept of "observability", how I have,
>> perhaps clumsily, named it.
>>
>> This is because we can have a mix of DRM file descriptors in a cgroup, not
>> all of which support the proposed functionality. So I wanted to have
>> something by which the administrator can observe the status of the group.
>>
>> For instance seeing some clients do not support the feature could be signal
>> that things have been misconfigured, or that appeal needs to be made for
>> driver X to start supporting the feature. Seeing a "no" there in other words
>> is a signal that budgeting may not really work as expected and needs to be
>> investigated.
>
> I still don't see how this is per-cgroup given that it's indicating whether
> the driver supports some feature. Also, the eventual goal would be
> supporting the same control mechanisms across most (if not all) GPUs, right?

I have dropped this from v2 as well, to focus the feedback on most
important points.

But in general the problem it wanted to address is that a single cgroup
can contain multiple processes, each with zero to N open DRM file
descriptors to any random GPU which happens to be installed in the
system. And it can all change dynamically. It may be different vendors
or different hardware generations, where some do not support the
required functionality to support the cgroup controller.

So I wanted to give the sysadmin some visibility if at any given time
the configuration applied to a cgroup has a chance to work fully, or
only partially.

For instance with i915 we can have two supported devices in a laptop -
integrated and discrete. Integrated can support the controller well,
while for discrete is work in progress, maybe comes in the next kernel
release. And then we can have this:

1)

cgexec -g drm:gfx/clients glxgears

This runs fully on integrated and budgeting works as expected.

2) DRI_PRIME=1 cgexec -g drm:gfx/clients glxgears

This one runs on the discrete GPU where budgeting does not work yet.
While at the same time there can be another client in the same cgroup
for which budgeting works. Or in other words:

cgexec -g drm:gfx/clients transcode_me_a_video
DRI_PRIME=1 cgexec -g drm:gfx/clients run_a_game

In this case game is not taking part in budgeting, even though it is a
same driver. Maybe it will in a few kernel releases but not yet. Or in
case of super old GPUs maybe support never gets added.

Perhaps I am over complicating things and what it would enough is to log
the capability per device during driver probe:

i915 0000:00:02.0: DRM cgroup support: weights
i915 0000:01:00.0: DRM cgroup support: none

?

>>> Rather than doing it hierarchically on the spot, it's usually a lot cheaper
>>> and easier to calculate the flattened hierarchical weight per leaf cgroup
>>> and divide the bandwidth according to the eventual portions. For an example,
>>> please take a look at block/blk-iocost.c.
>>
>> This seems exactly what I had in mind (but haven't implemented it yet). So
>> in this RFC I have budget splitting per group where each tree level adds up
>> to "100%" and the thing which I have not implemented is "borrowing" or
>> yielding (how blk-iocost.c calls it, or donating) unused budgets to
>> siblings.
>>
>> I am very happy to hear my idea is the right one and someone already
>> implemented it. Thanks for this pointer!
>
> The budget donation thing in iocost is necessary only because it wants to
> make the hot path local to the cgroup because io control has to support very
> high decision rate. For time-slicing GPU, it's likely that following the
> current hierarchical weight on the spot is enough.

I think I completed this part in v2. At least some quick smoke testing
showed me that budgets now correctly propagate through the tree.

Not guaranteeing no bugs just yet and there are certainly still things
to polish up in v2.

Regards,

Tvrtko