2023-10-24 16:12:37

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC v6 0/8] DRM scheduling cgroup controller

From: Tvrtko Ursulin <[email protected]>

This series contains a proposal for a DRM cgroup controller which implements a
weight based hierarchical GPU usage budget approach and is similar in concept to
some of the existing controllers like CPU and IO.

Motivation mostly comes from my earlier proposal where I identified that GPU
scheduling lags significantly behind what is available for CPU and IO. In the
modern world of heterogenous computing pipelines I think it would be good to
close this gap.

Originally my proposal was also to tie the DRM scheduling priority with process
nice, also similar to CPU and IO, and to add explicit priority control on top,
but the feedback was that scheduling priority is to abstract so that part of the
proposal was dropped.

This series hope to demonstrate there are gains to be had in real world
usage(*), today, that the concepts the proposal relies are well enough
established and stable, and that wiring up DRM drivers into the controller would
be straightforward.

*) Specifically under ChromeOS which uses cgroups to control CPU bandwith for
VMs based on the window focused status. It can be demonstrated how GPU
scheduling control can easily be integrated into that setup.

*) Another real world example later in the cover letter.

There should be no conflict with this proposal and any efforts to implement
memory usage based controller. Skeleton DRM cgroup controller is deliberatly
purely a skeleton patch where any further functionality can be added with no
real conflicts. [In fact, perhaps scheduling is even easier to deal with than
memory accounting.]

Structure of the series is as follows:

1) Adds a skeleton DRM cgroup controller with no functionality.
2-5) Laying down some infrastructure to enable the controller.
6) The scheduling controller itself.
7) i915 support for the scheduling controller.
8) Expose GPU utilisation from the controller.

The proposals defines a delegation of duties between the tree parties: cgroup
controller, DRM core and individual drivers. Two way communication interfaces
are then defined to enable the delegation to work.

DRM weight based time control
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The controller configures the GPU time allowed per group and periodically scans
the belonging tasks to detect the over budget condition, at which point it
invokes a callback notifying the DRM core of the condition.

Because of the heterogenous hardware and driver DRM capabilities, time control
is implemented as a loose co-operative (bi-directional) interface between the
controller and DRM core.

DRM core provides an API to query per process GPU utilization and 2nd API to
receive notification from the cgroup controller when the group enters or exits
the over budget condition.

Individual DRM drivers which implement the interface are expected to act on this
in a best-effort manner. There are no guarantees that the time budget will be
respected.

DRM weight based time control interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

drm.stat
A read-only flat-keyed file.

Contains these fields:

- usage_usec - GPU time used by the group, recursively including all
child groups.

drm.weight
Standard cgroup weight based control [1, 10000] used to configure the
relative distributing of GPU time between the sibling groups.

This builds upon the per client GPU utilisation work which landed recently for a
few drivers. My thinking is that in principle, an intersection of drivers which
support both that and some sort of scheduling control, like priorities, could
support this controller relatively easily.

Another really interesting angle for this controller is that it mimics the same
control menthod used by the CPU scheduler. That is the proportional/weight based
GPU time budgeting. Which makes it easy to configure and does not need a new
mental model.

However, as the introduction mentions, GPUs are much more heterogenous and
therefore the controller uses very "soft" wording as to what it promises. The
general statement is that it can define budgets, notify clients when they are
over them, and let individual drivers implement best effort handling of those
conditions.

Delegation of duties in the implementation goes likes this:

* DRM cgroup controller implements the control files, the scanning loop and
tracks the DRM clients associated with each cgroup. It provides an API DRM
core needs to call to (de)register and migrate clients.
* DRM core defines two call-backs which the core calls directly: First for
querying GPU time by a client and second for notifying the client that it
is over budget. It calls controller API for (de)registering clients and
migrating then between tasks on file descriptor hand over.
* Individual drivers implement the above mentioned callbacks and register
them with the DRM core.

What I have demonstrated in practice is that when wired to i915, even in a
primitive way where the over-budget condition simply lowers the scheduling
priority, the concept can be almost equally effective as the static priority
control. I say almost because the design where budget control depends on the
periodic usage scanning has a fundamental initial reaction delay, so
responsiveness will depend on the scanning period, which may or may not be a
problem for a particular use case.

There are also interesting conversations to be had around mental models for what
is GPU usage as a single number when faced with GPUs which have different
execution engines. To an extent this is similar to the multi-core and cgroup
CPU controller problems, but with GPUs it even goes further than that.

I deliberately did not want to include any such complications in the controller
itself and left the individual drivers to handle it. For instance in the i915
over-budget callback it will not do anything unless client's GPU usage is on a
physical engine which is oversubscribed. This enables multiple clients to be
harmlessly over budget, as long as they are not competing for the same GPU
resource.

Example usage from within a Linux desktop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Standard Linux distributions like Ubuntu already uses cgroups heavily for
session management and that could easily be extended with the DRM controller.

After logging into the system graphically we can enable the DRM controller
throughout the cgroups hierarchy:

echo +drm > /sys/fs/cgroup/cgroup.subtree_control
echo +drm > /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo +drm > /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.subtree_control

Next we will open two SSH sessions, just so separate cgroups are handily created
by systemd for this experiment.

Roughly simultaneously we run the following two benchmarks in each session
respectively:

1)
./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000

2)
vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan

(The reason for vsync off here is because I struggled to find an easily runnable
and at the same time demanding enough benchmark, or to run on a screen large
enough to make even some simple ones demanding enough.)

With this test we get 252fps from GpuTest and 96fps from GfxBenchmark.

Premise here is that one of these GPU intensive benchmarks is intended to be ran
by the user with lower priority. Imagine kicking off some background compute
processing and continuing to use the UI for other tasks. Hence the user will now
re-run the test by first lowering the weight control of the first session (DRM
cgroup):

1)
echo 50 | sudo tee /sys/fs/cgroup/`cut -d':' -f3 /proc/self/cgroup`/drm.weight
./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000

2)
vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan

In this case we will see that GpuTest has recorded 208fps (~18% down) and
GfxBenchmark 114fps (18% up), demonstrating that even a very simple approach of
wiring up i915 to the DRM cgroup controller can enable somewhat effective
external GPU scheduling control.

* Note here that default weight is 100, so setting 50 for the background session
is asking the controller to give it half as much GPU bandwidth.

We can also show an example of GPU utilisation querying on a standard Linux
install:

localhost:/sys/fs/cgroup# grep -H . ./drm.stat ./user.slice/drm.stat ./system.slice/drm.stat
./drm.stat:usage_usec 59019083
./user.slice/drm.stat:usage_usec 48249318
./system.slice/drm.stat:usage_usec 10769765

Here it is visible that the GPU utilisation can be correctly queried per cgroup.

v2:
* Prefaced the series with some core DRM work as suggested by Christian.
* Dropped the priority based controller for now.
* Dropped the introspection cgroup controller file.
* Implemented unused budget sharing/propagation.
* Some small fixes/tweak as per review feedback and in general.

v3:
* Dropped one upstreamed patch.
* Logging cleanup (use DRM macros where available).

v4:
* Dropped the struct pid tracking indirection in favour of tracking individual
DRM clients directly in the controller. (Michal Koutný)
* Added boot time param for configuring the scanning period. (Tejun Heo)
* Improved spreading of unused budget to over budget clients, regardless of
their location in the tree so that all unused budget can be utilized.
* Made scanning more robust by not re-starting it on every client de-
registration and removal. Instead new baseline GPU activity data is simply
collected on those events and next scan invocation can proceed as scheduled.
* Dropped the debugging aids from the series.

v5:
* Exposed GPU utilisation.
* Added memory stats.

v6:
* Dropped memory stats to avoid diluting focus.
* Fixed CONFIG_CGROUP_DRM=n build.
* Dropped one resolved question from the code.
* Emit logging if scanning period modparam is set too low.
* Survive tree modifications while recording the baseline.
* Retired the "soft limits" wording. (Tejun)
* Tweaked the stats file to mimic the CPU controller. (Tejun)

TODOs/Opens:

* For now (RFC) I haven't implemented the 2nd suggestion from Tejun of having
a shadow tree which would only contain groups with DRM clients. (Purpose
being less nodes to traverse in the scanning loop.)

* Allowing per DRM card configuration and queries is deliberatly left out but
it is compatible in principle with the current proposal.

Where today we have, for drm.weight:

100

We can later extend with:

100
card0 80
card1 20

Similarly for drm.stat:

usage_usec 1000
card0.usage_usec 500
card1.usage_usec 500

Tvrtko Ursulin (8):
cgroup: Add the DRM cgroup controller
drm/cgroup: Track DRM clients per cgroup
drm/cgroup: Add ability to query drm cgroup GPU time
drm/cgroup: Add over budget signalling callback
drm/cgroup: Only track clients which are providing drm_cgroup_ops
cgroup/drm: Introduce weight based drm cgroup control
drm/i915: Implement cgroup controller over budget throttling
cgroup/drm: Expose GPU utilisation

Documentation/admin-guide/cgroup-v2.rst | 39 ++
drivers/gpu/drm/drm_file.c | 6 +
.../gpu/drm/i915/gem/i915_gem_execbuffer.c | 38 +-
drivers/gpu/drm/i915/i915_driver.c | 11 +
drivers/gpu/drm/i915/i915_drm_client.c | 203 +++++-
drivers/gpu/drm/i915/i915_drm_client.h | 11 +
include/drm/drm_drv.h | 36 ++
include/drm/drm_file.h | 6 +
include/linux/cgroup_drm.h | 29 +
include/linux/cgroup_subsys.h | 4 +
init/Kconfig | 7 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/drm.c | 608 ++++++++++++++++++
13 files changed, 989 insertions(+), 10 deletions(-)
create mode 100644 include/linux/cgroup_drm.h
create mode 100644 kernel/cgroup/drm.c

--
2.39.2


2023-10-24 16:12:38

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 4/8] drm/cgroup: Add over budget signalling callback

From: Tvrtko Ursulin <[email protected]>

Add a new callback via which the drm cgroup controller is notifying the
drm core that a certain process is above its allotted GPU time.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
include/drm/drm_drv.h | 8 ++++++++
kernel/cgroup/drm.c | 16 ++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index d1cee5899cde..c518f03b9f0f 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -173,6 +173,14 @@ struct drm_cgroup_ops {
* Used by the DRM core when queried by the DRM cgroup controller.
*/
u64 (*active_time_us) (struct drm_file *);
+
+ /**
+ * @signal_budget:
+ *
+ * Optional callback used by the DRM core to forward over/under GPU time
+ * messages sent by the DRM cgroup controller.
+ */
+ int (*signal_budget) (struct drm_file *, u64 used, u64 budget);
};

/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index acdb76635b60..68f31797c4f0 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -51,6 +51,22 @@ static u64 drmcs_get_active_time_us(struct drm_cgroup_state *drmcs)
return total;
}

+static void
+drmcs_signal_budget(struct drm_cgroup_state *drmcs, u64 usage, u64 budget)
+{
+ struct drm_file *fpriv;
+
+ lockdep_assert_held(&drmcg_mutex);
+
+ list_for_each_entry(fpriv, &drmcs->clients, clink) {
+ const struct drm_cgroup_ops *cg_ops =
+ fpriv->minor->dev->driver->cg_ops;
+
+ if (cg_ops && cg_ops->signal_budget)
+ cg_ops->signal_budget(fpriv, usage, budget);
+ }
+}
+
static void drmcs_free(struct cgroup_subsys_state *css)
{
struct drm_cgroup_state *drmcs = css_to_drmcs(css);
--
2.39.2

2023-10-24 16:12:47

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 6/8] cgroup/drm: Introduce weight based drm cgroup control

From: Tvrtko Ursulin <[email protected]>

Similar to CPU scheduling, implement a concept of weight in the drm cgroup
controller.

Uses the same range and default as the CPU controller - CGROUP_WEIGHT_MIN,
CGROUP_WEIGHT_DFL and CGROUP_WEIGHT_MAX.

Later each cgroup is assigned a time budget proportionaly based on the
relative weights of it's siblings. This time budget is in turn split by
the group's children and so on.

This will be used to implement a soft, or best effort signal from drm
cgroup to drm core notifying about groups which are over their allotted
budget.

No guarantees that the limit can be enforced are provided or implied.

Checking of GPU usage is done periodically by the controller which can be
configured via drmcg_period_ms kernel boot parameter and which defaults
to 2s.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Tejun Heo <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 31 ++
kernel/cgroup/drm.c | 422 +++++++++++++++++++++++-
2 files changed, 450 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b26b5274eaaf..841533527b7b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2418,6 +2418,37 @@ HugeTLB Interface Files
hugetlb pages of <hugepagesize> in this cgroup. Only active in
use hugetlb pages are included. The per-node values are in bytes.

+DRM
+---
+
+The DRM controller allows configuring weight based time control.
+
+DRM weight based time control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The controller configures the GPU time allowed per group and periodically scans
+the belonging tasks to detect the over budget condition, at which point it
+invokes a callback notifying the DRM core of the condition.
+
+Because of the heterogenous hardware and driver DRM capabilities, time control
+is implemented as a loose co-operative (bi-directional) interface between the
+controller and DRM core.
+
+DRM core provides an API to query per process GPU utilization and 2nd API to
+receive notification from the cgroup controller when the group enters or exits
+the over budget condition.
+
+Individual DRM drivers which implement the interface are expected to act on this
+in a best-effort manner. There are no guarantees that the time budget will be
+respected.
+
+DRM weight based time control interface files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ drm.weight
+ Standard cgroup weight based control [1, 10000] used to configure the
+ relative distributing of GPU time between the sibling groups.
+
Misc
----

diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 60e1f3861576..1d1570bf3e90 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -6,7 +6,9 @@
#include <linux/cgroup.h>
#include <linux/cgroup_drm.h>
#include <linux/list.h>
+#include <linux/moduleparam.h>
#include <linux/mutex.h>
+#include <linux/signal.h>
#include <linux/slab.h>

#include <drm/drm_drv.h>
@@ -15,10 +17,28 @@ struct drm_cgroup_state {
struct cgroup_subsys_state css;

struct list_head clients;
+
+ unsigned int weight;
+
+ unsigned int sum_children_weights;
+
+ bool over;
+ bool over_budget;
+
+ u64 per_s_budget_us;
+ u64 prev_active_us;
+ u64 active_us;
};

struct drm_root_cgroup_state {
struct drm_cgroup_state drmcs;
+
+ unsigned int period_us;
+
+ unsigned int last_scan_duration_us;
+ ktime_t prev_timestamp;
+
+ struct delayed_work scan_work;
};

static struct drm_root_cgroup_state root_drmcs = {
@@ -27,6 +47,9 @@ static struct drm_root_cgroup_state root_drmcs = {

static DEFINE_MUTEX(drmcg_mutex);

+static int drmcg_period_ms = 2000;
+module_param(drmcg_period_ms, int, 0644);
+
static inline struct drm_cgroup_state *
css_to_drmcs(struct cgroup_subsys_state *css)
{
@@ -67,12 +90,272 @@ drmcs_signal_budget(struct drm_cgroup_state *drmcs, u64 usage, u64 budget)
}
}

+static u64
+drmcs_read_weight(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+ return drmcs->weight;
+}
+
+static int
+drmcs_write_weight(struct cgroup_subsys_state *css, struct cftype *cftype,
+ u64 weight)
+{
+ struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+ int ret;
+
+ if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+ return -ERANGE;
+
+ ret = mutex_lock_interruptible(&drmcg_mutex);
+ if (ret)
+ return ret;
+ drmcs->weight = weight;
+ mutex_unlock(&drmcg_mutex);
+
+ return 0;
+}
+
+static bool __start_scanning(unsigned int period_us)
+{
+ struct drm_cgroup_state *root = &root_drmcs.drmcs;
+ struct cgroup_subsys_state *node;
+ ktime_t start, now;
+ bool ok = false;
+
+ lockdep_assert_held(&drmcg_mutex);
+
+ start = ktime_get();
+ if (period_us > root_drmcs.last_scan_duration_us)
+ period_us -= root_drmcs.last_scan_duration_us;
+
+ rcu_read_lock();
+
+ css_for_each_descendant_post(node, &root->css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+
+ if (!css_tryget_online(node))
+ goto out;
+
+ drmcs->active_us = 0;
+ drmcs->sum_children_weights = 0;
+
+ if (period_us && node == &root->css)
+ drmcs->per_s_budget_us =
+ DIV_ROUND_UP_ULL((u64)period_us * USEC_PER_SEC,
+ USEC_PER_SEC);
+ else
+ drmcs->per_s_budget_us = 0;
+
+ css_put(node);
+ }
+
+ css_for_each_descendant_post(node, &root->css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+ struct drm_cgroup_state *parent;
+ u64 active;
+
+ if (!css_tryget_online(node))
+ goto out;
+ if (!node->parent) {
+ css_put(node);
+ continue;
+ }
+ if (!css_tryget_online(node->parent)) {
+ css_put(node);
+ goto out;
+ }
+ parent = css_to_drmcs(node->parent);
+
+ active = drmcs_get_active_time_us(drmcs);
+ if (period_us && active > drmcs->prev_active_us)
+ drmcs->active_us += active - drmcs->prev_active_us;
+ drmcs->prev_active_us = active;
+
+ parent->active_us += drmcs->active_us;
+ parent->sum_children_weights += drmcs->weight;
+
+ css_put(node);
+ css_put(&parent->css);
+ }
+
+ ok = true;
+ now = ktime_get();
+ root_drmcs.last_scan_duration_us = ktime_to_us(ktime_sub(now, start));
+ root_drmcs.prev_timestamp = now;
+
+out:
+ rcu_read_unlock();
+
+ return ok;
+}
+
+static void scan_worker(struct work_struct *work)
+{
+ struct drm_cgroup_state *root = &root_drmcs.drmcs;
+ struct cgroup_subsys_state *node;
+ unsigned int period_us;
+
+ mutex_lock(&drmcg_mutex);
+
+ rcu_read_lock();
+
+ if (WARN_ON_ONCE(!css_tryget_online(&root->css))) {
+ rcu_read_unlock();
+ mutex_unlock(&drmcg_mutex);
+ return;
+ }
+
+ period_us = ktime_to_us(ktime_sub(ktime_get(),
+ root_drmcs.prev_timestamp));
+
+ /*
+ * 1st pass - reset working values and update hierarchical weights and
+ * GPU utilisation.
+ */
+ if (!__start_scanning(period_us))
+ goto out_retry; /*
+ * Always come back later if scanner races with
+ * core cgroup management. (Repeated pattern.)
+ */
+
+ css_for_each_descendant_pre(node, &root->css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+ struct cgroup_subsys_state *css;
+ u64 reused_us = 0, unused_us = 0;
+ unsigned int over_weights = 0;
+
+ if (!css_tryget_online(node))
+ goto out_retry;
+
+ /*
+ * 2nd pass - calculate initial budgets, mark over budget
+ * siblings and add up unused budget for the group.
+ */
+ css_for_each_child(css, &drmcs->css) {
+ struct drm_cgroup_state *sibling = css_to_drmcs(css);
+
+ if (!css_tryget_online(css)) {
+ css_put(node);
+ goto out_retry;
+ }
+
+ sibling->per_s_budget_us =
+ DIV_ROUND_UP_ULL(drmcs->per_s_budget_us *
+ sibling->weight,
+ drmcs->sum_children_weights);
+
+ sibling->over = sibling->active_us >
+ sibling->per_s_budget_us;
+ if (sibling->over)
+ over_weights += sibling->weight;
+ else
+ unused_us += sibling->per_s_budget_us -
+ sibling->active_us;
+
+ css_put(css);
+ }
+
+ /*
+ * 3rd pass - spread unused budget according to relative weights
+ * of over budget siblings.
+ */
+ while (over_weights && reused_us < unused_us) {
+ unsigned int under = 0;
+
+ unused_us -= reused_us;
+ reused_us = 0;
+
+ css_for_each_child(css, &drmcs->css) {
+ struct drm_cgroup_state *sibling;
+ u64 extra_us, max_us, need_us;
+
+ if (!css_tryget_online(css)) {
+ css_put(node);
+ goto out_retry;
+ }
+
+ sibling = css_to_drmcs(css);
+ if (!sibling->over) {
+ css_put(css);
+ continue;
+ }
+
+ extra_us = DIV_ROUND_UP_ULL(unused_us *
+ sibling->weight,
+ over_weights);
+ max_us = sibling->per_s_budget_us + extra_us;
+ if (max_us > sibling->active_us)
+ need_us = sibling->active_us -
+ sibling->per_s_budget_us;
+ else
+ need_us = extra_us;
+ reused_us += need_us;
+ sibling->per_s_budget_us += need_us;
+ sibling->over = sibling->active_us >
+ sibling->per_s_budget_us;
+ if (!sibling->over)
+ under += sibling->weight;
+
+ css_put(css);
+ }
+
+ over_weights -= under;
+ }
+
+ css_put(node);
+ }
+
+ /*
+ * 4th pass - send out over/under budget notifications.
+ */
+ css_for_each_descendant_post(node, &root->css) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+
+ if (!css_tryget_online(node))
+ goto out_retry;
+
+ if (drmcs->over || drmcs->over_budget)
+ drmcs_signal_budget(drmcs,
+ drmcs->active_us,
+ drmcs->per_s_budget_us);
+ drmcs->over_budget = drmcs->over;
+
+ css_put(node);
+ }
+
+out_retry:
+ rcu_read_unlock();
+ mutex_unlock(&drmcg_mutex);
+
+ period_us = READ_ONCE(root_drmcs.period_us);
+ if (period_us)
+ schedule_delayed_work(&root_drmcs.scan_work,
+ usecs_to_jiffies(period_us));
+
+ css_put(&root->css);
+}
+
static void drmcs_free(struct cgroup_subsys_state *css)
{
- struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+ if (css != &root_drmcs.drmcs.css)
+ kfree(css_to_drmcs(css));
+}

- if (drmcs != &root_drmcs.drmcs)
- kfree(drmcs);
+static void record_baseline_utilisation(void)
+{
+ /*
+ * Re-capture baseline group GPU times to avoid downward jumps.
+ *
+ * __start_scanning can fail if hierarchy members transition their
+ * online status while it is traversing the tree, so retry with a little
+ * bit of back-off to be nice, although it is not really needed but
+ * callers are also not latency sensitive, especially since retrying is
+ * very unlikely during stable system operation.
+ */
+ while (!__start_scanning(0))
+ synchronize_rcu();
}

static struct cgroup_subsys_state *
@@ -82,6 +365,7 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)

if (!parent_css) {
drmcs = &root_drmcs.drmcs;
+ INIT_DELAYED_WORK(&root_drmcs.scan_work, scan_worker);
} else {
drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
if (!drmcs)
@@ -90,9 +374,128 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
INIT_LIST_HEAD(&drmcs->clients);
}

+ drmcs->weight = CGROUP_WEIGHT_DFL;
+
return &drmcs->css;
}

+static int drmcs_online(struct cgroup_subsys_state *css)
+{
+ if (css == &root_drmcs.drmcs.css && drmcg_period_ms) {
+ const int min_period_ms = 500;
+ int period_ms;
+
+ mutex_lock(&drmcg_mutex);
+ record_baseline_utilisation();
+ if (drmcg_period_ms < min_period_ms) {
+ period_ms = min_period_ms;
+ pr_notice("Capping DRM control group scanning to %ums\n",
+ period_ms);
+ } else {
+ period_ms = drmcg_period_ms;
+ }
+ root_drmcs.period_us = period_ms * 1000;
+ mod_delayed_work(system_wq,
+ &root_drmcs.scan_work,
+ usecs_to_jiffies(root_drmcs.period_us));
+ mutex_unlock(&drmcg_mutex);
+ }
+
+ return 0;
+}
+
+static void drmcs_offline(struct cgroup_subsys_state *css)
+{
+ bool flush = false;
+
+ if (css != &root_drmcs.drmcs.css)
+ return;
+
+ mutex_lock(&drmcg_mutex);
+ if (root_drmcs.period_us) {
+ root_drmcs.period_us = 0;
+ cancel_delayed_work(&root_drmcs.scan_work);
+ flush = true;
+ }
+ mutex_unlock(&drmcg_mutex);
+
+ if (flush)
+ flush_delayed_work(&root_drmcs.scan_work);
+}
+
+static struct drm_cgroup_state *old_drmcs;
+
+static int drmcs_can_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *task;
+
+ task = cgroup_taskset_first(tset, &css);
+ old_drmcs = css_to_drmcs(task_css(task, drm_cgrp_id));
+
+ return 0;
+}
+
+static void drmcs_attach(struct cgroup_taskset *tset)
+{
+ struct drm_cgroup_state *old = old_drmcs;
+ struct cgroup_subsys_state *css;
+ struct drm_file *fpriv, *next;
+ struct drm_cgroup_state *new;
+ struct task_struct *task;
+ bool migrated = false;
+
+ if (!old)
+ return;
+
+ task = cgroup_taskset_first(tset, &css);
+ new = css_to_drmcs(task_css(task, drm_cgrp_id));
+ if (new == old)
+ return;
+
+ mutex_lock(&drmcg_mutex);
+
+ list_for_each_entry_safe(fpriv, next, &old->clients, clink) {
+ cgroup_taskset_for_each(task, css, tset) {
+ struct cgroup_subsys_state *old_css;
+
+ if (task->flags & PF_KTHREAD)
+ continue;
+ if (!thread_group_leader(task))
+ continue;
+
+ new = css_to_drmcs(task_css(task, drm_cgrp_id));
+ if (WARN_ON_ONCE(new == old))
+ continue;
+
+ if (rcu_access_pointer(fpriv->pid) != task_tgid(task))
+ continue;
+
+ if (WARN_ON_ONCE(fpriv->__css != &old->css))
+ continue;
+
+ old_css = fpriv->__css;
+ fpriv->__css = &new->css;
+ css_get(fpriv->__css);
+ list_move_tail(&fpriv->clink, &new->clients);
+ css_put(old_css);
+ migrated = true;
+ }
+ }
+
+ if (migrated)
+ record_baseline_utilisation();
+
+ mutex_unlock(&drmcg_mutex);
+
+ old_drmcs = NULL;
+}
+
+static void drmcs_cancel_attach(struct cgroup_taskset *tset)
+{
+ old_drmcs = NULL;
+}
+
void drmcgroup_client_open(struct drm_file *file_priv)
{
struct drm_cgroup_state *drmcs;
@@ -121,6 +524,7 @@ void drmcgroup_client_close(struct drm_file *file_priv)
mutex_lock(&drmcg_mutex);
list_del(&file_priv->clink);
file_priv->__css = NULL;
+ record_baseline_utilisation();
mutex_unlock(&drmcg_mutex);

css_put(&drmcs->css);
@@ -144,6 +548,7 @@ void drmcgroup_client_migrate(struct drm_file *file_priv)
if (src != dst) {
file_priv->__css = &dst->css; /* Keeps the reference. */
list_move_tail(&file_priv->clink, &dst->clients);
+ record_baseline_utilisation();
}

mutex_unlock(&drmcg_mutex);
@@ -153,12 +558,23 @@ void drmcgroup_client_migrate(struct drm_file *file_priv)
EXPORT_SYMBOL_GPL(drmcgroup_client_migrate);

struct cftype files[] = {
+ {
+ .name = "weight",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = drmcs_read_weight,
+ .write_u64 = drmcs_write_weight,
+ },
{ } /* Zero entry terminates. */
};

struct cgroup_subsys drm_cgrp_subsys = {
.css_alloc = drmcs_alloc,
.css_free = drmcs_free,
+ .css_online = drmcs_online,
+ .css_offline = drmcs_offline,
+ .can_attach = drmcs_can_attach,
+ .attach = drmcs_attach,
+ .cancel_attach = drmcs_cancel_attach,
.early_init = false,
.legacy_cftypes = files,
.dfl_cftypes = files,
--
2.39.2

2023-10-24 16:12:57

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 8/8] cgroup/drm: Expose GPU utilisation

From: Tvrtko Ursulin <[email protected]>

To support container use cases where external orchestrators want to make
deployment and migration decisions based on GPU load and capacity, we can
expose the GPU load as seen by the controller in a new drm.active_us
field. This field contains a monotonic cumulative time cgroup has spent
executing GPU loads, as reported by the DRM drivers being used by group
members.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Eero Tamminen <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 8 +++++++
kernel/cgroup/drm.c | 29 ++++++++++++++++++++++++-
2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 841533527b7b..9ac8ab65161c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2445,6 +2445,14 @@ respected.
DRM weight based time control interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+ drm.stat
+ A read-only flat-keyed file.
+
+ Contains these fields:
+
+ - usage_usec - GPU time used by the group, recursively including all
+ child groups.
+
drm.weight
Standard cgroup weight based control [1, 10000] used to configure the
relative distributing of GPU time between the sibling groups.
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 1d1570bf3e90..127730990301 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -25,6 +25,8 @@ struct drm_cgroup_state {
bool over;
bool over_budget;

+ u64 total_us;
+
u64 per_s_budget_us;
u64 prev_active_us;
u64 active_us;
@@ -117,6 +119,24 @@ drmcs_write_weight(struct cgroup_subsys_state *css, struct cftype *cftype,
return 0;
}

+static int drmcs_show_stat(struct seq_file *sf, void *v)
+{
+ struct drm_cgroup_state *drmcs = css_to_drmcs(seq_css(sf));
+ u64 val;
+
+#ifndef CONFIG_64BIT
+ mutex_lock(&drmcg_mutex);
+#endif
+ val = drmcs->total_us;
+#ifndef CONFIG_64BIT
+ mutex_unlock(&drmcg_mutex);
+#endif
+
+ seq_printf(sf, "usage_usec %llu\n", val);
+
+ return 0;
+}
+
static bool __start_scanning(unsigned int period_us)
{
struct drm_cgroup_state *root = &root_drmcs.drmcs;
@@ -169,11 +189,14 @@ static bool __start_scanning(unsigned int period_us)
parent = css_to_drmcs(node->parent);

active = drmcs_get_active_time_us(drmcs);
- if (period_us && active > drmcs->prev_active_us)
+ if (period_us && active > drmcs->prev_active_us) {
drmcs->active_us += active - drmcs->prev_active_us;
+ drmcs->total_us += drmcs->active_us;
+ }
drmcs->prev_active_us = active;

parent->active_us += drmcs->active_us;
+ parent->total_us += drmcs->active_us;
parent->sum_children_weights += drmcs->weight;

css_put(node);
@@ -564,6 +587,10 @@ struct cftype files[] = {
.read_u64 = drmcs_read_weight,
.write_u64 = drmcs_write_weight,
},
+ {
+ .name = "stat",
+ .seq_show = drmcs_show_stat,
+ },
{ } /* Zero entry terminates. */
};

--
2.39.2

2023-10-24 16:13:09

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 7/8] drm/i915: Implement cgroup controller over budget throttling

From: Tvrtko Ursulin <[email protected]>

When notified by the drm core we are over our allotted time budget, i915
instance will check if any of the GPU engines it is reponsible for is
fully saturated. If it is, and the client in question is using that
engine, it will throttle it.

For now throttling is done simplistically by lowering the scheduling
priority while clients are throttled.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
.../gpu/drm/i915/gem/i915_gem_execbuffer.c | 38 +++-
drivers/gpu/drm/i915/i915_driver.c | 11 +
drivers/gpu/drm/i915/i915_drm_client.c | 203 +++++++++++++++++-
drivers/gpu/drm/i915/i915_drm_client.h | 11 +
4 files changed, 253 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 683fd8d3151c..f87935a030a1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3086,6 +3086,42 @@ static void retire_requests(struct intel_timeline *tl, struct i915_request *end)
break;
}

+#ifdef CONFIG_CGROUP_DRM
+static unsigned int
+__get_class(struct drm_i915_file_private *fpriv, const struct i915_request *rq)
+{
+ unsigned int class;
+
+ class = rq->context->engine->uabi_class;
+
+ if (WARN_ON_ONCE(class >= ARRAY_SIZE(fpriv->client->throttle)))
+ class = 0;
+
+ return class;
+}
+
+static void copy_priority(struct i915_sched_attr *attr,
+ const struct i915_execbuffer *eb,
+ const struct i915_request *rq)
+{
+ struct drm_i915_file_private *file_priv = eb->file->driver_priv;
+ int prio;
+
+ *attr = eb->gem_context->sched;
+
+ prio = file_priv->client->throttle[__get_class(file_priv, rq)];
+ if (prio)
+ attr->priority = prio;
+}
+#else
+static void copy_priority(struct i915_sched_attr *attr,
+ const struct i915_execbuffer *eb,
+ const struct i915_request *rq)
+{
+ *attr = eb->gem_context->sched;
+}
+#endif
+
static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq,
int err, bool last_parallel)
{
@@ -3102,7 +3138,7 @@ static int eb_request_add(struct i915_execbuffer *eb, struct i915_request *rq,

/* Check that the context wasn't destroyed before submission */
if (likely(!intel_context_is_closed(eb->context))) {
- attr = eb->gem_context->sched;
+ copy_priority(&attr, eb, rq);
} else {
/* Serialise with context_close via the add_to_timeline */
i915_request_set_error_once(rq, -ENOENT);
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index 8a0e2c745e1f..450bbcfc16af 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1794,6 +1794,13 @@ static const struct drm_ioctl_desc i915_ioctls[] = {
DRM_IOCTL_DEF_DRV(I915_GEM_VM_DESTROY, i915_gem_vm_destroy_ioctl, DRM_RENDER_ALLOW),
};

+#ifdef CONFIG_CGROUP_DRM
+static const struct drm_cgroup_ops i915_drm_cgroup_ops = {
+ .active_time_us = i915_drm_cgroup_get_active_time_us,
+ .signal_budget = i915_drm_cgroup_signal_budget,
+};
+#endif
+
/*
* Interface history:
*
@@ -1823,6 +1830,10 @@ static const struct drm_driver i915_drm_driver = {
.postclose = i915_driver_postclose,
.show_fdinfo = PTR_IF(IS_ENABLED(CONFIG_PROC_FS), i915_drm_client_fdinfo),

+#ifdef CONFIG_CGROUP_DRM
+ .cg_ops = &i915_drm_cgroup_ops,
+#endif
+
.gem_prime_import = i915_gem_prime_import,

.dumb_create = i915_gem_dumb_create,
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 2a44b3876cb5..403baf8c86ad 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -4,6 +4,7 @@
*/

#include <linux/kernel.h>
+#include <linux/ktime.h>
#include <linux/slab.h>
#include <linux/types.h>

@@ -40,7 +41,7 @@ void __i915_drm_client_free(struct kref *kref)
kfree(client);
}

-#ifdef CONFIG_PROC_FS
+#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
static const char * const uabi_class_names[] = {
[I915_ENGINE_CLASS_RENDER] = "render",
[I915_ENGINE_CLASS_COPY] = "copy",
@@ -65,20 +66,204 @@ static u64 busy_add(struct i915_gem_context *ctx, unsigned int class)
return total;
}

+static u64 get_class_active_ns(struct i915_drm_client *client,
+ struct drm_i915_private *i915,
+ unsigned int class,
+ unsigned int *capacity)
+{
+ struct i915_gem_context *ctx;
+ u64 total;
+
+ *capacity = i915->engine_uabi_class_count[class];
+ if (!*capacity)
+ return 0;
+
+ total = atomic64_read(&client->past_runtime[class]);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(ctx, &client->ctx_list, client_link)
+ total += busy_add(ctx, class);
+ rcu_read_unlock();
+
+ return total;
+}
+
+static bool supports_stats(struct drm_i915_private *i915)
+{
+ return GRAPHICS_VER(i915) >= 8;
+}
+#endif
+
+#if defined(CONFIG_CGROUP_DRM)
+u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file)
+{
+ struct drm_i915_file_private *fpriv = file->driver_priv;
+ struct i915_drm_client *client = fpriv->client;
+ struct drm_i915_private *i915 = fpriv->i915;
+ unsigned int i;
+ u64 busy = 0;
+
+ if (!supports_stats(i915))
+ return 0;
+
+ for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++) {
+ unsigned int capacity;
+ u64 b;
+
+ b = get_class_active_ns(client, i915, i, &capacity);
+ if (capacity) {
+ b = DIV_ROUND_UP_ULL(b, capacity * 1000);
+ busy += b;
+ }
+ }
+
+ return busy;
+}
+
+int i915_drm_cgroup_signal_budget(struct drm_file *file, u64 usage, u64 budget)
+{
+ struct drm_i915_file_private *fpriv = file->driver_priv;
+ u64 class_usage[I915_LAST_UABI_ENGINE_CLASS + 1];
+ u64 class_last[I915_LAST_UABI_ENGINE_CLASS + 1];
+ struct i915_drm_client *client = fpriv->client;
+ struct drm_i915_private *i915 = fpriv->i915;
+ struct intel_engine_cs *engine;
+ bool over = usage > budget;
+ struct task_struct *task;
+ struct pid *pid;
+ unsigned int i;
+ ktime_t unused;
+ int ret = 0;
+ u64 t;
+
+ if (!supports_stats(i915))
+ return -EINVAL;
+
+ if (usage == 0 && budget == 0)
+ return 0;
+
+ rcu_read_lock();
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ if (over) {
+ client->over_budget++;
+ if (!client->over_budget)
+ client->over_budget = 2;
+
+ drm_dbg(&i915->drm, "%s[%u] over budget (%llu/%llu)\n",
+ task ? task->comm : "<unknown>", pid_vnr(pid),
+ usage, budget);
+ } else {
+ client->over_budget = 0;
+ memset(client->class_last, 0, sizeof(client->class_last));
+ memset(client->throttle, 0, sizeof(client->throttle));
+
+ drm_dbg(&i915->drm, "%s[%u] un-throttled; under budget\n",
+ task ? task->comm : "<unknown>", pid_vnr(pid));
+
+ rcu_read_unlock();
+ return 0;
+ }
+ rcu_read_unlock();
+
+ memset(class_usage, 0, sizeof(class_usage));
+ for_each_uabi_engine(engine, i915)
+ class_usage[engine->uabi_class] +=
+ ktime_to_ns(intel_engine_get_busy_time(engine, &unused));
+
+ memcpy(class_last, client->class_last, sizeof(class_last));
+ memcpy(client->class_last, class_usage, sizeof(class_last));
+
+ for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++)
+ class_usage[i] -= class_last[i];
+
+ t = client->last;
+ client->last = ktime_get_raw_ns();
+ t = client->last - t;
+
+ if (client->over_budget == 1)
+ return 0;
+
+ for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++) {
+ u64 client_class_usage[I915_LAST_UABI_ENGINE_CLASS + 1];
+ unsigned int capacity, rel_usage;
+
+ if (!i915->engine_uabi_class_count[i])
+ continue;
+
+ t = DIV_ROUND_UP_ULL(t, 1000);
+ class_usage[i] = DIV_ROUND_CLOSEST_ULL(class_usage[i], 1000);
+ rel_usage = DIV_ROUND_CLOSEST_ULL(class_usage[i] * 100ULL,
+ t *
+ i915->engine_uabi_class_count[i]);
+ if (rel_usage < 95) {
+ /* Physical class not oversubsribed. */
+ if (client->throttle[i]) {
+ client->throttle[i] = 0;
+
+ rcu_read_lock();
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ drm_dbg(&i915->drm,
+ "%s[%u] un-throttled; physical class %s utilisation %u%%\n",
+ task ? task->comm : "<unknown>",
+ pid_vnr(pid),
+ uabi_class_names[i],
+ rel_usage);
+ rcu_read_unlock();
+ }
+ continue;
+ }
+
+ client_class_usage[i] =
+ get_class_active_ns(client, i915, i, &capacity);
+ if (client_class_usage[i]) {
+ int permille;
+
+ ret |= 1;
+
+ permille = DIV_ROUND_CLOSEST_ULL((usage - budget) *
+ 1000,
+ budget);
+ client->throttle[i] =
+ DIV_ROUND_CLOSEST(permille *
+ I915_CONTEXT_MIN_USER_PRIORITY,
+ 1000);
+ if (client->throttle[i] <
+ I915_CONTEXT_MIN_USER_PRIORITY)
+ client->throttle[i] =
+ I915_CONTEXT_MIN_USER_PRIORITY;
+
+ rcu_read_lock();
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ drm_dbg(&i915->drm,
+ "%s[%u] %d‰ over budget, throttled to priority %d; physical class %s utilisation %u%%\n",
+ task ? task->comm : "<unknown>",
+ pid_vnr(pid),
+ permille,
+ client->throttle[i],
+ uabi_class_names[i],
+ rel_usage);
+ rcu_read_unlock();
+ }
+ }
+
+ return ret;
+}
+#endif
+
+#ifdef CONFIG_PROC_FS
static void
show_client_class(struct drm_printer *p,
struct drm_i915_private *i915,
struct i915_drm_client *client,
unsigned int class)
{
- const unsigned int capacity = i915->engine_uabi_class_count[class];
- u64 total = atomic64_read(&client->past_runtime[class]);
- struct i915_gem_context *ctx;
+ unsigned int capacity;
+ u64 total;

- rcu_read_lock();
- list_for_each_entry_rcu(ctx, &client->ctx_list, client_link)
- total += busy_add(ctx, class);
- rcu_read_unlock();
+ total = get_class_active_ns(client, i915, class, &capacity);

if (capacity)
drm_printf(p, "drm-engine-%s:\t%llu ns\n",
@@ -102,7 +287,7 @@ void i915_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file)
* ******************************************************************
*/

- if (GRAPHICS_VER(i915) < 8)
+ if (!supports_stats(i915))
return;

for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++)
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 67816c912bca..396dbb0780cc 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -29,6 +29,13 @@ struct i915_drm_client {
* @past_runtime: Accumulation of pphwsp runtimes from closed contexts.
*/
atomic64_t past_runtime[I915_LAST_UABI_ENGINE_CLASS + 1];
+
+#ifdef CONFIG_CGROUP_DRM
+ int throttle[I915_LAST_UABI_ENGINE_CLASS + 1];
+ unsigned int over_budget;
+ u64 last;
+ u64 class_last[I915_LAST_UABI_ENGINE_CLASS + 1];
+#endif
};

static inline struct i915_drm_client *
@@ -49,4 +56,8 @@ struct i915_drm_client *i915_drm_client_alloc(void);

void i915_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file);

+u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file);
+int i915_drm_cgroup_signal_budget(struct drm_file *file,
+ u64 usage, u64 budget);
+
#endif /* !__I915_DRM_CLIENT_H__ */
--
2.39.2

2024-02-07 15:28:26

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC 6/8] cgroup/drm: Introduce weight based drm cgroup control

Hello.

(I hope I'm replying to the latest iteration and it has some validitiy
still. Sorry for my late reply. Few points caught my attention.)

On Tue, Oct 24, 2023 at 05:07:25PM +0100, Tvrtko Ursulin <[email protected]> wrote:
> @@ -15,10 +17,28 @@ struct drm_cgroup_state {
> struct cgroup_subsys_state css;
>
> struct list_head clients;
> +
> + unsigned int weight;
> +
> + unsigned int sum_children_weights;
> +
> + bool over;
> + bool over_budget;
> +
> + u64 per_s_budget_us;

Nit: sounds reversed (cf USEC_PER_SEC).

> +static int drmcg_period_ms = 2000;
> +module_param(drmcg_period_ms, int, 0644);
> +

cgroup subsystems as loadable modules were abandoded long time ago.
I'm not sure if this works as expected then.
The common way is __seutp(), see for instance __setup() in mm/memcontrol.c

> +static bool __start_scanning(unsigned int period_us)
...
> + css_for_each_descendant_post(node, &root->css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> + struct drm_cgroup_state *parent;
> + u64 active;
> +
> + if (!css_tryget_online(node))
> + goto out;
> + if (!node->parent) {
> + css_put(node);
> + continue;
> + }

I think the parent check can go first witout put'ting (RCU is enough for
node).

> + if (!css_tryget_online(node->parent)) {
> + css_put(node);
> + goto out;
> + }

Online parent is implied onlined node. So this can be removed.

...
> +
> + css_put(node);
> + }

I wonder if the first passes could be implemented with rstat flushing
and then only invoke signalling based on the data. (As that's
best-effort).

> +
> + /*
> + * 4th pass - send out over/under budget notifications.
> + */
> + css_for_each_descendant_post(node, &root->css) {
> + struct drm_cgroup_state *drmcs = css_to_drmcs(node);
> +
> + if (!css_tryget_online(node))
> + goto out_retry;
> +
> + if (drmcs->over || drmcs->over_budget)
> + drmcs_signal_budget(drmcs,
> + drmcs->active_us,
> + drmcs->per_s_budget_us);
> + drmcs->over_budget = drmcs->over;
> +
> + css_put(node);
> + }

> static struct cgroup_subsys_state *
> @@ -82,6 +365,7 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
>
> if (!parent_css) {
> drmcs = &root_drmcs.drmcs;
> + INIT_DELAYED_WORK(&root_drmcs.scan_work, scan_worker);

This reminds me discussion
https://lore.kernel.org/lkml/rlm36iypckvxol2edyr25jyo4imvlidtepbcjdaa2ouvwh3wjq@pqyuk3v2jesb/

I.e. worker needn't be initialized if controller is "invisible".

> +static void drmcs_attach(struct cgroup_taskset *tset)
> +{
> + struct drm_cgroup_state *old = old_drmcs;
> + struct cgroup_subsys_state *css;
> + struct drm_file *fpriv, *next;
> + struct drm_cgroup_state *new;
> + struct task_struct *task;
> + bool migrated = false;
> +
> + if (!old)
> + return;
> +
> + task = cgroup_taskset_first(tset, &css);
> + new = css_to_drmcs(task_css(task, drm_cgrp_id));
> + if (new == old)
> + return;
> +
> + mutex_lock(&drmcg_mutex);
> +
> + list_for_each_entry_safe(fpriv, next, &old->clients, clink) {
> + cgroup_taskset_for_each(task, css, tset) {
> + struct cgroup_subsys_state *old_css;
> +
> + if (task->flags & PF_KTHREAD)
> + continue;

I'm curious, is this because of particular kthreads? Or would it fail
somehow below? (Like people should not migrate kthreads normally, so
their expectation cannot be high.)

Michal


Attachments:
(No filename) (3.39 kB)
signature.asc (235.00 B)
Download all attachments