2023-07-12 12:14:29

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC v5 00/17] DRM cgroup controller with scheduling control and memory stats

From: Tvrtko Ursulin <[email protected]>

This series contains a proposal for a DRM cgroup controller which implements a
weight based hierarchical GPU usage budget based controller similar in concept
to some of the existing controllers and also exposes GPU memory usage as a read-
only field.

Motivation mostly comes from my earlier proposal where I identified that GPU
scheduling lags significantly behind what is available for CPU and IO. Whereas
back then I was proposing to somehow tie this with process nice, feedback mostly
was that people wanted cgroups. So here it is - in the world of heterogenous
computing pipelines I think it is time to do something about this gap.

Code is not finished but should survive some light experimenting with. I am
sharing it early since the topic has been controversial in the past. I hope to
demonstrate there are gains to be had in real world usage(*), today, and that
the concepts the proposal relies are well enough established and stable.

*) Specifically under ChromeOS which uses cgroups to control CPU bandwith for
VMs based on the window focused status. It can be demonstrated how GPU
scheduling control can easily be integrated into that setup.

*) Another real world example later in the cover letter.

There should be no conflict with this proposal and any efforts to implement
memory usage based controller. Skeleton DRM cgroup controller is deliberatly
purely a skeleton patch where any further functionality can be added with no
real conflicts. [In fact, perhaps scheduling is even easier to deal with than
memory accounting.]

Structure of the series is as follows:

1-5) A separate/different series which adds fdinfo memory stats support to
i915. This is only a pre-requisite for patches 16-17 so can be ignored in
scope of this series.
6) Improve client ownership tracking in DRM core. Also a pre-requisite which
can be ignored.
7) Adds a skeleton DRM cgroup controller with no functionality.
8-11) Laying down some infrastructure to enable the controller.
12) The scheduling controller itself.
13-14) i915 support for the scheduling controller.
15) Expose GPU utilisation from the controller.
16) Add memory stats plumbing and core logic to the controller.
17) i915 support for controller memory stats.

The proposals defines a delegation of duties between the tree parties: cgroup
controller, DRM core and individual drivers. Two way communication interfaces
are then defined to enable the delegation to work.

DRM scheduling soft limits
~~~~~~~~~~~~~~~~~~~~~~~~~~

Because of the heterogenous hardware and driver DRM capabilities, soft limits
are implemented as a loose co-operative (bi-directional) interface between the
controller and DRM core.

The controller configures the GPU time allowed per group and periodically scans
the belonging tasks to detect the over budget condition, at which point it
invokes a callback notifying the DRM core of the condition.

DRM core provides an API to query per process GPU utilization and 2nd API to
receive notification from the cgroup controller when the group enters or exits
the over budget condition.

Individual DRM drivers which implement the interface are expected to act on this
in the best-effort manner only. There are no guarantees that the soft limits
will be respected.

DRM controller interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

drm.active_us
GPU time used by the group recursively including all child groups.

drm.weight
Standard cgroup weight based control [1, 10000] used to configure the
relative distributing of GPU time between the sibling groups.

drm.memory.stat
A nested file containing cumulative memory statistics for the whole
sub-hierarchy, broken down into separate GPUs and separate memory
regions supported by the latter.

For example::

$ cat drm.memory.stat
card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Card designation corresponds to the DRM device names and multiple line
entries can be present per card.

Memory region names should be expected to be driver specific with the
exception of 'system' which is standardised and applicable for GPUs
which can operate on system memory buffers.

Sub-keys 'resident' and 'purgeable' are optional.

Per category region usage is reported in bytes.

This builds upon the per client GPU utilisation work which landed recently for a
few drivers. My thinking is that in principle, an intersect of drivers which
support both that and some sort of scheduling control, like priorities, could
also in theory support this controller.

Another really interesting angle for this controller is that it mimics the same
control menthod used by the CPU scheduler. That is the proportional/weight based
GPU time budgeting. Which makes it easy to configure and does not need a new
mental model.

However, as the introduction mentions, GPUs are much more heterogenous and
therefore the controller uses very "soft" wording as to what it promises. The
general statement is that it can define budgets, notify clients when they are
over them, and let individual drivers implement best effort handling of those
conditions.

Delegation of duties in the implementation goes likes this:

* DRM cgroup controller implements the control files, the scanning loop and
tracks the DRM clients associated with each cgroup. It provides API DRM
core needs to call to (de)register and migrate clients.
* DRM core defines two call-backs which the core calls directly: First for
querying GPU time by a client and second for notifying the client that it
is over budget. It calls controller API for (de)registering clients and
migrating then between tasks on file descriptor hand over.
* Individual drivers implement the above mentiopned callbacks and register
them with the DRM core.

What I have demonstrated in practice is that when wired to i915, in a really
primitive way where the over-budget condition simply lowers the scheduling
priority, the concept can be almost equally effective as the static priority
control. I say almost because the design where budget control depends on the
periodic usage scanning has a fundamental delay, so responsiveness will depend
on the scanning period, which may or may not be a problem for a particular use
case.

There are also interesting conversations to be had around mental models for what
is GPU usage as a single number when faced with GPUs which have different
execution engines. To an extent this is similar to the multi-core and cgroup
CPU controller problems, but definitely goes further than that.

I deliberately did not want to include any such complications in the controller
itself and left the individual drivers to handle it. For instance in the i915
over-budget callback it will not do anything unless client's GPU usage is on a
physical engine which is oversubscribed. This enables multiple clients to be
harmlessly over budget, as long as they are not competing for the same GPU
resource.

Example usage from within a Linux desktop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Standard Linux distributions like Ubuntu already uses cgroups heavily for
session management and that could easily be extended with the DRM controller.

After logging into the system graphically we can enable the DRM controller
throughout the cgroups hierarchy:

echo +drm > /sys/fs/cgroup/cgroup.subtree_control
echo +drm > /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo +drm > /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.subtree_control

Next we will open two SSH sessions, just so separate cgroups are handily created
by systemd for this experiment.

Roughly simultaneously we run the following two benchmarks in each session
respectively:

1)
./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000

2)
vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan

(The only reason for vsync off here is because I struggled to find an easily
runnable and demanding enough benchmark, or to run on a screen large enough to
make even a simpler ones demanding.)

With this test we get 252fps from GpuTest and 96fps from GfxBenchmark.

Premise here is that one of these GPU intensive benchmarks is intended to be ran
by the user with lower priority. Imagine kicking off some background compute
processing and continuing to use the UI for other tasks. Hence the user will now
re-run the test by first lowering the weight control of the first session (DRM
cgroup):

1)
echo 50 | sudo tee /sys/fs/cgroup/`cut -d':' -f3 /proc/self/cgroup`/drm.weight
./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000

2)
vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan

In this case we will see that GpuTest has recorded 208fps (~18% down) and
GfxBenchmark 114fps (18% up), demonstrating that even a very simple approach of
wiring up i915 to the DRM cgroup controller can enable external GPU scheduling
control.

* Note here that default weight is 100, so setting 50 for the background session
is asking the controller to give it half as much GPU bandwidth.

v2:
* Prefaced the series with some core DRM work as suggested by Christian.
* Dropped the priority based controller for now.
* Dropped the introspection cgroup controller file.
* Implemented unused budget sharing/propagation.
* Some small fixes/tweak as per review feedback and in general.

v3:
* Dropped one upstreamed patch.
* Logging cleanup (use DRM macros where available).

v4:
* Dropped the struct pid tracking indirection in favour of tracking individual
DRM clients directly in the controller. (Michal Koutný)
* Added boot time param for configuring the scanning period. (Tejun Heo)
* Improved spreading of unused budget to over budget clients, regardless of
their location in the tree so that all unused budget can be utilized.
* Made scanning more robust by not re-starting it on every client de-
registration and removal. Instead new baseline GPU activity data is simply
collected on those events and next scan invocation can proceed as scheduled.
* Dropped the debugging aids from the series.

v5:
* Exposed GPU utilisation.
* Added memory stats.

TODOs/Opens:

* For now (RFC) I haven't implemented the 2nd suggestion from Tejun of having
a shadow tree which would only contain groups with DRM clients. (Purpose
being less nodes to traverse in the scanning loop.)

* Feedback from people interested in drm.active_us and drm.memory.stat is
required to understand the use cases and their usefulness (of the fields).

Memory stats are something which was easy to add to my series, since I was
already working on the fdinfo memory stats patches, but the question is how
useful it is.

And for the drm.active_us the question is how useful is this "flat" GPU
utilisation metric. I expect it could be challenging with mixed containers
eg. one container using GPU engine A, while the other uses B. For those use
cases we might need to design something with more fine-grained visibility.

Cc: Eero Tamminen <[email protected]>

Tvrtko Ursulin (17):
drm/i915: Add ability for tracking buffer objects per client
drm/i915: Record which client owns a VM
drm/i915: Track page table backing store usage
drm/i915: Account ring buffer and context state storage
drm/i915: Implement fdinfo memory stats printing
drm: Update file owner during use
cgroup: Add the DRM cgroup controller
drm/cgroup: Track DRM clients per cgroup
drm/cgroup: Add ability to query drm cgroup GPU time
drm/cgroup: Add over budget signalling callback
drm/cgroup: Only track clients which are providing drm_cgroup_ops
cgroup/drm: Introduce weight based drm cgroup control
drm/i915: Wire up with drm controller GPU time query
drm/i915: Implement cgroup controller over budget throttling
cgroup/drm: Expose GPU utilisation
cgroup/drm: Expose memory stats
drm/i915: Wire up to the drm cgroup memory stats

Documentation/admin-guide/cgroup-v2.rst | 56 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +-
drivers/gpu/drm/drm_auth.c | 3 +-
drivers/gpu/drm/drm_debugfs.c | 10 +-
drivers/gpu/drm/drm_file.c | 46 +-
drivers/gpu/drm/drm_ioctl.c | 3 +
drivers/gpu/drm/i915/gem/i915_gem_context.c | 11 +-
.../gpu/drm/i915/gem/i915_gem_context_types.h | 3 +
.../gpu/drm/i915/gem/i915_gem_execbuffer.c | 38 +-
drivers/gpu/drm/i915/gem/i915_gem_object.c | 13 +-
.../gpu/drm/i915/gem/i915_gem_object_types.h | 12 +
.../gpu/drm/i915/gem/selftests/mock_context.c | 4 +-
drivers/gpu/drm/i915/gt/intel_context.c | 14 +
drivers/gpu/drm/i915/gt/intel_gtt.c | 6 +
drivers/gpu/drm/i915/gt/intel_gtt.h | 1 +
drivers/gpu/drm/i915/i915_driver.c | 13 +
drivers/gpu/drm/i915/i915_drm_client.c | 383 ++++++++-
drivers/gpu/drm/i915/i915_drm_client.h | 59 ++
drivers/gpu/drm/nouveau/nouveau_drm.c | 5 +-
drivers/gpu/drm/vmwgfx/vmwgfx_gem.c | 6 +-
include/drm/drm_drv.h | 97 +++
include/drm/drm_file.h | 19 +-
include/linux/cgroup_drm.h | 29 +
include/linux/cgroup_subsys.h | 4 +
init/Kconfig | 7 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/drm.c | 741 ++++++++++++++++++
27 files changed, 1557 insertions(+), 33 deletions(-)
create mode 100644 include/linux/cgroup_drm.h
create mode 100644 kernel/cgroup/drm.c

--
2.39.2



2023-07-12 12:15:09

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 07/17] cgroup: Add the DRM cgroup controller

From: Tvrtko Ursulin <[email protected]>

Skeleton controller without any functionality.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
include/linux/cgroup_drm.h | 9 ++++++
include/linux/cgroup_subsys.h | 4 +++
init/Kconfig | 7 ++++
kernel/cgroup/Makefile | 1 +
kernel/cgroup/drm.c | 60 +++++++++++++++++++++++++++++++++++
5 files changed, 81 insertions(+)
create mode 100644 include/linux/cgroup_drm.h
create mode 100644 kernel/cgroup/drm.c

diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
new file mode 100644
index 000000000000..8ef66a47619f
--- /dev/null
+++ b/include/linux/cgroup_drm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef _CGROUP_DRM_H
+#define _CGROUP_DRM_H
+
+#endif /* _CGROUP_DRM_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 445235487230..49460494a010 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
SUBSYS(misc)
#endif

+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+SUBSYS(drm)
+#endif
+
/*
* The following subsystems are not supported on the default hierarchy.
*/
diff --git a/init/Kconfig b/init/Kconfig
index f7f65af4ee12..485cff856c0b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1065,6 +1065,13 @@ config CGROUP_RDMA
Attaching processes with active RDMA resources to the cgroup
hierarchy is allowed even if can cross the hierarchy's limit.

+config CGROUP_DRM
+ bool "DRM controller"
+ help
+ Provides the DRM subsystem controller.
+
+ ...
+
config CGROUP_FREEZER
bool "Freezer controller"
help
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 12f8457ad1f9..849bd2917477 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -6,4 +6,5 @@ obj-$(CONFIG_CGROUP_PIDS) += pids.o
obj-$(CONFIG_CGROUP_RDMA) += rdma.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CGROUP_MISC) += misc.o
+obj-$(CONFIG_CGROUP_DRM) += drm.o
obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
new file mode 100644
index 000000000000..02c8eaa633d3
--- /dev/null
+++ b/kernel/cgroup/drm.c
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_drm.h>
+#include <linux/slab.h>
+
+struct drm_cgroup_state {
+ struct cgroup_subsys_state css;
+};
+
+struct drm_root_cgroup_state {
+ struct drm_cgroup_state drmcs;
+};
+
+static struct drm_root_cgroup_state root_drmcs;
+
+static inline struct drm_cgroup_state *
+css_to_drmcs(struct cgroup_subsys_state *css)
+{
+ return container_of(css, struct drm_cgroup_state, css);
+}
+
+static void drmcs_free(struct cgroup_subsys_state *css)
+{
+ struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+
+ if (drmcs != &root_drmcs.drmcs)
+ kfree(drmcs);
+}
+
+static struct cgroup_subsys_state *
+drmcs_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct drm_cgroup_state *drmcs;
+
+ if (!parent_css) {
+ drmcs = &root_drmcs.drmcs;
+ } else {
+ drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
+ if (!drmcs)
+ return ERR_PTR(-ENOMEM);
+ }
+
+ return &drmcs->css;
+}
+
+struct cftype files[] = {
+ { } /* Zero entry terminates. */
+};
+
+struct cgroup_subsys drm_cgrp_subsys = {
+ .css_alloc = drmcs_alloc,
+ .css_free = drmcs_free,
+ .early_init = false,
+ .legacy_cftypes = files,
+ .dfl_cftypes = files,
+};
--
2.39.2


2023-07-12 12:16:46

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 09/17] drm/cgroup: Add ability to query drm cgroup GPU time

From: Tvrtko Ursulin <[email protected]>

Add a driver callback and core helper which allow querying the time spent
on GPUs for processes belonging to a group.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
include/drm/drm_drv.h | 28 ++++++++++++++++++++++++++++
kernel/cgroup/drm.c | 20 ++++++++++++++++++++
2 files changed, 48 insertions(+)

diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index b77f2c7275b7..3116fa4dbc48 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -151,6 +151,24 @@ enum drm_driver_feature {
DRIVER_HAVE_IRQ = BIT(30),
};

+/**
+ * struct drm_cgroup_ops
+ *
+ * This structure contains a number of callbacks that drivers can provide if
+ * they are able to support one or more of the functionalities implemented by
+ * the DRM cgroup controller.
+ */
+struct drm_cgroup_ops {
+ /**
+ * @active_time_us:
+ *
+ * Optional callback for reporting the GPU time consumed by this client.
+ *
+ * Used by the DRM core when queried by the DRM cgroup controller.
+ */
+ u64 (*active_time_us) (struct drm_file *);
+};
+
/**
* struct drm_driver - DRM driver structure
*
@@ -428,6 +446,16 @@ struct drm_driver {
*/
const struct file_operations *fops;

+#ifdef CONFIG_CGROUP_DRM
+ /**
+ * @cg_ops:
+ *
+ * Optional pointer to driver callbacks facilitating integration with
+ * the DRM cgroup controller.
+ */
+ const struct drm_cgroup_ops *cg_ops;
+#endif
+
#ifdef CONFIG_DRM_LEGACY
/* Everything below here is for legacy driver, never use! */
/* private: */
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index d702be1b441f..acdb76635b60 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -9,6 +9,8 @@
#include <linux/mutex.h>
#include <linux/slab.h>

+#include <drm/drm_drv.h>
+
struct drm_cgroup_state {
struct cgroup_subsys_state css;

@@ -31,6 +33,24 @@ css_to_drmcs(struct cgroup_subsys_state *css)
return container_of(css, struct drm_cgroup_state, css);
}

+static u64 drmcs_get_active_time_us(struct drm_cgroup_state *drmcs)
+{
+ struct drm_file *fpriv;
+ u64 total = 0;
+
+ lockdep_assert_held(&drmcg_mutex);
+
+ list_for_each_entry(fpriv, &drmcs->clients, clink) {
+ const struct drm_cgroup_ops *cg_ops =
+ fpriv->minor->dev->driver->cg_ops;
+
+ if (cg_ops && cg_ops->active_time_us)
+ total += cg_ops->active_time_us(fpriv);
+ }
+
+ return total;
+}
+
static void drmcs_free(struct cgroup_subsys_state *css)
{
struct drm_cgroup_state *drmcs = css_to_drmcs(css);
--
2.39.2


2023-07-12 12:19:00

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 11/17] drm/cgroup: Only track clients which are providing drm_cgroup_ops

From: Tvrtko Ursulin <[email protected]>

To reduce the number of tracking going on, especially with drivers which
will not support any sort of control from the drm cgroup controller side,
lets express the funcionality as opt-in and use the presence of
drm_cgroup_ops as activation criteria.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
kernel/cgroup/drm.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 68f31797c4f0..60e1f3861576 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -97,6 +97,9 @@ void drmcgroup_client_open(struct drm_file *file_priv)
{
struct drm_cgroup_state *drmcs;

+ if (!file_priv->minor->dev->driver->cg_ops)
+ return;
+
drmcs = css_to_drmcs(task_get_css(current, drm_cgrp_id));

mutex_lock(&drmcg_mutex);
@@ -112,6 +115,9 @@ void drmcgroup_client_close(struct drm_file *file_priv)

drmcs = css_to_drmcs(file_priv->__css);

+ if (!file_priv->minor->dev->driver->cg_ops)
+ return;
+
mutex_lock(&drmcg_mutex);
list_del(&file_priv->clink);
file_priv->__css = NULL;
@@ -126,6 +132,9 @@ void drmcgroup_client_migrate(struct drm_file *file_priv)
struct drm_cgroup_state *src, *dst;
struct cgroup_subsys_state *old;

+ if (!file_priv->minor->dev->driver->cg_ops)
+ return;
+
mutex_lock(&drmcg_mutex);

old = file_priv->__css;
--
2.39.2


2023-07-12 12:20:06

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 17/17] drm/i915: Wire up to the drm cgroup memory stats

From: Tvrtko Ursulin <[email protected]>

Simply refactor the existing helpers which collate the data for fdinfo
and share them with thin drm cgroup controller callbacks.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/i915/i915_driver.c | 4 +
drivers/gpu/drm/i915/i915_drm_client.c | 183 ++++++++++++++++---------
drivers/gpu/drm/i915/i915_drm_client.h | 11 +-
3 files changed, 129 insertions(+), 69 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index 3b9d47c2097b..a299edc9eb79 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1795,6 +1795,10 @@ static const struct drm_ioctl_desc i915_ioctls[] = {
static const struct drm_cgroup_ops i915_drm_cgroup_ops = {
.active_time_us = i915_drm_cgroup_get_active_time_us,
.signal_budget = i915_drm_cgroup_signal_budget,
+
+ .num_memory_regions = i915_drm_cgroup_num_memory_regions,
+ .memory_region_name = i915_drm_cgroup_memory_region_name,
+ .memory_stats = i915_drm_cgroup_memory_stats,
};
#endif

diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 9be007b10523..c54b1ac753c6 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -29,7 +29,7 @@ struct i915_drm_client *i915_drm_client_alloc(void)
kref_init(&client->kref);
spin_lock_init(&client->ctx_lock);
INIT_LIST_HEAD(&client->ctx_list);
-#ifdef CONFIG_PROC_FS
+#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
spin_lock_init(&client->objects_lock);
INIT_LIST_HEAD(&client->objects_list);
#endif
@@ -46,6 +46,89 @@ void __i915_drm_client_free(struct kref *kref)
}

#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
+static void
+obj_meminfo(struct drm_i915_gem_object *obj,
+ struct drm_memory_stats *stats,
+ unsigned int num)
+{
+ struct intel_memory_region *mr;
+ u64 sz = obj->base.size;
+ enum intel_region_id id;
+ unsigned int i;
+
+ /* Attribute size and shared to all possible memory regions. */
+ for (i = 0; i < obj->mm.n_placements; i++) {
+ mr = obj->mm.placements[i];
+ id = mr->id;
+
+ if (WARN_ON_ONCE(id >= num))
+ return;
+
+ if (obj->base.handle_count > 1)
+ stats[id].shared += sz;
+ else
+ stats[id].private += sz;
+ }
+
+ /* Attribute other categories to only the current region. */
+ mr = obj->mm.region;
+ if (mr)
+ id = mr->id;
+ else
+ id = INTEL_REGION_SMEM;
+
+ if (WARN_ON_ONCE(id >= num))
+ return;
+
+ if (!obj->mm.n_placements) {
+ if (obj->base.handle_count > 1)
+ stats[id].shared += sz;
+ else
+ stats[id].private += sz;
+ }
+
+ if (i915_gem_object_has_pages(obj)) {
+ stats[id].resident += sz;
+
+ if (!dma_resv_test_signaled(obj->base.resv,
+ dma_resv_usage_rw(true)))
+ stats[id].active += sz;
+ else if (i915_gem_object_is_shrinkable(obj) &&
+ obj->mm.madv == I915_MADV_DONTNEED)
+ stats[id].purgeable += sz;
+ }
+}
+
+static void
+memory_stats(struct drm_file *file,
+ struct drm_memory_stats *stats,
+ unsigned int num)
+{
+ struct drm_i915_file_private *fpriv = file->driver_priv;
+ struct i915_drm_client *client = fpriv->client;
+ struct drm_i915_gem_object *obj;
+ struct list_head *pos;
+ unsigned int id;
+
+ /* Public objects. */
+ spin_lock(&file->table_lock);
+ idr_for_each_entry(&file->object_idr, obj, id)
+ obj_meminfo(obj, stats, num);
+ spin_unlock(&file->table_lock);
+
+ /* Internal objects. */
+ rcu_read_lock();
+ list_for_each_rcu(pos, &client->objects_list) {
+ obj = i915_gem_object_get_rcu(list_entry(pos, typeof(*obj),
+ client_link));
+ if (!obj)
+ continue;
+ obj_meminfo(obj, stats, num);
+ i915_gem_object_put(obj);
+ }
+ rcu_read_unlock();
+}
+
static const char * const uabi_class_names[] = {
[I915_ENGINE_CLASS_RENDER] = "render",
[I915_ENGINE_CLASS_COPY] = "copy",
@@ -255,83 +338,47 @@ int i915_drm_cgroup_signal_budget(struct drm_file *file, u64 usage, u64 budget)

return ret;
}
+
+unsigned int i915_drm_cgroup_num_memory_regions(const struct drm_device *dev)
+{
+ return INTEL_REGION_UNKNOWN;
+}
+
+const char *i915_drm_cgroup_memory_region_name(const struct drm_device *dev,
+ unsigned int index)
+{
+ const struct drm_i915_private *i915 = to_i915(dev);
+
+ if (index < ARRAY_SIZE(i915->mm.regions)) {
+ struct intel_memory_region *mr = i915->mm.regions[index];
+
+ if (mr)
+ return mr->name;
+ }
+
+ return NULL;
+}
+
+unsigned int i915_drm_cgroup_memory_stats(struct drm_file *file,
+ struct drm_memory_stats *stats,
+ unsigned int num)
+{
+ memory_stats(file, stats, num);
+
+ return DRM_GEM_OBJECT_RESIDENT | DRM_GEM_OBJECT_PURGEABLE;
+}
#endif

#ifdef CONFIG_PROC_FS
-static void
-obj_meminfo(struct drm_i915_gem_object *obj,
- struct drm_memory_stats stats[INTEL_REGION_UNKNOWN])
-{
- struct intel_memory_region *mr;
- u64 sz = obj->base.size;
- enum intel_region_id id;
- unsigned int i;
-
- /* Attribute size and shared to all possible memory regions. */
- for (i = 0; i < obj->mm.n_placements; i++) {
- mr = obj->mm.placements[i];
- id = mr->id;
-
- if (obj->base.handle_count > 1)
- stats[id].shared += sz;
- else
- stats[id].private += sz;
- }
-
- /* Attribute other categories to only the current region. */
- mr = obj->mm.region;
- if (mr)
- id = mr->id;
- else
- id = INTEL_REGION_SMEM;
-
- if (!obj->mm.n_placements) {
- if (obj->base.handle_count > 1)
- stats[id].shared += sz;
- else
- stats[id].private += sz;
- }
-
- if (i915_gem_object_has_pages(obj)) {
- stats[id].resident += sz;
-
- if (!dma_resv_test_signaled(obj->base.resv,
- dma_resv_usage_rw(true)))
- stats[id].active += sz;
- else if (i915_gem_object_is_shrinkable(obj) &&
- obj->mm.madv == I915_MADV_DONTNEED)
- stats[id].purgeable += sz;
- }
-}
-
static void show_meminfo(struct drm_printer *p, struct drm_file *file)
{
struct drm_memory_stats stats[INTEL_REGION_UNKNOWN] = {};
struct drm_i915_file_private *fpriv = file->driver_priv;
- struct i915_drm_client *client = fpriv->client;
struct drm_i915_private *i915 = fpriv->i915;
- struct drm_i915_gem_object *obj;
struct intel_memory_region *mr;
- struct list_head *pos;
unsigned int id;

- /* Public objects. */
- spin_lock(&file->table_lock);
- idr_for_each_entry(&file->object_idr, obj, id)
- obj_meminfo(obj, stats);
- spin_unlock(&file->table_lock);
-
- /* Internal objects. */
- rcu_read_lock();
- list_for_each_rcu(pos, &client->objects_list) {
- obj = i915_gem_object_get_rcu(list_entry(pos, typeof(*obj),
- client_link));
- if (!obj)
- continue;
- obj_meminfo(obj, stats);
- i915_gem_object_put(obj);
- }
- rcu_read_unlock();
+ memory_stats(file, stats, ARRAY_SIZE(stats));

for_each_memory_region(mr, i915, id)
drm_print_memory_stats(p,
@@ -382,7 +429,9 @@ void i915_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file)
for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++)
show_client_class(p, i915, file_priv->client, i);
}
+#endif

+#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
void i915_drm_client_add_object(struct i915_drm_client *client,
struct drm_i915_gem_object *obj)
{
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 6eadc9596b8f..8b34be25e887 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -29,7 +29,7 @@ struct i915_drm_client {
spinlock_t ctx_lock; /* For add/remove from ctx_list. */
struct list_head ctx_list; /* List of contexts belonging to client. */

-#ifdef CONFIG_PROC_FS
+#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
/**
* @objects_lock: lock protecting @objects_list
*/
@@ -74,7 +74,7 @@ struct i915_drm_client *i915_drm_client_alloc(void);

void i915_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file);

-#ifdef CONFIG_PROC_FS
+#if defined(CONFIG_PROC_FS) || defined(CONFIG_CGROUP_DRM)
void i915_drm_client_add_object(struct i915_drm_client *client,
struct drm_i915_gem_object *obj);
bool i915_drm_client_remove_object(struct drm_i915_gem_object *obj);
@@ -101,4 +101,11 @@ u64 i915_drm_cgroup_get_active_time_us(struct drm_file *file);
int i915_drm_cgroup_signal_budget(struct drm_file *file,
u64 usage, u64 budget);

+unsigned int i915_drm_cgroup_num_memory_regions(const struct drm_device *);
+const char *i915_drm_cgroup_memory_region_name(const struct drm_device *,
+ unsigned int index);
+unsigned int i915_drm_cgroup_memory_stats(struct drm_file *,
+ struct drm_memory_stats *,
+ unsigned int num);
+
#endif /* !__I915_DRM_CLIENT_H__ */
--
2.39.2


2023-07-12 12:20:29

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 06/17] drm: Update file owner during use

From: Tvrtko Ursulin <[email protected]>

With the typical model where the display server opens the file descriptor
and then hands it over to the client(*), we were showing stale data in
debugfs.

Fix it by updating the drm_file->pid on ioctl access from a different
process.

The field is also made RCU protected to allow for lockless readers. Update
side is protected with dev->filelist_mutex.

Before:

$ cat /sys/kernel/debug/dri/0/clients
command pid dev master a uid magic
Xorg 2344 0 y y 0 0
Xorg 2344 0 n y 0 2
Xorg 2344 0 n y 0 3
Xorg 2344 0 n y 0 4

After:

$ cat /sys/kernel/debug/dri/0/clients
command tgid dev master a uid magic
Xorg 830 0 y y 0 0
xfce4-session 880 0 n y 0 1
xfwm4 943 0 n y 0 2
neverball 1095 0 n y 0 3

*)
More detailed and historically accurate description of various handover
implementation kindly provided by Emil Velikov:

"""
The traditional model, the server was the orchestrator managing the
primary device node. From the fd, to the master status and
authentication. But looking at the fd alone, this has varied across
the years.

IIRC in the DRI1 days, Xorg (libdrm really) would have a list of open
fd(s) and reuse those whenever needed, DRI2 the client was responsible
for open() themselves and with DRI3 the fd was passed to the client.

Around the inception of DRI3 and systemd-logind, the latter became
another possible orchestrator. Whereby Xorg and Wayland compositors
could ask it for the fd. For various reasons (hysterical and genuine
ones) Xorg has a fallback path going the open(), whereas Wayland
compositors are moving to solely relying on logind... some never had
fallback even.

Over the past few years, more projects have emerged which provide
functionality similar (be that on API level, Dbus, or otherwise) to
systemd-logind.
"""

v2:
* Fixed typo in commit text and added a fine historical explanation
from Emil.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Daniel Vetter <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Emil Velikov <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 ++--
drivers/gpu/drm/drm_auth.c | 3 +-
drivers/gpu/drm/drm_debugfs.c | 10 ++++---
drivers/gpu/drm/drm_file.c | 40 +++++++++++++++++++++++--
drivers/gpu/drm/drm_ioctl.c | 3 ++
drivers/gpu/drm/nouveau/nouveau_drm.c | 5 +++-
drivers/gpu/drm/vmwgfx/vmwgfx_gem.c | 6 ++--
include/drm/drm_file.h | 13 ++++++--
8 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 74055cba3dc9..849097dff02b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -963,6 +963,7 @@ static int amdgpu_debugfs_gem_info_show(struct seq_file *m, void *unused)
list_for_each_entry(file, &dev->filelist, lhead) {
struct task_struct *task;
struct drm_gem_object *gobj;
+ struct pid *pid;
int id;

/*
@@ -972,8 +973,9 @@ static int amdgpu_debugfs_gem_info_show(struct seq_file *m, void *unused)
* Therefore, we need to protect this ->comm access using RCU.
*/
rcu_read_lock();
- task = pid_task(file->pid, PIDTYPE_TGID);
- seq_printf(m, "pid %8d command %s:\n", pid_nr(file->pid),
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ seq_printf(m, "pid %8d command %s:\n", pid_nr(pid),
task ? task->comm : "<unknown>");
rcu_read_unlock();

diff --git a/drivers/gpu/drm/drm_auth.c b/drivers/gpu/drm/drm_auth.c
index cf92a9ae8034..2ed2585ded37 100644
--- a/drivers/gpu/drm/drm_auth.c
+++ b/drivers/gpu/drm/drm_auth.c
@@ -235,7 +235,8 @@ static int drm_new_set_master(struct drm_device *dev, struct drm_file *fpriv)
static int
drm_master_check_perm(struct drm_device *dev, struct drm_file *file_priv)
{
- if (file_priv->pid == task_pid(current) && file_priv->was_master)
+ if (file_priv->was_master &&
+ rcu_access_pointer(file_priv->pid) == task_pid(current))
return 0;

if (!capable(CAP_SYS_ADMIN))
diff --git a/drivers/gpu/drm/drm_debugfs.c b/drivers/gpu/drm/drm_debugfs.c
index 4855230ba2c6..b46f5ceb24c6 100644
--- a/drivers/gpu/drm/drm_debugfs.c
+++ b/drivers/gpu/drm/drm_debugfs.c
@@ -90,15 +90,17 @@ static int drm_clients_info(struct seq_file *m, void *data)
*/
mutex_lock(&dev->filelist_mutex);
list_for_each_entry_reverse(priv, &dev->filelist, lhead) {
- struct task_struct *task;
bool is_current_master = drm_is_current_master(priv);
+ struct task_struct *task;
+ struct pid *pid;

- rcu_read_lock(); /* locks pid_task()->comm */
- task = pid_task(priv->pid, PIDTYPE_TGID);
+ rcu_read_lock(); /* Locks priv->pid and pid_task()->comm! */
+ pid = rcu_dereference(priv->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
uid = task ? __task_cred(task)->euid : GLOBAL_ROOT_UID;
seq_printf(m, "%20s %5d %3d %c %c %5d %10u\n",
task ? task->comm : "<unknown>",
- pid_vnr(priv->pid),
+ pid_vnr(pid),
priv->minor->index,
is_current_master ? 'y' : 'n',
priv->authenticated ? 'y' : 'n',
diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index 883d83bc0e3d..e692770ef6d3 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -160,7 +160,7 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)

/* Get a unique identifier for fdinfo: */
file->client_id = atomic64_inc_return(&ident);
- file->pid = get_pid(task_tgid(current));
+ rcu_assign_pointer(file->pid, get_pid(task_tgid(current)));
file->minor = minor;

/* for compatibility root is always authenticated */
@@ -200,7 +200,7 @@ struct drm_file *drm_file_alloc(struct drm_minor *minor)
drm_syncobj_release(file);
if (drm_core_check_feature(dev, DRIVER_GEM))
drm_gem_release(dev, file);
- put_pid(file->pid);
+ put_pid(rcu_access_pointer(file->pid));
kfree(file);

return ERR_PTR(ret);
@@ -291,7 +291,7 @@ void drm_file_free(struct drm_file *file)

WARN_ON(!list_empty(&file->event_list));

- put_pid(file->pid);
+ put_pid(rcu_access_pointer(file->pid));
kfree(file);
}

@@ -505,6 +505,40 @@ int drm_release(struct inode *inode, struct file *filp)
}
EXPORT_SYMBOL(drm_release);

+void drm_file_update_pid(struct drm_file *filp)
+{
+ struct drm_device *dev;
+ struct pid *pid, *old;
+
+ /*
+ * Master nodes need to keep the original ownership in order for
+ * drm_master_check_perm to keep working correctly. (See comment in
+ * drm_auth.c.)
+ */
+ if (filp->was_master)
+ return;
+
+ pid = task_tgid(current);
+
+ /*
+ * Quick unlocked check since the model is a single handover followed by
+ * exclusive repeated use.
+ */
+ if (pid == rcu_access_pointer(filp->pid))
+ return;
+
+ dev = filp->minor->dev;
+ mutex_lock(&dev->filelist_mutex);
+ old = rcu_replace_pointer(filp->pid, pid, 1);
+ mutex_unlock(&dev->filelist_mutex);
+
+ if (pid != old) {
+ get_pid(pid);
+ synchronize_rcu();
+ put_pid(old);
+ }
+}
+
/**
* drm_release_noglobal - release method for DRM file
* @inode: device inode
diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
index 8e9afe7af19c..78f9387e79fe 100644
--- a/drivers/gpu/drm/drm_ioctl.c
+++ b/drivers/gpu/drm/drm_ioctl.c
@@ -774,6 +774,9 @@ long drm_ioctl_kernel(struct file *file, drm_ioctl_t *func, void *kdata,
struct drm_device *dev = file_priv->minor->dev;
int retcode;

+ /* Update drm_file owner if fd was passed along. */
+ drm_file_update_pid(file_priv);
+
if (drm_dev_is_unplugged(dev))
return -ENODEV;

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index ca3bb8075357..cd15759815a8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -1101,7 +1101,10 @@ nouveau_drm_open(struct drm_device *dev, struct drm_file *fpriv)
}

get_task_comm(tmpname, current);
- snprintf(name, sizeof(name), "%s[%d]", tmpname, pid_nr(fpriv->pid));
+ rcu_read_lock();
+ snprintf(name, sizeof(name), "%s[%d]",
+ tmpname, pid_nr(rcu_dereference(fpriv->pid)));
+ rcu_read_unlock();

if (!(cli = kzalloc(sizeof(*cli), GFP_KERNEL))) {
ret = -ENOMEM;
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
index c0da89e16e6f..a07e5b7e2f2f 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
@@ -232,6 +232,7 @@ static int vmw_debugfs_gem_info_show(struct seq_file *m, void *unused)
list_for_each_entry(file, &dev->filelist, lhead) {
struct task_struct *task;
struct drm_gem_object *gobj;
+ struct pid *pid;
int id;

/*
@@ -241,8 +242,9 @@ static int vmw_debugfs_gem_info_show(struct seq_file *m, void *unused)
* Therefore, we need to protect this ->comm access using RCU.
*/
rcu_read_lock();
- task = pid_task(file->pid, PIDTYPE_TGID);
- seq_printf(m, "pid %8d command %s:\n", pid_nr(file->pid),
+ pid = rcu_dereference(file->pid);
+ task = pid_task(pid, PIDTYPE_TGID);
+ seq_printf(m, "pid %8d command %s:\n", pid_nr(pid),
task ? task->comm : "<unknown>");
rcu_read_unlock();

diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index 966912053cb0..c76249d5467e 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -256,8 +256,15 @@ struct drm_file {
/** @master_lookup_lock: Serializes @master. */
spinlock_t master_lookup_lock;

- /** @pid: Process that opened this file. */
- struct pid *pid;
+ /**
+ * @pid: Process that is using this file.
+ *
+ * Must only be dereferenced under a rcu_read_lock or equivalent.
+ *
+ * Updates are guarded with dev->filelist_mutex and reference must be
+ * dropped after a RCU grace period to accommodate lockless readers.
+ */
+ struct pid __rcu *pid;

/** @client_id: A unique id for fdinfo */
u64 client_id;
@@ -420,6 +427,8 @@ static inline bool drm_is_accel_client(const struct drm_file *file_priv)
return file_priv->minor->type == DRM_MINOR_ACCEL;
}

+void drm_file_update_pid(struct drm_file *);
+
int drm_open(struct inode *inode, struct file *filp);
int drm_open_helper(struct file *filp, struct drm_minor *minor);
ssize_t drm_read(struct file *filp, char __user *buffer,
--
2.39.2


2023-07-12 12:21:07

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 05/17] drm/i915: Implement fdinfo memory stats printing

From: Tvrtko Ursulin <[email protected]>

Use the newly added drm_print_memory_stats helper to show memory
utilisation of our objects in drm/driver specific fdinfo output.

To collect the stats we walk the per memory regions object lists
and accumulate object size into the respective drm_memory_stats
categories.

Objects with multiple possible placements are reported in multiple
regions for total and shared sizes, while other categories are
counted only for the currently active region.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Aravind Iddamsetty <[email protected]>
Cc: Rob Clark <[email protected]>
---
drivers/gpu/drm/i915/i915_drm_client.c | 85 ++++++++++++++++++++++++++
1 file changed, 85 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index a61356012df8..9e7a6075ee25 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -45,6 +45,89 @@ void __i915_drm_client_free(struct kref *kref)
}

#ifdef CONFIG_PROC_FS
+static void
+obj_meminfo(struct drm_i915_gem_object *obj,
+ struct drm_memory_stats stats[INTEL_REGION_UNKNOWN])
+{
+ struct intel_memory_region *mr;
+ u64 sz = obj->base.size;
+ enum intel_region_id id;
+ unsigned int i;
+
+ /* Attribute size and shared to all possible memory regions. */
+ for (i = 0; i < obj->mm.n_placements; i++) {
+ mr = obj->mm.placements[i];
+ id = mr->id;
+
+ if (obj->base.handle_count > 1)
+ stats[id].shared += sz;
+ else
+ stats[id].private += sz;
+ }
+
+ /* Attribute other categories to only the current region. */
+ mr = obj->mm.region;
+ if (mr)
+ id = mr->id;
+ else
+ id = INTEL_REGION_SMEM;
+
+ if (!obj->mm.n_placements) {
+ if (obj->base.handle_count > 1)
+ stats[id].shared += sz;
+ else
+ stats[id].private += sz;
+ }
+
+ if (i915_gem_object_has_pages(obj)) {
+ stats[id].resident += sz;
+
+ if (!dma_resv_test_signaled(obj->base.resv,
+ dma_resv_usage_rw(true)))
+ stats[id].active += sz;
+ else if (i915_gem_object_is_shrinkable(obj) &&
+ obj->mm.madv == I915_MADV_DONTNEED)
+ stats[id].purgeable += sz;
+ }
+}
+
+static void show_meminfo(struct drm_printer *p, struct drm_file *file)
+{
+ struct drm_memory_stats stats[INTEL_REGION_UNKNOWN] = {};
+ struct drm_i915_file_private *fpriv = file->driver_priv;
+ struct i915_drm_client *client = fpriv->client;
+ struct drm_i915_private *i915 = fpriv->i915;
+ struct drm_i915_gem_object *obj;
+ struct intel_memory_region *mr;
+ struct list_head *pos;
+ unsigned int id;
+
+ /* Public objects. */
+ spin_lock(&file->table_lock);
+ idr_for_each_entry(&file->object_idr, obj, id)
+ obj_meminfo(obj, stats);
+ spin_unlock(&file->table_lock);
+
+ /* Internal objects. */
+ rcu_read_lock();
+ list_for_each_rcu(pos, &client->objects_list) {
+ obj = i915_gem_object_get_rcu(list_entry(pos, typeof(*obj),
+ client_link));
+ if (!obj)
+ continue;
+ obj_meminfo(obj, stats);
+ i915_gem_object_put(obj);
+ }
+ rcu_read_unlock();
+
+ for_each_memory_region(mr, i915, id)
+ drm_print_memory_stats(p,
+ &stats[id],
+ DRM_GEM_OBJECT_RESIDENT |
+ DRM_GEM_OBJECT_PURGEABLE,
+ mr->name);
+}
+
static const char * const uabi_class_names[] = {
[I915_ENGINE_CLASS_RENDER] = "render",
[I915_ENGINE_CLASS_COPY] = "copy",
@@ -106,6 +189,8 @@ void i915_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file)
* ******************************************************************
*/

+ show_meminfo(p, file);
+
if (GRAPHICS_VER(i915) < 8)
return;

--
2.39.2


2023-07-12 12:21:37

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 08/17] drm/cgroup: Track DRM clients per cgroup

From: Tvrtko Ursulin <[email protected]>

To enable propagation of settings from the cgroup DRM controller to DRM
and vice-versa, we need to start tracking to which cgroups DRM clients
belong.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/drm_file.c | 6 ++++
include/drm/drm_file.h | 6 ++++
include/linux/cgroup_drm.h | 20 ++++++++++++
kernel/cgroup/drm.c | 62 +++++++++++++++++++++++++++++++++++++-
4 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index e692770ef6d3..794ffb487472 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -32,6 +32,7 @@
*/

#include <linux/anon_inodes.h>
+#include <linux/cgroup_drm.h>
#include <linux/dma-fence.h>
#include <linux/file.h>
#include <linux/module.h>
@@ -304,6 +305,8 @@ static void drm_close_helper(struct file *filp)
list_del(&file_priv->lhead);
mutex_unlock(&dev->filelist_mutex);

+ drmcgroup_client_close(file_priv);
+
drm_file_free(file_priv);
}

@@ -367,6 +370,8 @@ int drm_open_helper(struct file *filp, struct drm_minor *minor)
list_add(&priv->lhead, &dev->filelist);
mutex_unlock(&dev->filelist_mutex);

+ drmcgroup_client_open(priv);
+
#ifdef CONFIG_DRM_LEGACY
#ifdef __alpha__
/*
@@ -533,6 +538,7 @@ void drm_file_update_pid(struct drm_file *filp)
mutex_unlock(&dev->filelist_mutex);

if (pid != old) {
+ drmcgroup_client_migrate(filp);
get_pid(pid);
synchronize_rcu();
put_pid(old);
diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index c76249d5467e..e8fb3a39d482 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -30,6 +30,7 @@
#ifndef _DRM_FILE_H_
#define _DRM_FILE_H_

+#include <linux/cgroup.h>
#include <linux/types.h>
#include <linux/completion.h>
#include <linux/idr.h>
@@ -283,6 +284,11 @@ struct drm_file {
/** @minor: &struct drm_minor for this file. */
struct drm_minor *minor;

+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+ struct cgroup_subsys_state *__css;
+ struct list_head clink;
+#endif
+
/**
* @object_idr:
*
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h
index 8ef66a47619f..176431842d8e 100644
--- a/include/linux/cgroup_drm.h
+++ b/include/linux/cgroup_drm.h
@@ -6,4 +6,24 @@
#ifndef _CGROUP_DRM_H
#define _CGROUP_DRM_H

+#include <drm/drm_file.h>
+
+#if IS_ENABLED(CONFIG_CGROUP_DRM)
+void drmcgroup_client_open(struct drm_file *file_priv);
+void drmcgroup_client_close(struct drm_file *file_priv);
+void drmcgroup_client_migrate(struct drm_file *file_priv);
+#else
+static inline void drmcgroup_client_open(struct drm_file *file_priv)
+{
+}
+
+static inline void drmcgroup_client_close(struct drm_file *file_priv)
+{
+}
+
+static void drmcgroup_client_migrate(struct drm_file *file_priv)
+{
+}
+#endif
+
#endif /* _CGROUP_DRM_H */
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 02c8eaa633d3..d702be1b441f 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -5,17 +5,25 @@

#include <linux/cgroup.h>
#include <linux/cgroup_drm.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
#include <linux/slab.h>

struct drm_cgroup_state {
struct cgroup_subsys_state css;
+
+ struct list_head clients;
};

struct drm_root_cgroup_state {
struct drm_cgroup_state drmcs;
};

-static struct drm_root_cgroup_state root_drmcs;
+static struct drm_root_cgroup_state root_drmcs = {
+ .drmcs.clients = LIST_HEAD_INIT(root_drmcs.drmcs.clients),
+};
+
+static DEFINE_MUTEX(drmcg_mutex);

static inline struct drm_cgroup_state *
css_to_drmcs(struct cgroup_subsys_state *css)
@@ -42,11 +50,63 @@ drmcs_alloc(struct cgroup_subsys_state *parent_css)
drmcs = kzalloc(sizeof(*drmcs), GFP_KERNEL);
if (!drmcs)
return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&drmcs->clients);
}

return &drmcs->css;
}

+void drmcgroup_client_open(struct drm_file *file_priv)
+{
+ struct drm_cgroup_state *drmcs;
+
+ drmcs = css_to_drmcs(task_get_css(current, drm_cgrp_id));
+
+ mutex_lock(&drmcg_mutex);
+ file_priv->__css = &drmcs->css; /* Keeps the reference. */
+ list_add_tail(&file_priv->clink, &drmcs->clients);
+ mutex_unlock(&drmcg_mutex);
+}
+EXPORT_SYMBOL_GPL(drmcgroup_client_open);
+
+void drmcgroup_client_close(struct drm_file *file_priv)
+{
+ struct drm_cgroup_state *drmcs;
+
+ drmcs = css_to_drmcs(file_priv->__css);
+
+ mutex_lock(&drmcg_mutex);
+ list_del(&file_priv->clink);
+ file_priv->__css = NULL;
+ mutex_unlock(&drmcg_mutex);
+
+ css_put(&drmcs->css);
+}
+EXPORT_SYMBOL_GPL(drmcgroup_client_close);
+
+void drmcgroup_client_migrate(struct drm_file *file_priv)
+{
+ struct drm_cgroup_state *src, *dst;
+ struct cgroup_subsys_state *old;
+
+ mutex_lock(&drmcg_mutex);
+
+ old = file_priv->__css;
+ src = css_to_drmcs(old);
+ dst = css_to_drmcs(task_get_css(current, drm_cgrp_id));
+
+ if (src != dst) {
+ file_priv->__css = &dst->css; /* Keeps the reference. */
+ list_move_tail(&file_priv->clink, &dst->clients);
+ }
+
+ mutex_unlock(&drmcg_mutex);
+
+ css_put(old);
+}
+EXPORT_SYMBOL_GPL(drmcgroup_client_migrate);
+
struct cftype files[] = {
{ } /* Zero entry terminates. */
};
--
2.39.2


2023-07-12 12:46:34

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 16/17] cgroup/drm: Expose memory stats

From: Tvrtko Ursulin <[email protected]>

With a few DRM drivers exposing per client memory stats via the fdinfo
interface already, we can add support for exposing the same data (albeit
aggregated for cgroup hierarchies) via the drm cgroup controller.

Add some driver callbacks and controller code to use them, walking the
sub-tree, collating the numbers, and presenting them in a new field
name drm.memory.stat.

Example file content:

$ cat drm.memory.stat
card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Data is generated on demand for simplicty of implementation ie. no running
totals are kept or accounted during migrations and such. Various
optimisations such as cheaper collection of data are possible but
deliberately left out for now.

Overall, the feature is deemed to be useful to container orchestration
software (and manual management).

Limits, either soft or hard, are not envisaged to be implemented on top of
this approach due on demand nature of collecting the stats.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Eero Tamminen <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 22 ++++
include/drm/drm_drv.h | 61 ++++++++++
kernel/cgroup/drm.c | 149 ++++++++++++++++++++++++
3 files changed, 232 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index bbe986366f4a..1891c7d98206 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2452,6 +2452,28 @@ DRM scheduling soft limits interface files
Standard cgroup weight based control [1, 10000] used to configure the
relative distributing of GPU time between the sibling groups.

+ drm.memory.stat
+ A nested file containing cumulative memory statistics for the whole
+ sub-hierarchy, broken down into separate GPUs and separate memory
+ regions supported by the latter.
+
+ For example::
+
+ $ cat drm.memory.stat
+ card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
+ card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
+
+ Card designation corresponds to the DRM device names and multiple line
+ entries can be present per card.
+
+ Memory region names should be expected to be driver specific with the
+ exception of 'system' which is standardised and applicable for GPUs
+ which can operate on system memory buffers.
+
+ Sub-keys 'resident' and 'purgeable' are optional.
+
+ Per category region usage is reported in bytes.
+
Misc
----

diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 29e11a87bf75..2ea9a46b5031 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -41,6 +41,7 @@ struct drm_minor;
struct dma_buf;
struct dma_buf_attachment;
struct drm_display_mode;
+struct drm_memory_stats;
struct drm_mode_create_dumb;
struct drm_printer;
struct sg_table;
@@ -175,6 +176,66 @@ struct drm_cgroup_ops {
* messages sent by the DRM cgroup controller.
*/
int (*signal_budget) (struct drm_file *, u64 used, u64 budget);
+
+
+ /**
+ * @num_memory_regions:
+ *
+ * Optional callback reporting the number of memory regions driver
+ * supports.
+ *
+ * Callback is allowed to report a larger number than present memory
+ * regions, but @memory_region_name is then supposed to return NULL for
+ * those indices.
+ *
+ * Used by the DRM core when queried by the DRM cgroup controller.
+ *
+ * All three callbacks of @num_memory_regions, @memory_region_name and
+ * @memory_stats need to be implemented for DRM cgroup memory stats
+ * support.
+ */
+ unsigned int (*num_memory_regions) (const struct drm_device *);
+
+ /**
+ * @memory_region_name:
+ *
+ * Optional callback reporting the name of the queried memory region.
+ *
+ * Can be NULL if the memory region index is not supported by the
+ * passed in device.
+ *
+ * Used by the DRM core when queried by the DRM cgroup controller.
+ *
+ * All three callbacks of @num_memory_regions, @memory_region_name and
+ * @memory_stats need to be implemented for DRM cgroup memory stats
+ * support.
+ */
+ const char * (*memory_region_name) (const struct drm_device *,
+ unsigned int index);
+
+ /**
+ * memory_stats:
+ *
+ * Optional callback adding to the passed in array of struct
+ * drm_memory_stats objects.
+ *
+ * Number of objects in the array is passed in the @num argument.
+ *
+ * Returns a bitmask of supported enum drm_gem_object_status by the
+ * driver instance.
+ *
+ * Callback is only allow to add to the existing fields and should
+ * never clear them.
+ *
+ * Used by the DRM core when queried by the DRM cgroup controller.
+ *
+ * All three callbacks of @num_memory_regions, @memory_region_name and
+ * @memory_stats need to be implemented for DRM cgroup memory stats
+ * support.
+ */
+ unsigned int (*memory_stats) (struct drm_file *,
+ struct drm_memory_stats *,
+ unsigned int num);
};

/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index 7c20d4ebc634..22fc180dd659 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -5,6 +5,7 @@

#include <linux/cgroup.h>
#include <linux/cgroup_drm.h>
+#include <linux/device.h>
#include <linux/list.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
@@ -12,6 +13,8 @@
#include <linux/slab.h>

#include <drm/drm_drv.h>
+#include <drm/drm_file.h>
+#include <drm/drm_gem.h>

struct drm_cgroup_state {
struct cgroup_subsys_state css;
@@ -133,6 +136,147 @@ drmcs_read_total_us(struct cgroup_subsys_state *css, struct cftype *cft)
return val;
}

+struct drmcs_stat {
+ const struct drm_device *dev;
+ const struct drm_cgroup_ops *cg_ops;
+ const char *device_name;
+ unsigned int regions;
+ enum drm_gem_object_status flags;
+ struct drm_memory_stats *stats;
+};
+
+static int drmcs_seq_show_memory(struct seq_file *sf, void *v)
+{
+ struct cgroup_subsys_state *node;
+ struct drmcs_stat *stats = NULL;
+ unsigned int num_devices, i;
+ int ret;
+
+ /*
+ * We could avoid taking the cgroup_lock and just walk the tree under
+ * RCU but then allocating temporary storage becomes a problem. So for
+ * now keep it simple and take the lock.
+ */
+ cgroup_lock();
+
+ /* Protect against client migrations and clients disappearing. */
+ ret = mutex_lock_interruptible(&drmcg_mutex);
+ if (ret) {
+ cgroup_unlock();
+ return ret;
+ }
+
+ num_devices = 0;
+ css_for_each_descendant_pre(node, seq_css(sf)) {
+ struct drm_cgroup_state *drmcs = css_to_drmcs(node);
+ struct drm_file *fpriv;
+
+ list_for_each_entry(fpriv, &drmcs->clients, clink) {
+ const struct drm_cgroup_ops *cg_ops =
+ fpriv->minor->dev->driver->cg_ops;
+ const char *device_name = dev_name(fpriv->minor->kdev);
+ struct drmcs_stat *stat;
+ unsigned int regions;
+
+ /* Does this driver supports memory stats? */
+ if (cg_ops &&
+ cg_ops->num_memory_regions &&
+ cg_ops->memory_region_name &&
+ cg_ops->memory_stats)
+ regions =
+ cg_ops->num_memory_regions(fpriv->minor->dev);
+ else
+ regions = 0;
+
+ if (!regions)
+ continue;
+
+ /* Have we seen this device before? */
+ stat = NULL;
+ for (i = 0; i < num_devices; i++) {
+ if (!strcmp(stats[i].device_name,
+ device_name)) {
+ stat = &stats[i];
+ break;
+ }
+ }
+
+ /* If not allocate space for it. */
+ if (!stat) {
+ stats = krealloc_array(stats, num_devices + 1,
+ sizeof(*stats),
+ GFP_USER);
+ if (!stats) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ stat = &stats[num_devices++];
+ stat->dev = fpriv->minor->dev;
+ stat->cg_ops = cg_ops;
+ stat->device_name = device_name;
+ stat->flags = 0;
+ stat->regions = regions;
+ stat->stats =
+ kcalloc(regions,
+ sizeof(struct drm_memory_stats),
+ GFP_USER);
+ if (!stat->stats) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ }
+
+ /* Accumulate the stats for this device+client. */
+ stat->flags |= cg_ops->memory_stats(fpriv,
+ stat->stats,
+ stat->regions);
+ }
+ }
+
+ for (i = 0; i < num_devices; i++) {
+ struct drmcs_stat *stat = &stats[i];
+ unsigned int j;
+
+ for (j = 0; j < stat->regions; j++) {
+ const char *name =
+ stat->cg_ops->memory_region_name(stat->dev, j);
+
+ if (!name)
+ continue;
+
+ seq_printf(sf,
+ "%s region=%s total=%llu shared=%llu active=%llu",
+ stat->device_name,
+ name,
+ stat->stats[j].private +
+ stat->stats[j].shared,
+ stat->stats[j].shared,
+ stat->stats[j].active);
+
+ if (stat->flags & DRM_GEM_OBJECT_RESIDENT)
+ seq_printf(sf, " resident=%llu",
+ stat->stats[j].resident);
+
+ if (stat->flags & DRM_GEM_OBJECT_PURGEABLE)
+ seq_printf(sf, " purgeable=%llu",
+ stat->stats[j].purgeable);
+
+ seq_puts(sf, "\n");
+ }
+ }
+
+out:
+ mutex_unlock(&drmcg_mutex);
+ cgroup_unlock();
+
+ for (i = 0; i < num_devices; i++)
+ kfree(stats[i].stats);
+ kfree(stats);
+
+ return ret;
+}
+
static bool __start_scanning(unsigned int period_us)
{
struct drm_cgroup_state *root = &root_drmcs.drmcs;
@@ -575,6 +719,11 @@ struct cftype files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = drmcs_read_total_us,
},
+ {
+ .name = "memory.stat",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = drmcs_seq_show_memory,
+ },
{ } /* Zero entry terminates. */
};

--
2.39.2


2023-07-12 12:53:40

by Tvrtko Ursulin

[permalink] [raw]
Subject: [PATCH 15/17] cgroup/drm: Expose GPU utilisation

From: Tvrtko Ursulin <[email protected]>

To support container use cases where external orchestrators want to make
deployment and migration decisions based on GPU load and capacity, we can
expose the GPU load as seen by the controller in a new drm.active_us
field. This field contains a monotonic cumulative time cgroup has spent
executing GPU loads, as reported by the DRM drivers being used by group
members.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Eero Tamminen <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 3 +++
kernel/cgroup/drm.c | 26 ++++++++++++++++++++++++-
2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index da350858c59f..bbe986366f4a 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2445,6 +2445,9 @@ will be respected.
DRM scheduling soft limits interface files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+ drm.active_us
+ GPU time used by the group recursively including all child groups.
+
drm.weight
Standard cgroup weight based control [1, 10000] used to configure the
relative distributing of GPU time between the sibling groups.
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c
index b244e3d828cc..7c20d4ebc634 100644
--- a/kernel/cgroup/drm.c
+++ b/kernel/cgroup/drm.c
@@ -25,6 +25,8 @@ struct drm_cgroup_state {
bool over;
bool over_budget;

+ u64 total_us;
+
u64 per_s_budget_us;
u64 prev_active_us;
u64 active_us;
@@ -117,6 +119,20 @@ drmcs_write_weight(struct cgroup_subsys_state *css, struct cftype *cftype,
return 0;
}

+static u64
+drmcs_read_total_us(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct drm_cgroup_state *drmcs = css_to_drmcs(css);
+ u64 val;
+
+ /* Mutex being overkill unless arch cannot atomically read u64.. */
+ mutex_lock(&drmcg_mutex);
+ val = drmcs->total_us;
+ mutex_unlock(&drmcg_mutex);
+
+ return val;
+}
+
static bool __start_scanning(unsigned int period_us)
{
struct drm_cgroup_state *root = &root_drmcs.drmcs;
@@ -169,11 +185,14 @@ static bool __start_scanning(unsigned int period_us)
parent = css_to_drmcs(node->parent);

active = drmcs_get_active_time_us(drmcs);
- if (period_us && active > drmcs->prev_active_us)
+ if (period_us && active > drmcs->prev_active_us) {
drmcs->active_us += active - drmcs->prev_active_us;
+ drmcs->total_us += drmcs->active_us;
+ }
drmcs->prev_active_us = active;

parent->active_us += drmcs->active_us;
+ parent->total_us += drmcs->active_us;
parent->sum_children_weights += drmcs->weight;

css_put(node);
@@ -551,6 +570,11 @@ struct cftype files[] = {
.read_u64 = drmcs_read_weight,
.write_u64 = drmcs_write_weight,
},
+ {
+ .name = "active_us",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = drmcs_read_total_us,
+ },
{ } /* Zero entry terminates. */
};

--
2.39.2


2023-07-19 21:53:14

by T.J. Mercier

[permalink] [raw]
Subject: Re: [RFC v5 00/17] DRM cgroup controller with scheduling control and memory stats

On Wed, Jul 12, 2023 at 4:47 AM Tvrtko Ursulin
<[email protected]> wrote:
>
> drm.memory.stat
> A nested file containing cumulative memory statistics for the whole
> sub-hierarchy, broken down into separate GPUs and separate memory
> regions supported by the latter.
>
> For example::
>
> $ cat drm.memory.stat
> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>
> Card designation corresponds to the DRM device names and multiple line
> entries can be present per card.
>
> Memory region names should be expected to be driver specific with the
> exception of 'system' which is standardised and applicable for GPUs
> which can operate on system memory buffers.
>
> Sub-keys 'resident' and 'purgeable' are optional.
>
> Per category region usage is reported in bytes.
>
> * Feedback from people interested in drm.active_us and drm.memory.stat is
> required to understand the use cases and their usefulness (of the fields).
>
> Memory stats are something which was easy to add to my series, since I was
> already working on the fdinfo memory stats patches, but the question is how
> useful it is.
>
Hi Tvrtko,

I think this style of driver-defined categories for reporting of
memory could potentially allow us to eliminate the GPU memory tracking
tracepoint used on Android (gpu_mem_total). This would involve reading
drm.memory.stat at the root cgroup (I see it's currently disabled on
the root), which means traversing the whole cgroup tree under the
cgroup lock to generate the values on-demand. This would be done
rarely, but I still wonder what the cost of that would turn out to be.
The drm_memory_stats categories in the output don't seem like a big
value-add for this use-case, but no real objection to them being
there. I know it's called the DRM cgroup controller, but it'd be nice
if there were a way to make the mem tracking part work for any driver
that wishes to participate as many of our devices don't use a DRM
driver. But making that work doesn't look like it would fit very
cleanly into this controller, so I'll just shut up now.

Thanks!
-T.J.

2023-07-20 11:17:52

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC v5 00/17] DRM cgroup controller with scheduling control and memory stats


Hi,

On 19/07/2023 21:31, T.J. Mercier wrote:
> On Wed, Jul 12, 2023 at 4:47 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>> drm.memory.stat
>> A nested file containing cumulative memory statistics for the whole
>> sub-hierarchy, broken down into separate GPUs and separate memory
>> regions supported by the latter.
>>
>> For example::
>>
>> $ cat drm.memory.stat
>> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
>> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>>
>> Card designation corresponds to the DRM device names and multiple line
>> entries can be present per card.
>>
>> Memory region names should be expected to be driver specific with the
>> exception of 'system' which is standardised and applicable for GPUs
>> which can operate on system memory buffers.
>>
>> Sub-keys 'resident' and 'purgeable' are optional.
>>
>> Per category region usage is reported in bytes.
>>
>> * Feedback from people interested in drm.active_us and drm.memory.stat is
>> required to understand the use cases and their usefulness (of the fields).
>>
>> Memory stats are something which was easy to add to my series, since I was
>> already working on the fdinfo memory stats patches, but the question is how
>> useful it is.
>>
> Hi Tvrtko,
>
> I think this style of driver-defined categories for reporting of
> memory could potentially allow us to eliminate the GPU memory tracking
> tracepoint used on Android (gpu_mem_total). This would involve reading
> drm.memory.stat at the root cgroup (I see it's currently disabled on

I can put it available under root too, don't think there is any
technical reason to not have it. In fact, now that I look at it again,
memory.stat is present on root so that would align with my general
guideline to keep the two as similar as possible.

> the root), which means traversing the whole cgroup tree under the
> cgroup lock to generate the values on-demand. This would be done
> rarely, but I still wonder what the cost of that would turn out to be.

Yeah that's ugly. I could eliminate cgroup_lock by being a bit smarter.
Just didn't think it worth it for the RFC.

Basically to account memory stats for any sub-tree I need the equivalent
one struct drm_memory_stats per DRM device present in the hierarchy. So
I could pre-allocate a few and restart if run out of spares, or
something. They are really small so pre-allocating a good number, based
on past state or something, should would good enough. Or even total
number of DRM devices in a system as a pessimistic and safe option for
most reasonable deployments.

> The drm_memory_stats categories in the output don't seem like a big
> value-add for this use-case, but no real objection to them being

You mean the fact there are different categories is not a value add for
your use case because you would only use one?

The idea was to align 1:1 with DRM memory stats fdinfo and somewhat
emulate how memory.stat also offers a breakdown.

> there. I know it's called the DRM cgroup controller, but it'd be nice
> if there were a way to make the mem tracking part work for any driver
> that wishes to participate as many of our devices don't use a DRM
> driver. But making that work doesn't look like it would fit very

Ah that would be a challenge indeed to which I don't have any answers
right now.

Hm if you have a DRM device somewhere in the chain memory stats would
still show up. Like if you had a dma-buf producer which is not a DRM
driver, but then that buffer was imported by a DRM driver, it would show
up in a cgroup. Or vice-versa. But if there aren't any in the whole
chain then it would not.

> cleanly into this controller, so I'll just shut up now.

Not all all, good feedback!

Regards,

Tvrtko

2023-07-20 18:43:19

by T.J. Mercier

[permalink] [raw]
Subject: Re: [RFC v5 00/17] DRM cgroup controller with scheduling control and memory stats

On Thu, Jul 20, 2023 at 3:55 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> Hi,
>
> On 19/07/2023 21:31, T.J. Mercier wrote:
> > On Wed, Jul 12, 2023 at 4:47 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >> drm.memory.stat
> >> A nested file containing cumulative memory statistics for the whole
> >> sub-hierarchy, broken down into separate GPUs and separate memory
> >> regions supported by the latter.
> >>
> >> For example::
> >>
> >> $ cat drm.memory.stat
> >> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
> >> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
> >>
> >> Card designation corresponds to the DRM device names and multiple line
> >> entries can be present per card.
> >>
> >> Memory region names should be expected to be driver specific with the
> >> exception of 'system' which is standardised and applicable for GPUs
> >> which can operate on system memory buffers.
> >>
> >> Sub-keys 'resident' and 'purgeable' are optional.
> >>
> >> Per category region usage is reported in bytes.
> >>
> >> * Feedback from people interested in drm.active_us and drm.memory.stat is
> >> required to understand the use cases and their usefulness (of the fields).
> >>
> >> Memory stats are something which was easy to add to my series, since I was
> >> already working on the fdinfo memory stats patches, but the question is how
> >> useful it is.
> >>
> > Hi Tvrtko,
> >
> > I think this style of driver-defined categories for reporting of
> > memory could potentially allow us to eliminate the GPU memory tracking
> > tracepoint used on Android (gpu_mem_total). This would involve reading
> > drm.memory.stat at the root cgroup (I see it's currently disabled on
>
> I can put it available under root too, don't think there is any
> technical reason to not have it. In fact, now that I look at it again,
> memory.stat is present on root so that would align with my general
> guideline to keep the two as similar as possible.
>
> > the root), which means traversing the whole cgroup tree under the
> > cgroup lock to generate the values on-demand. This would be done
> > rarely, but I still wonder what the cost of that would turn out to be.
>
> Yeah that's ugly. I could eliminate cgroup_lock by being a bit smarter.
> Just didn't think it worth it for the RFC.
>
> Basically to account memory stats for any sub-tree I need the equivalent
> one struct drm_memory_stats per DRM device present in the hierarchy. So
> I could pre-allocate a few and restart if run out of spares, or
> something. They are really small so pre-allocating a good number, based
> on past state or something, should would good enough. Or even total
> number of DRM devices in a system as a pessimistic and safe option for
> most reasonable deployments.
>
> > The drm_memory_stats categories in the output don't seem like a big
> > value-add for this use-case, but no real objection to them being
>
> You mean the fact there are different categories is not a value add for
> your use case because you would only use one?
>
Exactly, I guess that'd be just "private" (or pick another one) for
the driver-defined "regions" where
shared/private/resident/purgeable/active aren't really applicable.
That doesn't seem like a big problem to me since you already need an
understanding of what a driver-defined region means. It's just adding
a requirement to understand what fields are used, and a driver can
document that in the same place as the region itself. That does mean
performing arithmetic on values from different drivers might not make
sense. But this is just my perspective from trying to fit the
gpu_mem_total tracepoint here. I think we could probably change the
way drivers that use it report memory to fit closer into the
drm_memory_stats categories.

> The idea was to align 1:1 with DRM memory stats fdinfo and somewhat
> emulate how memory.stat also offers a breakdown.
>
> > there. I know it's called the DRM cgroup controller, but it'd be nice
> > if there were a way to make the mem tracking part work for any driver
> > that wishes to participate as many of our devices don't use a DRM
> > driver. But making that work doesn't look like it would fit very
>
> Ah that would be a challenge indeed to which I don't have any answers
> right now.
>
> Hm if you have a DRM device somewhere in the chain memory stats would
> still show up. Like if you had a dma-buf producer which is not a DRM
> driver, but then that buffer was imported by a DRM driver, it would show
> up in a cgroup. Or vice-versa. But if there aren't any in the whole
> chain then it would not.
>
Creating a dummy DRM driver underneath an existing driver as an
adaptation layer also came to mind, but yeah... probably not. :)

By the way I still want to try to add tracking for dma-bufs backed by
system memory to memcg, but I'm trying to get memcg v2 up and running
for us first. I don't think that should conflict with the tracking
here.

> > cleanly into this controller, so I'll just shut up now.
>
> Not all all, good feedback!
>
> Regards,
>
> Tvrtko

2023-07-21 22:29:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats

On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
> $ cat drm.memory.stat
> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>
> Data is generated on demand for simplicty of implementation ie. no running
> totals are kept or accounted during migrations and such. Various
> optimisations such as cheaper collection of data are possible but
> deliberately left out for now.
>
> Overall, the feature is deemed to be useful to container orchestration
> software (and manual management).
>
> Limits, either soft or hard, are not envisaged to be implemented on top of
> this approach due on demand nature of collecting the stats.

So, yeah, if you want to add memory controls, we better think through how
the fd ownership migration should work.

Thanks.

--
tejun

2023-07-21 22:34:12

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 15/17] cgroup/drm: Expose GPU utilisation

On Wed, Jul 12, 2023 at 12:46:03PM +0100, Tvrtko Ursulin wrote:
> + drm.active_us
> + GPU time used by the group recursively including all child groups.

Maybe instead add drm.stat and have "usage_usec" inside? That'd be more
consistent with cpu side.

Thanks.

--
tejun

2023-07-21 22:37:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 08/17] drm/cgroup: Track DRM clients per cgroup

Hello,

On Wed, Jul 12, 2023 at 12:45:56PM +0100, Tvrtko Ursulin wrote:
> +void drmcgroup_client_migrate(struct drm_file *file_priv)
> +{
> + struct drm_cgroup_state *src, *dst;
> + struct cgroup_subsys_state *old;
> +
> + mutex_lock(&drmcg_mutex);
> +
> + old = file_priv->__css;
> + src = css_to_drmcs(old);
> + dst = css_to_drmcs(task_get_css(current, drm_cgrp_id));
> +
> + if (src != dst) {
> + file_priv->__css = &dst->css; /* Keeps the reference. */
> + list_move_tail(&file_priv->clink, &dst->clients);
> + }
> +
> + mutex_unlock(&drmcg_mutex);
> +
> + css_put(old);
> +}
> +EXPORT_SYMBOL_GPL(drmcgroup_client_migrate);

So, you're implicitly migrating the fd to the latest ioctl user on the first
access. While this may work for state-less control like usage time. This
likely will cause problem if you later want to add stateful control for
memory consumption. e.g. What happens if the new destination doesn't have
enough budget to accommodate the fd's usage? Let's say we allow over-commit
with follow-up reclaim or oom kill actions, if so, can we guarantee that all
memory ownership for can always be updated? If not, what happens after
failure?

If DRM needs to transfer fd ownership with resources attached to it, it
likely would be a good idea to make that an explicit operation so that the
attempt can be verified and failed if necessary.

Thanks.

--
tejun

2023-07-21 22:44:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 15/17] cgroup/drm: Expose GPU utilisation

On Fri, Jul 21, 2023 at 12:19:32PM -1000, Tejun Heo wrote:
> On Wed, Jul 12, 2023 at 12:46:03PM +0100, Tvrtko Ursulin wrote:
> > + drm.active_us
> > + GPU time used by the group recursively including all child groups.
>
> Maybe instead add drm.stat and have "usage_usec" inside? That'd be more
> consistent with cpu side.

Also, shouldn't this be keyed by the drm device?

--
tejun

2023-07-25 14:25:19

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH 15/17] cgroup/drm: Expose GPU utilisation


On 21/07/2023 23:20, Tejun Heo wrote:
> On Fri, Jul 21, 2023 at 12:19:32PM -1000, Tejun Heo wrote:
>> On Wed, Jul 12, 2023 at 12:46:03PM +0100, Tvrtko Ursulin wrote:
>>> + drm.active_us
>>> + GPU time used by the group recursively including all child groups.
>>
>> Maybe instead add drm.stat and have "usage_usec" inside? That'd be more
>> consistent with cpu side.

Could be, but no strong opinion from my side either way. Perhaps it boils down to what could be put in the file, I mean to decide whether keyed format makes sense or not.

> Also, shouldn't this be keyed by the drm device?

It could have that too, or it could come later. Fun with GPUs that it not only could be keyed by the device, but also by the type of the GPU engine. (Which are a) vendor specific and b) some aree fully independent, some partially so, and some not at all - so it could get complicated semantics wise really fast.)

If for now I'd go with drm.stat/usage_usec containing the total time spent how would you suggest adding per device granularity? Files as documented are either flag or nested, not both at the same time. So something like:

usage_usec 100000
card0 usage_usec 50000
card1 usage_usec 50000

Would or would not fly? Have two files along the lines of drm.stat and drm.dev_stat?

While on this general topic, you will notice that for memory stats I have _sort of_ nested keyed per device format, for example on integrated Intel GPU:

$ cat drm.memory.stat
card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

If one a discrete Intel GPU two more lines would appear with memory regions of local and local-system. But then on some server class multi-tile GPUs even further regions with more than one device local memory region. And users do want to see this granularity for container use cases at least.

Anyway, this may not be compatible with the nested key format as documented in cgroup-v2.rst, although it does not explicitly say.

Should I cheat and create key names based on device and memory region name and let userspace parse it? Like:

$ cat drm.memory.stat
card0.system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
card0.stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Regards,

Tvrtko

2023-07-25 22:16:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 15/17] cgroup/drm: Expose GPU utilisation

Hello,

On Tue, Jul 25, 2023 at 03:08:40PM +0100, Tvrtko Ursulin wrote:
> > Also, shouldn't this be keyed by the drm device?
>
> It could have that too, or it could come later. Fun with GPUs that it not
> only could be keyed by the device, but also by the type of the GPU engine.
> (Which are a) vendor specific and b) some aree fully independent, some
> partially so, and some not at all - so it could get complicated semantics
> wise really fast.)

I see.

> If for now I'd go with drm.stat/usage_usec containing the total time spent
> how would you suggest adding per device granularity? Files as documented
> are either flag or nested, not both at the same time. So something like:
>
> usage_usec 100000
> card0 usage_usec 50000
> card1 usage_usec 50000
>
> Would or would not fly? Have two files along the lines of drm.stat and drm.dev_stat?

Please follow one of the pre-defined formats. If you want to have card
identifier and field key, it should be a nested keyed file like io.stat.

> While on this general topic, you will notice that for memory stats I have
> _sort of_ nested keyed per device format, for example on integrated Intel
> GPU:
>
> $ cat drm.memory.stat
> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>
> If one a discrete Intel GPU two more lines would appear with memory
> regions of local and local-system. But then on some server class
> multi-tile GPUs even further regions with more than one device local
> memory region. And users do want to see this granularity for container use
> cases at least.
>
> Anyway, this may not be compatible with the nested key format as
> documented in cgroup-v2.rst, although it does not explicitly say.
>
> Should I cheat and create key names based on device and memory region name
> and let userspace parse it? Like:
>
> $ cat drm.memory.stat
> card0.system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
> card0.stolen-system total=0 shared=0 active=0 resident=0 purgeable=0

Yeah, this looks better to me. If they're reporting different values for the
same fields, they're separate keys.

Thanks.

--
tejun

2023-07-26 10:29:23

by Maarten Lankhorst

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats

Hey,

On 2023-07-22 00:21, Tejun Heo wrote:
> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>> $ cat drm.memory.stat
>> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
>> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>>
>> Data is generated on demand for simplicty of implementation ie. no running
>> totals are kept or accounted during migrations and such. Various
>> optimisations such as cheaper collection of data are possible but
>> deliberately left out for now.
>>
>> Overall, the feature is deemed to be useful to container orchestration
>> software (and manual management).
>>
>> Limits, either soft or hard, are not envisaged to be implemented on top of
>> this approach due on demand nature of collecting the stats.
>
> So, yeah, if you want to add memory controls, we better think through how
> the fd ownership migration should work.
I've taken a look at the series, since I have been working on cgroup
memory eviction.

The scheduling stuff will work for i915, since it has a purely software
execlist scheduler, but I don't think it will work for GuC (firmware)
scheduling or other drivers that use the generic drm scheduler.

For something like this, you would probably want it to work inside the
drm scheduler first. Presumably, this can be done by setting a weight on
each runqueue, and perhaps adding a callback to update one for a running
queue. Calculating the weights hierarchically might be fun..

I have taken a look at how the rest of cgroup controllers change
ownership when moved to a different cgroup, and the answer was: not at
all. If we attempt to create the scheduler controls only on the first
time the fd is used, you could probably get rid of all the tracking.
This can be done very easily with the drm scheduler.

WRT memory, I think the consensus is to track system memory like normal
memory. Stolen memory doesn't need to be tracked. It's kernel only
memory, used for internal bookkeeping only.

The only time userspace can directly manipulate stolen memory, is by
mapping the pinned initial framebuffer to its own address space. The
only allocation it can do is when a framebuffer is displayed, and
framebuffer compression creates some stolen memory. Userspace is not
aware of this though, and has no way to manipulate those contents.

Cheers,
~Maarten

2023-07-26 13:44:18

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats


On 26/07/2023 11:14, Maarten Lankhorst wrote:
> Hey,
>
> On 2023-07-22 00:21, Tejun Heo wrote:
>> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>>>    $ cat drm.memory.stat
>>>    card0 region=system total=12898304 shared=0 active=0
>>> resident=12111872 purgeable=167936
>>>    card0 region=stolen-system total=0 shared=0 active=0 resident=0
>>> purgeable=0
>>>
>>> Data is generated on demand for simplicty of implementation ie. no
>>> running
>>> totals are kept or accounted during migrations and such. Various
>>> optimisations such as cheaper collection of data are possible but
>>> deliberately left out for now.
>>>
>>> Overall, the feature is deemed to be useful to container orchestration
>>> software (and manual management).
>>>
>>> Limits, either soft or hard, are not envisaged to be implemented on
>>> top of
>>> this approach due on demand nature of collecting the stats.
>>
>> So, yeah, if you want to add memory controls, we better think through how
>> the fd ownership migration should work.
> I've taken a look at the series, since I have been working on cgroup
> memory eviction.
>
> The scheduling stuff will work for i915, since it has a purely software
> execlist scheduler, but I don't think it will work for GuC (firmware)
> scheduling or other drivers that use the generic drm scheduler.

It actually works - I used to have a blurb in the cover letter about it
but apparently I dropped it. Just a bit less well with many clients,
since there are fewer priority levels.

All that the design requires from the invididual drivers is some way to
react to the "you are over budget by this much" signal. The rest is
driver and backend specific.

> For something like this,  you would probably want it to work inside the
> drm scheduler first. Presumably, this can be done by setting a weight on
> each runqueue, and perhaps adding a callback to update one for a running
> queue. Calculating the weights hierarchically might be fun..

It is not needed to work in drm scheduler first. In fact drm scheduler
based drivers can plug into what I have since it already has the notion
of scheduling priorities.

They would only need to implement a hook which allow the cgroup
controller to query client GPU utilisation and another to received the
over budget signal.

Amdgpu and msm AFAIK could be easy candidates because they both support
per client utilisation and priorities.

Looks like I need to put all this info back into the cover letter.

Also, hierarchic weights and time budgets are all already there. What
could be done later is make this all smarter and respect the time budget
with more precision. That would however, in many cases including Intel,
require co-operation with the firmware. In any case it is only work in
the implementation, while the cgroup control interface remains the same.

> I have taken a look at how the rest of cgroup controllers change
> ownership when moved to a different cgroup, and the answer was: not at
> all. If we attempt to create the scheduler controls only on the first
> time the fd is used, you could probably get rid of all the tracking.

Can you send a CPU file descriptor from process A to process B and have
CPU usage belonging to process B show up in process' A cgroup, or
vice-versa? Nope, I am not making any sense, am I? My point being it is
not like-to-like, model is different.

No ownership transfer would mean in wide deployments all GPU utilisation
would be assigned to Xorg and so there is no point to any of this. No
way to throttle a cgroup with un-important GPU clients for instance.

> This can be done very easily with the drm scheduler.
>
> WRT memory, I think the consensus is to track system memory like normal
> memory. Stolen memory doesn't need to be tracked. It's kernel only
> memory, used for internal bookkeeping  only.
>
> The only time userspace can directly manipulate stolen memory, is by
> mapping the pinned initial framebuffer to its own address space. The
> only allocation it can do is when a framebuffer is displayed, and
> framebuffer compression creates some stolen memory. Userspace is not
> aware of this though, and has no way to manipulate those contents.

Stolen memory is irrelevant and not something cgroup controller knows
about. Point is drivers say which memory regions they have and their
utilisation.

Imagine instead of stolen it said vram0, or on Intel multi-tile it shows
local0 and local1. People working with containers are interested to see
this breakdown. I guess the parallel and use case here is closer to
memory.numa_stat.

Regards,

Tvrtko

2023-07-26 17:13:37

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats


On 21/07/2023 23:21, Tejun Heo wrote:
> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>> $ cat drm.memory.stat
>> card0 region=system total=12898304 shared=0 active=0 resident=12111872 purgeable=167936
>> card0 region=stolen-system total=0 shared=0 active=0 resident=0 purgeable=0
>>
>> Data is generated on demand for simplicty of implementation ie. no running
>> totals are kept or accounted during migrations and such. Various
>> optimisations such as cheaper collection of data are possible but
>> deliberately left out for now.
>>
>> Overall, the feature is deemed to be useful to container orchestration
>> software (and manual management).
>>
>> Limits, either soft or hard, are not envisaged to be implemented on top of
>> this approach due on demand nature of collecting the stats.
>
> So, yeah, if you want to add memory controls, we better think through how
> the fd ownership migration should work.

It would be quite easy to make the implicit migration fail - just the
matter of failing the first ioctl, which is what triggers the migration,
after the file descriptor access from a new owner.

But I don't think I can really add that in the RFC given I have no hard
controls or anything like that.

With GPU usage throttling it doesn't really apply, at least I don't
think it does, since even when migrated to a lower budget group it would
just get immediately de-prioritized.

I don't think hard GPU time limits are feasible in general, and while
soft might be, again I don't see that any limiting would necessarily
have to run immediately on implicit migration.

Second part of the story are hypothetical/future memory controls.

I think first thing to say is that implicit migration is important, but
it is not really established to use the file descriptor from two places
or to migrate more than once. It is simply fresh fd which gets sent to
clients from Xorg, which is one of the legacy ways of doing things.

So we probably can just ignore that given no significant amount of
memory ownership would be getting migrated.

And for drm.memory.stat I think what I have is good enough - both
private and shared data get accounted, for any clients that have handles
to particular buffers.

Maarten was working on memory controls so maybe he would have more
thoughts on memory ownership and implicit migration.

But I don't think there is anything incompatible with that and
drm.memory.stats as proposed here, given how the categories reported are
the established ones from the DRM fdinfo spec, and it is fact of the
matter that we can have multiple memory regions per driver.

The main thing that would change between this RFC and future memory
controls in the area of drm.memory.stat is the implementation - it would
have to get changed under the hood from "collect on query" to "account
at allocation/free/etc". But that is just implementation details.

Regards,

Tvrtko

2023-07-26 20:31:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats

Hello,

On Wed, Jul 26, 2023 at 05:44:28PM +0100, Tvrtko Ursulin wrote:
...
> > So, yeah, if you want to add memory controls, we better think through how
> > the fd ownership migration should work.
>
> It would be quite easy to make the implicit migration fail - just the matter
> of failing the first ioctl, which is what triggers the migration, after the
> file descriptor access from a new owner.

So, it'd be best if there's no migration involved at all as per the
discussion with Maarten.

> But I don't think I can really add that in the RFC given I have no hard
> controls or anything like that.
>
> With GPU usage throttling it doesn't really apply, at least I don't think it
> does, since even when migrated to a lower budget group it would just get
> immediately de-prioritized.
>
> I don't think hard GPU time limits are feasible in general, and while soft
> might be, again I don't see that any limiting would necessarily have to run
> immediately on implicit migration.

Yeah, I wouldn't worry about hard allocation of GPU time. CPU RT control
does that but it's barely used.

> Second part of the story are hypothetical/future memory controls.
>
> I think first thing to say is that implicit migration is important, but it
> is not really established to use the file descriptor from two places or to
> migrate more than once. It is simply fresh fd which gets sent to clients
> from Xorg, which is one of the legacy ways of doing things.
>
> So we probably can just ignore that given no significant amount of memory
> ownership would be getting migrated.

So, if this is the case, it'd be better to clarify this. ie. if the summary is:

fd gets assigned to the user with a certain action at which point the fd
doesn't have significant resources attached to it and the fd can't be moved
to some other cgroup afterwards.

then, everything is pretty simple. No need to worry about migration at all.
fd just gets assigned once at the beginning and everything gets accounted
towards that afterwards.

> And for drm.memory.stat I think what I have is good enough - both private
> and shared data get accounted, for any clients that have handles to
> particular buffers.
>
> Maarten was working on memory controls so maybe he would have more thoughts
> on memory ownership and implicit migration.
>
> But I don't think there is anything incompatible with that and
> drm.memory.stats as proposed here, given how the categories reported are the
> established ones from the DRM fdinfo spec, and it is fact of the matter that
> we can have multiple memory regions per driver.
>
> The main thing that would change between this RFC and future memory controls
> in the area of drm.memory.stat is the implementation - it would have to get
> changed under the hood from "collect on query" to "account at
> allocation/free/etc". But that is just implementation details.

I'd much prefer to straighten out this before adding a prelimiary stat only
thing. If the previously described ownership model is sufficient, none of
this is complicated, right? We can just add counters to track the resources
and print them out.

Thanks.

--
tejun

2023-07-26 20:42:28

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats

Hello,

On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote:
> > So, yeah, if you want to add memory controls, we better think through how
> > the fd ownership migration should work.
>
> I've taken a look at the series, since I have been working on cgroup memory
> eviction.
>
> The scheduling stuff will work for i915, since it has a purely software
> execlist scheduler, but I don't think it will work for GuC (firmware)
> scheduling or other drivers that use the generic drm scheduler.
>
> For something like this, you would probably want it to work inside the drm
> scheduler first. Presumably, this can be done by setting a weight on each
> runqueue, and perhaps adding a callback to update one for a running queue.
> Calculating the weights hierarchically might be fun..

I don't have any idea on this front. The basic idea of making high level
distribution decisions in core code and letting individual drivers enforce
that in a way which fits them the best makes sense to me but I don't know
enough to have an opinion here.

> I have taken a look at how the rest of cgroup controllers change ownership
> when moved to a different cgroup, and the answer was: not at all. If we

For persistent resources, that's the general rule. Whoever instantiates a
resource gets to own it until the resource gets freed. There is an exception
with the pid controller and there are discussions around whether we want
some sort of migration behavior with memcg but yes by and large instantiator
being the owner is the general model cgroup follows.

> attempt to create the scheduler controls only on the first time the fd is
> used, you could probably get rid of all the tracking.
> This can be done very easily with the drm scheduler.
>
> WRT memory, I think the consensus is to track system memory like normal
> memory. Stolen memory doesn't need to be tracked. It's kernel only memory,
> used for internal bookkeeping only.
>
> The only time userspace can directly manipulate stolen memory, is by mapping
> the pinned initial framebuffer to its own address space. The only allocation
> it can do is when a framebuffer is displayed, and framebuffer compression
> creates some stolen memory. Userspace is not
> aware of this though, and has no way to manipulate those contents.

So, my dumb understanding:

* Ownership of an fd can be established on the first ioctl call and doesn't
need to be migrated afterwards. There are no persistent resources to
migration on the first call.

* Memory then can be tracked in a similar way to memcg. Memory gets charged
to the initial instantiator and doesn't need to be moved around
afterwards. There may be some discrepancies around stolen memory but the
magnitude of inaccuracy introduced that way is limited and bound and can
be safely ignored.

Is that correct?

Thanks.

--
tejun

2023-07-27 13:10:22

by Maarten Lankhorst

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats

Hey,

On 2023-07-26 13:41, Tvrtko Ursulin wrote:
>
> On 26/07/2023 11:14, Maarten Lankhorst wrote:
>> Hey,
>>
>> On 2023-07-22 00:21, Tejun Heo wrote:
>>> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>>>>    $ cat drm.memory.stat
>>>>    card0 region=system total=12898304 shared=0 active=0
>>>> resident=12111872 purgeable=167936
>>>>    card0 region=stolen-system total=0 shared=0 active=0 resident=0
>>>> purgeable=0
>>>>
>>>> Data is generated on demand for simplicty of implementation ie. no
>>>> running
>>>> totals are kept or accounted during migrations and such. Various
>>>> optimisations such as cheaper collection of data are possible but
>>>> deliberately left out for now.
>>>>
>>>> Overall, the feature is deemed to be useful to container orchestration
>>>> software (and manual management).
>>>>
>>>> Limits, either soft or hard, are not envisaged to be implemented on
>>>> top of
>>>> this approach due on demand nature of collecting the stats.
>>>
>>> So, yeah, if you want to add memory controls, we better think through
>>> how
>>> the fd ownership migration should work.
>> I've taken a look at the series, since I have been working on cgroup
>> memory eviction.
>>
>> The scheduling stuff will work for i915, since it has a purely
>> software execlist scheduler, but I don't think it will work for GuC
>> (firmware) scheduling or other drivers that use the generic drm
>> scheduler.
>
> It actually works - I used to have a blurb in the cover letter about it
> but apparently I dropped it. Just a bit less well with many clients,
> since there are fewer priority levels.
>
> All that the design requires from the invididual drivers is some way to
> react to the "you are over budget by this much" signal. The rest is
> driver and backend specific.

What I mean is that this signal may not be applicable since the drm
scheduler just schedules jobs that run. Adding a weight might be done in
hardware, since it's responsible for scheduling which context gets to
run. The over budget signal is useless in that case, and you just need
to set a scheduling priority for the hardware instead.

>> For something like this,  you would probably want it to work inside
>> the drm scheduler first. Presumably, this can be done by setting a
>> weight on each runqueue, and perhaps adding a callback to update one
>> for a running queue. Calculating the weights hierarchically might be
>> fun..
>
> It is not needed to work in drm scheduler first. In fact drm scheduler
> based drivers can plug into what I have since it already has the notion
> of scheduling priorities.
>
> They would only need to implement a hook which allow the cgroup
> controller to query client GPU utilisation and another to received the
> over budget signal.
>
> Amdgpu and msm AFAIK could be easy candidates because they both support
> per client utilisation and priorities.
>
> Looks like I need to put all this info back into the cover letter.
>
> Also, hierarchic weights and time budgets are all already there. What
> could be done later is make this all smarter and respect the time budget
> with more precision. That would however, in many cases including Intel,
> require co-operation with the firmware. In any case it is only work in
> the implementation, while the cgroup control interface remains the same.
>
>> I have taken a look at how the rest of cgroup controllers change
>> ownership when moved to a different cgroup, and the answer was: not at
>> all. If we attempt to create the scheduler controls only on the first
>> time the fd is used, you could probably get rid of all the tracking.
>
> Can you send a CPU file descriptor from process A to process B and have
> CPU usage belonging to process B show up in process' A cgroup, or
> vice-versa? Nope, I am not making any sense, am I? My point being it is
> not like-to-like, model is different.
>
> No ownership transfer would mean in wide deployments all GPU utilisation
> would be assigned to Xorg and so there is no point to any of this. No
> way to throttle a cgroup with un-important GPU clients for instance.
If you just grab the current process' cgroup when a drm_sched_entity is
created, you don't have everything charged to X.org. No need for
complicated ownership tracking in drm_file. The same equivalent should
be done in i915 as well when a context is created as it's not using the
drm scheduler.

>> This can be done very easily with the drm scheduler.
>>
>> WRT memory, I think the consensus is to track system memory like
>> normal memory. Stolen memory doesn't need to be tracked. It's kernel
>> only memory, used for internal bookkeeping  only.
>>
>> The only time userspace can directly manipulate stolen memory, is by
>> mapping the pinned initial framebuffer to its own address space. The
>> only allocation it can do is when a framebuffer is displayed, and
>> framebuffer compression creates some stolen memory. Userspace is not
>> aware of this though, and has no way to manipulate those contents.
>
> Stolen memory is irrelevant and not something cgroup controller knows
> about. Point is drivers say which memory regions they have and their
> utilisation.
>
> Imagine instead of stolen it said vram0, or on Intel multi-tile it shows
> local0 and local1. People working with containers are interested to see
> this breakdown. I guess the parallel and use case here is closer to
> memory.numa_stat.
Correct, but for the same reason, I think it might be more useful to
split up the weight too.

A single scheduling weight for the global GPU might be less useful than
per engine, or per tile perhaps..

Cheers,
~Maarten

2023-07-27 14:07:09

by Maarten Lankhorst

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats

Hey,

On 2023-07-26 21:44, Tejun Heo wrote:
> Hello,
>
> On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote:
>>> So, yeah, if you want to add memory controls, we better think through how
>>> the fd ownership migration should work.
>>
>> I've taken a look at the series, since I have been working on cgroup memory
>> eviction.
>>
>> The scheduling stuff will work for i915, since it has a purely software
>> execlist scheduler, but I don't think it will work for GuC (firmware)
>> scheduling or other drivers that use the generic drm scheduler.
>>
>> For something like this, you would probably want it to work inside the drm
>> scheduler first. Presumably, this can be done by setting a weight on each
>> runqueue, and perhaps adding a callback to update one for a running queue.
>> Calculating the weights hierarchically might be fun..
>
> I don't have any idea on this front. The basic idea of making high level
> distribution decisions in core code and letting individual drivers enforce
> that in a way which fits them the best makes sense to me but I don't know
> enough to have an opinion here.
>
>> I have taken a look at how the rest of cgroup controllers change ownership
>> when moved to a different cgroup, and the answer was: not at all. If we
>
> For persistent resources, that's the general rule. Whoever instantiates a
> resource gets to own it until the resource gets freed. There is an exception
> with the pid controller and there are discussions around whether we want
> some sort of migration behavior with memcg but yes by and large instantiator
> being the owner is the general model cgroup follows.
>
>> attempt to create the scheduler controls only on the first time the fd is
>> used, you could probably get rid of all the tracking.
>> This can be done very easily with the drm scheduler.
>>
>> WRT memory, I think the consensus is to track system memory like normal
>> memory. Stolen memory doesn't need to be tracked. It's kernel only memory,
>> used for internal bookkeeping only.
>>
>> The only time userspace can directly manipulate stolen memory, is by mapping
>> the pinned initial framebuffer to its own address space. The only allocation
>> it can do is when a framebuffer is displayed, and framebuffer compression
>> creates some stolen memory. Userspace is not
>> aware of this though, and has no way to manipulate those contents.
>
> So, my dumb understanding:
>
> * Ownership of an fd can be established on the first ioctl call and doesn't
> need to be migrated afterwards. There are no persistent resources to
> migration on the first call.
>
> * Memory then can be tracked in a similar way to memcg. Memory gets charged
> to the initial instantiator and doesn't need to be moved around
> afterwards. There may be some discrepancies around stolen memory but the
> magnitude of inaccuracy introduced that way is limited and bound and can
> be safely ignored.
>
> Is that correct?

Hey,

Yeah mostly, I think we can stop tracking stolen memory. I stopped doing
that for Xe, there is literally nothing to control for userspace in there.

Cheers,
Maarten

2023-07-27 17:13:32

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats


On 27/07/2023 12:54, Maarten Lankhorst wrote:
> Hey,
>
> On 2023-07-26 13:41, Tvrtko Ursulin wrote:
>>
>> On 26/07/2023 11:14, Maarten Lankhorst wrote:
>>> Hey,
>>>
>>> On 2023-07-22 00:21, Tejun Heo wrote:
>>>> On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
>>>>>    $ cat drm.memory.stat
>>>>>    card0 region=system total=12898304 shared=0 active=0
>>>>> resident=12111872 purgeable=167936
>>>>>    card0 region=stolen-system total=0 shared=0 active=0 resident=0
>>>>> purgeable=0
>>>>>
>>>>> Data is generated on demand for simplicty of implementation ie. no
>>>>> running
>>>>> totals are kept or accounted during migrations and such. Various
>>>>> optimisations such as cheaper collection of data are possible but
>>>>> deliberately left out for now.
>>>>>
>>>>> Overall, the feature is deemed to be useful to container orchestration
>>>>> software (and manual management).
>>>>>
>>>>> Limits, either soft or hard, are not envisaged to be implemented on
>>>>> top of
>>>>> this approach due on demand nature of collecting the stats.
>>>>
>>>> So, yeah, if you want to add memory controls, we better think
>>>> through how
>>>> the fd ownership migration should work.
>>> I've taken a look at the series, since I have been working on cgroup
>>> memory eviction.
>>>
>>> The scheduling stuff will work for i915, since it has a purely
>>> software execlist scheduler, but I don't think it will work for GuC
>>> (firmware) scheduling or other drivers that use the generic drm
>>> scheduler.
>>
>> It actually works - I used to have a blurb in the cover letter about
>> it but apparently I dropped it. Just a bit less well with many
>> clients, since there are fewer priority levels.
>>
>> All that the design requires from the invididual drivers is some way
>> to react to the "you are over budget by this much" signal. The rest is
>> driver and backend specific.
>
> What I mean is that this signal may not be applicable since the drm
> scheduler just schedules jobs that run. Adding a weight might be done in
> hardware, since it's responsible for  scheduling which context gets to
> run. The over budget signal is useless in that case, and you just need
> to set a scheduling priority for the hardware instead.

The over budget callback lets the driver know its assigned budget and
its current utilisation. Already with that data drivers could implement
something smarter than what I did in my RFC. So I don't think callback
is completely useless even for some smarter implementation which
potentially ties into firmware scheduling.

Anyway, I maintain this is implementation details.

>>> For something like this,  you would probably want it to work inside
>>> the drm scheduler first. Presumably, this can be done by setting a
>>> weight on each runqueue, and perhaps adding a callback to update one
>>> for a running queue. Calculating the weights hierarchically might be
>>> fun..
>>
>> It is not needed to work in drm scheduler first. In fact drm scheduler
>> based drivers can plug into what I have since it already has the
>> notion of scheduling priorities.
>>
>> They would only need to implement a hook which allow the cgroup
>> controller to query client GPU utilisation and another to received the
>> over budget signal.
>>
>> Amdgpu and msm AFAIK could be easy candidates because they both
>> support per client utilisation and priorities.
>>
>> Looks like I need to put all this info back into the cover letter.
>>
>> Also, hierarchic weights and time budgets are all already there. What
>> could be done later is make this all smarter and respect the time
>> budget with more precision. That would however, in many cases
>> including Intel, require co-operation with the firmware. In any case
>> it is only work in the implementation, while the cgroup control
>> interface remains the same.
>>
>>> I have taken a look at how the rest of cgroup controllers change
>>> ownership when moved to a different cgroup, and the answer was: not
>>> at all. If we attempt to create the scheduler controls only on the
>>> first time the fd is used, you could probably get rid of all the
>>> tracking.
>>
>> Can you send a CPU file descriptor from process A to process B and
>> have CPU usage belonging to process B show up in process' A cgroup, or
>> vice-versa? Nope, I am not making any sense, am I? My point being it
>> is not like-to-like, model is different.
>>
>> No ownership transfer would mean in wide deployments all GPU
>> utilisation would be assigned to Xorg and so there is no point to any
>> of this. No way to throttle a cgroup with un-important GPU clients for
>> instance.
> If you just grab the current process' cgroup when a drm_sched_entity is
> created, you don't have everything charged to X.org. No need for
> complicated ownership tracking in drm_file. The same equivalent should
> be done in i915 as well when a context is created as it's not using the
> drm scheduler.

Okay so essentially nuking the concept of DRM clients belongs to one
cgroup and instead tracking at the context level. That is an interesting
idea. I suspect implementation could require somewhat generalizing the
concept of an "execution context", or at least expressing it via the DRM
cgroup controller.

I can give this a spin, or at least some more detailed thought, once we
close on a few more details regarding charging in general.

>>> This can be done very easily with the drm scheduler.
>>>
>>> WRT memory, I think the consensus is to track system memory like
>>> normal memory. Stolen memory doesn't need to be tracked. It's kernel
>>> only memory, used for internal bookkeeping  only.
>>>
>>> The only time userspace can directly manipulate stolen memory, is by
>>> mapping the pinned initial framebuffer to its own address space. The
>>> only allocation it can do is when a framebuffer is displayed, and
>>> framebuffer compression creates some stolen memory. Userspace is not
>>> aware of this though, and has no way to manipulate those contents.
>>
>> Stolen memory is irrelevant and not something cgroup controller knows
>> about. Point is drivers say which memory regions they have and their
>> utilisation.
>>
>> Imagine instead of stolen it said vram0, or on Intel multi-tile it
>> shows local0 and local1. People working with containers are interested
>> to see this breakdown. I guess the parallel and use case here is
>> closer to memory.numa_stat.
> Correct, but for the same reason, I think it might be more useful to
> split up the weight too.
>
> A single scheduling weight for the global GPU might be less useful than
> per engine, or per tile perhaps..

Yeah, there is some complexity there for sure and could be a larger
write up. In short per engine stuff tends to work out in practice as is
given how each driver can decide upon receiving the signal what to do.

In the i915 RFC for instance if it gets "over budget" signal from the
group, but it sees that the physical engines belonging to this specific
GPU are not over-subscribed, it simply omits any throttling. Which in
practice works out fine for two clients competing for different engines.
Same would be for multiple GPUs (or tiles with our stuff) in the same
cgroup.

Going back to the single scheduling weight or more fine grained. We
could choose to follow for instance io.weight format? Start with
drm.weight being "default 1000" and later extend to per card (or more):

"""
default 100
card0 20
card1 50
"""

In this case I would convert drm.weight to this format straight away for
the next respin, just wouldn't support per card just yet.

Regards,

Tvrtko

2023-07-27 17:31:42

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats


On 27/07/2023 14:42, Maarten Lankhorst wrote:
> On 2023-07-26 21:44, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Jul 26, 2023 at 12:14:24PM +0200, Maarten Lankhorst wrote:
>>>> So, yeah, if you want to add memory controls, we better think
>>>> through how
>>>> the fd ownership migration should work.
>>>
>>> I've taken a look at the series, since I have been working on cgroup
>>> memory
>>> eviction.
>>>
>>> The scheduling stuff will work for i915, since it has a purely software
>>> execlist scheduler, but I don't think it will work for GuC (firmware)
>>> scheduling or other drivers that use the generic drm scheduler.
>>>
>>> For something like this,  you would probably want it to work inside
>>> the drm
>>> scheduler first. Presumably, this can be done by setting a weight on
>>> each
>>> runqueue, and perhaps adding a callback to update one for a running
>>> queue.
>>> Calculating the weights hierarchically might be fun..
>>
>> I don't have any idea on this front. The basic idea of making high level
>> distribution decisions in core code and letting individual drivers
>> enforce
>> that in a way which fits them the best makes sense to me but I don't know
>> enough to have an opinion here.
>>
>>> I have taken a look at how the rest of cgroup controllers change
>>> ownership
>>> when moved to a different cgroup, and the answer was: not at all. If we
>>
>> For persistent resources, that's the general rule. Whoever instantiates a
>> resource gets to own it until the resource gets freed. There is an
>> exception
>> with the pid controller and there are discussions around whether we want
>> some sort of migration behavior with memcg but yes by and large
>> instantiator
>> being the owner is the general model cgroup follows.
>>
>>> attempt to create the scheduler controls only on the first time the
>>> fd is
>>> used, you could probably get rid of all the tracking.
>>> This can be done very easily with the drm scheduler.
>>>
>>> WRT memory, I think the consensus is to track system memory like normal
>>> memory. Stolen memory doesn't need to be tracked. It's kernel only
>>> memory,
>>> used for internal bookkeeping  only.
>>>
>>> The only time userspace can directly manipulate stolen memory, is by
>>> mapping
>>> the pinned initial framebuffer to its own address space. The only
>>> allocation
>>> it can do is when a framebuffer is displayed, and framebuffer
>>> compression
>>> creates some stolen memory. Userspace is not
>>> aware of this though, and has no way to manipulate those contents.
>>
>> So, my dumb understanding:
>>
>> * Ownership of an fd can be established on the first ioctl call and
>> doesn't
>>    need to be migrated afterwards. There are no persistent resources to
>>    migration on the first call.

Yes, keyword is "can". Trouble is migration may or may not happen.

One may choose "Plasma X.org" session type in your login manager and all
DRM fds would be under Xorg if not migrated. Or one may choose "Plasma
Wayland" and migration wouldn't matter. But former is I think has a huge
deployed base so that not supporting implicit migration would be a
significant asterisk next to the controller.

>> * Memory then can be tracked in a similar way to memcg. Memory gets
>> charged
>>    to the initial instantiator and doesn't need to be moved around
>>    afterwards. There may be some discrepancies around stolen memory
>> but the
>>    magnitude of inaccuracy introduced that way is limited and bound
>> and can
>>    be safely ignored.
>>
>> Is that correct?
>
> Hey,
>
> Yeah mostly, I think we can stop tracking stolen memory. I stopped doing
> that for Xe, there is literally nothing to control for userspace in there.

Right, but for reporting stolen is a red-herring. In this RFC I simply
report on all memory regions known by the driver. As I said in the other
reply, imagine the keys are 'system' and 'vram0'. Point was just to
illustrate multiplicity of regions.

Regards,

Tvrtko

2023-07-28 15:01:42

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH 16/17] cgroup/drm: Expose memory stats


One additional thought on one sub-topic:

On 27/07/2023 18:08, Tvrtko Ursulin wrote:

[snip]

>>>> For something like this,  you would probably want it to work inside
>>>> the drm scheduler first. Presumably, this can be done by setting a
>>>> weight on each runqueue, and perhaps adding a callback to update one
>>>> for a running queue. Calculating the weights hierarchically might be
>>>> fun..
>>>
>>> It is not needed to work in drm scheduler first. In fact drm
>>> scheduler based drivers can plug into what I have since it already
>>> has the notion of scheduling priorities.
>>>
>>> They would only need to implement a hook which allow the cgroup
>>> controller to query client GPU utilisation and another to received
>>> the over budget signal.
>>>
>>> Amdgpu and msm AFAIK could be easy candidates because they both
>>> support per client utilisation and priorities.
>>>
>>> Looks like I need to put all this info back into the cover letter.
>>>
>>> Also, hierarchic weights and time budgets are all already there. What
>>> could be done later is make this all smarter and respect the time
>>> budget with more precision. That would however, in many cases
>>> including Intel, require co-operation with the firmware. In any case
>>> it is only work in the implementation, while the cgroup control
>>> interface remains the same.
>>>
>>>> I have taken a look at how the rest of cgroup controllers change
>>>> ownership when moved to a different cgroup, and the answer was: not
>>>> at all. If we attempt to create the scheduler controls only on the
>>>> first time the fd is used, you could probably get rid of all the
>>>> tracking.
>>>
>>> Can you send a CPU file descriptor from process A to process B and
>>> have CPU usage belonging to process B show up in process' A cgroup,
>>> or vice-versa? Nope, I am not making any sense, am I? My point being
>>> it is not like-to-like, model is different.
>>>
>>> No ownership transfer would mean in wide deployments all GPU
>>> utilisation would be assigned to Xorg and so there is no point to any
>>> of this. No way to throttle a cgroup with un-important GPU clients
>>> for instance.
>> If you just grab the current process' cgroup when a drm_sched_entity
>> is created, you don't have everything charged to X.org. No need for
>> complicated ownership tracking in drm_file. The same equivalent should
>> be done in i915 as well when a context is created as it's not using
>> the drm scheduler.
>
> Okay so essentially nuking the concept of DRM clients belongs to one
> cgroup and instead tracking at the context level. That is an interesting
> idea. I suspect implementation could require somewhat generalizing the
> concept of an "execution context", or at least expressing it via the DRM
> cgroup controller.
>
> I can give this a spin, or at least some more detailed thought, once we
> close on a few more details regarding charging in general.

I didn't get much time to brainstorm this just yet, only one downside
randomly came to mind later - with this approach for i915 we wouldn't
correctly attribute any GPU activity done in the receiving process
against our default contexts. Those would still be accounted to the
sending process.

How much problem in practice that would be remains to be investigated,
including if it applies to other drivers too. If there is a good amount
of deployed userspace which use the default context, then it would be a
bit messy.

Regards,

Tvrtko

*) For non DRM and non i915 people, default context is a GPU submission
context implicitly created during the device node open. It always
remains valid, including in the receiving process if SCM_RIGHTS is used.