2021-09-30 17:19:36

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 0/6] CPU + GPU synchronised priority scheduling

From: Tvrtko Ursulin <[email protected]>

This is a somewhat early sketch of one of my ideas intended for early feedback
from the core scheduler experts. First and last two patches in the series are
the most interesting ones for people outside of i915. (Note I did not copy
everyone on all patches but just the cover letter for context and the rest
should be available from the mailing list.)

General idea is that current processing landscape seems to be more and more
composed of pipelines where computations are done on multiple hardware devices.
Furthermore some of the non-CPU devices, like in this case many GPUs supported
by the i915 driver, actually support priority based scheduling which is
currently rather inaccesible to the user (in terms of being able to control it
from the outside).

From these two statements a question arises on how to allow for a simple,
effective and consolidated user experience. In other words why user would not be
able to do something like:

$ nice ffmmpeg ...transcode my videos...
$ my-favourite-game

And have the nice hint apply to GPU parts of the transcode pipeline as well?

Another reason why I started thinking about this is that I noticed Chrome
browser for instance uses nice to de-prioritise background tabs. So again,
having that decision propagate to the GPU rendering pipeline sounds like a big
plus to the overall user experience.

This RFC implements this idea with the hairy part being the notifier chain I
added to enable dynamic adjustments. It is a global notifier which raises a few
questions so I am very curious what experts will think here. Please see the
opens in the first patch for more on this. And also the last two patches are the
ones which implement a hash table in i915 so it can associate the notifier call-
back with the correct GPU rendering contexts.

On a more positive note the thing seems to even work as is. For instance I
roughly simulated the above scenario by running a GPU hog at three nice levels
and a GfxBench TRex in parallel (as a game proxy). This is what I got:

GPU hog nice | TRex fps
------------------------------
0 | 34.8
10 | 38.0
-10 | 30.8

So it is visible the feature can improve the user experience. Question is just
if people are happy with this method of implementing it.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>

Tvrtko Ursulin (6):
sched: Add nice value change notifier
drm/i915: Explicitly track DRM clients
drm/i915: Make GEM contexts track DRM clients
drm/i915: Track all user contexts per client
drm/i915: Keep track of registered clients indexed by task struct
drm/i915: Connect task and GPU scheduling priorities

drivers/gpu/drm/i915/Makefile | 5 +-
drivers/gpu/drm/i915/gem/i915_gem_context.c | 20 +++
.../gpu/drm/i915/gem/i915_gem_context_types.h | 6 +
.../drm/i915/gt/intel_execlists_submission.c | 2 +-
drivers/gpu/drm/i915/i915_drm_client.c | 129 ++++++++++++++++++
drivers/gpu/drm/i915/i915_drm_client.h | 71 ++++++++++
drivers/gpu/drm/i915/i915_drv.c | 6 +
drivers/gpu/drm/i915/i915_drv.h | 5 +
drivers/gpu/drm/i915/i915_gem.c | 21 ++-
drivers/gpu/drm/i915/i915_request.c | 2 +-
drivers/gpu/drm/i915/i915_request.h | 5 +
drivers/gpu/drm/i915/i915_scheduler.c | 3 +-
drivers/gpu/drm/i915/i915_scheduler.h | 14 ++
drivers/gpu/drm/i915/i915_scheduler_types.h | 8 ++
include/linux/sched.h | 5 +
kernel/sched/core.c | 37 ++++-
16 files changed, 330 insertions(+), 9 deletions(-)
create mode 100644 drivers/gpu/drm/i915/i915_drm_client.c
create mode 100644 drivers/gpu/drm/i915/i915_drm_client.h

--
2.30.2


2021-09-30 17:19:36

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 1/6] sched: Add nice value change notifier

From: Tvrtko Ursulin <[email protected]>

Implement a simple notifier chain via which interested parties can track
when process nice value changes. Simple because it is global so each user
would have to track which tasks it is interested in.

To use register_user_nice_notifier and unregister_user_nice_notifier
functions are provided and new nice value and pointer to task_struct
being modified passed to the callbacks.

Opens:
* Security. Would some sort of a per process mechanism be better and
feasible?
* Put it all behind kconfig to be selected by interested drivers?

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 5 +++++
kernel/sched/core.c | 37 ++++++++++++++++++++++++++++++++++++-
2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 39039ce8ac4c..45ae9eca38c6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2309,4 +2309,9 @@ static inline void sched_core_free(struct task_struct *tsk) { }
static inline void sched_core_fork(struct task_struct *p) { }
#endif

+struct notifier_block;
+
+extern int register_user_nice_notifier(struct notifier_block *);
+extern int unregister_user_nice_notifier(struct notifier_block *);
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bba4128a3e6..26ff75d6fe00 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6864,10 +6864,42 @@ static inline int rt_effective_prio(struct task_struct *p, int prio)
}
#endif

+ATOMIC_NOTIFIER_HEAD(user_nice_notifier_list);
+
+/**
+ * register_user_nice_notifier - Register function to be called when task nice changes
+ * @nb: Info about notifier function to be called
+ *
+ * Registers a function with the list of functions to be called when task nice
+ * value changes.
+ *
+ * Currently always returns zero, as atomic_notifier_chain_register()
+ * always returns zero.
+ */
+int register_user_nice_notifier(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_register(&user_nice_notifier_list, nb);
+}
+EXPORT_SYMBOL(register_user_nice_notifier);
+
+/**
+ * unregister_user_nice_notifier - Unregister previously registered user nice notifier
+ * @nb: Hook to be unregistered
+ *
+ * Unregisters a previously registered user nice notifier function.
+ *
+ * Returns zero on success, or %-ENOENT on failure.
+ */
+int unregister_user_nice_notifier(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_unregister(&user_nice_notifier_list, nb);
+}
+EXPORT_SYMBOL(unregister_user_nice_notifier);
+
void set_user_nice(struct task_struct *p, long nice)
{
bool queued, running;
- int old_prio;
+ int old_prio, ret;
struct rq_flags rf;
struct rq *rq;

@@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p, long nice)
*/
p->sched_class->prio_changed(rq, p, old_prio);

+ ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
+ WARN_ON_ONCE(ret != NOTIFY_DONE);
+
out_unlock:
task_rq_unlock(rq, p, &rf);
}
--
2.30.2

2021-09-30 17:19:47

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 2/6] drm/i915: Explicitly track DRM clients

From: Tvrtko Ursulin <[email protected]>

Tracking DRM clients more explicitly will allow later patches to
accumulate past and current GPU usage in a centralised place and also
consolidate access to owning task pid/name.

Unique client id is also assigned for the purpose of distinguishing/
consolidating between multiple file descriptors owned by the same process.

v2:
Chris Wilson:
* Enclose new members into dedicated structs.
* Protect against failed sysfs registration.

v3:
* sysfs_attr_init.

v4:
* Fix for internal clients.

v5:
* Use cyclic ida for client id. (Chris)
* Do not leak pid reference. (Chris)
* Tidy code with some locals.

v6:
* Use xa_alloc_cyclic to simplify locking. (Chris)
* No need to unregister individial sysfs files. (Chris)
* Rebase on top of fpriv kref.
* Track client closed status and reflect in sysfs.

v7:
* Make drm_client more standalone concept.

v8:
* Simplify sysfs show. (Chris)
* Always track name and pid.

v9:
* Fix cyclic id assignment.

v10:
* No need for a mutex around xa_alloc_cyclic.
* Refactor sysfs into own function.
* Unregister sysfs before freeing pid and name.
* Move clients setup into own function.

v11:
* Call clients init directly from driver init. (Chris)

v12:
* Do not fail client add on id wrap. (Maciej)

v13 (Lucas): Rebase.

v14:
* Dropped sysfs bits.

v15:
* Dropped tracking of pid/ and name.
* Dropped RCU freeing of the client object.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Reviewed-by: Chris Wilson <[email protected]> # v11
Reviewed-by: Aravind Iddamsetty <[email protected]> # v11
Signed-off-by: Chris Wilson <[email protected]>
---
drivers/gpu/drm/i915/Makefile | 5 +-
drivers/gpu/drm/i915/i915_drm_client.c | 68 ++++++++++++++++++++++++++
drivers/gpu/drm/i915/i915_drm_client.h | 50 +++++++++++++++++++
drivers/gpu/drm/i915/i915_drv.c | 6 +++
drivers/gpu/drm/i915/i915_drv.h | 5 ++
drivers/gpu/drm/i915/i915_gem.c | 21 ++++++--
6 files changed, 150 insertions(+), 5 deletions(-)
create mode 100644 drivers/gpu/drm/i915/i915_drm_client.c
create mode 100644 drivers/gpu/drm/i915/i915_drm_client.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 5c8e022a7383..005b5df425a1 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -32,8 +32,9 @@ subdir-ccflags-y += -I$(srctree)/$(src)
# Please keep these build lists sorted!

# core driver code
-i915-y += i915_drv.o \
- i915_config.o \
+i915-y += i915_config.o \
+ i915_drm_client.o \
+ i915_drv.o \
i915_irq.o \
i915_getparam.o \
i915_mitigations.o \
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
new file mode 100644
index 000000000000..e61e9ba15256
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include "i915_drm_client.h"
+#include "i915_gem.h"
+#include "i915_utils.h"
+
+void i915_drm_clients_init(struct i915_drm_clients *clients,
+ struct drm_i915_private *i915)
+{
+ clients->i915 = i915;
+ clients->next_id = 0;
+
+ xa_init_flags(&clients->xarray, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
+}
+
+struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients)
+{
+ struct i915_drm_client *client;
+ struct xarray *xa = &clients->xarray;
+ int ret;
+
+ client = kzalloc(sizeof(*client), GFP_KERNEL);
+ if (!client)
+ return ERR_PTR(-ENOMEM);
+
+ xa_lock_irq(xa);
+ ret = __xa_alloc_cyclic(xa, &client->id, client, xa_limit_32b,
+ &clients->next_id, GFP_KERNEL);
+ xa_unlock_irq(xa);
+ if (ret < 0)
+ goto err;
+
+ kref_init(&client->kref);
+ client->clients = clients;
+
+ return client;
+
+err:
+ kfree(client);
+
+ return ERR_PTR(ret);
+}
+
+void __i915_drm_client_free(struct kref *kref)
+{
+ struct i915_drm_client *client =
+ container_of(kref, typeof(*client), kref);
+ struct xarray *xa = &client->clients->xarray;
+ unsigned long flags;
+
+ xa_lock_irqsave(xa, flags);
+ __xa_erase(xa, client->id);
+ xa_unlock_irqrestore(xa, flags);
+ kfree(client);
+}
+
+void i915_drm_clients_fini(struct i915_drm_clients *clients)
+{
+ GEM_BUG_ON(!xa_empty(&clients->xarray));
+ xa_destroy(&clients->xarray);
+}
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
new file mode 100644
index 000000000000..e8986ad51176
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#ifndef __I915_DRM_CLIENT_H__
+#define __I915_DRM_CLIENT_H__
+
+#include <linux/kref.h>
+#include <linux/xarray.h>
+
+struct drm_i915_private;
+
+struct i915_drm_clients {
+ struct drm_i915_private *i915;
+
+ struct xarray xarray;
+ u32 next_id;
+};
+
+struct i915_drm_client {
+ struct kref kref;
+
+ unsigned int id;
+
+ struct i915_drm_clients *clients;
+};
+
+void i915_drm_clients_init(struct i915_drm_clients *clients,
+ struct drm_i915_private *i915);
+
+static inline struct i915_drm_client *
+i915_drm_client_get(struct i915_drm_client *client)
+{
+ kref_get(&client->kref);
+ return client;
+}
+
+void __i915_drm_client_free(struct kref *kref);
+
+static inline void i915_drm_client_put(struct i915_drm_client *client)
+{
+ kref_put(&client->kref, __i915_drm_client_free);
+}
+
+struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients);
+
+void i915_drm_clients_fini(struct i915_drm_clients *clients);
+
+#endif /* !__I915_DRM_CLIENT_H__ */
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index ed7b421cad44..bd2e8d074d93 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -68,6 +68,7 @@
#include "gt/intel_rc6.h"

#include "i915_debugfs.h"
+#include "i915_drm_client.h"
#include "i915_drv.h"
#include "i915_ioc32.h"
#include "i915_irq.h"
@@ -344,6 +345,8 @@ static int i915_driver_early_probe(struct drm_i915_private *dev_priv)

intel_gt_init_early(&dev_priv->gt, dev_priv);

+ i915_drm_clients_init(&dev_priv->clients, dev_priv);
+
i915_gem_init_early(dev_priv);

/* This must be called before any calls to HAS_PCH_* */
@@ -363,6 +366,7 @@ static int i915_driver_early_probe(struct drm_i915_private *dev_priv)

err_gem:
i915_gem_cleanup_early(dev_priv);
+ i915_drm_clients_fini(&dev_priv->clients);
intel_gt_driver_late_release(&dev_priv->gt);
intel_region_ttm_device_fini(dev_priv);
err_ttm:
@@ -382,6 +386,7 @@ static void i915_driver_late_release(struct drm_i915_private *dev_priv)
intel_irq_fini(dev_priv);
intel_power_domains_cleanup(dev_priv);
i915_gem_cleanup_early(dev_priv);
+ i915_drm_clients_fini(&dev_priv->clients);
intel_gt_driver_late_release(&dev_priv->gt);
intel_region_ttm_device_fini(dev_priv);
vlv_suspend_cleanup(dev_priv);
@@ -997,6 +1002,7 @@ static void i915_driver_postclose(struct drm_device *dev, struct drm_file *file)
struct drm_i915_file_private *file_priv = file->driver_priv;

i915_gem_context_close(file);
+ i915_drm_client_put(file_priv->client);

kfree_rcu(file_priv, rcu);

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 63c939891f1c..de2ba95b81f6 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -96,6 +96,7 @@
#include "intel_wakeref.h"
#include "intel_wopcm.h"

+#include "i915_drm_client.h"
#include "i915_gem.h"
#include "i915_gem_gtt.h"
#include "i915_gpu_error.h"
@@ -284,6 +285,8 @@ struct drm_i915_file_private {
/** ban_score: Accumulated score of all ctx bans and fast hangs. */
atomic_t ban_score;
unsigned long hang_timestamp;
+
+ struct i915_drm_client *client;
};

/* Interface history:
@@ -1238,6 +1241,8 @@ struct drm_i915_private {

struct i915_pmu pmu;

+ struct i915_drm_clients clients;
+
struct i915_hdcp_comp_master *hdcp_master;
bool hdcp_comp_added;

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 981e383d1a5d..3df78babe1a9 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1177,25 +1177,40 @@ void i915_gem_cleanup_early(struct drm_i915_private *dev_priv)
int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file)
{
struct drm_i915_file_private *file_priv;
- int ret;
+ struct i915_drm_client *client;
+ int ret = -ENOMEM;

DRM_DEBUG("\n");

file_priv = kzalloc(sizeof(*file_priv), GFP_KERNEL);
if (!file_priv)
- return -ENOMEM;
+ goto err_alloc;
+
+ client = i915_drm_client_add(&i915->clients);
+ if (IS_ERR(client)) {
+ ret = PTR_ERR(client);
+ goto err_client;
+ }

file->driver_priv = file_priv;
file_priv->dev_priv = i915;
file_priv->file = file;
+ file_priv->client = client;

file_priv->bsd_engine = -1;
file_priv->hang_timestamp = jiffies;

ret = i915_gem_context_open(i915, file);
if (ret)
- kfree(file_priv);
+ goto err_context;
+
+ return 0;

+err_context:
+ i915_drm_client_put(client);
+err_client:
+ kfree(file_priv);
+err_alloc:
return ret;
}

--
2.30.2

2021-09-30 17:21:24

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 6/6] drm/i915: Connect task and GPU scheduling priorities

From: Tvrtko Ursulin <[email protected]>

Introduce the concept of context nice value which exactly matches the
process nice, and also make use of the task user nice notifier chain to be
able to dynamically adjust GPU scheduling. Apart from runtime adjustments
context also inherits the nice value when created.

Context nice is utilised secondary to context priority, only when the
latter has been left at the default setting, in order to avoid disturbing
any application made choices of low and high (batch processing and maybe
latency sensitive compositing). In those cases nice value adjusts the
effective priority in the narrow band of -19 to +20 around
I915_CONTEXT_DEFAULT_PRIORITY.

Opens:
* The [-19, +20] range should work well with GuC scheduling since it
effectively translates to three available buckets there, but should
it perhaps be streched to cover the full [-1023, +1023] range?
* Current implementation only has the nice apply to future submissions
against the context. Is it worth dealing with already submitted work?
I opted to avoid the complication since benefit seems marginal outside
of the endless compute workloads (which would probably start up with
the correct priority).

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/i915/gem/i915_gem_context.c | 1 +
.../drm/i915/gt/intel_execlists_submission.c | 2 +-
drivers/gpu/drm/i915/i915_drm_client.c | 38 +++++++++++++++++--
drivers/gpu/drm/i915/i915_drm_client.h | 3 ++
drivers/gpu/drm/i915/i915_request.c | 2 +-
drivers/gpu/drm/i915/i915_request.h | 5 +++
drivers/gpu/drm/i915/i915_scheduler.c | 3 +-
drivers/gpu/drm/i915/i915_scheduler.h | 14 +++++++
drivers/gpu/drm/i915/i915_scheduler_types.h | 8 ++++
9 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 8d4d687ab1d0..fed0733cb652 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -257,6 +257,7 @@ proto_context_create(struct drm_i915_private *i915, unsigned int flags)
if (i915->params.enable_hangcheck)
pc->user_flags |= BIT(UCONTEXT_PERSISTENCE);
pc->sched.priority = I915_PRIORITY_NORMAL;
+ pc->sched.nice = task_nice(current);

if (flags & I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE) {
if (!HAS_EXECLISTS(i915)) {
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 7147fe80919e..13c41f1f8ba5 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -250,7 +250,7 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)

static int rq_prio(const struct i915_request *rq)
{
- return READ_ONCE(rq->sched.attr.priority);
+ return i915_request_priority(rq);
}

static int effective_prio(const struct i915_request *rq)
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 82b9636482ef..13ac2311eb84 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -8,9 +8,33 @@
#include <linux/types.h>

#include "i915_drm_client.h"
+#include "gem/i915_gem_context.h"
#include "i915_gem.h"
#include "i915_utils.h"

+static int
+clients_notify(struct notifier_block *nb, unsigned long val, void *ptr)
+{
+ struct i915_drm_clients *clients =
+ container_of(nb, typeof(*clients), prio_notifier);
+ struct i915_drm_client *client;
+ unsigned long flags;
+
+ read_lock_irqsave(&clients->lock, flags);
+ hash_for_each_possible(clients->tasks, client, node, (uintptr_t)ptr) {
+ struct i915_gem_context *ctx;
+
+ if (client->owner != ptr)
+ continue;
+
+ list_for_each_entry_rcu(ctx, &client->ctx_list, client_link)
+ ctx->sched.nice = (int)val;
+ }
+ read_unlock_irqrestore(&clients->lock, flags);
+
+ return NOTIFY_DONE;
+}
+
void i915_drm_clients_init(struct i915_drm_clients *clients,
struct drm_i915_private *i915)
{
@@ -21,6 +45,10 @@ void i915_drm_clients_init(struct i915_drm_clients *clients,

rwlock_init(&clients->lock);
hash_init(clients->tasks);
+
+ memset(&clients->prio_notifier, 0, sizeof(clients->prio_notifier));
+ clients->prio_notifier.notifier_call = clients_notify;
+ register_user_nice_notifier(&clients->prio_notifier);
}

struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients)
@@ -63,9 +91,9 @@ void __i915_drm_client_free(struct kref *kref)
struct xarray *xa = &clients->xarray;
unsigned long flags;

- write_lock(&clients->lock);
+ write_lock_irq(&clients->lock);
hash_del(&client->node);
- write_unlock(&clients->lock);
+ write_unlock_irq(&clients->lock);

xa_lock_irqsave(xa, flags);
__xa_erase(xa, client->id);
@@ -75,6 +103,8 @@ void __i915_drm_client_free(struct kref *kref)

void i915_drm_clients_fini(struct i915_drm_clients *clients)
{
+ unregister_user_nice_notifier(&clients->prio_notifier);
+
GEM_BUG_ON(!xa_empty(&clients->xarray));
xa_destroy(&clients->xarray);
}
@@ -88,12 +118,12 @@ void i915_drm_client_update_owner(struct i915_drm_client *client,
return;

clients = client->clients;
- write_lock(&clients->lock);
+ write_lock_irq(&clients->lock);
if (READ_ONCE(client->owner) != owner) {
if (client->owner)
hash_del(&client->node);
client->owner = owner;
hash_add(clients->tasks, &client->node, (uintptr_t)owner);
}
- write_unlock(&clients->lock);
+ write_unlock_irq(&clients->lock);
}
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 42fd79f0558a..dda26aa42ac9 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -9,6 +9,7 @@
#include <linux/hashtable.h>
#include <linux/kref.h>
#include <linux/list.h>
+#include <linux/notifier.h>
#include <linux/rwlock.h>
#include <linux/sched.h>
#include <linux/spinlock.h>
@@ -24,6 +25,8 @@ struct i915_drm_clients {

rwlock_t lock;
DECLARE_HASHTABLE(tasks, 6);
+
+ struct notifier_block prio_notifier;
};

struct i915_drm_client {
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 79da5eca60af..a8c6f3a64895 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1930,7 +1930,7 @@ static int print_sched_attr(const struct i915_sched_attr *attr,
return x;

x += snprintf(buf + x, len - x,
- " prio=%d", attr->priority);
+ " prio=%d nice=%d", attr->priority, attr->nice);

return x;
}
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 7bd9ed20623e..c2c4c344837e 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -399,6 +399,11 @@ long i915_request_wait(struct i915_request *rq,
#define I915_WAIT_PRIORITY BIT(1) /* small priority bump for the request */
#define I915_WAIT_ALL BIT(2) /* used by i915_gem_object_wait() */

+static inline int i915_request_priority(const struct i915_request *rq)
+{
+ return i915_sched_attr_priority(&rq->sched.attr);
+}
+
void i915_request_show(struct drm_printer *m,
const struct i915_request *rq,
const char *prefix,
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 762127dd56c5..3883511f90ea 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -155,7 +155,7 @@ lock_sched_engine(struct i915_sched_node *node,
static void __i915_schedule(struct i915_sched_node *node,
const struct i915_sched_attr *attr)
{
- const int prio = max(attr->priority, node->attr.priority);
+ const int prio = max(i915_sched_attr_priority(attr), node->attr.priority);
struct i915_sched_engine *sched_engine;
struct i915_dependency *dep, *p;
struct i915_dependency stack;
@@ -305,6 +305,7 @@ void i915_sched_node_init(struct i915_sched_node *node)
void i915_sched_node_reinit(struct i915_sched_node *node)
{
node->attr.priority = I915_PRIORITY_INVALID;
+ node->attr.nice = 0;
node->semaphores = 0;
node->flags = 0;

diff --git a/drivers/gpu/drm/i915/i915_scheduler.h b/drivers/gpu/drm/i915/i915_scheduler.h
index 0b9b86af6c7f..75ccc9f55d14 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.h
+++ b/drivers/gpu/drm/i915/i915_scheduler.h
@@ -38,6 +38,20 @@ void i915_sched_node_fini(struct i915_sched_node *node);
void i915_schedule(struct i915_request *request,
const struct i915_sched_attr *attr);

+static inline int i915_sched_attr_priority(const struct i915_sched_attr *attr)
+{
+ int prio = attr->priority;
+
+ /*
+ * Only allow I915_CONTEXT_DEFAULT_PRIORITY to be affected by the
+ * nice setting.
+ */
+ if (!prio)
+ prio = -attr->nice;
+
+ return prio;
+}
+
struct list_head *
i915_sched_lookup_priolist(struct i915_sched_engine *sched_engine, int prio);

diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
index b0a1b58c7893..34124cfc93e9 100644
--- a/drivers/gpu/drm/i915/i915_scheduler_types.h
+++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
@@ -29,6 +29,14 @@ struct i915_sched_attr {
* The &drm_i915_private.kernel_context is assigned the lowest priority.
*/
int priority;
+
+ /**
+ * @nice: context nice level
+ *
+ * Nice level follows the CPU scheduler nice value as set for the
+ * process owning the GPU context.
+ */
+ int nice;
};

/*
--
2.30.2

2021-09-30 17:21:26

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 3/6] drm/i915: Make GEM contexts track DRM clients

From: Tvrtko Ursulin <[email protected]>

Make GEM contexts keep a reference to i915_drm_client for the whole of
of their lifetime which will come handy in following patches.

v2: Don't bother supporting selftests contexts from debugfs. (Chris)
v3 (Lucas): Finish constructing ctx before adding it to the list
v4 (Ram): Rebase.
v5: Trivial rebase for proto ctx changes.
v6: Rebase after clients no longer track name and pid.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Reviewed-by: Chris Wilson <[email protected]> # v5
Reviewed-by: Aravind Iddamsetty <[email protected]> # v5
Signed-off-by: Chris Wilson <[email protected]>
---
drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 +++++
drivers/gpu/drm/i915/gem/i915_gem_context_types.h | 3 +++
2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 8208fd5b72c3..9296d69681d7 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -956,6 +956,9 @@ static void i915_gem_context_release_work(struct work_struct *work)
if (vm)
i915_vm_put(vm);

+ if (ctx->client)
+ i915_drm_client_put(ctx->client);
+
mutex_destroy(&ctx->engines_mutex);
mutex_destroy(&ctx->lut_mutex);

@@ -1373,6 +1376,8 @@ static void gem_context_register(struct i915_gem_context *ctx,
ctx->file_priv = fpriv;

ctx->pid = get_task_pid(current, PIDTYPE_PID);
+ ctx->client = i915_drm_client_get(fpriv->client);
+
snprintf(ctx->name, sizeof(ctx->name), "%s[%d]",
current->comm, pid_nr(ctx->pid));

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index c4617e4d9fa9..598c57ac5cdf 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -277,6 +277,9 @@ struct i915_gem_context {
/** @link: place with &drm_i915_private.context_list */
struct list_head link;

+ /** @client: struct i915_drm_client */
+ struct i915_drm_client *client;
+
/**
* @ref: reference count
*
--
2.30.2

2021-09-30 17:21:29

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 5/6] drm/i915: Keep track of registered clients indexed by task struct

From: Tvrtko Ursulin <[email protected]>

A simple hash table of registered clients indexed by the task struct
pointer is kept to be used in a following patch.

Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 ++
drivers/gpu/drm/i915/i915_drm_client.c | 31 ++++++++++++++++++++-
drivers/gpu/drm/i915/i915_drm_client.h | 13 +++++++++
3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index d1992ba59ed8..8d4d687ab1d0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -1932,6 +1932,8 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
return -EIO;
}

+ i915_drm_client_update_owner(ext_data.fpriv->client, current);
+
ext_data.pc = proto_context_create(i915, args->flags);
if (IS_ERR(ext_data.pc))
return PTR_ERR(ext_data.pc);
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 91a8559bebf7..82b9636482ef 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -18,6 +18,9 @@ void i915_drm_clients_init(struct i915_drm_clients *clients,
clients->next_id = 0;

xa_init_flags(&clients->xarray, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
+
+ rwlock_init(&clients->lock);
+ hash_init(clients->tasks);
}

struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients)
@@ -42,6 +45,8 @@ struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients)
INIT_LIST_HEAD(&client->ctx_list);
client->clients = clients;

+ i915_drm_client_update_owner(client, current);
+
return client;

err:
@@ -54,9 +59,14 @@ void __i915_drm_client_free(struct kref *kref)
{
struct i915_drm_client *client =
container_of(kref, typeof(*client), kref);
- struct xarray *xa = &client->clients->xarray;
+ struct i915_drm_clients *clients = client->clients;
+ struct xarray *xa = &clients->xarray;
unsigned long flags;

+ write_lock(&clients->lock);
+ hash_del(&client->node);
+ write_unlock(&clients->lock);
+
xa_lock_irqsave(xa, flags);
__xa_erase(xa, client->id);
xa_unlock_irqrestore(xa, flags);
@@ -68,3 +78,22 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients)
GEM_BUG_ON(!xa_empty(&clients->xarray));
xa_destroy(&clients->xarray);
}
+
+void i915_drm_client_update_owner(struct i915_drm_client *client,
+ struct task_struct *owner)
+{
+ struct i915_drm_clients *clients;
+
+ if (READ_ONCE(client->owner) == owner)
+ return;
+
+ clients = client->clients;
+ write_lock(&clients->lock);
+ if (READ_ONCE(client->owner) != owner) {
+ if (client->owner)
+ hash_del(&client->node);
+ client->owner = owner;
+ hash_add(clients->tasks, &client->node, (uintptr_t)owner);
+ }
+ write_unlock(&clients->lock);
+}
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 0207dfad4568..42fd79f0558a 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -6,8 +6,11 @@
#ifndef __I915_DRM_CLIENT_H__
#define __I915_DRM_CLIENT_H__

+#include <linux/hashtable.h>
#include <linux/kref.h>
#include <linux/list.h>
+#include <linux/rwlock.h>
+#include <linux/sched.h>
#include <linux/spinlock.h>
#include <linux/xarray.h>

@@ -18,6 +21,9 @@ struct i915_drm_clients {

struct xarray xarray;
u32 next_id;
+
+ rwlock_t lock;
+ DECLARE_HASHTABLE(tasks, 6);
};

struct i915_drm_client {
@@ -28,6 +34,9 @@ struct i915_drm_client {
spinlock_t ctx_lock; /* For add/remove from ctx_list. */
struct list_head ctx_list; /* List of contexts belonging to client. */

+ struct task_struct *owner; /* No reference kept, never dereferenced. */
+ struct hlist_node node;
+
struct i915_drm_clients *clients;
};

@@ -52,4 +61,8 @@ struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients);

void i915_drm_clients_fini(struct i915_drm_clients *clients);

+void i915_drm_client_update_owner(struct i915_drm_client *client,
+ struct task_struct *owner);
+
+
#endif /* !__I915_DRM_CLIENT_H__ */
--
2.30.2

2021-09-30 17:22:03

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC 4/6] drm/i915: Track all user contexts per client

From: Tvrtko Ursulin <[email protected]>

We soon want to start answering questions like how much GPU time is the
context belonging to a client which exited still using.

To enable this we start tracking all context belonging to a client on a
separate list.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Reviewed-by: Aravind Iddamsetty <[email protected]>
Reviewed-by: Chris Wilson <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
---
drivers/gpu/drm/i915/gem/i915_gem_context.c | 12 ++++++++++++
drivers/gpu/drm/i915/gem/i915_gem_context_types.h | 3 +++
drivers/gpu/drm/i915/i915_drm_client.c | 2 ++
drivers/gpu/drm/i915/i915_drm_client.h | 5 +++++
4 files changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 9296d69681d7..d1992ba59ed8 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -1169,6 +1169,7 @@ static void set_closed_name(struct i915_gem_context *ctx)

static void context_close(struct i915_gem_context *ctx)
{
+ struct i915_drm_client *client;
struct i915_address_space *vm;

/* Flush any concurrent set_engines() */
@@ -1205,6 +1206,13 @@ static void context_close(struct i915_gem_context *ctx)
list_del(&ctx->link);
spin_unlock(&ctx->i915->gem.contexts.lock);

+ client = ctx->client;
+ if (client) {
+ spin_lock(&client->ctx_lock);
+ list_del_rcu(&ctx->client_link);
+ spin_unlock(&client->ctx_lock);
+ }
+
mutex_unlock(&ctx->mutex);

/*
@@ -1385,6 +1393,10 @@ static void gem_context_register(struct i915_gem_context *ctx,
old = xa_store(&fpriv->context_xa, id, ctx, GFP_KERNEL);
WARN_ON(old);

+ spin_lock(&ctx->client->ctx_lock);
+ list_add_tail_rcu(&ctx->client_link, &ctx->client->ctx_list);
+ spin_unlock(&ctx->client->ctx_lock);
+
spin_lock(&i915->gem.contexts.lock);
list_add_tail(&ctx->link, &i915->gem.contexts.list);
spin_unlock(&i915->gem.contexts.lock);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index 598c57ac5cdf..b878e1b13b38 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -280,6 +280,9 @@ struct i915_gem_context {
/** @client: struct i915_drm_client */
struct i915_drm_client *client;

+ /** link: &drm_client.context_list */
+ struct list_head client_link;
+
/**
* @ref: reference count
*
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index e61e9ba15256..91a8559bebf7 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -38,6 +38,8 @@ struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients)
goto err;

kref_init(&client->kref);
+ spin_lock_init(&client->ctx_lock);
+ INIT_LIST_HEAD(&client->ctx_list);
client->clients = clients;

return client;
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index e8986ad51176..0207dfad4568 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -7,6 +7,8 @@
#define __I915_DRM_CLIENT_H__

#include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
#include <linux/xarray.h>

struct drm_i915_private;
@@ -23,6 +25,9 @@ struct i915_drm_client {

unsigned int id;

+ spinlock_t ctx_lock; /* For add/remove from ctx_list. */
+ struct list_head ctx_list; /* List of contexts belonging to client. */
+
struct i915_drm_clients *clients;
};

--
2.30.2

2021-09-30 19:25:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/6] sched: Add nice value change notifier

On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
> void set_user_nice(struct task_struct *p, long nice)
> {
> bool queued, running;
> - int old_prio;
> + int old_prio, ret;
> struct rq_flags rf;
> struct rq *rq;
>
> @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p, long nice)
> */
> p->sched_class->prio_changed(rq, p, old_prio);
>
> + ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
> + WARN_ON_ONCE(ret != NOTIFY_DONE);
> +
> out_unlock:
> task_rq_unlock(rq, p, &rf);
> }

No, we're not going to call out to exported, and potentially unbounded,
functions under scheduler locks.

2021-09-30 19:27:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/6] CPU + GPU synchronised priority scheduling

On Thu, Sep 30, 2021 at 06:15:46PM +0100, Tvrtko Ursulin wrote:

> (Note I did not copy
> everyone on all patches but just the cover letter for context and the rest
> should be available from the mailing list.)

In general, if you can't be arsed to send it to me, I can't be arsed to
dig it out. I've got plenty other email I can read without having to go
looking for more.

2021-10-01 09:08:11

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC 1/6] sched: Add nice value change notifier


Hi Peter,

On 30/09/2021 19:33, Peter Zijlstra wrote:
> On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
>> void set_user_nice(struct task_struct *p, long nice)
>> {
>> bool queued, running;
>> - int old_prio;
>> + int old_prio, ret;
>> struct rq_flags rf;
>> struct rq *rq;
>>
>> @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p, long nice)
>> */
>> p->sched_class->prio_changed(rq, p, old_prio);
>>
>> + ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
>> + WARN_ON_ONCE(ret != NOTIFY_DONE);
>> +
>> out_unlock:
>> task_rq_unlock(rq, p, &rf);
>> }
>
> No, we're not going to call out to exported, and potentially unbounded,
> functions under scheduler locks.

Agreed, that's another good point why it is even more hairy, as I have
generally alluded in the cover letter.

Do you have any immediate thoughts on possible alternatives?

Like for instance if I did a queue_work from set_user_nice and then ran
a notifier chain async from a worker? I haven't looked at yet what
repercussion would that have in terms of having to cancel the pending
workers when tasks exit. I can try and prototype that and see how it
would look.

There is of course an example ioprio which solves the runtime
adjustments via a dedicated system call. But I don't currently feel that
a third one would be a good solution. At least I don't see a case for
being able to decouple the priority of CPU and GPU and computations.

Have I opened a large can of worms? :)

Regards,

Tvrtko

2021-10-01 12:57:44

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC 1/6] sched: Add nice value change notifier


On 01/10/2021 10:04, Tvrtko Ursulin wrote:
>
> Hi Peter,
>
> On 30/09/2021 19:33, Peter Zijlstra wrote:
>> On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
>>>   void set_user_nice(struct task_struct *p, long nice)
>>>   {
>>>       bool queued, running;
>>> -    int old_prio;
>>> +    int old_prio, ret;
>>>       struct rq_flags rf;
>>>       struct rq *rq;
>>> @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p, long
>>> nice)
>>>        */
>>>       p->sched_class->prio_changed(rq, p, old_prio);
>>> +    ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice,
>>> p);
>>> +    WARN_ON_ONCE(ret != NOTIFY_DONE);
>>> +
>>>   out_unlock:
>>>       task_rq_unlock(rq, p, &rf);
>>>   }
>>
>> No, we're not going to call out to exported, and potentially unbounded,
>> functions under scheduler locks.
>
> Agreed, that's another good point why it is even more hairy, as I have
> generally alluded in the cover letter.
>
> Do you have any immediate thoughts on possible alternatives?
>
> Like for instance if I did a queue_work from set_user_nice and then ran
> a notifier chain async from a worker? I haven't looked at yet what
> repercussion would that have in terms of having to cancel the pending
> workers when tasks exit. I can try and prototype that and see how it
> would look.

Hm or I simply move calling the notifier chain to after task_rq_unlock?
That would leave it run under the tasklist lock so probably still quite bad.

Or another option - I stash aside the tasks on a private list (adding
new list_head to trask_struct), with elevated task ref count, and run
the notifier chain outside any locked sections, at the end of the
setpriority syscall. This way only the sycall caller pays the cost of
any misbehaving notifiers in the chain. Further improvement could be per
task notifiers but that would grow the task_struct more.

Regards,

Tvrtko

> There is of course an example ioprio which solves the runtime
> adjustments via a dedicated system call. But I don't currently feel that
> a third one would be a good solution. At least I don't see a case for
> being able to decouple the priority of CPU and GPU and computations.
>
> Have I opened a large can of worms? :)
>
> Regards,
>
> Tvrtko

2021-10-01 16:10:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/6] sched: Add nice value change notifier

On Fri, Oct 01, 2021 at 11:32:16AM +0100, Tvrtko Ursulin wrote:
>
> On 01/10/2021 10:04, Tvrtko Ursulin wrote:
> >
> > Hi Peter,
> >
> > On 30/09/2021 19:33, Peter Zijlstra wrote:
> > > On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
> > > > ? void set_user_nice(struct task_struct *p, long nice)
> > > > ? {
> > > > ????? bool queued, running;
> > > > -??? int old_prio;
> > > > +??? int old_prio, ret;
> > > > ????? struct rq_flags rf;
> > > > ????? struct rq *rq;
> > > > @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p,
> > > > long nice)
> > > > ?????? */
> > > > ????? p->sched_class->prio_changed(rq, p, old_prio);
> > > > +??? ret = atomic_notifier_call_chain(&user_nice_notifier_list,
> > > > nice, p);
> > > > +??? WARN_ON_ONCE(ret != NOTIFY_DONE);
> > > > +
> > > > ? out_unlock:
> > > > ????? task_rq_unlock(rq, p, &rf);
> > > > ? }
> > >
> > > No, we're not going to call out to exported, and potentially unbounded,
> > > functions under scheduler locks.
> >
> > Agreed, that's another good point why it is even more hairy, as I have
> > generally alluded in the cover letter.
> >
> > Do you have any immediate thoughts on possible alternatives?
> >
> > Like for instance if I did a queue_work from set_user_nice and then ran
> > a notifier chain async from a worker? I haven't looked at yet what
> > repercussion would that have in terms of having to cancel the pending
> > workers when tasks exit. I can try and prototype that and see how it
> > would look.
>
> Hm or I simply move calling the notifier chain to after task_rq_unlock? That
> would leave it run under the tasklist lock so probably still quite bad.

Hmm? That's for normalize_rt_tasks() only, right? Just don't have it
call the notifier in that special case (that's a magic sysrq thing
anyway).

2021-10-04 08:16:45

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [RFC 1/6] sched: Add nice value change notifier


On 01/10/2021 16:48, Peter Zijlstra wrote:
> On Fri, Oct 01, 2021 at 11:32:16AM +0100, Tvrtko Ursulin wrote:
>>
>> On 01/10/2021 10:04, Tvrtko Ursulin wrote:
>>>
>>> Hi Peter,
>>>
>>> On 30/09/2021 19:33, Peter Zijlstra wrote:
>>>> On Thu, Sep 30, 2021 at 06:15:47PM +0100, Tvrtko Ursulin wrote:
>>>>>   void set_user_nice(struct task_struct *p, long nice)
>>>>>   {
>>>>>       bool queued, running;
>>>>> -    int old_prio;
>>>>> +    int old_prio, ret;
>>>>>       struct rq_flags rf;
>>>>>       struct rq *rq;
>>>>> @@ -6913,6 +6945,9 @@ void set_user_nice(struct task_struct *p,
>>>>> long nice)
>>>>>        */
>>>>>       p->sched_class->prio_changed(rq, p, old_prio);
>>>>> +    ret = atomic_notifier_call_chain(&user_nice_notifier_list,
>>>>> nice, p);
>>>>> +    WARN_ON_ONCE(ret != NOTIFY_DONE);
>>>>> +
>>>>>   out_unlock:
>>>>>       task_rq_unlock(rq, p, &rf);
>>>>>   }
>>>>
>>>> No, we're not going to call out to exported, and potentially unbounded,
>>>> functions under scheduler locks.
>>>
>>> Agreed, that's another good point why it is even more hairy, as I have
>>> generally alluded in the cover letter.
>>>
>>> Do you have any immediate thoughts on possible alternatives?
>>>
>>> Like for instance if I did a queue_work from set_user_nice and then ran
>>> a notifier chain async from a worker? I haven't looked at yet what
>>> repercussion would that have in terms of having to cancel the pending
>>> workers when tasks exit. I can try and prototype that and see how it
>>> would look.
>>
>> Hm or I simply move calling the notifier chain to after task_rq_unlock? That
>> would leave it run under the tasklist lock so probably still quite bad.
>
> Hmm? That's for normalize_rt_tasks() only, right? Just don't have it
> call the notifier in that special case (that's a magic sysrq thing
> anyway).

You mean my talk about tasklist_lock? No, it is also on the syscall part
I am interested in as well. Call chain looks like this:

sys_setpriority()
{
...
rcu_read_lock();
read_lock(&tasklist_lock);
...
set_one_prio()
set_user_nice()
{
...
task_rq_lock();
-> my notifier from this RFC [1]
task_rq_unlock();
-> I can move the notifier here for _some_ improvement [2]
}
...
read_unlock(&tasklist_lock);
rcu_read_unlock();
}

So this RFC had the notifier call chain at [1], which I understood was
the thing you initially pointed was horrible, being under a scheduler lock.

I can trivially move it to [2] but that still leaves it under the
tasklist lock. I don't have a good feel how much better that would be.
If not good enough then I will look for a smarter solution with less
opportunity for global impact.

Regards,

Tvrtko

2021-10-04 08:43:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/6] sched: Add nice value change notifier

On Mon, Oct 04, 2021 at 09:12:37AM +0100, Tvrtko Ursulin wrote:
> On 01/10/2021 16:48, Peter Zijlstra wrote:

> > Hmm? That's for normalize_rt_tasks() only, right? Just don't have it
> > call the notifier in that special case (that's a magic sysrq thing
> > anyway).
>
> You mean my talk about tasklist_lock? No, it is also on the syscall part I
> am interested in as well. Call chain looks like this:

Urgh, I alwys miss that because it lives outside of sched.. :/

> sys_setpriority()
> {
> ...
> rcu_read_lock();
> read_lock(&tasklist_lock);
> ...
> set_one_prio()
> set_user_nice()
> {
> ...
> task_rq_lock();
> -> my notifier from this RFC [1]
> task_rq_unlock();
> -> I can move the notifier here for _some_ improvement [2]
> }
> ...
> read_unlock(&tasklist_lock);
> rcu_read_unlock();
> }
>
> So this RFC had the notifier call chain at [1], which I understood was the
> thing you initially pointed was horrible, being under a scheduler lock.
>
> I can trivially move it to [2] but that still leaves it under the tasklist
> lock. I don't have a good feel how much better that would be. If not good
> enough then I will look for a smarter solution with less opportunity for
> global impact.

So task_list lock is pretty terrible and effectively unbound already
(just create more tasks etc..) so adding a notifier call there shouldn't
really make it much worse.