From: Tvrtko Ursulin <[email protected]>
This is a somewhat early sketch of one of my ideas intended for early feedback
from the core scheduler experts. First and last two patches in the series are
the most interesting ones for people outside of i915. (Note I did not copy
everyone on all patches but just the cover letter for context and the rest
should be available from the mailing list.)
General idea is that current processing landscape seems to be more and more
composed of pipelines where computations are done on multiple hardware devices.
Furthermore some of the non-CPU devices, like in this case many GPUs supported
by the i915 driver, actually support priority based scheduling which is
currently rather inaccesible to the user (in terms of being able to control it
from the outside).
From these two statements a question arises on how to allow for a simple,
effective and consolidated user experience. In other words why user would not be
able to do something like:
$ nice ffmmpeg ...transcode my videos...
$ my-favourite-game
And have the nice hint apply to GPU parts of the transcode pipeline as well?
Another reason why I started thinking about this is that I noticed Chrome
browser for instance uses nice to de-prioritise background tabs. So again,
having that decision propagate to the GPU rendering pipeline sounds like a big
plus to the overall user experience.
This RFC implements this idea with the hairy part being the notifier chain I
added to enable dynamic adjustments. It is a global notifier which raises a few
questions so I am very curious what experts will think here. Please see the
opens in the first patch for more on this.
Last patch ("drm/i915: Connect with the process nice change notifier")
demonstrates how i915 can use the notifier. With a bit of notable tracking being
required and addedd in "drm/i915: Keep track of registered clients indexed by
task struct".
On a more positive note the thing seems to even work as is. For instance I
roughly simulated the above scenario by running a GPU hog at three nice levels
and a GfxBench TRex in parallel (as a game proxy). This is what I got:
GPU hog nice | TRex fps
--------------+---------------
not running | 48.9
0 | 42.7
10 | 47.9
-10 | 29.0
When re-niced the background GPU hog has a much smaller effect on the
performance of the game user is running in the foreground. So it appears the
feature can indeed improve the user experience. Question is just if people are
happy with this method of implementing it.
v2:
* Moved notifier outside task_rq_lock.
* Some improvements and restructuring on the i915 side of the series.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Tvrtko Ursulin (8):
sched: Add nice value change notifier
drm/i915: Explicitly track DRM clients
drm/i915: Make GEM contexts track DRM clients
drm/i915: Track all user contexts per client
drm/i915: Keep track of registered clients indexed by task struct
drm/i915: Make some recently added vfuncs use full scheduling
attribute
drm/i915: Inherit process nice for context scheduling priority
drm/i915: Connect with the process nice change notifier
drivers/gpu/drm/i915/Makefile | 5 +-
drivers/gpu/drm/i915/gem/i915_gem_context.c | 20 +++
.../gpu/drm/i915/gem/i915_gem_context_types.h | 6 +
.../drm/i915/gt/intel_execlists_submission.c | 6 +-
.../gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +-
drivers/gpu/drm/i915/i915_drm_client.c | 130 ++++++++++++++++++
drivers/gpu/drm/i915/i915_drm_client.h | 71 ++++++++++
drivers/gpu/drm/i915/i915_drv.c | 6 +
drivers/gpu/drm/i915/i915_drv.h | 5 +
drivers/gpu/drm/i915/i915_gem.c | 21 ++-
drivers/gpu/drm/i915/i915_request.c | 2 +-
drivers/gpu/drm/i915/i915_request.h | 5 +
drivers/gpu/drm/i915/i915_scheduler.c | 16 ++-
drivers/gpu/drm/i915/i915_scheduler.h | 14 ++
drivers/gpu/drm/i915/i915_scheduler_types.h | 12 +-
include/linux/sched.h | 5 +
kernel/sched/core.c | 37 ++++-
17 files changed, 346 insertions(+), 18 deletions(-)
create mode 100644 drivers/gpu/drm/i915/i915_drm_client.c
create mode 100644 drivers/gpu/drm/i915/i915_drm_client.h
--
2.30.2
From: Tvrtko Ursulin <[email protected]>
Implement a simple notifier chain via which interested parties can track
when process nice value changes. Simple because it is global so each user
would have to track which tasks it is interested in.
First intended use case are GPU drivers using task nice as priority hint
when scheduling GPU contexts belonging to respective clients.
To use register_user_nice_notifier and unregister_user_nice_notifier
functions are provided and new nice value and pointer to task_struct
being modified passed to the callbacks.
v2:
* Move the notifier chain outside task_rq_lock. (Peter)
Opens:
* Security. Would some sort of a per process mechanism be better and
feasible?
x Peter Zijlstra thinks it may be passable now that it is outside
core scheduler locks.
* Put it all behind kconfig to be selected by interested drivers?
Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 5 +++++
kernel/sched/core.c | 37 ++++++++++++++++++++++++++++++++++++-
2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c1a927ddec64..1fcec88e5dbc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2309,4 +2309,9 @@ static inline void sched_core_free(struct task_struct *tsk) { }
static inline void sched_core_fork(struct task_struct *p) { }
#endif
+struct notifier_block;
+
+extern int register_user_nice_notifier(struct notifier_block *);
+extern int unregister_user_nice_notifier(struct notifier_block *);
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bba4128a3e6..fc90b603bb6f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6864,10 +6864,42 @@ static inline int rt_effective_prio(struct task_struct *p, int prio)
}
#endif
+ATOMIC_NOTIFIER_HEAD(user_nice_notifier_list);
+
+/**
+ * register_user_nice_notifier - Register function to be called when task nice changes
+ * @nb: Info about notifier function to be called
+ *
+ * Registers a function with the list of functions to be called when task nice
+ * value changes.
+ *
+ * Currently always returns zero, as atomic_notifier_chain_register()
+ * always returns zero.
+ */
+int register_user_nice_notifier(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_register(&user_nice_notifier_list, nb);
+}
+EXPORT_SYMBOL(register_user_nice_notifier);
+
+/**
+ * unregister_user_nice_notifier - Unregister previously registered user nice notifier
+ * @nb: Hook to be unregistered
+ *
+ * Unregisters a previously registered user nice notifier function.
+ *
+ * Returns zero on success, or %-ENOENT on failure.
+ */
+int unregister_user_nice_notifier(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_unregister(&user_nice_notifier_list, nb);
+}
+EXPORT_SYMBOL(unregister_user_nice_notifier);
+
void set_user_nice(struct task_struct *p, long nice)
{
bool queued, running;
- int old_prio;
+ int old_prio, ret;
struct rq_flags rf;
struct rq *rq;
@@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p, long nice)
out_unlock:
task_rq_unlock(rq, p, &rf);
+
+ ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
+ WARN_ON_ONCE(ret != NOTIFY_DONE);
}
EXPORT_SYMBOL(set_user_nice);
--
2.30.2
From: Tvrtko Ursulin <[email protected]>
A simple hash table of registered clients indexed by the task struct
pointer is kept to be used in a following patch.
Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 ++
drivers/gpu/drm/i915/i915_drm_client.c | 31 ++++++++++++++++++++-
drivers/gpu/drm/i915/i915_drm_client.h | 13 +++++++++
3 files changed, 45 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index d1992ba59ed8..8d4d687ab1d0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -1932,6 +1932,8 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
return -EIO;
}
+ i915_drm_client_update_owner(ext_data.fpriv->client, current);
+
ext_data.pc = proto_context_create(i915, args->flags);
if (IS_ERR(ext_data.pc))
return PTR_ERR(ext_data.pc);
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 91a8559bebf7..82b9636482ef 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -18,6 +18,9 @@ void i915_drm_clients_init(struct i915_drm_clients *clients,
clients->next_id = 0;
xa_init_flags(&clients->xarray, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
+
+ rwlock_init(&clients->lock);
+ hash_init(clients->tasks);
}
struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients)
@@ -42,6 +45,8 @@ struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients)
INIT_LIST_HEAD(&client->ctx_list);
client->clients = clients;
+ i915_drm_client_update_owner(client, current);
+
return client;
err:
@@ -54,9 +59,14 @@ void __i915_drm_client_free(struct kref *kref)
{
struct i915_drm_client *client =
container_of(kref, typeof(*client), kref);
- struct xarray *xa = &client->clients->xarray;
+ struct i915_drm_clients *clients = client->clients;
+ struct xarray *xa = &clients->xarray;
unsigned long flags;
+ write_lock(&clients->lock);
+ hash_del(&client->node);
+ write_unlock(&clients->lock);
+
xa_lock_irqsave(xa, flags);
__xa_erase(xa, client->id);
xa_unlock_irqrestore(xa, flags);
@@ -68,3 +78,22 @@ void i915_drm_clients_fini(struct i915_drm_clients *clients)
GEM_BUG_ON(!xa_empty(&clients->xarray));
xa_destroy(&clients->xarray);
}
+
+void i915_drm_client_update_owner(struct i915_drm_client *client,
+ struct task_struct *owner)
+{
+ struct i915_drm_clients *clients;
+
+ if (READ_ONCE(client->owner) == owner)
+ return;
+
+ clients = client->clients;
+ write_lock(&clients->lock);
+ if (READ_ONCE(client->owner) != owner) {
+ if (client->owner)
+ hash_del(&client->node);
+ client->owner = owner;
+ hash_add(clients->tasks, &client->node, (uintptr_t)owner);
+ }
+ write_unlock(&clients->lock);
+}
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 0207dfad4568..42fd79f0558a 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -6,8 +6,11 @@
#ifndef __I915_DRM_CLIENT_H__
#define __I915_DRM_CLIENT_H__
+#include <linux/hashtable.h>
#include <linux/kref.h>
#include <linux/list.h>
+#include <linux/rwlock.h>
+#include <linux/sched.h>
#include <linux/spinlock.h>
#include <linux/xarray.h>
@@ -18,6 +21,9 @@ struct i915_drm_clients {
struct xarray xarray;
u32 next_id;
+
+ rwlock_t lock;
+ DECLARE_HASHTABLE(tasks, 6);
};
struct i915_drm_client {
@@ -28,6 +34,9 @@ struct i915_drm_client {
spinlock_t ctx_lock; /* For add/remove from ctx_list. */
struct list_head ctx_list; /* List of contexts belonging to client. */
+ struct task_struct *owner; /* No reference kept, never dereferenced. */
+ struct hlist_node node;
+
struct i915_drm_clients *clients;
};
@@ -52,4 +61,8 @@ struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients);
void i915_drm_clients_fini(struct i915_drm_clients *clients);
+void i915_drm_client_update_owner(struct i915_drm_client *client,
+ struct task_struct *owner);
+
+
#endif /* !__I915_DRM_CLIENT_H__ */
--
2.30.2
From: Tvrtko Ursulin <[email protected]>
Connect i915 with the process nice change notifier so that our scheduling
can react to runtime adjustments, on top of previously added nice value
inheritance at context create time.
To achieve this we use the previously added map of clients per owning
tasks in combination with the list of GEM contexts per client.
To avoid possibly unnecessary complications the updated context nice value
will only apply to future submissions against the context.
Signed-off-by: Tvrtko Ursulin <[email protected]>
---
drivers/gpu/drm/i915/i915_drm_client.c | 31 ++++++++++++++++++++++++++
drivers/gpu/drm/i915/i915_drm_client.h | 3 +++
2 files changed, 34 insertions(+)
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index 82b9636482ef..e34c1228f65b 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -7,10 +7,35 @@
#include <linux/slab.h>
#include <linux/types.h>
+#include "gem/i915_gem_context.h"
#include "i915_drm_client.h"
#include "i915_gem.h"
#include "i915_utils.h"
+static int
+clients_notify(struct notifier_block *nb, unsigned long val, void *ptr)
+{
+ struct i915_drm_clients *clients =
+ container_of(nb, typeof(*clients), prio_notifier);
+ struct i915_drm_client *client;
+
+ rcu_read_lock();
+ read_lock(&clients->lock);
+ hash_for_each_possible(clients->tasks, client, node, (uintptr_t)ptr) {
+ struct i915_gem_context *ctx;
+
+ if (client->owner != ptr)
+ continue;
+
+ list_for_each_entry_rcu(ctx, &client->ctx_list, client_link)
+ ctx->sched.nice = (int)val;
+ }
+ read_unlock(&clients->lock);
+ rcu_read_unlock();
+
+ return NOTIFY_DONE;
+}
+
void i915_drm_clients_init(struct i915_drm_clients *clients,
struct drm_i915_private *i915)
{
@@ -21,6 +46,10 @@ void i915_drm_clients_init(struct i915_drm_clients *clients,
rwlock_init(&clients->lock);
hash_init(clients->tasks);
+
+ memset(&clients->prio_notifier, 0, sizeof(clients->prio_notifier));
+ clients->prio_notifier.notifier_call = clients_notify;
+ register_user_nice_notifier(&clients->prio_notifier);
}
struct i915_drm_client *i915_drm_client_add(struct i915_drm_clients *clients)
@@ -75,6 +104,8 @@ void __i915_drm_client_free(struct kref *kref)
void i915_drm_clients_fini(struct i915_drm_clients *clients)
{
+ unregister_user_nice_notifier(&clients->prio_notifier);
+
GEM_BUG_ON(!xa_empty(&clients->xarray));
xa_destroy(&clients->xarray);
}
diff --git a/drivers/gpu/drm/i915/i915_drm_client.h b/drivers/gpu/drm/i915/i915_drm_client.h
index 42fd79f0558a..dda26aa42ac9 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.h
+++ b/drivers/gpu/drm/i915/i915_drm_client.h
@@ -9,6 +9,7 @@
#include <linux/hashtable.h>
#include <linux/kref.h>
#include <linux/list.h>
+#include <linux/notifier.h>
#include <linux/rwlock.h>
#include <linux/sched.h>
#include <linux/spinlock.h>
@@ -24,6 +25,8 @@ struct i915_drm_clients {
rwlock_t lock;
DECLARE_HASHTABLE(tasks, 6);
+
+ struct notifier_block prio_notifier;
};
struct i915_drm_client {
--
2.30.2
HI Tvrtko
On 2021/10/4 22:36, Tvrtko Ursulin wrote:
> void set_user_nice(struct task_struct *p, long nice)
> {
> bool queued, running;
> - int old_prio;
> + int old_prio, ret;
> struct rq_flags rf;
> struct rq *rq;
>
> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p, long nice)
>
> out_unlock:
> task_rq_unlock(rq, p, &rf);
> +
> + ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
> + WARN_ON_ONCE(ret != NOTIFY_DONE);
> }
How about adding a new "io_nice" to task_struct,and move the call chain to
sched_setattr/getattr, there are two benefits:
1. Decoupled with fair scheduelr. In our use case, high priority tasks often
use rt scheduler.
2. The range of value don't need to be bound to -20~19 or 0~139
On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John) <[email protected]> wrote:
>
> HI Tvrtko
>
> On 2021/10/4 22:36, Tvrtko Ursulin wrote:
> > void set_user_nice(struct task_struct *p, long nice)
> > {
> > bool queued, running;
> > - int old_prio;
> > + int old_prio, ret;
> > struct rq_flags rf;
> > struct rq *rq;
> >
> > @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p, long nice)
> >
> > out_unlock:
> > task_rq_unlock(rq, p, &rf);
> > +
> > + ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
> > + WARN_ON_ONCE(ret != NOTIFY_DONE);
> > }
> How about adding a new "io_nice" to task_struct,and move the call chain to
> sched_setattr/getattr, there are two benefits:
We already have an ionice for block io scheduler. hardly can this new io_nice
be generic to all I/O. it seems the patchset is trying to link
process' nice with
GPU's scheduler, to some extent, it makes more senses than having a
common ionice because we have a lot of IO devices in the systems, we don't
know which I/O the ionice of task_struct should be applied to.
Maybe we could have an ionice dedicated for GPU just like ionice for CFQ
of bio/request scheduler.
>
> 1. Decoupled with fair scheduelr. In our use case, high priority tasks often
> use rt scheduler.
Is it possible to tell GPU RT as we are telling them CFS nice?
> 2. The range of value don't need to be bound to -20~19 or 0~139
>
could build a mapping between the priorities of process and GPU. It seems
not a big deal.
Thanks
barry
Hi,
On 06/10/2021 08:58, Barry Song wrote:
> On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John) <[email protected]> wrote:
>>
>> HI Tvrtko
>>
>> On 2021/10/4 22:36, Tvrtko Ursulin wrote:
>>> void set_user_nice(struct task_struct *p, long nice)
>>> {
>>> bool queued, running;
>>> - int old_prio;
>>> + int old_prio, ret;
>>> struct rq_flags rf;
>>> struct rq *rq;
>>>
>>> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p, long nice)
>>>
>>> out_unlock:
>>> task_rq_unlock(rq, p, &rf);
>>> +
>>> + ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
>>> + WARN_ON_ONCE(ret != NOTIFY_DONE);
>>> }
>> How about adding a new "io_nice" to task_struct,and move the call chain to
>> sched_setattr/getattr, there are two benefits:
>
> We already have an ionice for block io scheduler. hardly can this new io_nice
> be generic to all I/O. it seems the patchset is trying to link
> process' nice with
> GPU's scheduler, to some extent, it makes more senses than having a
> common ionice because we have a lot of IO devices in the systems, we don't
> know which I/O the ionice of task_struct should be applied to.
>
> Maybe we could have an ionice dedicated for GPU just like ionice for CFQ
> of bio/request scheduler.
Thought crossed my mind but I couldn't see the practicality of a 3rd
nice concept. I mean even to start with I struggle a bit with the
usefulness of existing ionice vs nice. Like coming up with practical
examples of usecases where it makes sense to decouple the two priorities.
From a different angle I did think inheriting CPU nice makes sense for
GPU workloads. This is because today, and more so in the future,
computations on a same data set do flow from one to the other.
Like maybe a simple example of batch image processing where CPU decodes,
GPU does a transform and then CPU encodes. Or a different mix, doesn't
really matter, since the main point it is one computing pipeline from
users point of view.
In this example perhaps everything could be handled in userspace so
that's another argument to be had. Userspace could query the current
scheduling attributes before submitting work to the processing pipeline
and adjust using respective uapi.
Downside would be inability to react to changes after the work is
already running which may not be too serious limitation outside the
world of multi-minute compute workloads. And latter are probably special
case enough that would be configured explicitly.
>>
>> 1. Decoupled with fair scheduelr. In our use case, high priority tasks often
>> use rt scheduler.
>
> Is it possible to tell GPU RT as we are telling them CFS nice?
Yes of course. We could create a common notification "data packet" which
would be sent from both entry points and provide more data than just the
nice value. Consumers (of the notifier chain) could then decide for
themselves what they want to do with the data.
Regards,
Tvrtko
>
>> 2. The range of value don't need to be bound to -20~19 or 0~139
>>
>
> could build a mapping between the priorities of process and GPU. It seems
> not a big deal.
>
> Thanks
> barry
>
On Thu, Oct 7, 2021 at 2:44 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> Hi,
>
> On 06/10/2021 08:58, Barry Song wrote:
> > On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John) <[email protected]> wrote:
> >>
> >> HI Tvrtko
> >>
> >> On 2021/10/4 22:36, Tvrtko Ursulin wrote:
> >>> void set_user_nice(struct task_struct *p, long nice)
> >>> {
> >>> bool queued, running;
> >>> - int old_prio;
> >>> + int old_prio, ret;
> >>> struct rq_flags rf;
> >>> struct rq *rq;
> >>>
> >>> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p, long nice)
> >>>
> >>> out_unlock:
> >>> task_rq_unlock(rq, p, &rf);
> >>> +
> >>> + ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
> >>> + WARN_ON_ONCE(ret != NOTIFY_DONE);
> >>> }
> >> How about adding a new "io_nice" to task_struct,and move the call chain to
> >> sched_setattr/getattr, there are two benefits:
> >
> > We already have an ionice for block io scheduler. hardly can this new io_nice
> > be generic to all I/O. it seems the patchset is trying to link
> > process' nice with
> > GPU's scheduler, to some extent, it makes more senses than having a
> > common ionice because we have a lot of IO devices in the systems, we don't
> > know which I/O the ionice of task_struct should be applied to.
> >
> > Maybe we could have an ionice dedicated for GPU just like ionice for CFQ
> > of bio/request scheduler.
>
> Thought crossed my mind but I couldn't see the practicality of a 3rd
> nice concept. I mean even to start with I struggle a bit with the
> usefulness of existing ionice vs nice. Like coming up with practical
> examples of usecases where it makes sense to decouple the two priorities.
>
> From a different angle I did think inheriting CPU nice makes sense for
> GPU workloads. This is because today, and more so in the future,
> computations on a same data set do flow from one to the other.
>
> Like maybe a simple example of batch image processing where CPU decodes,
> GPU does a transform and then CPU encodes. Or a different mix, doesn't
> really matter, since the main point it is one computing pipeline from
> users point of view.
>
I am on it. but I am also seeing two problems here:
1. nice is not global in linux. For example, if you have two cgroups, cgroup A
has more quota then cgroup B. Tasks in B won't win even if it has a lower nice.
cgroups will run proportional-weight time-based division of CPU.
2. Historically, we had dynamic nice which was adjusted based on the average
sleep/running time; right now, we don't have dynamic nice, but virtual time
still make tasks which sleep more preempt other tasks with the same nice
or even lower nice.
virtual time += physical time/weight by nice
so, static nice number doesn't always make sense to decide preemption.
So it seems your patch only works under some simple situation for example
no cgroups, tasks have similar sleep/running time.
> In this example perhaps everything could be handled in userspace so
> that's another argument to be had. Userspace could query the current
> scheduling attributes before submitting work to the processing pipeline
> and adjust using respective uapi.
>
> Downside would be inability to react to changes after the work is
> already running which may not be too serious limitation outside the
> world of multi-minute compute workloads. And latter are probably special
> case enough that would be configured explicitly.
>
> >>
> >> 1. Decoupled with fair scheduelr. In our use case, high priority tasks often
> >> use rt scheduler.
> >
> > Is it possible to tell GPU RT as we are telling them CFS nice?
>
> Yes of course. We could create a common notification "data packet" which
> would be sent from both entry points and provide more data than just the
> nice value. Consumers (of the notifier chain) could then decide for
> themselves what they want to do with the data.
RT should have the same problem with CFS once we have cgroups.
>
> Regards,
>
> Tvrtko
>
> >
> >> 2. The range of value don't need to be bound to -20~19 or 0~139
> >>
> >
> > could build a mapping between the priorities of process and GPU. It seems
> > not a big deal.
> >
> > Thanks
> > barry
> >
Thanks
barry
On 06/10/2021 21:21, Barry Song wrote:
> On Thu, Oct 7, 2021 at 2:44 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> Hi,
>>
>> On 06/10/2021 08:58, Barry Song wrote:
>>> On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John) <[email protected]> wrote:
>>>>
>>>> HI Tvrtko
>>>>
>>>> On 2021/10/4 22:36, Tvrtko Ursulin wrote:
>>>>> void set_user_nice(struct task_struct *p, long nice)
>>>>> {
>>>>> bool queued, running;
>>>>> - int old_prio;
>>>>> + int old_prio, ret;
>>>>> struct rq_flags rf;
>>>>> struct rq *rq;
>>>>>
>>>>> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p, long nice)
>>>>>
>>>>> out_unlock:
>>>>> task_rq_unlock(rq, p, &rf);
>>>>> +
>>>>> + ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p);
>>>>> + WARN_ON_ONCE(ret != NOTIFY_DONE);
>>>>> }
>>>> How about adding a new "io_nice" to task_struct,and move the call chain to
>>>> sched_setattr/getattr, there are two benefits:
>>>
>>> We already have an ionice for block io scheduler. hardly can this new io_nice
>>> be generic to all I/O. it seems the patchset is trying to link
>>> process' nice with
>>> GPU's scheduler, to some extent, it makes more senses than having a
>>> common ionice because we have a lot of IO devices in the systems, we don't
>>> know which I/O the ionice of task_struct should be applied to.
>>>
>>> Maybe we could have an ionice dedicated for GPU just like ionice for CFQ
>>> of bio/request scheduler.
>>
>> Thought crossed my mind but I couldn't see the practicality of a 3rd
>> nice concept. I mean even to start with I struggle a bit with the
>> usefulness of existing ionice vs nice. Like coming up with practical
>> examples of usecases where it makes sense to decouple the two priorities.
>>
>> From a different angle I did think inheriting CPU nice makes sense for
>> GPU workloads. This is because today, and more so in the future,
>> computations on a same data set do flow from one to the other.
>>
>> Like maybe a simple example of batch image processing where CPU decodes,
>> GPU does a transform and then CPU encodes. Or a different mix, doesn't
>> really matter, since the main point it is one computing pipeline from
>> users point of view.
>>
>
> I am on it. but I am also seeing two problems here:
> 1. nice is not global in linux. For example, if you have two cgroups, cgroup A
> has more quota then cgroup B. Tasks in B won't win even if it has a lower nice.
> cgroups will run proportional-weight time-based division of CPU.
>
> 2. Historically, we had dynamic nice which was adjusted based on the average
> sleep/running time; right now, we don't have dynamic nice, but virtual time
> still make tasks which sleep more preempt other tasks with the same nice
> or even lower nice.
> virtual time += physical time/weight by nice
> so, static nice number doesn't always make sense to decide preemption.
>
> So it seems your patch only works under some simple situation for example
> no cgroups, tasks have similar sleep/running time.
Yes, I broadly agree with your assessment. Although there are plans for
adding cgroup support to i915 scheduling, I doubt as fine grained
control and exact semantics as there are on the CPU side will happen.
Mostly because the drive seems to be for more micro-controller managed
scheduling which adds further challenges in connecting the two sides
together.
But when you say it is a problem, I would characterize it more a
weakness in terms of being only a subset of possible control. It is
still richer (better?) than what currently exists and as demonstrated
with benchmarks in my cover letter it can deliver improvements in user
experience. If in the mid term future we can extend it with cgroup
support then the concept should still apply and get closer to how you
described nice works in the CPU world.
Main question in my mind is whether the idea of adding the
sched_attr/priority notifier to the kernel can be justified. Because as
mentioned before, everything apart from adjusting currently running GPU
jobs could be done purely in userspace. Stack changes would be quite
extensive and all, but that is not usually a good enough reason to put
something in the kernel. That's why it is an RFC an invitation to discuss.
Even ionice inherits from nice (see task_nice_ioprio()) so I think
argument can be made for drivers as well.
Regards,
Tvrtko
>> In this example perhaps everything could be handled in userspace so
>> that's another argument to be had. Userspace could query the current
>> scheduling attributes before submitting work to the processing pipeline
>> and adjust using respective uapi.
>>
>> Downside would be inability to react to changes after the work is
>> already running which may not be too serious limitation outside the
>> world of multi-minute compute workloads. And latter are probably special
>> case enough that would be configured explicitly.
>>
>>>>
>>>> 1. Decoupled with fair scheduelr. In our use case, high priority tasks often
>>>> use rt scheduler.
>>>
>>> Is it possible to tell GPU RT as we are telling them CFS nice?
>>
>> Yes of course. We could create a common notification "data packet" which
>> would be sent from both entry points and provide more data than just the
>> nice value. Consumers (of the notifier chain) could then decide for
>> themselves what they want to do with the data.
>
> RT should have the same problem with CFS once we have cgroups.
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>>> 2. The range of value don't need to be bound to -20~19 or 0~139
>>>>
>>>
>>> could build a mapping between the priorities of process and GPU. It seems
>>> not a big deal.
>>>
>>> Thanks
>>> barry
>>>
>
> Thanks
> barry
>
On 07/10/2021 09:50, Tvrtko Ursulin wrote:
>
> On 06/10/2021 21:21, Barry Song wrote:
>> On Thu, Oct 7, 2021 at 2:44 AM Tvrtko Ursulin
>> <[email protected]> wrote:
>>>
>>>
>>> Hi,
>>>
>>> On 06/10/2021 08:58, Barry Song wrote:
>>>> On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John)
>>>> <[email protected]> wrote:
>>>>>
>>>>> HI Tvrtko
>>>>>
>>>>> On 2021/10/4 22:36, Tvrtko Ursulin wrote:
>>>>>> void set_user_nice(struct task_struct *p, long nice)
>>>>>> {
>>>>>> bool queued, running;
>>>>>> - int old_prio;
>>>>>> + int old_prio, ret;
>>>>>> struct rq_flags rf;
>>>>>> struct rq *rq;
>>>>>>
>>>>>> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p,
>>>>>> long nice)
>>>>>>
>>>>>> out_unlock:
>>>>>> task_rq_unlock(rq, p, &rf);
>>>>>> +
>>>>>> + ret = atomic_notifier_call_chain(&user_nice_notifier_list,
>>>>>> nice, p);
>>>>>> + WARN_ON_ONCE(ret != NOTIFY_DONE);
>>>>>> }
>>>>> How about adding a new "io_nice" to task_struct,and move the call
>>>>> chain to
>>>>> sched_setattr/getattr, there are two benefits:
>>>>
>>>> We already have an ionice for block io scheduler. hardly can this
>>>> new io_nice
>>>> be generic to all I/O. it seems the patchset is trying to link
>>>> process' nice with
>>>> GPU's scheduler, to some extent, it makes more senses than having a
>>>> common ionice because we have a lot of IO devices in the systems, we
>>>> don't
>>>> know which I/O the ionice of task_struct should be applied to.
>>>>
>>>> Maybe we could have an ionice dedicated for GPU just like ionice for
>>>> CFQ
>>>> of bio/request scheduler.
>>>
>>> Thought crossed my mind but I couldn't see the practicality of a 3rd
>>> nice concept. I mean even to start with I struggle a bit with the
>>> usefulness of existing ionice vs nice. Like coming up with practical
>>> examples of usecases where it makes sense to decouple the two
>>> priorities.
>>>
>>> From a different angle I did think inheriting CPU nice makes sense for
>>> GPU workloads. This is because today, and more so in the future,
>>> computations on a same data set do flow from one to the other.
>>>
>>> Like maybe a simple example of batch image processing where CPU decodes,
>>> GPU does a transform and then CPU encodes. Or a different mix, doesn't
>>> really matter, since the main point it is one computing pipeline from
>>> users point of view.
>>>
>>
>> I am on it. but I am also seeing two problems here:
>> 1. nice is not global in linux. For example, if you have two cgroups,
>> cgroup A
>> has more quota then cgroup B. Tasks in B won't win even if it has a
>> lower nice.
>> cgroups will run proportional-weight time-based division of CPU.
>>
>> 2. Historically, we had dynamic nice which was adjusted based on the
>> average
>> sleep/running time; right now, we don't have dynamic nice, but virtual
>> time
>> still make tasks which sleep more preempt other tasks with the same nice
>> or even lower nice.
>> virtual time += physical time/weight by nice
>> so, static nice number doesn't always make sense to decide preemption.
>>
>> So it seems your patch only works under some simple situation for example
>> no cgroups, tasks have similar sleep/running time.
>
> Yes, I broadly agree with your assessment. Although there are plans for
> adding cgroup support to i915 scheduling, I doubt as fine grained
> control and exact semantics as there are on the CPU side will happen.
>
> Mostly because the drive seems to be for more micro-controller managed
> scheduling which adds further challenges in connecting the two sides
> together.
>
> But when you say it is a problem, I would characterize it more a
> weakness in terms of being only a subset of possible control. It is
> still richer (better?) than what currently exists and as demonstrated
> with benchmarks in my cover letter it can deliver improvements in user
> experience. If in the mid term future we can extend it with cgroup
> support then the concept should still apply and get closer to how you
> described nice works in the CPU world.
>
> Main question in my mind is whether the idea of adding the
> sched_attr/priority notifier to the kernel can be justified. Because as
> mentioned before, everything apart from adjusting currently running GPU
> jobs could be done purely in userspace. Stack changes would be quite
> extensive and all, but that is not usually a good enough reason to put
> something in the kernel. That's why it is an RFC an invitation to discuss.
>
> Even ionice inherits from nice (see task_nice_ioprio()) so I think
> argument can be made for drivers as well.
Now that I wrote this, I had a little bit of a light bulb moment. If I
abandon the idea of adjusting the priority of already submitted work
items, then I can do much of what I want purely from within the confines
of i915.
I simply add code to inherit from current task nice on every new work
item submission. This should probably bring the majority of the benefit
I measured.
Regards,
Tvrtko
On Thu, Oct 7, 2021 at 10:09 PM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 07/10/2021 09:50, Tvrtko Ursulin wrote:
> >
> > On 06/10/2021 21:21, Barry Song wrote:
> >> On Thu, Oct 7, 2021 at 2:44 AM Tvrtko Ursulin
> >> <[email protected]> wrote:
> >>>
> >>>
> >>> Hi,
> >>>
> >>> On 06/10/2021 08:58, Barry Song wrote:
> >>>> On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John)
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> HI Tvrtko
> >>>>>
> >>>>> On 2021/10/4 22:36, Tvrtko Ursulin wrote:
> >>>>>> void set_user_nice(struct task_struct *p, long nice)
> >>>>>> {
> >>>>>> bool queued, running;
> >>>>>> - int old_prio;
> >>>>>> + int old_prio, ret;
> >>>>>> struct rq_flags rf;
> >>>>>> struct rq *rq;
> >>>>>>
> >>>>>> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p,
> >>>>>> long nice)
> >>>>>>
> >>>>>> out_unlock:
> >>>>>> task_rq_unlock(rq, p, &rf);
> >>>>>> +
> >>>>>> + ret = atomic_notifier_call_chain(&user_nice_notifier_list,
> >>>>>> nice, p);
> >>>>>> + WARN_ON_ONCE(ret != NOTIFY_DONE);
> >>>>>> }
> >>>>> How about adding a new "io_nice" to task_struct,and move the call
> >>>>> chain to
> >>>>> sched_setattr/getattr, there are two benefits:
> >>>>
> >>>> We already have an ionice for block io scheduler. hardly can this
> >>>> new io_nice
> >>>> be generic to all I/O. it seems the patchset is trying to link
> >>>> process' nice with
> >>>> GPU's scheduler, to some extent, it makes more senses than having a
> >>>> common ionice because we have a lot of IO devices in the systems, we
> >>>> don't
> >>>> know which I/O the ionice of task_struct should be applied to.
> >>>>
> >>>> Maybe we could have an ionice dedicated for GPU just like ionice for
> >>>> CFQ
> >>>> of bio/request scheduler.
> >>>
> >>> Thought crossed my mind but I couldn't see the practicality of a 3rd
> >>> nice concept. I mean even to start with I struggle a bit with the
> >>> usefulness of existing ionice vs nice. Like coming up with practical
> >>> examples of usecases where it makes sense to decouple the two
> >>> priorities.
> >>>
> >>> From a different angle I did think inheriting CPU nice makes sense for
> >>> GPU workloads. This is because today, and more so in the future,
> >>> computations on a same data set do flow from one to the other.
> >>>
> >>> Like maybe a simple example of batch image processing where CPU decodes,
> >>> GPU does a transform and then CPU encodes. Or a different mix, doesn't
> >>> really matter, since the main point it is one computing pipeline from
> >>> users point of view.
> >>>
> >>
> >> I am on it. but I am also seeing two problems here:
> >> 1. nice is not global in linux. For example, if you have two cgroups,
> >> cgroup A
> >> has more quota then cgroup B. Tasks in B won't win even if it has a
> >> lower nice.
> >> cgroups will run proportional-weight time-based division of CPU.
> >>
> >> 2. Historically, we had dynamic nice which was adjusted based on the
> >> average
> >> sleep/running time; right now, we don't have dynamic nice, but virtual
> >> time
> >> still make tasks which sleep more preempt other tasks with the same nice
> >> or even lower nice.
> >> virtual time += physical time/weight by nice
> >> so, static nice number doesn't always make sense to decide preemption.
> >>
> >> So it seems your patch only works under some simple situation for example
> >> no cgroups, tasks have similar sleep/running time.
> >
> > Yes, I broadly agree with your assessment. Although there are plans for
> > adding cgroup support to i915 scheduling, I doubt as fine grained
> > control and exact semantics as there are on the CPU side will happen.
> >
> > Mostly because the drive seems to be for more micro-controller managed
> > scheduling which adds further challenges in connecting the two sides
> > together.
> >
> > But when you say it is a problem, I would characterize it more a
> > weakness in terms of being only a subset of possible control. It is
> > still richer (better?) than what currently exists and as demonstrated
> > with benchmarks in my cover letter it can deliver improvements in user
> > experience. If in the mid term future we can extend it with cgroup
> > support then the concept should still apply and get closer to how you
> > described nice works in the CPU world.
> >
> > Main question in my mind is whether the idea of adding the
> > sched_attr/priority notifier to the kernel can be justified. Because as
> > mentioned before, everything apart from adjusting currently running GPU
> > jobs could be done purely in userspace. Stack changes would be quite
> > extensive and all, but that is not usually a good enough reason to put
> > something in the kernel. That's why it is an RFC an invitation to discuss.
> >
> > Even ionice inherits from nice (see task_nice_ioprio()) so I think
> > argument can be made for drivers as well.
>
> Now that I wrote this, I had a little bit of a light bulb moment. If I
> abandon the idea of adjusting the priority of already submitted work
> items, then I can do much of what I want purely from within the confines
> of i915.
>
> I simply add code to inherit from current task nice on every new work
> item submission. This should probably bring the majority of the benefit
> I measured.
I think the idea makes sense to link the process's priority with the GPU's
scheduler. I have no doubt about this.
My question is more of what is the best way to implement this.
Android has bg_non_interactive cgroup with much lower weight for
background processes. interactive tasks, on the other hand, are placed
in another cgroup with much higer weight. So Android depends on
cgroup to improve user experience.
Chrome browser in your cover-letter uses nice to de-prioritise background
tabs. this works perfectly as the whole chrome should be in the same
cgroup, so changing nice will improve/decrease the resource gotten by
tasks in this cgroup. But once we have two cgroups, bringing this nice
belonging to the cgroup to the global scheduler of GPU will somehow
break the aim.
For example, if we have two cgroup A and B
/sys/fs/cgroup/cpu$ sudo sh -c 'echo 4096 > A/cpu.shares'
/sys/fs/cgroup/cpu$ sudo sh -c 'echo 512 > B/cpu.shares'
task in B with lower nice will get more GPU than task in A. But actually A group
has 8X weight of B. So the result seems wrong. especially real users like
Android does depend on cgroup.
I don't know how to overcome this "weakness", it seems not easy.
>
> Regards,
>
> Tvrtko
Thanks
barry