2023-09-15 15:28:59

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 0/2] sched/eevdf: sched_attr::sched_runtime slice hint

Hi,

As promised a while ago, here is a new version of the variable slice length
hint stuff. Back when I asked for comments on the latency-nice vs slice length
thing, there was very limited feedback on-list, a number of people have
expressed interrest in the slice length hint.


I'm still working on improving the wakeup latency -- but esp. after commit:

63304558ba5d ("sched/eevdf: Curb wakeup-preemption")

it needs a little more work. Everything I tried so far made it worse.

As is it behaves ok-ish:

root@ivb-ep:~/bench# cat doit-latency-slice.sh
#!/bin/bash

perf bench sched messaging -g 40 -l 12000 &

sleep 1
chrt3 -o --sched-runtime $((`cat /debug/sched/base_slice_ns`*10)) 0 cyclictest --policy other -D 5 -q -H 20000 --histfile data.txt ; grep Latencies data.txt
chrt3 -o --sched-runtime 0 0 cyclictest --policy other -D 5 -q -H 20000 --histfile data.txt ; grep Latencies data.txt
chrt3 -o --sched-runtime $((`cat /debug/sched/base_slice_ns`/10)) 0 cyclictest --policy other -D 5 -q -H 20000 --histfile data.txt ; grep Latencies data.txt

wait $!
root@ivb-ep:~/bench# ./doit-latency-slice.sh
# Running 'sched/messaging' benchmark:
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00060
# Avg Latencies: 00990
# Max Latencies: 224925
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00020
# Avg Latencies: 00656
# Max Latencies: 37595
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00016
# Avg Latencies: 00354
# Max Latencies: 16687
# 20 sender and receiver processes per group
# 40 groups == 1600 processes run

Total time: 38.246 [sec]


(chrt3 is a hacked up version of util-linux/chrt that allows --sched-runtime unconditionally)


2023-09-17 01:09:51

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/2] sched/eevdf: sched_attr::sched_runtime slice hint

On 09/15/23 14:43, [email protected] wrote:
> Hi,
>
> As promised a while ago, here is a new version of the variable slice length
> hint stuff. Back when I asked for comments on the latency-nice vs slice length
> thing, there was very limited feedback on-list, a number of people have
> expressed interrest in the slice length hint.

I did try to give feedback then, but I don't see my email on lore and not sure
if it arrived.

As it stands having any interface to help describe latency sensitive tasks is
much desired! So I am desperate enough to take whatever.

But at the same time, I do see that we will not stop here. And we get a lot of
conflicting requirements from different workloads that I think we need
a framework to help provide sensible way to allow us describe those needs to
help the scheduler manage resources better.

I wasn't sure if you're still planning to send an interface or not, so was
working on this potential way (patch below) to provide a generic framework for
sched-qos. It's only a shell as I didn't get a chance yet to implement the
WAKEUP_LATENCY hint yet.

I did add the concept of grouping the hint to be meaningful for a group of
tasks sharing a cookie. Borrowing the concept from core scheduling. I've seen
many times how priorities (or nice) was used incorrectly with the assumption
that it applies to the apps tasks only. Which might be the case with autogroup
on some systems. But it struck a chord then that there's a perception/desire
not to apply it globally but only relative to a group of tasks they care about.
So I catered to allow describing such cases.

I was still trying to wrap my head around implementing WAKEUP_LATENCY hint, but
the idea I had is to represent WAKEUP_LATENCY in us time and somehow translate
this into lag. Which what I thought is our admission control. Based on your
patch it seems it might be simpler than this.

Was still thinking this through to be honest. But it seems it's either speak
now or forever hold, so here you go :)


Cheers

--
Qais Yousef

--->8---

From d6c83e05a81ac4ca34e99cb1f56d1acdacc63362 Mon Sep 17 00:00:00 2001
From: Qais Yousef <[email protected]>
Date: Sat, 26 Aug 2023 17:39:31 +0100
Subject: [PATCH] sched: Add a new sched-qos interface

The need to describe the conflicting demand of various workloads hasn't
been higher. Both hardware and software has moved rapidly in the past
decade and system usage is more diverse and the number of workloads
expected to run on the same machine whether on Mobile or Server markets
has created a big dilemma on how to better manage those requirements.

The problem is that we lack mechanisms to allow these workloads to
describe what they need, and then allow kernel to do best efforts to
manage those demands based on the hardware it is running on
transparently and current system state.

Example of conflicting requirements that come across frequently:

1. Improve wake up latency without for SCHED_OTHER. Many tasks
end up using SCHED_FIFO/SCHED_RR to compensate for this
shortcoming. RT tasks lack power management and fairness and
can be hard and error prone to use correctly and portably.

2. Prefer spreading vs prefer packing on wake up for a group of
tasks. Geekbench-like workloads would benefit from
parallelising on different CPUs. hackbench type of workloads
can benefit from waking on up same CPUs or a CPU that is
closer in the cache hierarchy.

3. Nice values for SCHED_OTHER are system wide and require
privileges. Many workloads would like a way to set relative
nice value so they can preempt each others, but not be
impact or be impacted by other tasks belong to different
workloads on the system.

4. Provide a way to tag some tasks as 'background' to keep them
out of the way. SCHED_IDLE is too strong for some of these
tasks but yet they can be computationally heavy. Example
tasks are garbage collectors. Their work is both important
and not important.

Whether any of these use cases warrants an additional QoS hint is
something to be discussed individually. But the main point is to
introduce an interface that can be extendable to cater for potentially
those requirements and more. Wake up latency is the major driving use
case that has brewing already for years and it is the first QoS hint to
be introduced in later patches.

It is desired to have apps (and benchmarks!) directly use this interface
for optimal perf/watt. But in the absence of such support, it should be
possible to write a userspace daemon to monitor workloads and apply
these QoS hints on apps behalf based on analysis done by anyone
interested in improving the performance of those workloads.

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-qos.rst | 44 +++++++++++++++++++++++++
include/uapi/linux/sched.h | 4 +++
include/uapi/linux/sched/types.h | 46 +++++++++++++++++++++++++++
kernel/sched/core.c | 3 ++
tools/include/uapi/linux/sched.h | 4 +++
6 files changed, 102 insertions(+)
create mode 100644 Documentation/scheduler/sched-qos.rst

diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 3170747226f6..fef59d7cd8e2 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -20,6 +20,7 @@ Scheduler
sched-rt-group
sched-stats
sched-debug
+ sched-qos

text_files

diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
new file mode 100644
index 000000000000..0911261cb124
--- /dev/null
+++ b/Documentation/scheduler/sched-qos.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Scheduler QoS
+=============
+
+1. Introduction
+===============
+
+Different workloads have different scheduling requirements to operate
+optimally. The same applies to tasks within the same workload.
+
+To enable smarter usage of system resources and to cater for the conflicting
+demands of various tasks, Scheduler QoS provides a mechanism to provide more
+information about those demands so that scheduler can do best-effort to
+honour them.
+
+ @sched_qos_type what QoS hint to apply
+ @sched_qos_value value of the QoS hint
+ @sched_qos_cookie magic cookie to tag a group of tasks for which the QoS
+ applies. If 0, the hint will apply globally system
+ wide. If not 0, the hint will be relative to tasks that
+ has the same cookie value only.
+
+QoS hints are set once and not inherited by children by design. The
+rationale is that each task has its individual characteristics and it is
+encouraged to describe each of these separately. Also since system resources
+are finite, there's a limit to what can be done to honour these requests
+before reaching a tipping point where there are too many requests for
+a particular QoS that is impossible to service for all of them at once and
+some will start to lose out. For example if 10 tasks require better wake
+up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
+4 can perceive the hint honoured and the rest will have to wait. Inheritance
+can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
+hint will lose its meaning and effectiveness rapidly. The chances of 10
+tasks waking up at the same time is lower than a 100 and lower than a 1000.
+
+To set multiple QoS hints, a syscall is required for each. This is a
+trade-off to reduce the churn on extending the interface as the hope for
+this to evolve as workloads and hardware get more sophisticated and the
+need for extension will arise; and when this happen the task should be
+simpler to add the kernel extension and allow userspace to use readily by
+setting the newly added flag without having to update the whole of
+sched_attr.
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..67ef99f64ddc 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -102,6 +102,9 @@ struct clone_args {
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
};
+
+enum sched_qos_type {
+};
#endif

#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
@@ -132,6 +135,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_QOS 0x80

#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index f2c4589d4dbf..8c2658ffe4bd 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -98,6 +98,48 @@ struct sched_param {
* scheduled on a CPU with no more capacity than the specified value.
*
* A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Scheduler QoS
+ * =============
+ *
+ * Different workloads have different scheduling requirements to operate
+ * optimally. The same applies to tasks within the same workload.
+ *
+ * To enable smarter usage of system resources and to cater for the conflicting
+ * demands of various tasks, Scheduler QoS provides a mechanism to provide more
+ * information about those demands so that scheduler can do best-effort to
+ * honour them.
+ *
+ * @sched_qos_type what QoS hint to apply
+ * @sched_qos_value value of the QoS hint
+ * @sched_qos_cookie magic cookie to tag a group of tasks for which the QoS
+ * applies. If 0, the hint will apply globally system
+ * wide. If not 0, the hint will be relative to tasks that
+ * has the same cookie value only.
+ *
+ * QoS hints are set once and not inherited by children by design. The
+ * rationale is that each task has its individual characteristics and it is
+ * encouraged to describe each of these separately. Also since system resources
+ * are finite, there's a limit to what can be done to honour these requests
+ * before reaching a tipping point where there are too many requests for
+ * a particular QoS that is impossible to service for all of them at once and
+ * some will start to lose out. For example if 10 tasks require better wake
+ * up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
+ * 4 can perceive the hint honoured and the rest will have to wait. Inheritance
+ * can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
+ * hint will lose its meaning and effectiveness rapidly. The chances of 10
+ * tasks waking up at the same time is lower than a 100 and lower than a 1000.
+ *
+ * To set multiple QoS hints, a syscall is required for each. This is a
+ * trade-off to reduce the churn on extending the interface as the hope for
+ * this to evolve as workloads and hardware get more sophisticated and the
+ * need for extension will arise; and when this happen the task should be
+ * simpler to add the kernel extension and allow userspace to use readily by
+ * setting the newly added flag without having to update the whole of
+ * sched_attr.
+ *
+ * Details about the available QoS hints can be found in:
+ * Documentation/scheduler/sched-qos.rst
*/
struct sched_attr {
__u32 size;
@@ -120,6 +162,10 @@ struct sched_attr {
__u32 sched_util_min;
__u32 sched_util_max;

+ __u32 sched_qos_type;
+ __s64 sched_qos_value;
+ __u32 sched_qos_cookie;
+
};

#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index efe3848978a0..efc658f0f6e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7680,6 +7680,9 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

+ if (attr->sched_flags & SCHED_FLAG_QOS)
+ return -ENOSYS;
+
/*
* SCHED_DEADLINE bandwidth accounting relies on stable cpusets
* information.
diff --git a/tools/include/uapi/linux/sched.h b/tools/include/uapi/linux/sched.h
index 3bac0a8ceab2..67ef99f64ddc 100644
--- a/tools/include/uapi/linux/sched.h
+++ b/tools/include/uapi/linux/sched.h
@@ -102,6 +102,9 @@ struct clone_args {
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
};
+
+enum sched_qos_type {
+};
#endif

#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
@@ -132,6 +135,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_QOS 0x80

#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
--
2.34.1

2023-09-18 11:17:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/2] sched/eevdf: sched_attr::sched_runtime slice hint

On Sat, 2023-09-16 at 22:33 +0100, Qais Yousef wrote:
>
> Example of conflicting requirements that come across frequently:
>
>         1. Improve wake up latency without for SCHED_OTHER. Many tasks
>            end up using SCHED_FIFO/SCHED_RR to compensate for this
>            shortcoming. RT tasks lack power management and fairness and
>            can be hard and error prone to use correctly and portably.

This bit appears to be dealt with about as nicely as it can be in a
fair class by the latency nice patch set, and deals with both
individual tasks and groups thereof, ie has cgroups support.

Its trade slice for latency fits EEVDF nicely IMHO. As its name
implies, the trade agreement language is relative niceness, which I
find more appropriate than time units, use of which would put the deal
squarely into the realm of RT, thus have no place in a fair class.

I don't yet know how effective it is. I dinged up schedtool to play
with both it and $subject, but have yet to target any pet piglets or
measured impact of shiny new lipstick cannon.

>         2. Prefer spreading vs prefer packing on wake up for a group of
>            tasks. Geekbench-like workloads would benefit from
>            parallelising on different CPUs. hackbench type of workloads
>            can benefit from waking on up same CPUs or a CPU that is
>            closer in the cache hierarchy.
>
>         3. Nice values for SCHED_OTHER are system wide and require
>            privileges. Many workloads would like a way to set relative
>            nice value so they can preempt each others, but not be
>            impact or be impacted by other tasks belong to different
>            workloads on the system.
>
>         4. Provide a way to tag some tasks as 'background' to keep them
>            out of the way. SCHED_IDLE is too strong for some of these
>            tasks but yet they can be computationally heavy. Example
>            tasks are garbage collectors. Their work is both important
>            and not important.

All three of those make my eyebrows twitch mightily even in their not
well defined form: any notion of applying badges to identify groups of
tasks would constitute creation of yet another cgroups.

-Mike

2023-09-20 01:45:55

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/2] sched/eevdf: sched_attr::sched_runtime slice hint

On 09/18/23 05:43, Mike Galbraith wrote:
> On Sat, 2023-09-16 at 22:33 +0100, Qais Yousef wrote:
> >
> > Example of conflicting requirements that come across frequently:
> >
> >         1. Improve wake up latency without for SCHED_OTHER. Many tasks
> >            end up using SCHED_FIFO/SCHED_RR to compensate for this
> >            shortcoming. RT tasks lack power management and fairness and
> >            can be hard and error prone to use correctly and portably.
>
> This bit appears to be dealt with about as nicely as it can be in a
> fair class by the latency nice patch set, and deals with both
> individual tasks and groups thereof, ie has cgroups support.

AFAIU the latency_nice is no longer going forward. But I could be mistaken.

> Its trade slice for latency fits EEVDF nicely IMHO. As its name
> implies, the trade agreement language is relative niceness, which I
> find more appropriate than time units, use of which would put the deal
> squarely into the realm of RT, thus have no place in a fair class.

Nice (or latency nice) have global indication that can make sense within the
specific context tested on. Like RT priorities.

Abstract notion is fine if you have a better suggestion, but being global
relative is a problem IMO. The intended consumers are application writers; who
have no prior knowledge about the system they'll be running on. I think that
was the main point against latency_nice IIUC.

> I don't yet know how effective it is. I dinged up schedtool to play
> with both it and $subject, but have yet to target any pet piglets or
> measured impact of shiny new lipstick cannon.
>
> >         2. Prefer spreading vs prefer packing on wake up for a group of
> >            tasks. Geekbench-like workloads would benefit from
> >            parallelising on different CPUs. hackbench type of workloads
> >            can benefit from waking on up same CPUs or a CPU that is
> >            closer in the cache hierarchy.
> >
> >         3. Nice values for SCHED_OTHER are system wide and require
> >            privileges. Many workloads would like a way to set relative
> >            nice value so they can preempt each others, but not be
> >            impact or be impacted by other tasks belong to different
> >            workloads on the system.
> >
> >         4. Provide a way to tag some tasks as 'background' to keep them
> >            out of the way. SCHED_IDLE is too strong for some of these
> >            tasks but yet they can be computationally heavy. Example
> >            tasks are garbage collectors. Their work is both important
> >            and not important.
>
> All three of those make my eyebrows twitch mightily even in their not
> well defined form: any notion of applying badges to identify groups of
> tasks would constitute creation of yet another cgroups.

cgroups require root privilege. And it is intended for sysadmins to split
system resources between apps. It doesn't help an app to describe the
relationship between its tasks. Nor any requirements for them to do their job
properly. But rather impose something on them regardless of what they want.


Cheers

--
Qais Yousef

2023-09-20 07:30:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/2] sched/eevdf: sched_attr::sched_runtime slice hint

On Tue, 2023-09-19 at 22:08 +0100, Qais Yousef wrote:
> On 09/18/23 05:43, Mike Galbraith wrote:
> > On Sat, 2023-09-16 at 22:33 +0100, Qais Yousef wrote:
> > >
> > > Example of conflicting requirements that come across frequently:
> > >
> > >         1. Improve wake up latency without for SCHED_OTHER. Many tasks
> > >            end up using SCHED_FIFO/SCHED_RR to compensate for this
> > >            shortcoming. RT tasks lack power management and fairness and
> > >            can be hard and error prone to use correctly and portably.
> >
> > This bit appears to be dealt with about as nicely as it can be in a
> > fair class by the latency nice patch set, and deals with both
> > individual tasks and groups thereof, ie has cgroups support.
>
> AFAIU the latency_nice is no longer going forward. But I could be mistaken.

Effectively it is, both making the same request under the hood, the
difference being trade negotiation idiom.

I took both to try out for no particularly good reason. The only thing
silly looking in the result is one clipping at OMG the other at OMFG.

> > All three of those make my eyebrows twitch mightily even in their not
> > well defined form: any notion of applying badges to identify groups of
> > tasks would constitute creation of yet another cgroups.
>
> cgroups require root privilege. And it is intended for sysadmins to split
> system resources between apps. It doesn't help an app to describe the
> relationship between its tasks. Nor any requirements for them to do their job
> properly. But rather impose something on them regardless of what they want.

The whys and wherefores are clear. I suspect that addition of another
task group interface with conflicting scheduling parameters, policies,
hopes and/or prayers to be dealt with at each and every level of the
existing hierarchy is going to be hard to sell, but who knows, maybe
that skeleton looks more attractive to maintainers than it does to me.
I suppose we'll find out once you hang some meat on it.

-Mike