2022-09-25 15:15:20

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 0/7] Add latency priority for CFS class

This patchset restarts the work about adding a latency priority to describe
the latency tolerance of cfs tasks.

The patches [1-3] have been done by Parth:
https://lore.kernel.org/lkml/[email protected]/

I have just rebased and moved the set of latency priority outside the
priority update. I have removed the reviewed tag because the patches
are 2 years old.

This aims to be a generic interface and the following patches is one use
of it to improve the scheduling latency of cfs tasks.

The patch [4] uses latency nice priority to define a latency offset
and then decide if a cfs task can or should preempt the current
running task. The patch gives some tests results with cyclictests and
hackbench to highlight the benefit of latency priority for short
interactive task or long intensive tasks.

Patch [5] adds the support of latency nice priority to task group by
adding a cpu.latency.nice field. The range is [-20:19] as for setting task
latency priority.

Patch [6] makes sched_core taking into account the latency offset.

Patch [7] adds a rb tree to cover some corner cases where the latency
sensitive task (priority < 0) is preempted by high priority task (RT/DL)
or fails to preempt them. This patch ensures that tasks will have at least
a slice of sched_min_granularity in priority at wakeup. The patch gives
results to show the benefit in addition to patch 4.

I have also backported the patchset on a dragonboard RB3 with an android
mainline kernel based on v5.18 for a quick test. I have used the
TouchLatency app which is part of AOSP and described to be a very good
test to highlight jitter and jank frame sources of a system [1].
In addition to the app, I have added some short running tasks waking-up
regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
without overloading it (and disabling EAS). The 1st results shows that the
patchset helps to reduce the missed deadline frames from 5% to less than
0.1% when the cpu.latency.nice of task group are set.

I have also tested the patchset with the modified version of the alsa
latency test that has been shared by Tim. The test quickly xruns with
default latency nice priority 0 but is able to run without underuns with
a latency -20 and hackbench running simultaneously.


[1] https://source.android.com/docs/core/debug/eval_perf#touchlatency

Change since v4:
- Removed permission checks to set latency priority. This enables user
without elevated priviledge like audio application to set their latency
priority as requested by Tim.
- Removed cpu.latency and replaced it by cpu.latency.nice so we keep a
generic interface not tied to latency_offset which can be used to
implement other latency features.
- Added an entry in Documentation/admin-guide/cgroup-v2.rst to describe
cpu.latency.nice.
- Fix some typos.

Change since v3:
- Fix 2 compilation warnings raised by kernel test robot <[email protected]>

Change since v2:
- Set a latency_offset field instead of saving a weight and computing it
on the fly.
- Make latency_offset available for task group: cpu.latency
- Fix some corner cases to make latency sensitive tasks schedule first and
add a rb tree for latency sensitive task.

Change since v1:
- fix typo
- move some codes in the right patch to make bisect happy
- simplify and fixed how the weight is computed
- added support of sched core patch 7

Parth Shah (3):
sched: Introduce latency-nice as a per-task attribute
sched/core: Propagate parent task's latency requirements to the child
task
sched: Allow sched_{get,set}attr to change latency_nice of the task

Vincent Guittot (4):
sched/fair: Take into account latency priority at wakeup
sched/fair: Add sched group latency support
sched/core: Support latency priority with sched core
sched/fair: Add latency list

Documentation/admin-guide/cgroup-v2.rst | 8 +
include/linux/sched.h | 5 +
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 19 +++
init/init_task.c | 1 +
kernel/sched/core.c | 106 +++++++++++++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 189 +++++++++++++++++++++++-
kernel/sched/sched.h | 37 ++++-
tools/include/uapi/linux/sched.h | 4 +-
10 files changed, 366 insertions(+), 8 deletions(-)

--
2.17.1


2022-09-25 15:36:59

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 1/7] sched: Introduce latency-nice as a per-task attribute

From: Parth Shah <[email protected]>

Latency-nice indicates the latency requirements of a task with respect
to the other tasks in the system. The value of the attribute can be within
the range of [-20, 19] both inclusive to be in-line with the values just
like task nice values.

latency_nice = -20 indicates the task to have the least latency as
compared to the tasks having latency_nice = +19.

The latency_nice may affect only the CFS SCHED_CLASS by getting
latency requirements from the userspace.

Additionally, add debugging bits for newly added latency_nice attribute.

Signed-off-by: Parth Shah <[email protected]>
[rebase]
Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/debug.c | 1 +
kernel/sched/sched.h | 18 ++++++++++++++++++
3 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 15e3bd96e4ce..6805f378a9c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -783,6 +783,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned int rt_priority;
+ int latency_nice;

struct sched_entity se;
struct sched_rt_entity rt;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index bb3d63bdf4ae..a3f7876217a6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1042,6 +1042,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
#endif
P(policy);
P(prio);
+ P(latency_nice);
if (task_has_dl_policy(p)) {
P(dl.runtime);
P(dl.deadline);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1fc198be1ffd..eeb6efb0b610 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -125,6 +125,24 @@ extern int sched_rr_timeslice;
*/
#define NS_TO_JIFFIES(TIME) ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))

+/*
+ * Latency nice is meant to provide scheduler hints about the relative
+ * latency requirements of a task with respect to other tasks.
+ * Thus a task with latency_nice == 19 can be hinted as the task with no
+ * latency requirements, in contrast to the task with latency_nice == -20
+ * which should be given priority in terms of lower latency.
+ */
+#define MAX_LATENCY_NICE 19
+#define MIN_LATENCY_NICE -20
+
+#define LATENCY_NICE_WIDTH \
+ (MAX_LATENCY_NICE - MIN_LATENCY_NICE + 1)
+
+/*
+ * Default tasks should be treated as a task with latency_nice = 0.
+ */
+#define DEFAULT_LATENCY_NICE 0
+
/*
* Increase resolution of nice-level calculations for 64-bit architectures.
* The extra resolution improves shares distribution and load balancing of
--
2.17.1

2022-09-25 15:40:42

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 6/7] sched/core: Support latency priority with sched core

Take into account wakeup_latency_gran() when ordering the cfs threads.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74e42d19c1ce..e524e892d118 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11443,6 +11443,10 @@ bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
delta = (s64)(sea->vruntime - seb->vruntime) +
(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);

+ /* Take into account latency prio */
+ delta -= wakeup_latency_gran(sea, seb);
+
+
return delta > 0;
}
#else
--
2.17.1

2022-10-12 16:13:04

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] Add latency priority for CFS class

Hello Vincent,

Sharing results from testing on dual socket Zen3 system (2 x 64C/128T)

tl;dr

o I don't see any regression when workloads are running with
DEFAULT_LATENCY_NICE
o I can reproduce similar results as one reported in Patch 4 for
hackbench with latency nice 19 and hackbench and cyclictest
with various combination of latency nice values.
o I can see improvements to tail latency for schbench with hackbench
running in the background.
o There is an unexpected non-linear behavior observed for couple of
cases that I cannot explain yet. (Marked with "^" in detailed results)
I have not yet gotten to the bottom of it but if I've missed
something, please do let me know.

Detailed results are shared below:

On 9/25/2022 8:09 PM, Vincent Guittot wrote:
> This patchset restarts the work about adding a latency priority to describe
> the latency tolerance of cfs tasks.
>
> The patches [1-3] have been done by Parth:
> https://lore.kernel.org/lkml/[email protected]/
>
> I have just rebased and moved the set of latency priority outside the
> priority update. I have removed the reviewed tag because the patches
> are 2 years old.
>
> This aims to be a generic interface and the following patches is one use
> of it to improve the scheduling latency of cfs tasks.
>
> The patch [4] uses latency nice priority to define a latency offset
> and then decide if a cfs task can or should preempt the current
> running task. The patch gives some tests results with cyclictests and
> hackbench to highlight the benefit of latency priority for short
> interactive task or long intensive tasks.
>
> Patch [5] adds the support of latency nice priority to task group by
> adding a cpu.latency.nice field. The range is [-20:19] as for setting task
> latency priority.
>
> Patch [6] makes sched_core taking into account the latency offset.
>
> Patch [7] adds a rb tree to cover some corner cases where the latency
> sensitive task (priority < 0) is preempted by high priority task (RT/DL)
> or fails to preempt them. This patch ensures that tasks will have at least
> a slice of sched_min_granularity in priority at wakeup. The patch gives
> results to show the benefit in addition to patch 4.
>
> I have also backported the patchset on a dragonboard RB3 with an android
> mainline kernel based on v5.18 for a quick test. I have used the
> TouchLatency app which is part of AOSP and described to be a very good
> test to highlight jitter and jank frame sources of a system [1].
> In addition to the app, I have added some short running tasks waking-up
> regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> without overloading it (and disabling EAS). The 1st results shows that the
> patchset helps to reduce the missed deadline frames from 5% to less than
> 0.1% when the cpu.latency.nice of task group are set.
>
> I have also tested the patchset with the modified version of the alsa
> latency test that has been shared by Tim. The test quickly xruns with
> default latency nice priority 0 but is able to run without underuns with
> a latency -20 and hackbench running simultaneously.
>
>
> [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip: 5.19.0 tip sched/core
- latency_nice: 5.19.0 tip sched/core + this series

When we started testing, the tip was at:
commit 7e9518baed4c ("sched/fair: Move call to list_last_entry() in detach_tasks")

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ hackbench - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Test: tip latency_nice
1-groups: 4.23 (0.00 pct) 4.06 (4.01 pct)
2-groups: 4.93 (0.00 pct) 4.89 (0.81 pct)
4-groups: 5.32 (0.00 pct) 5.31 (0.18 pct)
8-groups: 5.46 (0.00 pct) 5.54 (-1.46 pct)
16-groups: 7.31 (0.00 pct) 7.33 (-0.27 pct)

NPS2

Test: tip latency_nice
1-groups: 4.19 (0.00 pct) 4.12 (1.67 pct)
2-groups: 4.77 (0.00 pct) 4.82 (-1.04 pct)
4-groups: 5.15 (0.00 pct) 5.17 (-0.38 pct)
8-groups: 5.47 (0.00 pct) 5.48 (-0.18 pct)
16-groups: 6.63 (0.00 pct) 6.65 (-0.30 pct)

NPS4

Test: tip latency_nice
1-groups: 4.23 (0.00 pct) 4.31 (-1.89 pct)
2-groups: 4.78 (0.00 pct) 4.75 (0.62 pct)
4-groups: 5.17 (0.00 pct) 5.24 (-1.35 pct)
8-groups: 5.63 (0.00 pct) 5.59 (0.71 pct)
16-groups: 7.88 (0.00 pct) 7.09 (10.02 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ schbench - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

#workers: tip latency_nice
1: 22.00 (0.00 pct) 21.00 (4.54 pct)
2: 34.00 (0.00 pct) 34.00 (0.00 pct)
4: 37.00 (0.00 pct) 40.00 (-8.10 pct)
8: 55.00 (0.00 pct) 49.00 (10.90 pct)
16: 69.00 (0.00 pct) 66.00 (4.34 pct)
32: 113.00 (0.00 pct) 117.00 (-3.53 pct)
64: 219.00 (0.00 pct) 242.00 (-10.50 pct) *
64: 219.00 (0.00 pct) 194.00 (11.41 pct) [Verification Run]
128: 506.00 (0.00 pct) 513.00 (-1.38 pct)
256: 45440.00 (0.00 pct) 44992.00 (0.98 pct)
512: 76672.00 (0.00 pct) 83328.00 (-8.68 pct)

NPS2

#workers: tip latency_nice
1: 31.00 (0.00 pct) 20.00 (35.48 pct)
2: 36.00 (0.00 pct) 28.00 (22.22 pct)
4: 45.00 (0.00 pct) 37.00 (17.77 pct)
8: 47.00 (0.00 pct) 51.00 (-8.51 pct)
16: 66.00 (0.00 pct) 69.00 (-4.54 pct)
32: 114.00 (0.00 pct) 113.00 (0.87 pct)
64: 215.00 (0.00 pct) 215.00 (0.00 pct)
128: 495.00 (0.00 pct) 529.00 (-6.86 pct) *
128: 495.00 (0.00 pct) 416.00 (15.95 pct) [Verification Run]
256: 48576.00 (0.00 pct) 46912.00 (3.42 pct)
512: 79232.00 (0.00 pct) 82560.00 (-4.20 pct)

NPS4

#workers: tip latency_nice
1: 30.00 (0.00 pct) 34.00 (-13.33 pct)
2: 34.00 (0.00 pct) 42.00 (-23.52 pct)
4: 41.00 (0.00 pct) 42.00 (-2.43 pct)
8: 60.00 (0.00 pct) 55.00 (8.33 pct)
16: 68.00 (0.00 pct) 69.00 (-1.47 pct)
32: 116.00 (0.00 pct) 115.00 (0.86 pct)
64: 224.00 (0.00 pct) 223.00 (0.44 pct)
128: 495.00 (0.00 pct) 677.00 (-36.76 pct) *
128: 495.00 (0.00 pct) 388.00 (21.61 pct) [Verification Run]
256: 45888.00 (0.00 pct) 44608.00 (2.78 pct)
512: 78464.00 (0.00 pct) 81536.00 (-3.91 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ tbench - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Clients: tip latency_nice
1 550.66 (0.00 pct) 546.63 (-0.73 pct)
2 1009.69 (0.00 pct) 1016.40 (0.66 pct)
4 1795.32 (0.00 pct) 1773.95 (-1.19 pct)
8 2971.16 (0.00 pct) 2930.26 (-1.37 pct)
16 4627.98 (0.00 pct) 4727.82 (2.15 pct)
32 8065.15 (0.00 pct) 9019.11 (11.82 pct)
64 14994.32 (0.00 pct) 15100.22 (0.70 pct)
128 5175.73 (0.00 pct) 18223.69 (252.09 pct) *
128 20029.53 (0.00 pct) 20517.17 (2.43 pct) [Verification Run]
256 48763.57 (0.00 pct) 44463.63 (-8.81 pct)
512 43780.78 (0.00 pct) 44170.21 (0.88 pct)
1024 40341.84 (0.00 pct) 40883.10 (1.34 pct)

NPS2

Clients: tip latency_nice
1 551.06 (0.00 pct) 547.43 (-0.65 pct)
2 1000.76 (0.00 pct) 1014.83 (1.40 pct)
4 1737.02 (0.00 pct) 1742.30 (0.30 pct)
8 2992.31 (0.00 pct) 2951.59 (-1.36 pct)
16 4579.29 (0.00 pct) 4558.05 (-0.46 pct)
32 9120.73 (0.00 pct) 8122.06 (-10.94 pct) *
32 8814.62 (0.00 pct) 8965.54 (1.71 pct) [Verification Run]
64 14918.58 (0.00 pct) 14890.93 (-0.18 pct)
128 20830.61 (0.00 pct) 20410.48 (-2.01 pct)
256 47708.18 (0.00 pct) 45312.84 (-5.02 pct) *
256 44941.88 (0.00 pct) 44555.92 (-0.85 pct) [Verification Run]
512 43721.79 (0.00 pct) 43653.43 (-0.15 pct)
1024 40920.49 (0.00 pct) 41162.17 (0.59 pct)

NPS4

Clients: tip latency_nice
1 549.22 (0.00 pct) 539.81 (-1.71 pct)
2 1000.08 (0.00 pct) 1010.12 (1.00 pct)
4 1794.78 (0.00 pct) 1736.06 (-3.27 pct)
8 3008.50 (0.00 pct) 2952.68 (-1.85 pct)
16 4804.71 (0.00 pct) 4454.17 (-7.29 pct) *
16 4391.10 (0.00 pct) 4497.43 (2.42 pct) [Verification Run]
32 9156.57 (0.00 pct) 8820.05 (-3.67 pct)
64 14901.45 (0.00 pct) 14786.25 (-0.77 pct)
128 20771.20 (0.00 pct) 19955.11 (-3.92 pct)
256 47033.88 (0.00 pct) 44937.51 (-4.45 pct)
512 43429.01 (0.00 pct) 42638.81 (-1.81 pct)
1024 39271.27 (0.00 pct) 40044.17 (1.96 pct)


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ stream - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

10 Runs:

Test: tip latency_nice
Copy: 336311.52 (0.00 pct) 326015.98 (-3.06 pct)
Scale: 212955.82 (0.00 pct) 208667.27 (-2.01 pct)
Add: 251518.23 (0.00 pct) 237286.20 (-5.65 pct)
Triad: 262077.88 (0.00 pct) 258949.80 (-1.19 pct)

100 Runs:

Test: tip latency_nice
Copy: 339533.83 (0.00 pct) 335126.73 (-1.29 pct)
Scale: 194736.72 (0.00 pct) 221151.24 (13.56 pct)
Add: 218294.54 (0.00 pct) 251427.43 (15.17 pct)
Triad: 262371.40 (0.00 pct) 260100.85 (-0.86 pct)

NPS2

10 Runs:

Test: tip latency_nice
Copy: 335277.15 (0.00 pct) 339614.38 (1.29 pct)
Scale: 220990.24 (0.00 pct) 221052.78 (0.02 pct)
Add: 264156.13 (0.00 pct) 263684.19 (-0.17 pct)
Triad: 268707.53 (0.00 pct) 272610.96 (1.45 pct)

100 Runs:

Test: tip latency_nice
Copy: 334913.73 (0.00 pct) 339001.88 (1.22 pct)
Scale: 230522.47 (0.00 pct) 229848.86 (-0.29 pct)
Add: 264567.28 (0.00 pct) 264288.34 (-0.10 pct)
Triad: 272974.23 (0.00 pct) 272045.17 (-0.34 pct)

NPS4

10 Runs:

Test: tip latency_nice
Copy: 299432.31 (0.00 pct) 307649.18 (2.74 pct)
Scale: 217998.17 (0.00 pct) 205763.70 (-5.61 pct)
Add: 234305.46 (0.00 pct) 226381.75 (-3.38 pct)
Triad: 244369.15 (0.00 pct) 254225.30 (4.03 pct)

100 Runs:

Test: tip latency_nice
Copy: 344421.25 (0.00 pct) 322189.81 (-6.45 pct)
Scale: 237998.44 (0.00 pct) 227709.58 (-4.32 pct)
Add: 257501.82 (0.00 pct) 244009.58 (-5.23 pct)
Triad: 267686.50 (0.00 pct) 251840.25 (-5.91 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Test cases for Latency Nice ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Note: Latency Nice might be referred to as LN in the data below. Latency Nice
value was set using a wrapper script for all the workload threads during the
testing.
All the test results reported below are for NPS1 configuration.

o Hackbench Pipes (100000 loops, threads)

Test: tip Latency Nice: -20 Latency Nice: 0 Latency Nice: 19
1-groups: 4.23 (0.00 pct) 4.39 (-3.78 pct) 3.99 (5.67 pct) 3.88 (8.27 pct)
2-groups: 4.93 (0.00 pct) 4.91 (0.40 pct) 4.69 (4.86 pct) 4.59 (6.89 pct)
4-groups: 5.32 (0.00 pct) 5.37 (-0.93 pct) 5.19 (2.44 pct) 5.05 (5.07 pct)
8-groups: 5.46 (0.00 pct) 5.90 (-8.05 pct) 5.34 (2.19 pct) 5.17 (5.31 pct)
16-groups: 7.31 (0.00 pct) 7.99 (-9.30 pct) 6.96 (4.78 pct) 6.51 (10.94 pct)

o Only Hackbench with different Latency Nice Values

> Loops: 100000

- Pipe (Process)

Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
1-groups: 3.77 (0.00 pct) 4.23 (-12.20 pct) 3.83 (-1.59 pct)
2-groups: 4.39 (0.00 pct) 4.73 (-7.74 pct) 4.31 (1.82 pct)
4-groups: 4.80 (0.00 pct) 5.07 (-5.62 pct) 4.68 (2.50 pct)
8-groups: 4.95 (0.00 pct) 5.68 (-14.74 pct) 4.76 (3.83 pct)
16-groups: 6.47 (0.00 pct) 7.87 (-21.63 pct) 6.08 (6.02 pct)

- Socket (Thread)

Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
1-groups: 6.08 (0.00 pct) 5.99 (1.48 pct) 6.08 (0.00 pct)
2-groups: 6.15 (0.00 pct) 6.25 (-1.62 pct) 6.14 (0.16 pct)
4-groups: 6.39 (0.00 pct) 6.42 (-0.46 pct) 6.44 (-0.78 pct)
8-groups: 8.51 (0.00 pct) 9.01 (-5.87 pct) 8.36 (1.76 pct)
16-groups: 12.48 (0.00 pct) 15.32 (-22.75 pct) 12.72 (-1.92 pct)

- Socket (Process)

Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
1-groups: 6.44 (0.00 pct) 5.50 (14.59 pct) ^ 6.43 (0.15 pct)
2-groups: 6.55 (0.00 pct) 5.56 (15.11 pct) ^ 6.36 (2.90 pct)
4-groups: 6.74 (0.00 pct) 6.19 (8.16 pct) ^ 6.69 (0.74 pct)
8-groups: 8.03 (0.00 pct) 8.29 (-3.23 pct) 8.02 (0.12 pct)
16-groups: 12.25 (0.00 pct) 14.11 (-15.18 pct) 12.41 (-1.30 pct)

> Loops: 2160 (Same as in testing)

- Pipe (Thread)

Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
1-groups: 0.10 (0.00 pct) 0.12 (-20.00 pct) 0.10 (0.00 pct)
2-groups: 0.12 (0.00 pct) 0.15 (-25.00 pct) 0.11 (8.33 pct)
4-groups: 0.14 (0.00 pct) 0.18 (-28.57 pct) 0.15 (-7.14 pct)
8-groups: 0.17 (0.00 pct) 0.24 (-41.17 pct) 0.17 (0.00 pct)
16-groups: 0.26 (0.00 pct) 0.33 (-26.92 pct) 0.21 (19.23 pct)

- Pipe (Process)

Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
1-groups: 0.10 (0.00 pct) 0.12 (-20.00 pct) 0.10 (0.00 pct)
2-groups: 0.12 (0.00 pct) 0.16 (-33.33 pct) 0.12 (0.00 pct)
4-groups: 0.14 (0.00 pct) 0.17 (-21.42 pct) 0.13 (7.14 pct)
8-groups: 0.16 (0.00 pct) 0.24 (-50.00 pct) 0.16 (0.00 pct)
16-groups: 0.23 (0.00 pct) 0.33 (-43.47 pct) 0.19 (17.39 pct)

- Socket (Thread)

Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
1-groups: 0.19 (0.00 pct) 0.18 (5.26 pct) 0.18 (5.26 pct)
2-groups: 0.21 (0.00 pct) 0.21 (0.00 pct) 0.20 (4.76 pct)
4-groups: 0.22 (0.00 pct) 0.25 (-13.63 pct) 0.22 (0.00 pct)
8-groups: 0.27 (0.00 pct) 0.36 (-33.33 pct) 0.27 (0.00 pct)
16-groups: 0.42 (0.00 pct) 0.55 (-30.95 pct) 0.40 (4.76 pct)

- Socket (Process)

Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
1-groups: 0.17 (0.00 pct) 0.17 (0.00 pct) 0.17 (0.00 pct)
2-groups: 0.19 (0.00 pct) 0.20 (-5.26 pct) 0.19 (0.00 pct)
4-groups: 0.20 (0.00 pct) 0.22 (-10.00 pct) 0.20 (0.00 pct)
8-groups: 0.25 (0.00 pct) 0.32 (-28.00 pct) 0.25 (0.00 pct)
16-groups: 0.40 (0.00 pct) 0.51 (-27.50 pct) 0.39 (2.50 pct)

o Hackbench and Cyclictest in NPS1 configuration

perf bench sched messaging -p -t -l 100000 -g 16&
cyclictest --policy other -D 5 -q -n -H 20000

-----------------------------------------------------------------------------------------------------------------
|Hackbench | Cyclictest LN = 19 | Cyclictest LN = 0 | Cyclictest LN = -20 |
|LN |--------------------------------|---------------------------------|-----------------------------|
|v | Min | Avg | Max | Min | Avg | Max | Min | Avg | Max |
|--------------|--------|---------|-------------|----------|---------|------------|----------|---------|--------|
|0 | 54.00 | 117.00 | 3021.67 | 53.67 | 65.33 | 133.00 | 53.67 | 65.00 | 201.33 | ^
|19 | 50.00 | 100.67 | 3099.33 | 41.00 | 64.33 | 1014.33 | 54.00 | 63.67 | 213.33 |
|-20 | 53.00 | 169.00 | 11661.67 | 53.67 | 217.33 | 14313.67 | 46.00 | 61.33 | 236.00 | ^
-----------------------------------------------------------------------------------------------------------------

o Hackbench and schbench in NPS1 configuration

perf bench sched messaging -p -t -l 1000000 -g 16&
schebcnh -m 1 -t 64 -s 30s

------------------------------------------------------------------------------------------------------------
|Hackbench | schbench LN = 19 | schbench LN = 0 | schbench LN = -20 |
|LN |----------------------------|--------------------------------|-----------------------------|
|v | 90th | 95th | 99th | 90th | 95th | 99th | 90th | 95th | 99th |
|--------------|--------|--------|----------|---------|---------|------------|---------|----------|--------|
|0 | 4264 | 6744 | 15664 | 17952 | 32672 | 55488 | 15088 | 25312 | 50112 |
|19 | 288 | 613 | 2332 | 274 | 1015 | 3628 | 374 | 1394 | 4424 |
|-20 | 35904 | 47680 | 79744 | 87168 | 113536 | 176896 | 13008 | 21216 | 42560 | ^
------------------------------------------------------------------------------------------------------------

o SpecJBB Multi-JVM

---------------------------------------------
| Latency Nice | 0 | 19 |
---------------------------------------------
| max-jOPS | 100% | 109.92% |
| critical-jOPS | 100% | 153.70% |
---------------------------------------------

In most cases, latency nice delivers what it promises.
Some cases marked with "^" have shown anomalies or non-linear behavior
that is yet to be root caused. If you've seen something similar during
your testing, I would love to know what could lead to such a behavior.

If you would like more details on the benchmarks results reported above
or if there is any specific workload you would like me to test on the
Zen3 machine, please do let me know.

>
> [..snip..]
>

--
Thanks and Regards,
Prateek
--
--
Thanks and Regards,
Prateek

2022-10-13 15:51:18

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] Add latency priority for CFS class

Hi Prateek,

Thanks for testing the patchset on AMD and the test report below.

On Wed, 12 Oct 2022 at 16:54, K Prateek Nayak <[email protected]> wrote:
>
> Hello Vincent,
>
> Sharing results from testing on dual socket Zen3 system (2 x 64C/128T)
>
> tl;dr
>
> o I don't see any regression when workloads are running with
> DEFAULT_LATENCY_NICE
> o I can reproduce similar results as one reported in Patch 4 for
> hackbench with latency nice 19 and hackbench and cyclictest
> with various combination of latency nice values.
> o I can see improvements to tail latency for schbench with hackbench
> running in the background.
> o There is an unexpected non-linear behavior observed for couple of
> cases that I cannot explain yet. (Marked with "^" in detailed results)
> I have not yet gotten to the bottom of it but if I've missed
> something, please do let me know.
>
> Detailed results are shared below:
>
> On 9/25/2022 8:09 PM, Vincent Guittot wrote:
> > This patchset restarts the work about adding a latency priority to describe
> > the latency tolerance of cfs tasks.
> >
> > The patches [1-3] have been done by Parth:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > I have just rebased and moved the set of latency priority outside the
> > priority update. I have removed the reviewed tag because the patches
> > are 2 years old.
> >
> > This aims to be a generic interface and the following patches is one use
> > of it to improve the scheduling latency of cfs tasks.
> >
> > The patch [4] uses latency nice priority to define a latency offset
> > and then decide if a cfs task can or should preempt the current
> > running task. The patch gives some tests results with cyclictests and
> > hackbench to highlight the benefit of latency priority for short
> > interactive task or long intensive tasks.
> >
> > Patch [5] adds the support of latency nice priority to task group by
> > adding a cpu.latency.nice field. The range is [-20:19] as for setting task
> > latency priority.
> >
> > Patch [6] makes sched_core taking into account the latency offset.
> >
> > Patch [7] adds a rb tree to cover some corner cases where the latency
> > sensitive task (priority < 0) is preempted by high priority task (RT/DL)
> > or fails to preempt them. This patch ensures that tasks will have at least
> > a slice of sched_min_granularity in priority at wakeup. The patch gives
> > results to show the benefit in addition to patch 4.
> >
> > I have also backported the patchset on a dragonboard RB3 with an android
> > mainline kernel based on v5.18 for a quick test. I have used the
> > TouchLatency app which is part of AOSP and described to be a very good
> > test to highlight jitter and jank frame sources of a system [1].
> > In addition to the app, I have added some short running tasks waking-up
> > regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> > without overloading it (and disabling EAS). The 1st results shows that the
> > patchset helps to reduce the missed deadline frames from 5% to less than
> > 0.1% when the cpu.latency.nice of task group are set.
> >
> > I have also tested the patchset with the modified version of the alsa
> > latency test that has been shared by Tim. The test quickly xruns with
> > default latency nice priority 0 but is able to run without underuns with
> > a latency -20 and hackbench running simultaneously.
> >
> >
> > [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
>
> Following are the results from running standard benchmarks on a
> dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Benchmark Results:
>
> Kernel versions:
> - tip: 5.19.0 tip sched/core
> - latency_nice: 5.19.0 tip sched/core + this series
>
> When we started testing, the tip was at:
> commit 7e9518baed4c ("sched/fair: Move call to list_last_entry() in detach_tasks")
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ hackbench - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> NPS1
>
> Test: tip latency_nice
> 1-groups: 4.23 (0.00 pct) 4.06 (4.01 pct)
> 2-groups: 4.93 (0.00 pct) 4.89 (0.81 pct)
> 4-groups: 5.32 (0.00 pct) 5.31 (0.18 pct)
> 8-groups: 5.46 (0.00 pct) 5.54 (-1.46 pct)
> 16-groups: 7.31 (0.00 pct) 7.33 (-0.27 pct)
>
> NPS2
>
> Test: tip latency_nice
> 1-groups: 4.19 (0.00 pct) 4.12 (1.67 pct)
> 2-groups: 4.77 (0.00 pct) 4.82 (-1.04 pct)
> 4-groups: 5.15 (0.00 pct) 5.17 (-0.38 pct)
> 8-groups: 5.47 (0.00 pct) 5.48 (-0.18 pct)
> 16-groups: 6.63 (0.00 pct) 6.65 (-0.30 pct)
>
> NPS4
>
> Test: tip latency_nice
> 1-groups: 4.23 (0.00 pct) 4.31 (-1.89 pct)
> 2-groups: 4.78 (0.00 pct) 4.75 (0.62 pct)
> 4-groups: 5.17 (0.00 pct) 5.24 (-1.35 pct)
> 8-groups: 5.63 (0.00 pct) 5.59 (0.71 pct)
> 16-groups: 7.88 (0.00 pct) 7.09 (10.02 pct)
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ schbench - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> NPS1
>
> #workers: tip latency_nice
> 1: 22.00 (0.00 pct) 21.00 (4.54 pct)
> 2: 34.00 (0.00 pct) 34.00 (0.00 pct)
> 4: 37.00 (0.00 pct) 40.00 (-8.10 pct)
> 8: 55.00 (0.00 pct) 49.00 (10.90 pct)
> 16: 69.00 (0.00 pct) 66.00 (4.34 pct)
> 32: 113.00 (0.00 pct) 117.00 (-3.53 pct)
> 64: 219.00 (0.00 pct) 242.00 (-10.50 pct) *
> 64: 219.00 (0.00 pct) 194.00 (11.41 pct) [Verification Run]
> 128: 506.00 (0.00 pct) 513.00 (-1.38 pct)
> 256: 45440.00 (0.00 pct) 44992.00 (0.98 pct)
> 512: 76672.00 (0.00 pct) 83328.00 (-8.68 pct)
>
> NPS2
>
> #workers: tip latency_nice
> 1: 31.00 (0.00 pct) 20.00 (35.48 pct)
> 2: 36.00 (0.00 pct) 28.00 (22.22 pct)
> 4: 45.00 (0.00 pct) 37.00 (17.77 pct)
> 8: 47.00 (0.00 pct) 51.00 (-8.51 pct)
> 16: 66.00 (0.00 pct) 69.00 (-4.54 pct)
> 32: 114.00 (0.00 pct) 113.00 (0.87 pct)
> 64: 215.00 (0.00 pct) 215.00 (0.00 pct)
> 128: 495.00 (0.00 pct) 529.00 (-6.86 pct) *
> 128: 495.00 (0.00 pct) 416.00 (15.95 pct) [Verification Run]
> 256: 48576.00 (0.00 pct) 46912.00 (3.42 pct)
> 512: 79232.00 (0.00 pct) 82560.00 (-4.20 pct)
>
> NPS4
>
> #workers: tip latency_nice
> 1: 30.00 (0.00 pct) 34.00 (-13.33 pct)
> 2: 34.00 (0.00 pct) 42.00 (-23.52 pct)
> 4: 41.00 (0.00 pct) 42.00 (-2.43 pct)
> 8: 60.00 (0.00 pct) 55.00 (8.33 pct)
> 16: 68.00 (0.00 pct) 69.00 (-1.47 pct)
> 32: 116.00 (0.00 pct) 115.00 (0.86 pct)
> 64: 224.00 (0.00 pct) 223.00 (0.44 pct)
> 128: 495.00 (0.00 pct) 677.00 (-36.76 pct) *
> 128: 495.00 (0.00 pct) 388.00 (21.61 pct) [Verification Run]
> 256: 45888.00 (0.00 pct) 44608.00 (2.78 pct)
> 512: 78464.00 (0.00 pct) 81536.00 (-3.91 pct)
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ tbench - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> NPS1
>
> Clients: tip latency_nice
> 1 550.66 (0.00 pct) 546.63 (-0.73 pct)
> 2 1009.69 (0.00 pct) 1016.40 (0.66 pct)
> 4 1795.32 (0.00 pct) 1773.95 (-1.19 pct)
> 8 2971.16 (0.00 pct) 2930.26 (-1.37 pct)
> 16 4627.98 (0.00 pct) 4727.82 (2.15 pct)
> 32 8065.15 (0.00 pct) 9019.11 (11.82 pct)
> 64 14994.32 (0.00 pct) 15100.22 (0.70 pct)
> 128 5175.73 (0.00 pct) 18223.69 (252.09 pct) *
> 128 20029.53 (0.00 pct) 20517.17 (2.43 pct) [Verification Run]
> 256 48763.57 (0.00 pct) 44463.63 (-8.81 pct)
> 512 43780.78 (0.00 pct) 44170.21 (0.88 pct)
> 1024 40341.84 (0.00 pct) 40883.10 (1.34 pct)
>
> NPS2
>
> Clients: tip latency_nice
> 1 551.06 (0.00 pct) 547.43 (-0.65 pct)
> 2 1000.76 (0.00 pct) 1014.83 (1.40 pct)
> 4 1737.02 (0.00 pct) 1742.30 (0.30 pct)
> 8 2992.31 (0.00 pct) 2951.59 (-1.36 pct)
> 16 4579.29 (0.00 pct) 4558.05 (-0.46 pct)
> 32 9120.73 (0.00 pct) 8122.06 (-10.94 pct) *
> 32 8814.62 (0.00 pct) 8965.54 (1.71 pct) [Verification Run]
> 64 14918.58 (0.00 pct) 14890.93 (-0.18 pct)
> 128 20830.61 (0.00 pct) 20410.48 (-2.01 pct)
> 256 47708.18 (0.00 pct) 45312.84 (-5.02 pct) *
> 256 44941.88 (0.00 pct) 44555.92 (-0.85 pct) [Verification Run]
> 512 43721.79 (0.00 pct) 43653.43 (-0.15 pct)
> 1024 40920.49 (0.00 pct) 41162.17 (0.59 pct)
>
> NPS4
>
> Clients: tip latency_nice
> 1 549.22 (0.00 pct) 539.81 (-1.71 pct)
> 2 1000.08 (0.00 pct) 1010.12 (1.00 pct)
> 4 1794.78 (0.00 pct) 1736.06 (-3.27 pct)
> 8 3008.50 (0.00 pct) 2952.68 (-1.85 pct)
> 16 4804.71 (0.00 pct) 4454.17 (-7.29 pct) *
> 16 4391.10 (0.00 pct) 4497.43 (2.42 pct) [Verification Run]
> 32 9156.57 (0.00 pct) 8820.05 (-3.67 pct)
> 64 14901.45 (0.00 pct) 14786.25 (-0.77 pct)
> 128 20771.20 (0.00 pct) 19955.11 (-3.92 pct)
> 256 47033.88 (0.00 pct) 44937.51 (-4.45 pct)
> 512 43429.01 (0.00 pct) 42638.81 (-1.81 pct)
> 1024 39271.27 (0.00 pct) 40044.17 (1.96 pct)
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ stream - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> NPS1
>
> 10 Runs:
>
> Test: tip latency_nice
> Copy: 336311.52 (0.00 pct) 326015.98 (-3.06 pct)
> Scale: 212955.82 (0.00 pct) 208667.27 (-2.01 pct)
> Add: 251518.23 (0.00 pct) 237286.20 (-5.65 pct)
> Triad: 262077.88 (0.00 pct) 258949.80 (-1.19 pct)
>
> 100 Runs:
>
> Test: tip latency_nice
> Copy: 339533.83 (0.00 pct) 335126.73 (-1.29 pct)
> Scale: 194736.72 (0.00 pct) 221151.24 (13.56 pct)
> Add: 218294.54 (0.00 pct) 251427.43 (15.17 pct)
> Triad: 262371.40 (0.00 pct) 260100.85 (-0.86 pct)
>
> NPS2
>
> 10 Runs:
>
> Test: tip latency_nice
> Copy: 335277.15 (0.00 pct) 339614.38 (1.29 pct)
> Scale: 220990.24 (0.00 pct) 221052.78 (0.02 pct)
> Add: 264156.13 (0.00 pct) 263684.19 (-0.17 pct)
> Triad: 268707.53 (0.00 pct) 272610.96 (1.45 pct)
>
> 100 Runs:
>
> Test: tip latency_nice
> Copy: 334913.73 (0.00 pct) 339001.88 (1.22 pct)
> Scale: 230522.47 (0.00 pct) 229848.86 (-0.29 pct)
> Add: 264567.28 (0.00 pct) 264288.34 (-0.10 pct)
> Triad: 272974.23 (0.00 pct) 272045.17 (-0.34 pct)
>
> NPS4
>
> 10 Runs:
>
> Test: tip latency_nice
> Copy: 299432.31 (0.00 pct) 307649.18 (2.74 pct)
> Scale: 217998.17 (0.00 pct) 205763.70 (-5.61 pct)
> Add: 234305.46 (0.00 pct) 226381.75 (-3.38 pct)
> Triad: 244369.15 (0.00 pct) 254225.30 (4.03 pct)
>
> 100 Runs:
>
> Test: tip latency_nice
> Copy: 344421.25 (0.00 pct) 322189.81 (-6.45 pct)
> Scale: 237998.44 (0.00 pct) 227709.58 (-4.32 pct)
> Add: 257501.82 (0.00 pct) 244009.58 (-5.23 pct)
> Triad: 267686.50 (0.00 pct) 251840.25 (-5.91 pct)
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ Test cases for Latency Nice ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Note: Latency Nice might be referred to as LN in the data below. Latency Nice
> value was set using a wrapper script for all the workload threads during the
> testing.
> All the test results reported below are for NPS1 configuration.
>
> o Hackbench Pipes (100000 loops, threads)
>
> Test: tip Latency Nice: -20 Latency Nice: 0 Latency Nice: 19
> 1-groups: 4.23 (0.00 pct) 4.39 (-3.78 pct) 3.99 (5.67 pct) 3.88 (8.27 pct)
> 2-groups: 4.93 (0.00 pct) 4.91 (0.40 pct) 4.69 (4.86 pct) 4.59 (6.89 pct)
> 4-groups: 5.32 (0.00 pct) 5.37 (-0.93 pct) 5.19 (2.44 pct) 5.05 (5.07 pct)
> 8-groups: 5.46 (0.00 pct) 5.90 (-8.05 pct) 5.34 (2.19 pct) 5.17 (5.31 pct)
> 16-groups: 7.31 (0.00 pct) 7.99 (-9.30 pct) 6.96 (4.78 pct) 6.51 (10.94 pct)
>
> o Only Hackbench with different Latency Nice Values
>
> > Loops: 100000
>
> - Pipe (Process)
>
> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
> 1-groups: 3.77 (0.00 pct) 4.23 (-12.20 pct) 3.83 (-1.59 pct)
> 2-groups: 4.39 (0.00 pct) 4.73 (-7.74 pct) 4.31 (1.82 pct)
> 4-groups: 4.80 (0.00 pct) 5.07 (-5.62 pct) 4.68 (2.50 pct)
> 8-groups: 4.95 (0.00 pct) 5.68 (-14.74 pct) 4.76 (3.83 pct)
> 16-groups: 6.47 (0.00 pct) 7.87 (-21.63 pct) 6.08 (6.02 pct)
>
> - Socket (Thread)
>
> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
> 1-groups: 6.08 (0.00 pct) 5.99 (1.48 pct) 6.08 (0.00 pct)
> 2-groups: 6.15 (0.00 pct) 6.25 (-1.62 pct) 6.14 (0.16 pct)
> 4-groups: 6.39 (0.00 pct) 6.42 (-0.46 pct) 6.44 (-0.78 pct)
> 8-groups: 8.51 (0.00 pct) 9.01 (-5.87 pct) 8.36 (1.76 pct)
> 16-groups: 12.48 (0.00 pct) 15.32 (-22.75 pct) 12.72 (-1.92 pct)
>
> - Socket (Process)
>
> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
> 1-groups: 6.44 (0.00 pct) 5.50 (14.59 pct) ^ 6.43 (0.15 pct)
> 2-groups: 6.55 (0.00 pct) 5.56 (15.11 pct) ^ 6.36 (2.90 pct)
> 4-groups: 6.74 (0.00 pct) 6.19 (8.16 pct) ^ 6.69 (0.74 pct)
> 8-groups: 8.03 (0.00 pct) 8.29 (-3.23 pct) 8.02 (0.12 pct)
> 16-groups: 12.25 (0.00 pct) 14.11 (-15.18 pct) 12.41 (-1.30 pct)

I don't see any improvement with LN:-20 but only for LN:19

How many iterations do you run ? Could it be that the results vary
between iterations ? For some configuration I have a stddev of 10-20%
for LN:0 and LN:-20

>
> > Loops: 2160 (Same as in testing)
>
> - Pipe (Thread)
>
> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
> 1-groups: 0.10 (0.00 pct) 0.12 (-20.00 pct) 0.10 (0.00 pct)
> 2-groups: 0.12 (0.00 pct) 0.15 (-25.00 pct) 0.11 (8.33 pct)
> 4-groups: 0.14 (0.00 pct) 0.18 (-28.57 pct) 0.15 (-7.14 pct)
> 8-groups: 0.17 (0.00 pct) 0.24 (-41.17 pct) 0.17 (0.00 pct)
> 16-groups: 0.26 (0.00 pct) 0.33 (-26.92 pct) 0.21 (19.23 pct)
>
> - Pipe (Process)
>
> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
> 1-groups: 0.10 (0.00 pct) 0.12 (-20.00 pct) 0.10 (0.00 pct)
> 2-groups: 0.12 (0.00 pct) 0.16 (-33.33 pct) 0.12 (0.00 pct)
> 4-groups: 0.14 (0.00 pct) 0.17 (-21.42 pct) 0.13 (7.14 pct)
> 8-groups: 0.16 (0.00 pct) 0.24 (-50.00 pct) 0.16 (0.00 pct)
> 16-groups: 0.23 (0.00 pct) 0.33 (-43.47 pct) 0.19 (17.39 pct)
>
> - Socket (Thread)
>
> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
> 1-groups: 0.19 (0.00 pct) 0.18 (5.26 pct) 0.18 (5.26 pct)
> 2-groups: 0.21 (0.00 pct) 0.21 (0.00 pct) 0.20 (4.76 pct)
> 4-groups: 0.22 (0.00 pct) 0.25 (-13.63 pct) 0.22 (0.00 pct)
> 8-groups: 0.27 (0.00 pct) 0.36 (-33.33 pct) 0.27 (0.00 pct)
> 16-groups: 0.42 (0.00 pct) 0.55 (-30.95 pct) 0.40 (4.76 pct)
>
> - Socket (Process)
>
> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
> 1-groups: 0.17 (0.00 pct) 0.17 (0.00 pct) 0.17 (0.00 pct)
> 2-groups: 0.19 (0.00 pct) 0.20 (-5.26 pct) 0.19 (0.00 pct)
> 4-groups: 0.20 (0.00 pct) 0.22 (-10.00 pct) 0.20 (0.00 pct)
> 8-groups: 0.25 (0.00 pct) 0.32 (-28.00 pct) 0.25 (0.00 pct)
> 16-groups: 0.40 (0.00 pct) 0.51 (-27.50 pct) 0.39 (2.50 pct)
>
> o Hackbench and Cyclictest in NPS1 configuration
>
> perf bench sched messaging -p -t -l 100000 -g 16&
> cyclictest --policy other -D 5 -q -n -H 20000
>
> -----------------------------------------------------------------------------------------------------------------
> |Hackbench | Cyclictest LN = 19 | Cyclictest LN = 0 | Cyclictest LN = -20 |
> |LN |--------------------------------|---------------------------------|-----------------------------|
> |v | Min | Avg | Max | Min | Avg | Max | Min | Avg | Max |
> |--------------|--------|---------|-------------|----------|---------|------------|----------|---------|--------|
> |0 | 54.00 | 117.00 | 3021.67 | 53.67 | 65.33 | 133.00 | 53.67 | 65.00 | 201.33 | ^
> |19 | 50.00 | 100.67 | 3099.33 | 41.00 | 64.33 | 1014.33 | 54.00 | 63.67 | 213.33 |
> |-20 | 53.00 | 169.00 | 11661.67 | 53.67 | 217.33 | 14313.67 | 46.00 | 61.33 | 236.00 | ^
> -----------------------------------------------------------------------------------------------------------------

The latency results look good with Cyclictest LN:0 and hackbench LN:0.
133us max latency. This suggests that your system is not overloaded
and cyclictest doesn't really compete with others to run.

>
> o Hackbench and schbench in NPS1 configuration
>
> perf bench sched messaging -p -t -l 1000000 -g 16&
> schebcnh -m 1 -t 64 -s 30s
>
> ------------------------------------------------------------------------------------------------------------
> |Hackbench | schbench LN = 19 | schbench LN = 0 | schbench LN = -20 |
> |LN |----------------------------|--------------------------------|-----------------------------|
> |v | 90th | 95th | 99th | 90th | 95th | 99th | 90th | 95th | 99th |
> |--------------|--------|--------|----------|---------|---------|------------|---------|----------|--------|
> |0 | 4264 | 6744 | 15664 | 17952 | 32672 | 55488 | 15088 | 25312 | 50112 |
> |19 | 288 | 613 | 2332 | 274 | 1015 | 3628 | 374 | 1394 | 4424 |
> |-20 | 35904 | 47680 | 79744 | 87168 | 113536 | 176896 | 13008 | 21216 | 42560 | ^
> ------------------------------------------------------------------------------------------------------------

For the schbench, your test is 30 seconds long which is longer than
the duration of perf bench sched messaging -p -t -l 1000000 -g 16&

The duration of the latter varies depending of latency nice value so
schbench is disturb more time in some cases
>
> o SpecJBB Multi-JVM
>
> ---------------------------------------------
> | Latency Nice | 0 | 19 |
> ---------------------------------------------
> | max-jOPS | 100% | 109.92% |
> | critical-jOPS | 100% | 153.70% |
> ---------------------------------------------
>
> In most cases, latency nice delivers what it promises.
> Some cases marked with "^" have shown anomalies or non-linear behavior
> that is yet to be root caused. If you've seen something similar during
> your testing, I would love to know what could lead to such a behavior.

I haven't seen anything like the results that you tagged with ^. As a
side note, the numbers of groups (g1 g4 g8 g1) that I used with
hackbench, have been chosen according to my 8 cores system. Your
system is much larger and hackbench may not overload it with such a
small number of groups. Maybe you could try with g32 g64 g128 g256 ?



>
> If you would like more details on the benchmarks results reported above
> or if there is any specific workload you would like me to test on the
> Zen3 machine, please do let me know.
>
> >
> > [..snip..]
> >
>
> --
> Thanks and Regards,
> Prateek
> --
> --
> Thanks and Regards,
> Prateek

2022-10-17 07:01:28

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] Add latency priority for CFS class

Hello Vincent,

Thank you for taking a look at the report.

On 10/13/2022 8:54 PM, Vincent Guittot wrote:
> Hi Prateek,
>
> Thanks for testing the patchset on AMD and the test report below.
>
> On Wed, 12 Oct 2022 at 16:54, K Prateek Nayak <[email protected]> wrote:
>>
>> [..snip..]
>>
>> - Socket (Process)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 6.44 (0.00 pct) 5.50 (14.59 pct) ^ 6.43 (0.15 pct)
>> 2-groups: 6.55 (0.00 pct) 5.56 (15.11 pct) ^ 6.36 (2.90 pct)
>> 4-groups: 6.74 (0.00 pct) 6.19 (8.16 pct) ^ 6.69 (0.74 pct)
>> 8-groups: 8.03 (0.00 pct) 8.29 (-3.23 pct) 8.02 (0.12 pct)
>> 16-groups: 12.25 (0.00 pct) 14.11 (-15.18 pct) 12.41 (-1.30 pct)
>
> I don't see any improvement with LN:-20 but only for LN:19
>
> How many iterations do you run ? Could it be that the results vary
> between iterations ? For some configuration I have a stddev of 10-20%
> for LN:0 and LN:-20
>

Yes I do see a lot of run to run variation for the above runs:

For 1-group:

LN: : 0 -20 19
Min : 6.26 4.97 6.28
Max : 6.54 6.71 6.55
Median : 6.45 5.28 6.43
AMean : 6.44 5.50 6.43
GMean : 6.44 5.47 6.43
HMean : 6.44 5.44 6.43
AMean Stddev : 0.08 0.60 0.08
AMean CoefVar : 1.18 pct 10.89 pct 1.28 pct

For 2-group:

LN: : 0 -20 19
Min : 5.80 5.38 5.28
Max : 6.80 6.70 6.32
Median : 6.66 6.53 5.48
AMean : 6.55 6.36 5.56
GMean : 6.55 6.35 5.55
HMean : 6.54 6.33 5.54
AMean Stddev : 0.29 0.41 0.33
AMean CoefVar : 4.38 pct 6.48 pct 5.99 pct

I've rerun this data point and following are the results:

- Socket (Process) (Loop: 100000)

Test: LN:0 LN:-20 LN:19
1-groups: 6.81 (0.00 pct) 6.62 (2.79 pct) 6.62 (2.79 pct)
2-groups: 6.76 (0.00 pct) 6.69 (1.03 pct) 6.65 (1.62 pct)
4-groups: 6.62 (0.00 pct) 6.65 (-0.45 pct) 6.63 (-0.15 pct)
8-groups: 7.84 (0.00 pct) 7.81 (0.38 pct) 7.78 (0.76 pct)
16-groups: 12.87 (0.00 pct) 12.40 (3.65 pct) 12.35 (4.04 pct)

Results are more stable in these runs but runs with LN: -20,
have comparatively larger Stddev compared to LN: 0 and LN: 19

>>
>>> Loops: 2160 (Same as in testing)
>>
>> - Pipe (Thread)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 0.10 (0.00 pct) 0.12 (-20.00 pct) 0.10 (0.00 pct)
>> 2-groups: 0.12 (0.00 pct) 0.15 (-25.00 pct) 0.11 (8.33 pct)
>> 4-groups: 0.14 (0.00 pct) 0.18 (-28.57 pct) 0.15 (-7.14 pct)
>> 8-groups: 0.17 (0.00 pct) 0.24 (-41.17 pct) 0.17 (0.00 pct)
>> 16-groups: 0.26 (0.00 pct) 0.33 (-26.92 pct) 0.21 (19.23 pct)
>>
>> - Pipe (Process)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 0.10 (0.00 pct) 0.12 (-20.00 pct) 0.10 (0.00 pct)
>> 2-groups: 0.12 (0.00 pct) 0.16 (-33.33 pct) 0.12 (0.00 pct)
>> 4-groups: 0.14 (0.00 pct) 0.17 (-21.42 pct) 0.13 (7.14 pct)
>> 8-groups: 0.16 (0.00 pct) 0.24 (-50.00 pct) 0.16 (0.00 pct)
>> 16-groups: 0.23 (0.00 pct) 0.33 (-43.47 pct) 0.19 (17.39 pct)
>>
>> - Socket (Thread)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 0.19 (0.00 pct) 0.18 (5.26 pct) 0.18 (5.26 pct)
>> 2-groups: 0.21 (0.00 pct) 0.21 (0.00 pct) 0.20 (4.76 pct)
>> 4-groups: 0.22 (0.00 pct) 0.25 (-13.63 pct) 0.22 (0.00 pct)
>> 8-groups: 0.27 (0.00 pct) 0.36 (-33.33 pct) 0.27 (0.00 pct)
>> 16-groups: 0.42 (0.00 pct) 0.55 (-30.95 pct) 0.40 (4.76 pct)
>>
>> - Socket (Process)
>>
>> Test: Latency Nice: 0 Latency Nice: -20 Latency Nice: 19
>> 1-groups: 0.17 (0.00 pct) 0.17 (0.00 pct) 0.17 (0.00 pct)
>> 2-groups: 0.19 (0.00 pct) 0.20 (-5.26 pct) 0.19 (0.00 pct)
>> 4-groups: 0.20 (0.00 pct) 0.22 (-10.00 pct) 0.20 (0.00 pct)
>> 8-groups: 0.25 (0.00 pct) 0.32 (-28.00 pct) 0.25 (0.00 pct)
>> 16-groups: 0.40 (0.00 pct) 0.51 (-27.50 pct) 0.39 (2.50 pct)
>>
>> o Hackbench and Cyclictest in NPS1 configuration
>>
>> perf bench sched messaging -p -t -l 100000 -g 16&
>> cyclictest --policy other -D 5 -q -n -H 20000
>>
>> -----------------------------------------------------------------------------------------------------------------
>> |Hackbench | Cyclictest LN = 19 | Cyclictest LN = 0 | Cyclictest LN = -20 |
>> |LN |--------------------------------|---------------------------------|-----------------------------|
>> |v | Min | Avg | Max | Min | Avg | Max | Min | Avg | Max |
>> |--------------|--------|---------|-------------|----------|---------|------------|----------|---------|--------|
>> |0 | 54.00 | 117.00 | 3021.67 | 53.67 | 65.33 | 133.00 | 53.67 | 65.00 | 201.33 | ^
>> |19 | 50.00 | 100.67 | 3099.33 | 41.00 | 64.33 | 1014.33 | 54.00 | 63.67 | 213.33 |
>> |-20 | 53.00 | 169.00 | 11661.67 | 53.67 | 217.33 | 14313.67 | 46.00 | 61.33 | 236.00 | ^
>> -----------------------------------------------------------------------------------------------------------------
>
> The latency results look good with Cyclictest LN:0 and hackbench LN:0.
> 133us max latency. This suggests that your system is not overloaded
> and cyclictest doesn't really compete with others to run.

I'll get data while running hanckbench with larger number
of groups. I'll look out for larger latency in LN: (0, 0)
case to check for CPU contention.

>
>>
>> o Hackbench and schbench in NPS1 configuration
>>
>> perf bench sched messaging -p -t -l 1000000 -g 16&
>> schebcnh -m 1 -t 64 -s 30s
>>
>> ------------------------------------------------------------------------------------------------------------
>> |Hackbench | schbench LN = 19 | schbench LN = 0 | schbench LN = -20 |
>> |LN |----------------------------|--------------------------------|-----------------------------|
>> |v | 90th | 95th | 99th | 90th | 95th | 99th | 90th | 95th | 99th |
>> |--------------|--------|--------|----------|---------|---------|------------|---------|----------|--------|
>> |0 | 4264 | 6744 | 15664 | 17952 | 32672 | 55488 | 15088 | 25312 | 50112 |
>> |19 | 288 | 613 | 2332 | 274 | 1015 | 3628 | 374 | 1394 | 4424 |
>> |-20 | 35904 | 47680 | 79744 | 87168 | 113536 | 176896 | 13008 | 21216 | 42560 | ^
>> ------------------------------------------------------------------------------------------------------------
>
> For the schbench, your test is 30 seconds long which is longer than
> the duration of perf bench sched messaging -p -t -l 1000000 -g 16&

With loop size 1 million, I see the schbench complete before
hackbench in all the cases. I'll rerun this with larger group
size too to get more data and make sure hackbench runs longer
than schbench in all cases.

>
> The duration of the latter varies depending of latency nice value so
> schbench is disturb more time in some cases
>>
>> o SpecJBB Multi-JVM
>>
>> ---------------------------------------------
>> | Latency Nice | 0 | 19 |
>> ---------------------------------------------
>> | max-jOPS | 100% | 109.92% |
>> | critical-jOPS | 100% | 153.70% |
>> ---------------------------------------------
>>
>> In most cases, latency nice delivers what it promises.
>> Some cases marked with "^" have shown anomalies or non-linear behavior
>> that is yet to be root caused. If you've seen something similar during
>> your testing, I would love to know what could lead to such a behavior.
>
> I haven't seen anything like the results that you tagged with ^. As a
> side note, the numbers of groups (g1 g4 g8 g1) that I used with
> hackbench, have been chosen according to my 8 cores system. Your
> system is much larger and hackbench may not overload it with such a
> small number of groups. Maybe you could try with g32 g64 g128 g256 ?
>

I agree. I'll get the data for cyclictest and schbench with hackbench
running larger number of groups alongside.

>
>>
>> If you would like more details on the benchmarks results reported above
>> or if there is any specific workload you would like me to test on the
>> Zen3 machine, please do let me know.
>>
>>>
>>> [..snip..]
>>>
--
Thanks and Regards,
Prateek

2022-10-25 07:31:28

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] Add latency priority for CFS class

Hello Vincent,

I've rerun some tests with a different configuration with more
contention for CPU and I can see a linear behavior. Sharing the
results below.

On 10/13/2022 8:54 PM, Vincent Guittot wrote:
>
> [..snip..]
>>
>> o Hackbench and Cyclictest in NPS1 configuration
>>
>> perf bench sched messaging -p -t -l 100000 -g 16&
>> cyclictest --policy other -D 5 -q -n -H 20000
>>
>> -----------------------------------------------------------------------------------------------------------------
>> |Hackbench | Cyclictest LN = 19 | Cyclictest LN = 0 | Cyclictest LN = -20 |
>> |LN |--------------------------------|---------------------------------|-----------------------------|
>> |v | Min | Avg | Max | Min | Avg | Max | Min | Avg | Max |
>> |--------------|--------|---------|-------------|----------|---------|------------|----------|---------|--------|
>> |0 | 54.00 | 117.00 | 3021.67 | 53.67 | 65.33 | 133.00 | 53.67 | 65.00 | 201.33 | ^
>> |19 | 50.00 | 100.67 | 3099.33 | 41.00 | 64.33 | 1014.33 | 54.00 | 63.67 | 213.33 |
>> |-20 | 53.00 | 169.00 | 11661.67 | 53.67 | 217.33 | 14313.67 | 46.00 | 61.33 | 236.00 | ^
>> -----------------------------------------------------------------------------------------------------------------
>
> The latency results look good with Cyclictest LN:0 and hackbench LN:0.
> 133us max latency. This suggests that your system is not overloaded
> and cyclictest doesn't really compete with others to run.

Following is the result of running cyclictest alongside hackbench with 32 groups:

perf bench sched messaging -p -l 100000 -g 32&
cyclictest --policy other -D 5 -q -n -H 20000

----------------------------------------------------------------------------------------------------------
| Hackbench | Cyclictest LN = 19 | Cyclictest LN = 0 | Cyclictest LN = -20 |
| LN |------------------------------|-------------------------------|---------------------------|
| | Min | Avg | Max | Min | Avg | Max | Min | Avg | Max |
|-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
| 0 | 54.00 | 165.00 | 6899.00 | 22.00 | 85.00 | 3294.00 | 23.00 | 64.00 | 276.00 |
| 19 | 53.00 | 173.00 | 3275.00 | 40.00 | 60.00 | 2276.00 | 13.00 | 59.00 | 94.00 |
| -20 | 52.00 | 293.00 | 19980.00 | 52.00 | 280.00 | 14305.00 | 53.00 | 95.00 | 5713.00 |
----------------------------------------------------------------------------------------------------------

I see a spike for Max in (0, 0) configuration and the latency decreases
monotonically with lower latency nice value.

>
>>
>> o Hackbench and schbench in NPS1 configuration
>>
>> perf bench sched messaging -p -t -l 1000000 -g 16&
>> schebcnh -m 1 -t 64 -s 30s
>>
>> ------------------------------------------------------------------------------------------------------------
>> |Hackbench | schbench LN = 19 | schbench LN = 0 | schbench LN = -20 |
>> |LN |----------------------------|--------------------------------|-----------------------------|
>> |v | 90th | 95th | 99th | 90th | 95th | 99th | 90th | 95th | 99th |
>> |--------------|--------|--------|----------|---------|---------|------------|---------|----------|--------|
>> |0 | 4264 | 6744 | 15664 | 17952 | 32672 | 55488 | 15088 | 25312 | 50112 |
>> |19 | 288 | 613 | 2332 | 274 | 1015 | 3628 | 374 | 1394 | 4424 |
>> |-20 | 35904 | 47680 | 79744 | 87168 | 113536 | 176896 | 13008 | 21216 | 42560 | ^
>> ------------------------------------------------------------------------------------------------------------
>
> For the schbench, your test is 30 seconds long which is longer than
> the duration of perf bench sched messaging -p -t -l 1000000 -g 16&
>
> The duration of the latter varies depending of latency nice value so
> schbench is disturb more time in some cases

I've rerun this with hackbench running 128 groups alongside schbench
with 2 messenger and 1 worker each. With larger worker count, I still
see non-monotonic behavior in 99th percentile latency of schbench.
I also see number of latency samples collected by schbench to vary
over the 30 second run for different latency nice values which could
also pay a part in seeing the unexpected behavior. For lower worker
count, I see the number of samples collected is similar. Following
is the configuration and the latency reported by schbench:

perf bench sched messaging -p -t -l 150000 -g 128&
schbench -m 2 -t 1 -s 30s

Note: In all cases, hackbench runs longer than schbench.

-------------------------------------------------------------------------------------------------
| Hackbench | schbench LN = 19 | schbench LN = 0 | schbench LN = -20 |
| LN |----------------------------|---------------------------|--------------------------|
| | 90th | 95th | 99th | 90th | 95th | 99th | 90th | 95th | 99th |
|-----------|--------|--------|----------|--------|--------|---------|--------|--------|--------|
| 0 | 42 | 92 | 2972 | 26 | 49 | 2356 | 9 | 11 | 20 |
| 19 | 35 | 424 | 4984 | 13 | 390 | 5096 | 8 | 10 | 14 | ^
| -19 | 144 | 3516 | 110208 | 61 | 807 | 34880 | 25 | 39 | 295 |
-------------------------------------------------------------------------------------------------

I see 90th and 95th percentile latency decrease monotonically with
latency nice value of schbench (for a fixed latency nice value of
hackbench) but there are cases where 99th percentile latency
reported by schbench may not strictly decrease with lower latency
nice value (Marked with ^)

Note: Only a small number of bad samples can affect the 99th
percentile latency for the above configuration. The monotonic
behavior in 90th and 95th percentile latency is a good data point
to show latency nice is indeed working as expected.

If there is any specific workload you would like me to run on the
test system, or any additional data you would like for above
workloads, please do let me know.

--
Thanks and Regards,
Prateek

2022-10-27 16:54:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] Add latency priority for CFS class

Hi Prateek,

On Tue, 25 Oct 2022 at 08:36, K Prateek Nayak <[email protected]> wrote:
>
> Hello Vincent,
>
> I've rerun some tests with a different configuration with more
> contention for CPU and I can see a linear behavior. Sharing the
> results below.
>
> On 10/13/2022 8:54 PM, Vincent Guittot wrote:
> >
> > [..snip..]
> >>
> >> o Hackbench and Cyclictest in NPS1 configuration
> >>
> >> perf bench sched messaging -p -t -l 100000 -g 16&
> >> cyclictest --policy other -D 5 -q -n -H 20000
> >>
> >> -----------------------------------------------------------------------------------------------------------------
> >> |Hackbench | Cyclictest LN = 19 | Cyclictest LN = 0 | Cyclictest LN = -20 |
> >> |LN |--------------------------------|---------------------------------|-----------------------------|
> >> |v | Min | Avg | Max | Min | Avg | Max | Min | Avg | Max |
> >> |--------------|--------|---------|-------------|----------|---------|------------|----------|---------|--------|
> >> |0 | 54.00 | 117.00 | 3021.67 | 53.67 | 65.33 | 133.00 | 53.67 | 65.00 | 201.33 | ^
> >> |19 | 50.00 | 100.67 | 3099.33 | 41.00 | 64.33 | 1014.33 | 54.00 | 63.67 | 213.33 |
> >> |-20 | 53.00 | 169.00 | 11661.67 | 53.67 | 217.33 | 14313.67 | 46.00 | 61.33 | 236.00 | ^
> >> -----------------------------------------------------------------------------------------------------------------
> >
> > The latency results look good with Cyclictest LN:0 and hackbench LN:0.
> > 133us max latency. This suggests that your system is not overloaded
> > and cyclictest doesn't really compete with others to run.
>
> Following is the result of running cyclictest alongside hackbench with 32 groups:
>
> perf bench sched messaging -p -l 100000 -g 32&
> cyclictest --policy other -D 5 -q -n -H 20000
>
> ----------------------------------------------------------------------------------------------------------
> | Hackbench | Cyclictest LN = 19 | Cyclictest LN = 0 | Cyclictest LN = -20 |
> | LN |------------------------------|-------------------------------|---------------------------|
> | | Min | Avg | Max | Min | Avg | Max | Min | Avg | Max |
> |-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
> | 0 | 54.00 | 165.00 | 6899.00 | 22.00 | 85.00 | 3294.00 | 23.00 | 64.00 | 276.00 |
> | 19 | 53.00 | 173.00 | 3275.00 | 40.00 | 60.00 | 2276.00 | 13.00 | 59.00 | 94.00 |
> | -20 | 52.00 | 293.00 | 19980.00 | 52.00 | 280.00 | 14305.00 | 53.00 | 95.00 | 5713.00 |
> ----------------------------------------------------------------------------------------------------------
>
> I see a spike for Max in (0, 0) configuration and the latency decreases
> monotonically with lower latency nice value.

Your results looks good

>
> >
> >>
> >> o Hackbench and schbench in NPS1 configuration
> >>
> >> perf bench sched messaging -p -t -l 1000000 -g 16&
> >> schebcnh -m 1 -t 64 -s 30s
> >>
> >> ------------------------------------------------------------------------------------------------------------
> >> |Hackbench | schbench LN = 19 | schbench LN = 0 | schbench LN = -20 |
> >> |LN |----------------------------|--------------------------------|-----------------------------|
> >> |v | 90th | 95th | 99th | 90th | 95th | 99th | 90th | 95th | 99th |
> >> |--------------|--------|--------|----------|---------|---------|------------|---------|----------|--------|
> >> |0 | 4264 | 6744 | 15664 | 17952 | 32672 | 55488 | 15088 | 25312 | 50112 |
> >> |19 | 288 | 613 | 2332 | 274 | 1015 | 3628 | 374 | 1394 | 4424 |
> >> |-20 | 35904 | 47680 | 79744 | 87168 | 113536 | 176896 | 13008 | 21216 | 42560 | ^
> >> ------------------------------------------------------------------------------------------------------------
> >
> > For the schbench, your test is 30 seconds long which is longer than
> > the duration of perf bench sched messaging -p -t -l 1000000 -g 16&
> >
> > The duration of the latter varies depending of latency nice value so
> > schbench is disturb more time in some cases
>
> I've rerun this with hackbench running 128 groups alongside schbench
> with 2 messenger and 1 worker each. With larger worker count, I still
> see non-monotonic behavior in 99th percentile latency of schbench.
> I also see number of latency samples collected by schbench to vary
> over the 30 second run for different latency nice values which could
> also pay a part in seeing the unexpected behavior. For lower worker
> count, I see the number of samples collected is similar. Following
> is the configuration and the latency reported by schbench:
>
> perf bench sched messaging -p -t -l 150000 -g 128&
> schbench -m 2 -t 1 -s 30s
>
> Note: In all cases, hackbench runs longer than schbench.
>
> -------------------------------------------------------------------------------------------------
> | Hackbench | schbench LN = 19 | schbench LN = 0 | schbench LN = -20 |
> | LN |----------------------------|---------------------------|--------------------------|
> | | 90th | 95th | 99th | 90th | 95th | 99th | 90th | 95th | 99th |
> |-----------|--------|--------|----------|--------|--------|---------|--------|--------|--------|
> | 0 | 42 | 92 | 2972 | 26 | 49 | 2356 | 9 | 11 | 20 |
> | 19 | 35 | 424 | 4984 | 13 | 390 | 5096 | 8 | 10 | 14 | ^
> | -19 | 144 | 3516 | 110208 | 61 | 807 | 34880 | 25 | 39 | 295 |
> -------------------------------------------------------------------------------------------------
>
> I see 90th and 95th percentile latency decrease monotonically with
> latency nice value of schbench (for a fixed latency nice value of
> hackbench) but there are cases where 99th percentile latency
> reported by schbench may not strictly decrease with lower latency
> nice value (Marked with ^)
>
> Note: Only a small number of bad samples can affect the 99th
> percentile latency for the above configuration. The monotonic
> behavior in 90th and 95th percentile latency is a good data point
> to show latency nice is indeed working as expected.

Yes, I think you are right that the 99th percentile is not stable
enough because it can be impacted by a small number of bad samples

>
> If there is any specific workload you would like me to run on the
> test system, or any additional data you would like for above
> workloads, please do let me know.

Thanks a lot for your tests.
I'm about to send v6

>
> --
> Thanks and Regards,
> Prateek