2020-01-08 19:49:54

by Ivan Babrou

[permalink] [raw]
Subject: Lower than expected CPU pressure in PSI

We added reporting for PSI in cgroups and results are somewhat surprising.

My test setup consists of 3 services:

* stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
* stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1
* stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1

First service runs unconstrained, the other two compete for CPU.

As expected, I can see 500ms/s sched delay for the latter two and
aggregated 1000ms/s delay for /system.slice, no surprises here.

However, CPU pressure reported by PSI says that none of my services
have any pressure on them. I can see around 434ms/s pressure on
/unified/system.slice and 425ms/s pressure on /unified cgroup, which
is surprising for three reasons:

* Pressure is absent for my services (I expect it to match scheed delay)
* Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
* Pressure on root cgroup is lower than on system.slice

I'm running Linux 5.4.8 with hybrid cgroup hierarchy under systemd.

P.S.: ./scripts/get_maintainer.pl kernel/sched/psi.c does not say
Johannes Weiner is a maintainer


2020-01-09 19:51:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: Lower than expected CPU pressure in PSI

On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> P.S.: ./scripts/get_maintainer.pl kernel/sched/psi.c does not say
> Johannes Weiner is a maintainer

Mostly because psi is so intertwined with the scheduler that it's been
maintained through the scheduler tree. But let's add an entry for it,
the sched/ directory matching will ensure it shows all pertinent CCs:

$ ./scripts/get_maintainer.pl kernel/sched/psi.c
Johannes Weiner <[email protected]> (maintainer:PRESSURE STALL INFORMATION (PSI))
Ingo Molnar <[email protected]> (maintainer:SCHEDULER)
Peter Zijlstra <[email protected]> (maintainer:SCHEDULER)
Juri Lelli <[email protected]> (maintainer:SCHEDULER)
Vincent Guittot <[email protected]> (maintainer:SCHEDULER)
Dietmar Eggemann <[email protected]> (reviewer:SCHEDULER)
Steven Rostedt <[email protected]> (reviewer:SCHEDULER)
Ben Segall <[email protected]> (reviewer:SCHEDULER)
Mel Gorman <[email protected]> (reviewer:SCHEDULER)
[email protected] (open list:SCHEDULER)

---
From b865e10129c97c81fb83c368456d2c24e54badab Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Thu, 9 Jan 2020 10:36:58 -0500
Subject: [PATCH] MAINTAINERS: add maintenance information for psi

Signed-off-by: Johannes Weiner <[email protected]>
---
MAINTAINERS | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index bd5847e802de..228baf0e40a6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13367,6 +13367,12 @@ F: net/psample
F: include/net/psample.h
F: include/uapi/linux/psample.h

+PRESSURE STALL INFORMATION (PSI)
+M: Johannes Weiner <[email protected]>
+S: Maintained
+F: kernel/sched/psi.c
+F: include/linux/psi*
+
PSTORE FILESYSTEM
M: Kees Cook <[email protected]>
M: Anton Vorontsov <[email protected]>
--
2.24.1

2020-01-09 20:29:07

by Johannes Weiner

[permalink] [raw]
Subject: Re: Lower than expected CPU pressure in PSI

On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> We added reporting for PSI in cgroups and results are somewhat surprising.
>
> My test setup consists of 3 services:
>
> * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> * stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1
> * stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1
>
> First service runs unconstrained, the other two compete for CPU.
>
> As expected, I can see 500ms/s sched delay for the latter two and
> aggregated 1000ms/s delay for /system.slice, no surprises here.
>
> However, CPU pressure reported by PSI says that none of my services
> have any pressure on them. I can see around 434ms/s pressure on
> /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> is surprising for three reasons:
>
> * Pressure is absent for my services (I expect it to match scheed delay)
> * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> * Pressure on root cgroup is lower than on system.slice

CPU pressure is currently implemented based only on the number of
*runnable* tasks, not on who gets to actively use the CPU. This works
for contention within cgroups or at the global scope, but it doesn't
correctly reflect competition between cgroups. It also doesn't show
the effects of e.g. cpu cycle limiting through cpu.max where there
might *be* only one runnable task, but it's not getting the CPU.

I've been working on fixing this, but hadn't gotten around to sending
the patch upstream. Attaching it below. Would you mind testing it?

Peter, what would you think of the below?

---
From 98c233aae05b7d43e465d4c382a3a20905235296 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Thu, 14 Nov 2019 15:53:45 -0500
Subject: [PATCH] psi: fix cpu.pressure for cpu.max and competing cgroups

For simplicity, cpu pressure is defined as having more than one
runnable task on a given CPU. This works on the system-level, but it
has limitations in a cgrouped reality: When cpu.max is in use, it
doesn't capture the time in which a task is not executing on the CPU
due to throttling. Likewise, it doesn't capture the time in which a
competing cgroup is occupying the CPU - meaning it only reflects
cgroup-internal competitive pressure, not outside pressure.

Enable tracking of currently executing tasks, and then change the
definition of cpu pressure in a cgroup from

NR_RUNNING > 1

to

NR_RUNNING > ON_CPU

which will capture the effects of cpu.max as well as competition from
outside the cgroup.

After this patch, a cgroup running `stress -c 1` with a cpu.max
setting of 5000 10000 shows ~50% continuous CPU pressure.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/psi_types.h | 10 +++++++++-
kernel/sched/core.c | 2 ++
kernel/sched/psi.c | 12 +++++++-----
kernel/sched/stats.h | 28 ++++++++++++++++++++++++++++
4 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 07aaf9b82241..4b7258495a04 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -14,13 +14,21 @@ enum psi_task_count {
NR_IOWAIT,
NR_MEMSTALL,
NR_RUNNING,
- NR_PSI_TASK_COUNTS = 3,
+ /*
+ * This can't have values other than 0 or 1 and could be
+ * implemented as a bit flag. But for now we still have room
+ * in the first cacheline of psi_group_cpu, and this way we
+ * don't have to special case any state tracking for it.
+ */
+ NR_ONCPU,
+ NR_PSI_TASK_COUNTS = 4,
};

/* Task state bitmasks */
#define TSK_IOWAIT (1 << NR_IOWAIT)
#define TSK_MEMSTALL (1 << NR_MEMSTALL)
#define TSK_RUNNING (1 << NR_RUNNING)
+#define TSK_ONCPU (1 << NR_ONCPU)

/* Resources that workloads could be stalled on */
enum psi_res {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44123b4d14e8..9296e0de7b72 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4072,6 +4072,8 @@ static void __sched notrace __schedule(bool preempt)
*/
++*switch_count;

+ psi_sched_switch(prev, next, !task_on_rq_queued(prev));
+
trace_sched_switch(preempt, prev, next);

/* Also unlocks the rq: */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 517e3719027e..fad91da54aab 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -224,7 +224,7 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
case PSI_MEM_FULL:
return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
case PSI_CPU_SOME:
- return tasks[NR_RUNNING] > 1;
+ return tasks[NR_RUNNING] > tasks[NR_ONCPU];
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -694,10 +694,10 @@ static u32 psi_group_change(struct psi_group *group, int cpu,
if (!(m & (1 << t)))
continue;
if (groupc->tasks[t] == 0 && !psi_bug) {
- printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n",
+ printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n",
cpu, t, groupc->tasks[0],
groupc->tasks[1], groupc->tasks[2],
- clear, set);
+ groupc->tasks[3], clear, set);
psi_bug = 1;
}
groupc->tasks[t]--;
@@ -915,9 +915,11 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)

rq = task_rq_lock(task, &rf);

- if (task_on_rq_queued(task))
+ if (task_on_rq_queued(task)) {
task_flags = TSK_RUNNING;
- else if (task->in_iowait)
+ if (task_current(rq, task))
+ task_flags |= TSK_ONCPU;
+ } else if (task->in_iowait)
task_flags = TSK_IOWAIT;

if (task->flags & PF_MEMSTALL)
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index ba683fe81a6e..6ff0ac1a803f 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -93,6 +93,14 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep)
if (p->flags & PF_MEMSTALL)
clear |= TSK_MEMSTALL;
} else {
+ /*
+ * When a task sleeps, schedule() dequeues it before
+ * switching to the next one. Merge the clearing of
+ * TSK_RUNNING and TSK_ONCPU to save an unnecessary
+ * psi_task_change() call in psi_sched_switch().
+ */
+ clear |= TSK_ONCPU;
+
if (p->in_iowait)
set |= TSK_IOWAIT;
}
@@ -126,6 +134,23 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
}
}

+static inline void psi_sched_switch(struct task_struct *prev,
+ struct task_struct *next,
+ bool sleep)
+{
+ if (static_branch_likely(&psi_disabled))
+ return;
+
+ /*
+ * Clear the TSK_ONCPU state if the task was preempted. If
+ * it's a voluntary sleep, dequeue will have taken care of it.
+ */
+ if (!sleep)
+ psi_task_change(prev, TSK_ONCPU, 0);
+
+ psi_task_change(next, 0, TSK_ONCPU);
+}
+
static inline void psi_task_tick(struct rq *rq)
{
if (static_branch_likely(&psi_disabled))
@@ -138,6 +163,9 @@ static inline void psi_task_tick(struct rq *rq)
static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+static inline void psi_sched_switch(struct task_struct *prev,
+ struct task_struct *next,
+ bool sleep) {}
static inline void psi_task_tick(struct rq *rq) {}
#endif /* CONFIG_PSI */

--
2.24.1

2020-01-10 19:30:56

by Ivan Babrou

[permalink] [raw]
Subject: Re: Lower than expected CPU pressure in PSI

I applied the patch on top of 5.5.0-rc3 and it's definitely better
now, both competing cgroups report 500ms/s delay. Feel free to add
Tested-by from me.

I'm still seeing /unified/system.slice at 385ms/s and /unified.slice
at 372ms/s, do you have an explanation for that part? Maybe it's
totally reasonable, but warrants a patch for documentation.

On Thu, Jan 9, 2020 at 8:16 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> > We added reporting for PSI in cgroups and results are somewhat surprising.
> >
> > My test setup consists of 3 services:
> >
> > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> > * stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1
> > * stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1
> >
> > First service runs unconstrained, the other two compete for CPU.
> >
> > As expected, I can see 500ms/s sched delay for the latter two and
> > aggregated 1000ms/s delay for /system.slice, no surprises here.
> >
> > However, CPU pressure reported by PSI says that none of my services
> > have any pressure on them. I can see around 434ms/s pressure on
> > /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> > is surprising for three reasons:
> >
> > * Pressure is absent for my services (I expect it to match scheed delay)
> > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> > * Pressure on root cgroup is lower than on system.slice
>
> CPU pressure is currently implemented based only on the number of
> *runnable* tasks, not on who gets to actively use the CPU. This works
> for contention within cgroups or at the global scope, but it doesn't
> correctly reflect competition between cgroups. It also doesn't show
> the effects of e.g. cpu cycle limiting through cpu.max where there
> might *be* only one runnable task, but it's not getting the CPU.
>
> I've been working on fixing this, but hadn't gotten around to sending
> the patch upstream. Attaching it below. Would you mind testing it?
>
> Peter, what would you think of the below?
>
> ---
> From 98c233aae05b7d43e465d4c382a3a20905235296 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Thu, 14 Nov 2019 15:53:45 -0500
> Subject: [PATCH] psi: fix cpu.pressure for cpu.max and competing cgroups
>
> For simplicity, cpu pressure is defined as having more than one
> runnable task on a given CPU. This works on the system-level, but it
> has limitations in a cgrouped reality: When cpu.max is in use, it
> doesn't capture the time in which a task is not executing on the CPU
> due to throttling. Likewise, it doesn't capture the time in which a
> competing cgroup is occupying the CPU - meaning it only reflects
> cgroup-internal competitive pressure, not outside pressure.
>
> Enable tracking of currently executing tasks, and then change the
> definition of cpu pressure in a cgroup from
>
> NR_RUNNING > 1
>
> to
>
> NR_RUNNING > ON_CPU
>
> which will capture the effects of cpu.max as well as competition from
> outside the cgroup.
>
> After this patch, a cgroup running `stress -c 1` with a cpu.max
> setting of 5000 10000 shows ~50% continuous CPU pressure.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/psi_types.h | 10 +++++++++-
> kernel/sched/core.c | 2 ++
> kernel/sched/psi.c | 12 +++++++-----
> kernel/sched/stats.h | 28 ++++++++++++++++++++++++++++
> 4 files changed, 46 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index 07aaf9b82241..4b7258495a04 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -14,13 +14,21 @@ enum psi_task_count {
> NR_IOWAIT,
> NR_MEMSTALL,
> NR_RUNNING,
> - NR_PSI_TASK_COUNTS = 3,
> + /*
> + * This can't have values other than 0 or 1 and could be
> + * implemented as a bit flag. But for now we still have room
> + * in the first cacheline of psi_group_cpu, and this way we
> + * don't have to special case any state tracking for it.
> + */
> + NR_ONCPU,
> + NR_PSI_TASK_COUNTS = 4,
> };
>
> /* Task state bitmasks */
> #define TSK_IOWAIT (1 << NR_IOWAIT)
> #define TSK_MEMSTALL (1 << NR_MEMSTALL)
> #define TSK_RUNNING (1 << NR_RUNNING)
> +#define TSK_ONCPU (1 << NR_ONCPU)
>
> /* Resources that workloads could be stalled on */
> enum psi_res {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 44123b4d14e8..9296e0de7b72 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4072,6 +4072,8 @@ static void __sched notrace __schedule(bool preempt)
> */
> ++*switch_count;
>
> + psi_sched_switch(prev, next, !task_on_rq_queued(prev));
> +
> trace_sched_switch(preempt, prev, next);
>
> /* Also unlocks the rq: */
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 517e3719027e..fad91da54aab 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -224,7 +224,7 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
> case PSI_MEM_FULL:
> return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
> case PSI_CPU_SOME:
> - return tasks[NR_RUNNING] > 1;
> + return tasks[NR_RUNNING] > tasks[NR_ONCPU];
> case PSI_NONIDLE:
> return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
> tasks[NR_RUNNING];
> @@ -694,10 +694,10 @@ static u32 psi_group_change(struct psi_group *group, int cpu,
> if (!(m & (1 << t)))
> continue;
> if (groupc->tasks[t] == 0 && !psi_bug) {
> - printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n",
> + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n",
> cpu, t, groupc->tasks[0],
> groupc->tasks[1], groupc->tasks[2],
> - clear, set);
> + groupc->tasks[3], clear, set);
> psi_bug = 1;
> }
> groupc->tasks[t]--;
> @@ -915,9 +915,11 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
>
> rq = task_rq_lock(task, &rf);
>
> - if (task_on_rq_queued(task))
> + if (task_on_rq_queued(task)) {
> task_flags = TSK_RUNNING;
> - else if (task->in_iowait)
> + if (task_current(rq, task))
> + task_flags |= TSK_ONCPU;
> + } else if (task->in_iowait)
> task_flags = TSK_IOWAIT;
>
> if (task->flags & PF_MEMSTALL)
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index ba683fe81a6e..6ff0ac1a803f 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -93,6 +93,14 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep)
> if (p->flags & PF_MEMSTALL)
> clear |= TSK_MEMSTALL;
> } else {
> + /*
> + * When a task sleeps, schedule() dequeues it before
> + * switching to the next one. Merge the clearing of
> + * TSK_RUNNING and TSK_ONCPU to save an unnecessary
> + * psi_task_change() call in psi_sched_switch().
> + */
> + clear |= TSK_ONCPU;
> +
> if (p->in_iowait)
> set |= TSK_IOWAIT;
> }
> @@ -126,6 +134,23 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
> }
> }
>
> +static inline void psi_sched_switch(struct task_struct *prev,
> + struct task_struct *next,
> + bool sleep)
> +{
> + if (static_branch_likely(&psi_disabled))
> + return;
> +
> + /*
> + * Clear the TSK_ONCPU state if the task was preempted. If
> + * it's a voluntary sleep, dequeue will have taken care of it.
> + */
> + if (!sleep)
> + psi_task_change(prev, TSK_ONCPU, 0);
> +
> + psi_task_change(next, 0, TSK_ONCPU);
> +}
> +
> static inline void psi_task_tick(struct rq *rq)
> {
> if (static_branch_likely(&psi_disabled))
> @@ -138,6 +163,9 @@ static inline void psi_task_tick(struct rq *rq)
> static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
> static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
> static inline void psi_ttwu_dequeue(struct task_struct *p) {}
> +static inline void psi_sched_switch(struct task_struct *prev,
> + struct task_struct *next,
> + bool sleep) {}
> static inline void psi_task_tick(struct rq *rq) {}
> #endif /* CONFIG_PSI */
>
> --
> 2.24.1

2020-01-15 16:57:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: Lower than expected CPU pressure in PSI

On Fri, Jan 10, 2020 at 11:28:32AM -0800, Ivan Babrou wrote:
> I applied the patch on top of 5.5.0-rc3 and it's definitely better
> now, both competing cgroups report 500ms/s delay. Feel free to add
> Tested-by from me.

Thanks, Ivan!

> I'm still seeing /unified/system.slice at 385ms/s and /unified.slice
> at 372ms/s, do you have an explanation for that part? Maybe it's
> totally reasonable, but warrants a patch for documentation.

Yes, this is a combination of CPU pinning and how pressure is
calculated in SMP systems.

The stall times are defined as lost compute potential - which scales
with the number of concurrent threads - normalized to wallclock
time. See the "Multiple CPUs" section in kernel/sched/psi.c.

By restricting the CPUs in system.slice, there is less compute
available in that group than in the parent, which means that the
relative loss of potential can be higher.

It's a bit unintuitive because most cgroup metrics are plain numbers
that add up to bigger numbers as you go up the tree. If we exported
both the numerator (waste) and denominator (compute potential) here,
the numbers would act more conventionally, with parent numbers always
bigger than the child's. But because pressure is normalized to
wallclock time, you only see the ratio at each level, and that can
shrink as you go up the tree if lower levels are CPU-constrained.

We could have exported both numbers, but for most usecases that would
be more confusing than helpful. And in practice it's the ratio that
really matters: the pressure in the leaf cgroups is high due to the
CPU restriction; but when you go higher up the tree and look at not
just the pinned tasks, but also include tasks in other groups that
have more CPUs available to them, the aggregate productivity at that
level *is* actually higher.

I hope that helps!

2020-01-17 00:14:34

by Ivan Babrou

[permalink] [raw]
Subject: Re: Lower than expected CPU pressure in PSI

This definitely helps! It would be nice to add this as a section here:

* https://www.kernel.org/doc/html/latest/accounting/psi.html


On Wed, Jan 15, 2020 at 8:55 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Jan 10, 2020 at 11:28:32AM -0800, Ivan Babrou wrote:
> > I applied the patch on top of 5.5.0-rc3 and it's definitely better
> > now, both competing cgroups report 500ms/s delay. Feel free to add
> > Tested-by from me.
>
> Thanks, Ivan!
>
> > I'm still seeing /unified/system.slice at 385ms/s and /unified.slice
> > at 372ms/s, do you have an explanation for that part? Maybe it's
> > totally reasonable, but warrants a patch for documentation.
>
> Yes, this is a combination of CPU pinning and how pressure is
> calculated in SMP systems.
>
> The stall times are defined as lost compute potential - which scales
> with the number of concurrent threads - normalized to wallclock
> time. See the "Multiple CPUs" section in kernel/sched/psi.c.
>
> By restricting the CPUs in system.slice, there is less compute
> available in that group than in the parent, which means that the
> relative loss of potential can be higher.
>
> It's a bit unintuitive because most cgroup metrics are plain numbers
> that add up to bigger numbers as you go up the tree. If we exported
> both the numerator (waste) and denominator (compute potential) here,
> the numbers would act more conventionally, with parent numbers always
> bigger than the child's. But because pressure is normalized to
> wallclock time, you only see the ratio at each level, and that can
> shrink as you go up the tree if lower levels are CPU-constrained.
>
> We could have exported both numbers, but for most usecases that would
> be more confusing than helpful. And in practice it's the ratio that
> really matters: the pressure in the leaf cgroups is high due to the
> CPU restriction; but when you go higher up the tree and look at not
> just the pinned tasks, but also include tasks in other groups that
> have more CPUs available to them, the aggregate productivity at that
> level *is* actually higher.
>
> I hope that helps!

2020-02-07 13:10:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Lower than expected CPU pressure in PSI

On Thu, Jan 09, 2020 at 11:16:32AM -0500, Johannes Weiner wrote:
> On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> > We added reporting for PSI in cgroups and results are somewhat surprising.
> >
> > My test setup consists of 3 services:
> >
> > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> > * stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1
> > * stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1
> >
> > First service runs unconstrained, the other two compete for CPU.
> >
> > As expected, I can see 500ms/s sched delay for the latter two and
> > aggregated 1000ms/s delay for /system.slice, no surprises here.
> >
> > However, CPU pressure reported by PSI says that none of my services
> > have any pressure on them. I can see around 434ms/s pressure on
> > /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> > is surprising for three reasons:
> >
> > * Pressure is absent for my services (I expect it to match scheed delay)
> > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> > * Pressure on root cgroup is lower than on system.slice
>
> CPU pressure is currently implemented based only on the number of
> *runnable* tasks, not on who gets to actively use the CPU. This works
> for contention within cgroups or at the global scope, but it doesn't
> correctly reflect competition between cgroups. It also doesn't show
> the effects of e.g. cpu cycle limiting through cpu.max where there
> might *be* only one runnable task, but it's not getting the CPU.
>
> I've been working on fixing this, but hadn't gotten around to sending
> the patch upstream. Attaching it below. Would you mind testing it?
>
> Peter, what would you think of the below?

I'm not loving it; but I see what it does and I can't quickly see an
alternative.

My main gripe is doing even more of those cgroup traversals.

One thing pick_next_task_fair() does is try and limit the cgroup
traversal to the sub-tree that contains both prev and next. Not sure
that is immediately applicable here, but it might be worth looking into.

2020-02-08 10:22:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Lower than expected CPU pressure in PSI

On Fri, Feb 07, 2020 at 02:08:29PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 09, 2020 at 11:16:32AM -0500, Johannes Weiner wrote:
> > On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> > > We added reporting for PSI in cgroups and results are somewhat surprising.
> > >
> > > My test setup consists of 3 services:
> > >
> > > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> > > * stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1
> > > * stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1
> > >
> > > First service runs unconstrained, the other two compete for CPU.
> > >
> > > As expected, I can see 500ms/s sched delay for the latter two and
> > > aggregated 1000ms/s delay for /system.slice, no surprises here.
> > >
> > > However, CPU pressure reported by PSI says that none of my services
> > > have any pressure on them. I can see around 434ms/s pressure on
> > > /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> > > is surprising for three reasons:
> > >
> > > * Pressure is absent for my services (I expect it to match scheed delay)
> > > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> > > * Pressure on root cgroup is lower than on system.slice
> >
> > CPU pressure is currently implemented based only on the number of
> > *runnable* tasks, not on who gets to actively use the CPU. This works
> > for contention within cgroups or at the global scope, but it doesn't
> > correctly reflect competition between cgroups. It also doesn't show
> > the effects of e.g. cpu cycle limiting through cpu.max where there
> > might *be* only one runnable task, but it's not getting the CPU.
> >
> > I've been working on fixing this, but hadn't gotten around to sending
> > the patch upstream. Attaching it below. Would you mind testing it?
> >
> > Peter, what would you think of the below?
>
> I'm not loving it; but I see what it does and I can't quickly see an
> alternative.
>
> My main gripe is doing even more of those cgroup traversals.
>
> One thing pick_next_task_fair() does is try and limit the cgroup
> traversal to the sub-tree that contains both prev and next. Not sure
> that is immediately applicable here, but it might be worth looking into.

One option I suppose, would be to replace this:

+static inline void psi_sched_switch(struct task_struct *prev,
+ struct task_struct *next,
+ bool sleep)
+{
+ if (static_branch_likely(&psi_disabled))
+ return;
+
+ /*
+ * Clear the TSK_ONCPU state if the task was preempted. If
+ * it's a voluntary sleep, dequeue will have taken care of it.
+ */
+ if (!sleep)
+ psi_task_change(prev, TSK_ONCPU, 0);
+
+ psi_task_change(next, 0, TSK_ONCPU);
+}

With something like:

static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep)
{
struct psi_group *g, *p = NULL;

set = TSK_ONCPU;
clear = 0;

while ((g = iterate_group(next, &g))) {
u32 nr_running = per_cpu_ptr(g->pcpu, cpu)->tasks[NR_RUNNING];
if (nr_running) {
/* if set, we hit the subtree @prev lives in, terminate */
p = g;
break;
}

/* the rest of psi_task_change */
}

if (sleep)
return;

set = 0;
clear = TSK_ONCPU;

while ((g = iterate_group(prev, &g))) {
if (g == p)
break;

/* the rest of psi_task_change */
}
}

That way we avoid clearing and setting the common parents.

2020-02-10 18:06:22

by Johannes Weiner

[permalink] [raw]
Subject: Re: Lower than expected CPU pressure in PSI

On Sat, Feb 08, 2020 at 11:19:57AM +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2020 at 02:08:29PM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 09, 2020 at 11:16:32AM -0500, Johannes Weiner wrote:
> > > On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> > > > We added reporting for PSI in cgroups and results are somewhat surprising.
> > > >
> > > > My test setup consists of 3 services:
> > > >
> > > > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> > > > * stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1
> > > > * stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1
> > > >
> > > > First service runs unconstrained, the other two compete for CPU.
> > > >
> > > > As expected, I can see 500ms/s sched delay for the latter two and
> > > > aggregated 1000ms/s delay for /system.slice, no surprises here.
> > > >
> > > > However, CPU pressure reported by PSI says that none of my services
> > > > have any pressure on them. I can see around 434ms/s pressure on
> > > > /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> > > > is surprising for three reasons:
> > > >
> > > > * Pressure is absent for my services (I expect it to match scheed delay)
> > > > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> > > > * Pressure on root cgroup is lower than on system.slice
> > >
> > > CPU pressure is currently implemented based only on the number of
> > > *runnable* tasks, not on who gets to actively use the CPU. This works
> > > for contention within cgroups or at the global scope, but it doesn't
> > > correctly reflect competition between cgroups. It also doesn't show
> > > the effects of e.g. cpu cycle limiting through cpu.max where there
> > > might *be* only one runnable task, but it's not getting the CPU.
> > >
> > > I've been working on fixing this, but hadn't gotten around to sending
> > > the patch upstream. Attaching it below. Would you mind testing it?
> > >
> > > Peter, what would you think of the below?
> >
> > I'm not loving it; but I see what it does and I can't quickly see an
> > alternative.
> >
> > My main gripe is doing even more of those cgroup traversals.
> >
> > One thing pick_next_task_fair() does is try and limit the cgroup
> > traversal to the sub-tree that contains both prev and next. Not sure
> > that is immediately applicable here, but it might be worth looking into.
>
> One option I suppose, would be to replace this:

Thanks for looking closer at this, this is a cool idea.

> +static inline void psi_sched_switch(struct task_struct *prev,
> + struct task_struct *next,
> + bool sleep)
> +{
> + if (static_branch_likely(&psi_disabled))
> + return;
> +
> + /*
> + * Clear the TSK_ONCPU state if the task was preempted. If
> + * it's a voluntary sleep, dequeue will have taken care of it.
> + */
> + if (!sleep)
> + psi_task_change(prev, TSK_ONCPU, 0);
> +
> + psi_task_change(next, 0, TSK_ONCPU);
> +}
>
> With something like:
>
> static inline void psi_sched_switch(struct task_struct *prev,
> struct task_struct *next,
> bool sleep)
> {
> struct psi_group *g, *p = NULL;
>
> set = TSK_ONCPU;
> clear = 0;
>
> while ((g = iterate_group(next, &g))) {
> u32 nr_running = per_cpu_ptr(g->pcpu, cpu)->tasks[NR_RUNNING];

[ I'm assuming you meant NR_ONCPU instead of NR_RUNNING since the
incoming task will already be runnable and all its groups will
always have NR_RUNNING elevated.

Would switching this to NR_RUNNABLE / TSK_RUNNABLE be better? ]

Anyway, I implemented this and it seems to be working quite well. When
cgroup siblings contend over a CPU, i.e. context switching doesn't
change the group state, no group updates are performed at all:

# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c2.scope
# stress -c 64
stress: info: [216] dispatching hogs: 64 cpu, 0 io, 0 vm, 0 hdd

stress-238 [001] d..2 50.077379: psi_task_switch: stress->[stress] 0/4 cgroups updated
stress-238 [001] d..2 50.077380: psi_task_switch: [stress]->stress 0/4 cgroups updated
stress-265 [003] d..2 50.078379: psi_task_switch: stress->[stress] 0/4 cgroups updated
stress-245 [000] d..2 50.078379: psi_task_switch: stress->[stress] 0/4 cgroups updated
stress-265 [003] d..2 50.078380: psi_task_switch: [stress]->stress 0/4 cgroups updated
stress-245 [000] d..2 50.078380: psi_task_switch: [stress]->stress 0/4 cgroups updated

But even with otherwise no overlap in the user-created hierarchy, at
least the root group updates are avoided:

stress-261 [003] d..2 50.075265: psi_task_switch: stress->[kworker/u8:1] 0/1 cgroups updated
stress-261 [003] d..2 50.075266: psi_task_switch: [stress]->kworker/u8:1 3/4 cgroups updated

---
From 1cfa3601865b33f98415993a5774c509c6585900 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Mon, 10 Feb 2020 11:14:40 -0500
Subject: [PATCH] psi: optimize switching tasks inside shared cgroups

Unlike other task state changes, when merely switching tasks running
on a CPU, the aggregate state of the group that contains both tasks
does not change. We can exploit that. Update the group state only up
to the first shared ancestor.

We can identify the first shared ancestor by walking the groups of the
incoming task until we see TSK_ONCPU set on the local CPU; that's the
first group that also contains the outgoing task.

The new psi_task_switch() is similar to psi_task_change(). To allow
code reuse, move the task flag maintenance code into a new function
and the poll/avg worker wakeups into the shared psi_group_change().

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/psi.h | 2 +
kernel/sched/psi.c | 87 ++++++++++++++++++++++++++++++++++----------
kernel/sched/stats.h | 9 +----
3 files changed, 70 insertions(+), 28 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7b3de7321219..7361023f3fdd 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -17,6 +17,8 @@ extern struct psi_group psi_system;
void psi_init(void);

void psi_task_change(struct task_struct *task, int clear, int set);
+void psi_task_switch(struct task_struct *prev, struct task_struct *next,
+ bool sleep);

void psi_memstall_tick(struct task_struct *task, int cpu);
void psi_memstall_enter(unsigned long *flags);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 245aec187e4f..a053304b932b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -667,13 +667,14 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
groupc->times[PSI_NONIDLE] += delta;
}

-static u32 psi_group_change(struct psi_group *group, int cpu,
- unsigned int clear, unsigned int set)
+static void psi_group_change(struct psi_group *group, int cpu,
+ unsigned int clear, unsigned int set,
+ bool wake_clock)
{
struct psi_group_cpu *groupc;
+ u32 state_mask = 0;
unsigned int t, m;
enum psi_states s;
- u32 state_mask = 0;

groupc = per_cpu_ptr(group->pcpu, cpu);

@@ -715,7 +716,11 @@ static u32 psi_group_change(struct psi_group *group, int cpu,

write_seqcount_end(&groupc->seq);

- return state_mask;
+ if (state_mask & group->poll_states)
+ psi_schedule_poll_work(group, 1);
+
+ if (wake_clock && !delayed_work_pending(&group->avgs_work))
+ schedule_delayed_work(&group->avgs_work, PSI_FREQ);
}

static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
@@ -742,27 +747,32 @@ static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
return &psi_system;
}

-void psi_task_change(struct task_struct *task, int clear, int set)
+static void psi_flags_change(struct task_struct *task, int clear, int set)
{
- int cpu = task_cpu(task);
- struct psi_group *group;
- bool wake_clock = true;
- void *iter = NULL;
-
- if (!task->pid)
- return;
-
if (((task->psi_flags & set) ||
(task->psi_flags & clear) != clear) &&
!psi_bug) {
printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
- task->pid, task->comm, cpu,
+ task->pid, task->comm, task_cpu(task),
task->psi_flags, clear, set);
psi_bug = 1;
}

task->psi_flags &= ~clear;
task->psi_flags |= set;
+}
+
+void psi_task_change(struct task_struct *task, int clear, int set)
+{
+ int cpu = task_cpu(task);
+ struct psi_group *group;
+ bool wake_clock = true;
+ void *iter = NULL;
+
+ if (!task->pid)
+ return;
+
+ psi_flags_change(task, clear, set);

/*
* Periodic aggregation shuts off if there is a period of no
@@ -775,14 +785,51 @@ void psi_task_change(struct task_struct *task, int clear, int set)
wq_worker_last_func(task) == psi_avgs_work))
wake_clock = false;

- while ((group = iterate_groups(task, &iter))) {
- u32 state_mask = psi_group_change(group, cpu, clear, set);
+ while ((group = iterate_groups(task, &iter)))
+ psi_group_change(group, cpu, clear, set, wake_clock);
+}
+
+void psi_task_switch(struct task_struct *prev, struct task_struct *next,
+ bool sleep)
+{
+ struct psi_group *group, *common = NULL;
+ int cpu = task_cpu(prev);
+ void *iter;
+
+ if (next->pid) {
+ psi_flags_change(next, 0, TSK_ONCPU);
+ /*
+ * When moving state between tasks, the group that
+ * contains them both does not change: we can stop
+ * updating the tree once we reach the first common
+ * ancestor. Iterate @next's ancestors until we
+ * encounter @prev's state.
+ */
+ iter = NULL;
+ while ((group = iterate_groups(next, &iter))) {
+ if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ common = group;
+ break;
+ }
+
+ psi_group_change(group, cpu, 0, TSK_ONCPU, true);
+ }
+ }
+
+ /*
+ * If this is a voluntary sleep, dequeue will have taken care
+ * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We
+ * only need to deal with it during preemption.
+ */
+ if (sleep)
+ return;

- if (state_mask & group->poll_states)
- psi_schedule_poll_work(group, 1);
+ if (prev->pid) {
+ psi_flags_change(prev, TSK_ONCPU, 0);

- if (wake_clock && !delayed_work_pending(&group->avgs_work))
- schedule_delayed_work(&group->avgs_work, PSI_FREQ);
+ iter = NULL;
+ while ((group = iterate_groups(prev, &iter)) && group != common)
+ psi_group_change(group, cpu, TSK_ONCPU, 0, true);
}
}

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 6ff0ac1a803f..1339f5bfe513 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -141,14 +141,7 @@ static inline void psi_sched_switch(struct task_struct *prev,
if (static_branch_likely(&psi_disabled))
return;

- /*
- * Clear the TSK_ONCPU state if the task was preempted. If
- * it's a voluntary sleep, dequeue will have taken care of it.
- */
- if (!sleep)
- psi_task_change(prev, TSK_ONCPU, 0);
-
- psi_task_change(next, 0, TSK_ONCPU);
+ psi_task_switch(prev, next, sleep);
}

static inline void psi_task_tick(struct rq *rq)
--
2.24.1