2021-03-04 08:17:06

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization

This patch series is RESEND of the previous patches on psi subsystem. A few
weeks passed since the last review, so I put them together and resend for
more convenient review and merge.

Patch 1 add PSI_CPU_FULL state means all non-idle tasks in a cgroup are delayed
on the CPU resource which used by others outside of the cgroup or throttled
by the cgroup cpu.max configuration.

Patch 2 use ONCPU state and the current in_memstall flag to detect reclaim,
remove the hook in timer tick to make code more concise and maintainable.
And patch 3 adds unlikely() annotations to move the pressure state branches
out of line to eliminate undesirable jumps during wakeup and sleeps.

Patch 4 optimize the voluntary sleep switch by remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Chengming Zhou (3):
psi: Add PSI_CPU_FULL state
psi: Use ONCPU state tracking machinery to detect reclaim
psi: Optimize task switch inside shared cgroups

Johannes Weiner (1):
psi: pressure states are unlikely

include/linux/psi.h | 1 -
include/linux/psi_types.h | 3 +-
kernel/sched/core.c | 1 -
kernel/sched/psi.c | 122 ++++++++++++++++++++++++----------------------
kernel/sched/stats.h | 37 +++++---------
5 files changed, 78 insertions(+), 86 deletions(-)

--
2.11.0


2021-03-04 08:17:06

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups

The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
while ((group = iterate_groups(prev))) # all ancestors
psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
while ((group = iterate_groups(next))) # all ancestors
psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
nop
psi_task_switch()
while ((group = iterate_groups(next))) # until (prev & next)
psi_group_change(next, .set=TSK_ONCPU)
while ((group = iterate_groups(prev))) # all ancestors
psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
---
Updates since v1:
- Many improvements in the comments and code from Johannes Weiner.

kernel/sched/psi.c | 35 +++++++++++++++++++++++++----------
kernel/sched/stats.h | 28 ++++++++++++----------------
2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 3907a6b847aa..ee3c5b48622f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -840,20 +840,35 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}

- /*
- * If this is a voluntary sleep, dequeue will have taken care
- * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We
- * only need to deal with it during preemption.
- */
- if (sleep)
- return;
-
if (prev->pid) {
- psi_flags_change(prev, TSK_ONCPU, 0);
+ int clear = TSK_ONCPU, set = 0;
+
+ /*
+ * When we're going to sleep, psi_dequeue() lets us handle
+ * TSK_RUNNING and TSK_IOWAIT here, where we can combine it
+ * with TSK_ONCPU and save walking common ancestors twice.
+ */
+ if (sleep) {
+ clear |= TSK_RUNNING;
+ if (prev->in_iowait)
+ set |= TSK_IOWAIT;
+ }
+
+ psi_flags_change(prev, clear, set);

iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, TSK_ONCPU, 0, true);
+ psi_group_change(group, cpu, clear, set, true);
+
+ /*
+ * TSK_ONCPU is handled up to the common ancestor. If we're tasked
+ * with dequeuing too, finish that for the rest of the hierarchy.
+ */
+ if (sleep) {
+ clear &= ~TSK_ONCPU;
+ for (; group; group = iterate_groups(prev, &iter))
+ psi_group_change(group, cpu, clear, set, true);
+ }
}
}

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 9e4e67a94731..dc218e9f4558 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -84,28 +84,24 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)

static inline void psi_dequeue(struct task_struct *p, bool sleep)
{
- int clear = TSK_RUNNING, set = 0;
+ int clear = TSK_RUNNING;

if (static_branch_likely(&psi_disabled))
return;

- if (!sleep) {
- if (p->in_memstall)
- clear |= TSK_MEMSTALL;
- } else {
- /*
- * When a task sleeps, schedule() dequeues it before
- * switching to the next one. Merge the clearing of
- * TSK_RUNNING and TSK_ONCPU to save an unnecessary
- * psi_task_change() call in psi_sched_switch().
- */
- clear |= TSK_ONCPU;
+ /*
+ * A voluntary sleep is a dequeue followed by a task switch. To
+ * avoid walking all ancestors twice, psi_task_switch() handles
+ * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
+ * Do nothing here.
+ */
+ if (sleep)
+ return;

- if (p->in_iowait)
- set |= TSK_IOWAIT;
- }
+ if (p->in_memstall)
+ clear |= TSK_MEMSTALL;

- psi_task_change(p, clear, set);
+ psi_task_change(p, clear, 0);
}

static inline void psi_ttwu_dequeue(struct task_struct *p)
--
2.11.0

2021-03-04 08:17:07

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 3/4] psi: pressure states are unlikely

From: Johannes Weiner <[email protected]>

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner <[email protected]>
---
kernel/sched/psi.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0fe6ff6a6a15..3907a6b847aa 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -219,17 +219,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
{
switch (state) {
case PSI_IO_SOME:
- return tasks[NR_IOWAIT];
+ return unlikely(tasks[NR_IOWAIT]);
case PSI_IO_FULL:
- return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+ return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
case PSI_MEM_SOME:
- return tasks[NR_MEMSTALL];
+ return unlikely(tasks[NR_MEMSTALL]);
case PSI_MEM_FULL:
- return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+ return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
case PSI_CPU_SOME:
- return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+ return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
case PSI_CPU_FULL:
- return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
+ return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -729,7 +729,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
* task in a cgroup is in_memstall, the corresponding groupc
* on that cpu is in PSI_MEM_FULL state.
*/
- if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+ if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);

groupc->state_mask = state_mask;
--
2.11.0

2021-03-04 08:17:07

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim

Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.

In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.

Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.

Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
---
Updates since v1:
- Fold changes from Johannes that compare task psi_flags instead of
in_memstall in the psi_task_switch() optimization and move it out
of the loop
- Include some comments about the performance and cost from Johannes
too, and the detailed bench results can be seen from here:
https://lore.kernel.org/patchwork/patch/1378653/#1577607

include/linux/psi.h | 1 -
kernel/sched/core.c | 1 -
kernel/sched/psi.c | 65 +++++++++++++++++++---------------------------------
kernel/sched/stats.h | 9 --------
4 files changed, 24 insertions(+), 52 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7361023f3fdd..65eb1476ac70 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -20,7 +20,6 @@ void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);

-void psi_memstall_tick(struct task_struct *task, int cpu);
void psi_memstall_enter(unsigned long *flags);
void psi_memstall_leave(unsigned long *flags);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb629595f..860b006a72bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4544,7 +4544,6 @@ void scheduler_tick(void)
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
calc_global_load_tick(rq);
- psi_task_tick(rq);

rq_unlock(rq, &rf);

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2293c45d289d..0fe6ff6a6a15 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -644,8 +644,7 @@ static void poll_timer_fn(struct timer_list *t)
wake_up_interruptible(&group->poll_wait);
}

-static void record_times(struct psi_group_cpu *groupc, int cpu,
- bool memstall_tick)
+static void record_times(struct psi_group_cpu *groupc, int cpu)
{
u32 delta;
u64 now;
@@ -664,23 +663,6 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
groupc->times[PSI_MEM_SOME] += delta;
if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
- else if (memstall_tick) {
- u32 sample;
- /*
- * Since we care about lost potential, a
- * memstall is FULL when there are no other
- * working tasks, but also when the CPU is
- * actively reclaiming and nothing productive
- * could run even if it were runnable.
- *
- * When the timer tick sees a reclaiming CPU,
- * regardless of runnable tasks, sample a FULL
- * tick (or less if it hasn't been a full tick
- * since the last state change).
- */
- sample = min(delta, (u32)jiffies_to_nsecs(1));
- groupc->times[PSI_MEM_FULL] += sample;
- }
}

if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
@@ -714,7 +696,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
*/
write_seqcount_begin(&groupc->seq);

- record_times(groupc, cpu, false);
+ record_times(groupc, cpu);

for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
@@ -738,6 +720,18 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (test_state(groupc->tasks, s))
state_mask |= (1 << s);
}
+
+ /*
+ * Since we care about lost potential, a memstall is FULL
+ * when there are no other working tasks, but also when
+ * the CPU is actively reclaiming and nothing productive
+ * could run even if it were runnable. So when the current
+ * task in a cgroup is in_memstall, the corresponding groupc
+ * on that cpu is in PSI_MEM_FULL state.
+ */
+ if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+ state_mask |= (1 << PSI_MEM_FULL);
+
groupc->state_mask = state_mask;

write_seqcount_end(&groupc->seq);
@@ -823,17 +817,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void *iter;

if (next->pid) {
+ bool identical_state;
+
psi_flags_change(next, 0, TSK_ONCPU);
/*
- * When moving state between tasks, the group that
- * contains them both does not change: we can stop
- * updating the tree once we reach the first common
- * ancestor. Iterate @next's ancestors until we
- * encounter @prev's state.
+ * When switching between tasks that have an identical
+ * runtime state, the cgroup that contains both tasks
+ * runtime state, the cgroup that contains both tasks
+ * we reach the first common ancestor. Iterate @next's
+ * ancestors only until we encounter @prev's ONCPU.
*/
+ identical_state = prev->psi_flags == next->psi_flags;
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ if (identical_state &&
+ per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
common = group;
break;
}
@@ -859,21 +857,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}

-void psi_memstall_tick(struct task_struct *task, int cpu)
-{
- struct psi_group *group;
- void *iter = NULL;
-
- while ((group = iterate_groups(task, &iter))) {
- struct psi_group_cpu *groupc;
-
- groupc = per_cpu_ptr(group->pcpu, cpu);
- write_seqcount_begin(&groupc->seq);
- record_times(groupc, cpu, true);
- write_seqcount_end(&groupc->seq);
- }
-}
-
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 33d0daf83842..9e4e67a94731 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -144,14 +144,6 @@ static inline void psi_sched_switch(struct task_struct *prev,
psi_task_switch(prev, next, sleep);
}

-static inline void psi_task_tick(struct rq *rq)
-{
- if (static_branch_likely(&psi_disabled))
- return;
-
- if (unlikely(rq->curr->in_memstall))
- psi_memstall_tick(rq->curr, cpu_of(rq));
-}
#else /* CONFIG_PSI */
static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
@@ -159,7 +151,6 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
-static inline void psi_task_tick(struct rq *rq) {}
#endif /* CONFIG_PSI */

#ifdef CONFIG_SCHED_INFO
--
2.11.0

2021-03-04 11:16:24

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups

On Wed, Mar 03, 2021 at 11:46:59AM +0800, Chengming Zhou wrote:
> The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
> cgroups") only update cgroups whose state actually changes during a
> task switch only in task preempt case, not in task sleep case.
>
> We actually don't need to clear and set TSK_ONCPU state for common cgroups
> of next and prev task in sleep case, that can save many psi_group_change
> especially when most activity comes from one leaf cgroup.
>
> sleep before:
> psi_dequeue()
> while ((group = iterate_groups(prev))) # all ancestors
> psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
> psi_task_switch()
> while ((group = iterate_groups(next))) # all ancestors
> psi_group_change(next, .set=TSK_ONCPU)
>
> sleep after:
> psi_dequeue()
> nop
> psi_task_switch()
> while ((group = iterate_groups(next))) # until (prev & next)
> psi_group_change(next, .set=TSK_ONCPU)
> while ((group = iterate_groups(prev))) # all ancestors
> psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
>
> When a voluntary sleep switches to another task, we remove one call of
> psi_group_change() for every common cgroup ancestor of the two tasks.
>
> Signed-off-by: Muchun Song <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2021-03-04 11:16:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim

On Wed, Mar 03, 2021 at 11:46:57AM +0800, Chengming Zhou wrote:
> Move the reclaim detection from the timer tick to the task state
> tracking machinery using the recently added ONCPU state. And we
> also add task psi_flags changes checking in the psi_task_switch()
> optimization to update the parents properly.
>
> In terms of performance and cost, this ONCPU task state tracking
> is not cheaper than previous timer tick in aggregate. But the code is
> simpler and shorter this way, so it's a maintainability win. And
> Johannes did some testing with perf bench, the performace and cost
> changes would be acceptable for real workloads.
>
> Thanks to Johannes Weiner for pointing out the psi_task_switch()
> optimization things and the clearer changelog.
>
> Signed-off-by: Muchun Song <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2021-03-04 11:42:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization

On Wed, Mar 03, 2021 at 09:59:11AM -0500, Johannes Weiner wrote:
> On Wed, Mar 03, 2021 at 11:46:55AM +0800, Chengming Zhou wrote:
> > This patch series is RESEND of the previous patches on psi subsystem. A few
> > weeks passed since the last review, so I put them together and resend for
> > more convenient review and merge.
> >
> > Patch 1 add PSI_CPU_FULL state means all non-idle tasks in a cgroup are delayed
> > on the CPU resource which used by others outside of the cgroup or throttled
> > by the cgroup cpu.max configuration.
> >
> > Patch 2 use ONCPU state and the current in_memstall flag to detect reclaim,
> > remove the hook in timer tick to make code more concise and maintainable.
> > And patch 3 adds unlikely() annotations to move the pressure state branches
> > out of line to eliminate undesirable jumps during wakeup and sleeps.
> >
> > Patch 4 optimize the voluntary sleep switch by remove one call of
> > psi_group_change() for every common cgroup ancestor of the two tasks.
> >
> > Chengming Zhou (3):
> > psi: Add PSI_CPU_FULL state
> > psi: Use ONCPU state tracking machinery to detect reclaim
> > psi: Optimize task switch inside shared cgroups
> >
> > Johannes Weiner (1):
> > psi: pressure states are unlikely
>
> Peter, would you mind routing these through the sched tree for 5.13?

Yes, I can do that. Thanks!

2021-03-04 12:05:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization

On Wed, Mar 03, 2021 at 04:32:18PM +0100, Peter Zijlstra wrote:

> Yes, I can do that. Thanks!

Please double check the patches as found here:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core

I've manually edited the tags.

2021-03-04 12:11:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim

On Wed, Mar 03, 2021 at 11:46:57AM +0800, Chengming Zhou wrote:
> Move the reclaim detection from the timer tick to the task state
> tracking machinery using the recently added ONCPU state. And we
> also add task psi_flags changes checking in the psi_task_switch()
> optimization to update the parents properly.
>
> In terms of performance and cost, this ONCPU task state tracking
> is not cheaper than previous timer tick in aggregate. But the code is
> simpler and shorter this way, so it's a maintainability win. And
> Johannes did some testing with perf bench, the performace and cost
> changes would be acceptable for real workloads.
>
> Thanks to Johannes Weiner for pointing out the psi_task_switch()
> optimization things and the clearer changelog.
>
Co-developed-by: Muchun ?
> Signed-off-by: Muchun Song <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>

2021-03-04 12:12:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups

On Wed, Mar 03, 2021 at 11:46:59AM +0800, Chengming Zhou wrote:
> The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
> cgroups") only update cgroups whose state actually changes during a
> task switch only in task preempt case, not in task sleep case.
>
> We actually don't need to clear and set TSK_ONCPU state for common cgroups
> of next and prev task in sleep case, that can save many psi_group_change
> especially when most activity comes from one leaf cgroup.
>
> sleep before:
> psi_dequeue()
> while ((group = iterate_groups(prev))) # all ancestors
> psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
> psi_task_switch()
> while ((group = iterate_groups(next))) # all ancestors
> psi_group_change(next, .set=TSK_ONCPU)
>
> sleep after:
> psi_dequeue()
> nop
> psi_task_switch()
> while ((group = iterate_groups(next))) # until (prev & next)
> psi_group_change(next, .set=TSK_ONCPU)
> while ((group = iterate_groups(prev))) # all ancestors
> psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
>
> When a voluntary sleep switches to another task, we remove one call of
> psi_group_change() for every common cgroup ancestor of the two tasks.
>
Co-developed-by: Muchun ?
> Signed-off-by: Muchun Song <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>

2021-03-04 12:41:32

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization

On Wed, Mar 03, 2021 at 05:05:15PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 03, 2021 at 04:32:18PM +0100, Peter Zijlstra wrote:
>
> > Yes, I can do that. Thanks!
>
> Please double check the patches as found here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
>
> I've manually edited the tags.

Looks good to me, thanks!

2021-03-04 13:52:57

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization

在 2021/3/4 上午12:05, Peter Zijlstra 写道:

> On Wed, Mar 03, 2021 at 04:32:18PM +0100, Peter Zijlstra wrote:
>
>> Yes, I can do that. Thanks!
> Please double check the patches as found here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
>
> I've manually edited the tags.
Looks good to me, thanks!

2021-03-04 20:58:23

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 1/4] psi: Add PSI_CPU_FULL state

The FULL state doesn't exist for the CPU resource at the system level,
but exist at the cgroup level, means all non-idle tasks in a cgroup are
delayed on the CPU resource which used by others outside of the cgroup
or throttled by the cgroup cpu.max configuration.

Co-developed-by: Muchun Song <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/psi_types.h | 3 ++-
kernel/sched/psi.c | 14 +++++++++++---
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index b95f3211566a..0a23300d49af 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -50,9 +50,10 @@ enum psi_states {
PSI_MEM_SOME,
PSI_MEM_FULL,
PSI_CPU_SOME,
+ PSI_CPU_FULL,
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 6,
+ NR_PSI_STATES = 7,
};

enum psi_aggregators {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 967732c0766c..2293c45d289d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -34,7 +34,10 @@
* delayed on that resource such that nobody is advancing and the CPU
* goes idle. This leaves both workload and CPU unproductive.
*
- * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ * Naturally, the FULL state doesn't exist for the CPU resource at the
+ * system level, but exist at the cgroup level, means all non-idle tasks
+ * in a cgroup are delayed on the CPU resource which used by others outside
+ * of the cgroup or throttled by the cgroup cpu.max configuration.
*
* SOME = nr_delayed_tasks != 0
* FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
@@ -225,6 +228,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
case PSI_CPU_SOME:
return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+ case PSI_CPU_FULL:
+ return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -678,8 +683,11 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
}
}

- if (groupc->state_mask & (1 << PSI_CPU_SOME))
+ if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
groupc->times[PSI_CPU_SOME] += delta;
+ if (groupc->state_mask & (1 << PSI_CPU_FULL))
+ groupc->times[PSI_CPU_FULL] += delta;
+ }

if (groupc->state_mask & (1 << PSI_NONIDLE))
groupc->times[PSI_NONIDLE] += delta;
@@ -1018,7 +1026,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);

- for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+ for (full = 0; full < 2; full++) {
unsigned long avg[3];
u64 total;
int w;
--
2.11.0

2021-03-04 23:23:23

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization

On Wed, Mar 03, 2021 at 11:46:55AM +0800, Chengming Zhou wrote:
> This patch series is RESEND of the previous patches on psi subsystem. A few
> weeks passed since the last review, so I put them together and resend for
> more convenient review and merge.
>
> Patch 1 add PSI_CPU_FULL state means all non-idle tasks in a cgroup are delayed
> on the CPU resource which used by others outside of the cgroup or throttled
> by the cgroup cpu.max configuration.
>
> Patch 2 use ONCPU state and the current in_memstall flag to detect reclaim,
> remove the hook in timer tick to make code more concise and maintainable.
> And patch 3 adds unlikely() annotations to move the pressure state branches
> out of line to eliminate undesirable jumps during wakeup and sleeps.
>
> Patch 4 optimize the voluntary sleep switch by remove one call of
> psi_group_change() for every common cgroup ancestor of the two tasks.
>
> Chengming Zhou (3):
> psi: Add PSI_CPU_FULL state
> psi: Use ONCPU state tracking machinery to detect reclaim
> psi: Optimize task switch inside shared cgroups
>
> Johannes Weiner (1):
> psi: pressure states are unlikely

Peter, would you mind routing these through the sched tree for 5.13?

Subject: [tip: sched/core] psi: Add PSI_CPU_FULL state

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 311b293811a31929c72c790eff48cf767561589f
Gitweb: https://git.kernel.org/tip/311b293811a31929c72c790eff48cf767561589f
Author: Chengming Zhou <[email protected]>
AuthorDate: Wed, 03 Mar 2021 11:46:56 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 04 Mar 2021 09:56:01 +01:00

psi: Add PSI_CPU_FULL state

The FULL state doesn't exist for the CPU resource at the system level,
but exist at the cgroup level, means all non-idle tasks in a cgroup are
delayed on the CPU resource which used by others outside of the cgroup
or throttled by the cgroup cpu.max configuration.

Co-developed-by: Muchun Song <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/psi_types.h | 3 ++-
kernel/sched/psi.c | 14 +++++++++++---
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index b95f321..0a23300 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -50,9 +50,10 @@ enum psi_states {
PSI_MEM_SOME,
PSI_MEM_FULL,
PSI_CPU_SOME,
+ PSI_CPU_FULL,
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 6,
+ NR_PSI_STATES = 7,
};

enum psi_aggregators {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 967732c..2293c45 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -34,7 +34,10 @@
* delayed on that resource such that nobody is advancing and the CPU
* goes idle. This leaves both workload and CPU unproductive.
*
- * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ * Naturally, the FULL state doesn't exist for the CPU resource at the
+ * system level, but exist at the cgroup level, means all non-idle tasks
+ * in a cgroup are delayed on the CPU resource which used by others outside
+ * of the cgroup or throttled by the cgroup cpu.max configuration.
*
* SOME = nr_delayed_tasks != 0
* FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
@@ -225,6 +228,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
case PSI_CPU_SOME:
return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+ case PSI_CPU_FULL:
+ return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -678,8 +683,11 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
}
}

- if (groupc->state_mask & (1 << PSI_CPU_SOME))
+ if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
groupc->times[PSI_CPU_SOME] += delta;
+ if (groupc->state_mask & (1 << PSI_CPU_FULL))
+ groupc->times[PSI_CPU_FULL] += delta;
+ }

if (groupc->state_mask & (1 << PSI_NONIDLE))
groupc->times[PSI_NONIDLE] += delta;
@@ -1018,7 +1026,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);

- for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+ for (full = 0; full < 2; full++) {
unsigned long avg[3];
u64 total;
int w;

Subject: [tip: sched/core] psi: Pressure states are unlikely

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 24f3cb558f59debe0e9159459bb9627b51b47c17
Gitweb: https://git.kernel.org/tip/24f3cb558f59debe0e9159459bb9627b51b47c17
Author: Johannes Weiner <[email protected]>
AuthorDate: Wed, 03 Mar 2021 11:46:58 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 04 Mar 2021 09:56:02 +01:00

psi: Pressure states are unlikely

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/psi.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0fe6ff6..3907a6b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -219,17 +219,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
{
switch (state) {
case PSI_IO_SOME:
- return tasks[NR_IOWAIT];
+ return unlikely(tasks[NR_IOWAIT]);
case PSI_IO_FULL:
- return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+ return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
case PSI_MEM_SOME:
- return tasks[NR_MEMSTALL];
+ return unlikely(tasks[NR_MEMSTALL]);
case PSI_MEM_FULL:
- return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+ return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
case PSI_CPU_SOME:
- return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+ return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
case PSI_CPU_FULL:
- return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
+ return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -729,7 +729,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
* task in a cgroup is in_memstall, the corresponding groupc
* on that cpu is in PSI_MEM_FULL state.
*/
- if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+ if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);

groupc->state_mask = state_mask;

Subject: [tip: sched/core] psi: Use ONCPU state tracking machinery to detect reclaim

The following commit has been merged into the sched/core branch of tip:

Commit-ID: f3f7feec57b9141dfed9825874d0191b1ac18ad2
Gitweb: https://git.kernel.org/tip/f3f7feec57b9141dfed9825874d0191b1ac18ad2
Author: Chengming Zhou <[email protected]>
AuthorDate: Wed, 03 Mar 2021 11:46:57 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 04 Mar 2021 09:56:01 +01:00

psi: Use ONCPU state tracking machinery to detect reclaim

Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.

In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.

Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.

Co-developed-by: Muchun Song <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/psi.h | 1 +-
kernel/sched/core.c | 1 +-
kernel/sched/psi.c | 65 +++++++++++++++----------------------------
kernel/sched/stats.h | 9 +------
4 files changed, 24 insertions(+), 52 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7361023..65eb147 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -20,7 +20,6 @@ void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);

-void psi_memstall_tick(struct task_struct *task, int cpu);
void psi_memstall_enter(unsigned long *flags);
void psi_memstall_leave(unsigned long *flags);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 361974e..d2629fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4551,7 +4551,6 @@ void scheduler_tick(void)
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
calc_global_load_tick(rq);
- psi_task_tick(rq);

rq_unlock(rq, &rf);

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2293c45..0fe6ff6 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -644,8 +644,7 @@ static void poll_timer_fn(struct timer_list *t)
wake_up_interruptible(&group->poll_wait);
}

-static void record_times(struct psi_group_cpu *groupc, int cpu,
- bool memstall_tick)
+static void record_times(struct psi_group_cpu *groupc, int cpu)
{
u32 delta;
u64 now;
@@ -664,23 +663,6 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
groupc->times[PSI_MEM_SOME] += delta;
if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
- else if (memstall_tick) {
- u32 sample;
- /*
- * Since we care about lost potential, a
- * memstall is FULL when there are no other
- * working tasks, but also when the CPU is
- * actively reclaiming and nothing productive
- * could run even if it were runnable.
- *
- * When the timer tick sees a reclaiming CPU,
- * regardless of runnable tasks, sample a FULL
- * tick (or less if it hasn't been a full tick
- * since the last state change).
- */
- sample = min(delta, (u32)jiffies_to_nsecs(1));
- groupc->times[PSI_MEM_FULL] += sample;
- }
}

if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
@@ -714,7 +696,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
*/
write_seqcount_begin(&groupc->seq);

- record_times(groupc, cpu, false);
+ record_times(groupc, cpu);

for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
@@ -738,6 +720,18 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (test_state(groupc->tasks, s))
state_mask |= (1 << s);
}
+
+ /*
+ * Since we care about lost potential, a memstall is FULL
+ * when there are no other working tasks, but also when
+ * the CPU is actively reclaiming and nothing productive
+ * could run even if it were runnable. So when the current
+ * task in a cgroup is in_memstall, the corresponding groupc
+ * on that cpu is in PSI_MEM_FULL state.
+ */
+ if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+ state_mask |= (1 << PSI_MEM_FULL);
+
groupc->state_mask = state_mask;

write_seqcount_end(&groupc->seq);
@@ -823,17 +817,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void *iter;

if (next->pid) {
+ bool identical_state;
+
psi_flags_change(next, 0, TSK_ONCPU);
/*
- * When moving state between tasks, the group that
- * contains them both does not change: we can stop
- * updating the tree once we reach the first common
- * ancestor. Iterate @next's ancestors until we
- * encounter @prev's state.
+ * When switching between tasks that have an identical
+ * runtime state, the cgroup that contains both tasks
+ * runtime state, the cgroup that contains both tasks
+ * we reach the first common ancestor. Iterate @next's
+ * ancestors only until we encounter @prev's ONCPU.
*/
+ identical_state = prev->psi_flags == next->psi_flags;
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ if (identical_state &&
+ per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
common = group;
break;
}
@@ -859,21 +857,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}

-void psi_memstall_tick(struct task_struct *task, int cpu)
-{
- struct psi_group *group;
- void *iter = NULL;
-
- while ((group = iterate_groups(task, &iter))) {
- struct psi_group_cpu *groupc;
-
- groupc = per_cpu_ptr(group->pcpu, cpu);
- write_seqcount_begin(&groupc->seq);
- record_times(groupc, cpu, true);
- write_seqcount_end(&groupc->seq);
- }
-}
-
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 33d0daf..9e4e67a 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -144,14 +144,6 @@ static inline void psi_sched_switch(struct task_struct *prev,
psi_task_switch(prev, next, sleep);
}

-static inline void psi_task_tick(struct rq *rq)
-{
- if (static_branch_likely(&psi_disabled))
- return;
-
- if (unlikely(rq->curr->in_memstall))
- psi_memstall_tick(rq->curr, cpu_of(rq));
-}
#else /* CONFIG_PSI */
static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
@@ -159,7 +151,6 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
-static inline void psi_task_tick(struct rq *rq) {}
#endif /* CONFIG_PSI */

#ifdef CONFIG_SCHED_INFO

Subject: [tip: sched/core] psi: Optimize task switch inside shared cgroups

The following commit has been merged into the sched/core branch of tip:

Commit-ID: e6560d58334ca463061ade733674abc8dd0df9bd
Gitweb: https://git.kernel.org/tip/e6560d58334ca463061ade733674abc8dd0df9bd
Author: Chengming Zhou <[email protected]>
AuthorDate: Wed, 03 Mar 2021 11:46:59 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 04 Mar 2021 09:56:02 +01:00

psi: Optimize task switch inside shared cgroups

The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
while ((group = iterate_groups(prev))) # all ancestors
psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
while ((group = iterate_groups(next))) # all ancestors
psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
nop
psi_task_switch()
while ((group = iterate_groups(next))) # until (prev & next)
psi_group_change(next, .set=TSK_ONCPU)
while ((group = iterate_groups(prev))) # all ancestors
psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Co-developed-by: Muchun Song <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/psi.c | 35 +++++++++++++++++++++++++----------
kernel/sched/stats.h | 28 ++++++++++++----------------
2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 3907a6b..ee3c5b4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -840,20 +840,35 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}

- /*
- * If this is a voluntary sleep, dequeue will have taken care
- * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We
- * only need to deal with it during preemption.
- */
- if (sleep)
- return;
-
if (prev->pid) {
- psi_flags_change(prev, TSK_ONCPU, 0);
+ int clear = TSK_ONCPU, set = 0;
+
+ /*
+ * When we're going to sleep, psi_dequeue() lets us handle
+ * TSK_RUNNING and TSK_IOWAIT here, where we can combine it
+ * with TSK_ONCPU and save walking common ancestors twice.
+ */
+ if (sleep) {
+ clear |= TSK_RUNNING;
+ if (prev->in_iowait)
+ set |= TSK_IOWAIT;
+ }
+
+ psi_flags_change(prev, clear, set);

iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, TSK_ONCPU, 0, true);
+ psi_group_change(group, cpu, clear, set, true);
+
+ /*
+ * TSK_ONCPU is handled up to the common ancestor. If we're tasked
+ * with dequeuing too, finish that for the rest of the hierarchy.
+ */
+ if (sleep) {
+ clear &= ~TSK_ONCPU;
+ for (; group; group = iterate_groups(prev, &iter))
+ psi_group_change(group, cpu, clear, set, true);
+ }
}
}

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 9e4e67a..dc218e9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -84,28 +84,24 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)

static inline void psi_dequeue(struct task_struct *p, bool sleep)
{
- int clear = TSK_RUNNING, set = 0;
+ int clear = TSK_RUNNING;

if (static_branch_likely(&psi_disabled))
return;

- if (!sleep) {
- if (p->in_memstall)
- clear |= TSK_MEMSTALL;
- } else {
- /*
- * When a task sleeps, schedule() dequeues it before
- * switching to the next one. Merge the clearing of
- * TSK_RUNNING and TSK_ONCPU to save an unnecessary
- * psi_task_change() call in psi_sched_switch().
- */
- clear |= TSK_ONCPU;
+ /*
+ * A voluntary sleep is a dequeue followed by a task switch. To
+ * avoid walking all ancestors twice, psi_task_switch() handles
+ * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
+ * Do nothing here.
+ */
+ if (sleep)
+ return;

- if (p->in_iowait)
- set |= TSK_IOWAIT;
- }
+ if (p->in_memstall)
+ clear |= TSK_MEMSTALL;

- psi_task_change(p, clear, set);
+ psi_task_change(p, clear, 0);
}

static inline void psi_ttwu_dequeue(struct task_struct *p)

Subject: [tip: sched/core] psi: Optimize task switch inside shared cgroups

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 4117cebf1a9fcbf35b9aabf0e37b6c5eea296798
Gitweb: https://git.kernel.org/tip/4117cebf1a9fcbf35b9aabf0e37b6c5eea296798
Author: Chengming Zhou <[email protected]>
AuthorDate: Wed, 03 Mar 2021 11:46:59 +08:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sat, 06 Mar 2021 12:40:23 +01:00

psi: Optimize task switch inside shared cgroups

The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
while ((group = iterate_groups(prev))) # all ancestors
psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
while ((group = iterate_groups(next))) # all ancestors
psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
nop
psi_task_switch()
while ((group = iterate_groups(next))) # until (prev & next)
psi_group_change(next, .set=TSK_ONCPU)
while ((group = iterate_groups(prev))) # all ancestors
psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Co-developed-by: Muchun Song <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/psi.c | 35 +++++++++++++++++++++++++----------
kernel/sched/stats.h | 28 ++++++++++++----------------
2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 3907a6b..ee3c5b4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -840,20 +840,35 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}

- /*
- * If this is a voluntary sleep, dequeue will have taken care
- * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We
- * only need to deal with it during preemption.
- */
- if (sleep)
- return;
-
if (prev->pid) {
- psi_flags_change(prev, TSK_ONCPU, 0);
+ int clear = TSK_ONCPU, set = 0;
+
+ /*
+ * When we're going to sleep, psi_dequeue() lets us handle
+ * TSK_RUNNING and TSK_IOWAIT here, where we can combine it
+ * with TSK_ONCPU and save walking common ancestors twice.
+ */
+ if (sleep) {
+ clear |= TSK_RUNNING;
+ if (prev->in_iowait)
+ set |= TSK_IOWAIT;
+ }
+
+ psi_flags_change(prev, clear, set);

iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, TSK_ONCPU, 0, true);
+ psi_group_change(group, cpu, clear, set, true);
+
+ /*
+ * TSK_ONCPU is handled up to the common ancestor. If we're tasked
+ * with dequeuing too, finish that for the rest of the hierarchy.
+ */
+ if (sleep) {
+ clear &= ~TSK_ONCPU;
+ for (; group; group = iterate_groups(prev, &iter))
+ psi_group_change(group, cpu, clear, set, true);
+ }
}
}

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 9e4e67a..dc218e9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -84,28 +84,24 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)

static inline void psi_dequeue(struct task_struct *p, bool sleep)
{
- int clear = TSK_RUNNING, set = 0;
+ int clear = TSK_RUNNING;

if (static_branch_likely(&psi_disabled))
return;

- if (!sleep) {
- if (p->in_memstall)
- clear |= TSK_MEMSTALL;
- } else {
- /*
- * When a task sleeps, schedule() dequeues it before
- * switching to the next one. Merge the clearing of
- * TSK_RUNNING and TSK_ONCPU to save an unnecessary
- * psi_task_change() call in psi_sched_switch().
- */
- clear |= TSK_ONCPU;
+ /*
+ * A voluntary sleep is a dequeue followed by a task switch. To
+ * avoid walking all ancestors twice, psi_task_switch() handles
+ * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
+ * Do nothing here.
+ */
+ if (sleep)
+ return;

- if (p->in_iowait)
- set |= TSK_IOWAIT;
- }
+ if (p->in_memstall)
+ clear |= TSK_MEMSTALL;

- psi_task_change(p, clear, set);
+ psi_task_change(p, clear, 0);
}

static inline void psi_ttwu_dequeue(struct task_struct *p)

Subject: [tip: sched/core] psi: Add PSI_CPU_FULL state

The following commit has been merged into the sched/core branch of tip:

Commit-ID: e7fcd762282332f765af2035a9568fb126fa3c01
Gitweb: https://git.kernel.org/tip/e7fcd762282332f765af2035a9568fb126fa3c01
Author: Chengming Zhou <[email protected]>
AuthorDate: Wed, 03 Mar 2021 11:46:56 +08:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

psi: Add PSI_CPU_FULL state

The FULL state doesn't exist for the CPU resource at the system level,
but exist at the cgroup level, means all non-idle tasks in a cgroup are
delayed on the CPU resource which used by others outside of the cgroup
or throttled by the cgroup cpu.max configuration.

Co-developed-by: Muchun Song <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/psi_types.h | 3 ++-
kernel/sched/psi.c | 14 +++++++++++---
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index b95f321..0a23300 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -50,9 +50,10 @@ enum psi_states {
PSI_MEM_SOME,
PSI_MEM_FULL,
PSI_CPU_SOME,
+ PSI_CPU_FULL,
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 6,
+ NR_PSI_STATES = 7,
};

enum psi_aggregators {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 967732c..2293c45 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -34,7 +34,10 @@
* delayed on that resource such that nobody is advancing and the CPU
* goes idle. This leaves both workload and CPU unproductive.
*
- * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ * Naturally, the FULL state doesn't exist for the CPU resource at the
+ * system level, but exist at the cgroup level, means all non-idle tasks
+ * in a cgroup are delayed on the CPU resource which used by others outside
+ * of the cgroup or throttled by the cgroup cpu.max configuration.
*
* SOME = nr_delayed_tasks != 0
* FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
@@ -225,6 +228,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
case PSI_CPU_SOME:
return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+ case PSI_CPU_FULL:
+ return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -678,8 +683,11 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
}
}

- if (groupc->state_mask & (1 << PSI_CPU_SOME))
+ if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
groupc->times[PSI_CPU_SOME] += delta;
+ if (groupc->state_mask & (1 << PSI_CPU_FULL))
+ groupc->times[PSI_CPU_FULL] += delta;
+ }

if (groupc->state_mask & (1 << PSI_NONIDLE))
groupc->times[PSI_NONIDLE] += delta;
@@ -1018,7 +1026,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);

- for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+ for (full = 0; full < 2; full++) {
unsigned long avg[3];
u64 total;
int w;

Subject: [tip: sched/core] psi: Use ONCPU state tracking machinery to detect reclaim

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 7fae6c8171d20ac55402930ee8ae760cf85dff7b
Gitweb: https://git.kernel.org/tip/7fae6c8171d20ac55402930ee8ae760cf85dff7b
Author: Chengming Zhou <[email protected]>
AuthorDate: Wed, 03 Mar 2021 11:46:57 +08:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

psi: Use ONCPU state tracking machinery to detect reclaim

Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.

In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.

Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.

Co-developed-by: Muchun Song <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/psi.h | 1 +-
kernel/sched/core.c | 1 +-
kernel/sched/psi.c | 65 +++++++++++++++----------------------------
kernel/sched/stats.h | 9 +------
4 files changed, 24 insertions(+), 52 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7361023..65eb147 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -20,7 +20,6 @@ void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);

-void psi_memstall_tick(struct task_struct *task, int cpu);
void psi_memstall_enter(unsigned long *flags);
void psi_memstall_leave(unsigned long *flags);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 361974e..d2629fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4551,7 +4551,6 @@ void scheduler_tick(void)
update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
calc_global_load_tick(rq);
- psi_task_tick(rq);

rq_unlock(rq, &rf);

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2293c45..0fe6ff6 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -644,8 +644,7 @@ static void poll_timer_fn(struct timer_list *t)
wake_up_interruptible(&group->poll_wait);
}

-static void record_times(struct psi_group_cpu *groupc, int cpu,
- bool memstall_tick)
+static void record_times(struct psi_group_cpu *groupc, int cpu)
{
u32 delta;
u64 now;
@@ -664,23 +663,6 @@ static void record_times(struct psi_group_cpu *groupc, int cpu,
groupc->times[PSI_MEM_SOME] += delta;
if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
- else if (memstall_tick) {
- u32 sample;
- /*
- * Since we care about lost potential, a
- * memstall is FULL when there are no other
- * working tasks, but also when the CPU is
- * actively reclaiming and nothing productive
- * could run even if it were runnable.
- *
- * When the timer tick sees a reclaiming CPU,
- * regardless of runnable tasks, sample a FULL
- * tick (or less if it hasn't been a full tick
- * since the last state change).
- */
- sample = min(delta, (u32)jiffies_to_nsecs(1));
- groupc->times[PSI_MEM_FULL] += sample;
- }
}

if (groupc->state_mask & (1 << PSI_CPU_SOME)) {
@@ -714,7 +696,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
*/
write_seqcount_begin(&groupc->seq);

- record_times(groupc, cpu, false);
+ record_times(groupc, cpu);

for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
@@ -738,6 +720,18 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (test_state(groupc->tasks, s))
state_mask |= (1 << s);
}
+
+ /*
+ * Since we care about lost potential, a memstall is FULL
+ * when there are no other working tasks, but also when
+ * the CPU is actively reclaiming and nothing productive
+ * could run even if it were runnable. So when the current
+ * task in a cgroup is in_memstall, the corresponding groupc
+ * on that cpu is in PSI_MEM_FULL state.
+ */
+ if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+ state_mask |= (1 << PSI_MEM_FULL);
+
groupc->state_mask = state_mask;

write_seqcount_end(&groupc->seq);
@@ -823,17 +817,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void *iter;

if (next->pid) {
+ bool identical_state;
+
psi_flags_change(next, 0, TSK_ONCPU);
/*
- * When moving state between tasks, the group that
- * contains them both does not change: we can stop
- * updating the tree once we reach the first common
- * ancestor. Iterate @next's ancestors until we
- * encounter @prev's state.
+ * When switching between tasks that have an identical
+ * runtime state, the cgroup that contains both tasks
+ * runtime state, the cgroup that contains both tasks
+ * we reach the first common ancestor. Iterate @next's
+ * ancestors only until we encounter @prev's ONCPU.
*/
+ identical_state = prev->psi_flags == next->psi_flags;
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ if (identical_state &&
+ per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
common = group;
break;
}
@@ -859,21 +857,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}

-void psi_memstall_tick(struct task_struct *task, int cpu)
-{
- struct psi_group *group;
- void *iter = NULL;
-
- while ((group = iterate_groups(task, &iter))) {
- struct psi_group_cpu *groupc;
-
- groupc = per_cpu_ptr(group->pcpu, cpu);
- write_seqcount_begin(&groupc->seq);
- record_times(groupc, cpu, true);
- write_seqcount_end(&groupc->seq);
- }
-}
-
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 33d0daf..9e4e67a 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -144,14 +144,6 @@ static inline void psi_sched_switch(struct task_struct *prev,
psi_task_switch(prev, next, sleep);
}

-static inline void psi_task_tick(struct rq *rq)
-{
- if (static_branch_likely(&psi_disabled))
- return;
-
- if (unlikely(rq->curr->in_memstall))
- psi_memstall_tick(rq->curr, cpu_of(rq));
-}
#else /* CONFIG_PSI */
static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
@@ -159,7 +151,6 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
-static inline void psi_task_tick(struct rq *rq) {}
#endif /* CONFIG_PSI */

#ifdef CONFIG_SCHED_INFO

Subject: [tip: sched/core] psi: Pressure states are unlikely

The following commit has been merged into the sched/core branch of tip:

Commit-ID: fddc8bab531e217806b84906681324377d741c6c
Gitweb: https://git.kernel.org/tip/fddc8bab531e217806b84906681324377d741c6c
Author: Johannes Weiner <[email protected]>
AuthorDate: Wed, 03 Mar 2021 11:46:58 +08:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sat, 06 Mar 2021 12:40:23 +01:00

psi: Pressure states are unlikely

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/psi.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0fe6ff6..3907a6b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -219,17 +219,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
{
switch (state) {
case PSI_IO_SOME:
- return tasks[NR_IOWAIT];
+ return unlikely(tasks[NR_IOWAIT]);
case PSI_IO_FULL:
- return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+ return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
case PSI_MEM_SOME:
- return tasks[NR_MEMSTALL];
+ return unlikely(tasks[NR_MEMSTALL]);
case PSI_MEM_FULL:
- return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+ return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
case PSI_CPU_SOME:
- return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+ return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
case PSI_CPU_FULL:
- return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
+ return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -729,7 +729,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
* task in a cgroup is in_memstall, the corresponding groupc
* on that cpu is in PSI_MEM_FULL state.
*/
- if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+ if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);

groupc->state_mask = state_mask;