Hi all,
This patch series are some optimizations and extensions for PSI.
patch 1/10 fix periodic aggregation shut off problem introduced by earlier
commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups").
patch 2-4 are some misc optimizations, so put them in front of this series.
patch 5/10 optimize task switch inside shared cgroups when in_memstall status
of prev task and next task are different.
patch 6/10 remove NR_ONCPU task accounting to save 4 bytes in the first
cacheline to be used by the following patch 7/10, which introduce new
PSI resource PSI_IRQ to track IRQ/SOFTIRQ pressure stall information.
patch 8-9 cache parent psi_group in struct psi_group to speed up the
hot iteration path.
patch 10/10 introduce a per-cgroup interface "cgroup.pressure" to disable
or re-enable PSI in the cgroup level, and we implement hiding and unhiding
the pressure files per Tejun's suggestion[1], which depends on his work[2].
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
Performance test using mmtests/config-scheduler-perfpipe in
/user.slice/user-0.slice/session-4.scope:
next patched patched/only-leaf
Min Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
1st-qrtle Time 8.90 ( 0.00%) 8.58 ( 3.63%) 8.05 ( 9.58%)
2nd-qrtle Time 8.94 ( 0.00%) 8.61 ( 3.65%) 8.09 ( 9.50%)
3rd-qrtle Time 8.99 ( 0.00%) 8.65 ( 3.75%) 8.15 ( 9.35%)
Max-1 Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
Max-5 Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
Max-10 Time 8.84 ( 0.00%) 8.55 ( 3.20%) 8.04 ( 9.05%)
Max-90 Time 9.04 ( 0.00%) 8.67 ( 4.10%) 8.18 ( 9.51%)
Max-95 Time 9.04 ( 0.00%) 8.68 ( 4.03%) 8.20 ( 9.26%)
Max-99 Time 9.07 ( 0.00%) 8.73 ( 3.82%) 8.25 ( 9.11%)
Max Time 9.12 ( 0.00%) 8.89 ( 2.54%) 8.27 ( 9.29%)
Amean Time 8.95 ( 0.00%) 8.62 * 3.67%* 8.11 * 9.43%*
Big thanks to Johannes Weiner, Tejun Heo and Michal Koutný for your
suggestions and review!
Changes in v4:
- Collect Acked-by tags from Johannes Weiner.
- Add many clear comments and changelogs per Johannes Weiner.
- Replace for_each_psi_group() with better open-code.
- Change to use better names cgroup_pressure_show() and
cgroup_pressure_write().
- Change to use better name psi_cgroup_restart() and only
call it on enabling.
Changes in v3:
- Rebase on linux-next and reorder patches to put misc optimizations
patches in the front of this series.
- Drop patch "sched/psi: don't change task psi_flags when migrate CPU/group"
since it caused a little performance regression and it's just
code refactoring, so drop it.
- Don't define PSI_IRQ and PSI_IRQ_FULL when !CONFIG_IRQ_TIME_ACCOUNTING,
in which case they are not used.
- Add patch 8/10 "sched/psi: consolidate cgroup_psi()" make cgroup_psi()
can handle all cgroups including root cgroup, make patch 9/10 simpler.
- Rename interface to "cgroup.pressure" and add some explanation
per Michal's suggestion.
- Hide and unhide pressure files when disable/re-enable cgroup PSI,
depends on Tejun's work.
Changes in v2:
- Add Acked-by tags from Johannes Weiner. Thanks for review!
- Fix periodic aggregation wakeup for common ancestors in
psi_task_switch().
- Add patch 7/10 from Johannes Weiner, which remove NR_ONCPU
task accounting to save 4 bytes in the first cacheline.
- Remove "psi_irq=" kernel cmdline parameter in last version.
- Add per-cgroup interface "cgroup.psi" to disable/re-enable
PSI stats accounting in the cgroup level.
Chengming Zhou (9):
sched/psi: fix periodic aggregation shut off
sched/psi: don't create cgroup PSI files when psi_disabled
sched/psi: save percpu memory when !psi_cgroups_enabled
sched/psi: move private helpers to sched/stats.h
sched/psi: optimize task switch inside shared cgroups again
sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
sched/psi: consolidate cgroup_psi()
sched/psi: cache parent psi_group to speed up groups iterate
sched/psi: per-cgroup PSI accounting disable/re-enable interface
Johannes Weiner (1):
sched/psi: remove NR_ONCPU task accounting
Documentation/admin-guide/cgroup-v2.rst | 23 ++
include/linux/cgroup-defs.h | 3 +
include/linux/cgroup.h | 5 -
include/linux/psi.h | 12 +-
include/linux/psi_types.h | 29 ++-
kernel/cgroup/cgroup.c | 106 ++++++++-
kernel/sched/core.c | 1 +
kernel/sched/psi.c | 280 +++++++++++++++++-------
kernel/sched/stats.h | 6 +
9 files changed, 362 insertions(+), 103 deletions(-)
--
2.37.2
We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.
Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().
But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.
So this patch move this check to psi_task_switch().
Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
kernel/sched/psi.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ecb4b4ff4ce0..39463dcc16bb 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,7 +796,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- bool wake_clock = true;
void *iter = NULL;
u64 now;
@@ -806,19 +805,9 @@ void psi_task_change(struct task_struct *task, int clear, int set)
psi_flags_change(task, clear, set);
now = cpu_clock(cpu);
- /*
- * Periodic aggregation shuts off if there is a period of no
- * task changes, so we wake it back up if necessary. However,
- * don't do this if the task change is the aggregation worker
- * itself going to sleep, or we'll ping-pong forever.
- */
- if (unlikely((clear & TSK_RUNNING) &&
- (task->flags & PF_WQ_WORKER) &&
- wq_worker_last_func(task) == psi_avgs_work))
- wake_clock = false;
while ((group = iterate_groups(task, &iter)))
- psi_group_change(group, cpu, clear, set, now, wake_clock);
+ psi_group_change(group, cpu, clear, set, now, true);
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -854,6 +843,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (prev->pid) {
int clear = TSK_ONCPU, set = 0;
+ bool wake_clock = true;
/*
* When we're going to sleep, psi_dequeue() lets us
@@ -867,13 +857,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
clear |= TSK_MEMSTALL_RUNNING;
if (prev->in_iowait)
set |= TSK_IOWAIT;
+
+ /*
+ * Periodic aggregation shuts off if there is a period of no
+ * task changes, so we wake it back up if necessary. However,
+ * don't do this if the task change is the aggregation worker
+ * itself going to sleep, or we'll ping-pong forever.
+ */
+ if (unlikely((prev->flags & PF_WQ_WORKER) &&
+ wq_worker_last_func(prev) == psi_avgs_work))
+ wake_clock = false;
}
psi_flags_change(prev, clear, set);
iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
/*
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -882,7 +882,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (sleep) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, &iter))
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
}
--
2.37.2
cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".
This patch move cgroup_psi() definition to <linux/psi.h>, in which
we can return psi_system for root cgroup, so can handle all cgroups.
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/cgroup.h | 5 -----
include/linux/psi.h | 6 ++++++
kernel/cgroup/cgroup.c | 10 +++++-----
3 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7ed1fa7a6fc8..3c48753f2949 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -682,11 +682,6 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
}
-static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
-{
- return cgrp->psi;
-}
-
bool cgroup_psi_enabled(void);
static inline void cgroup_init_kthreadd(void)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index fffd229fbf19..362a74ca1d3b 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/poll.h>
#include <linux/cgroup-defs.h>
+#include <linux/cgroup.h>
struct seq_file;
struct css_set;
@@ -30,6 +31,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
#ifdef CONFIG_CGROUPS
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+}
+
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 371131a8b6f8..1d392c91ef95 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3657,21 +3657,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IO);
}
static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_MEM);
}
static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_CPU);
}
@@ -3697,7 +3697,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
return -EBUSY;
}
- psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ psi = cgroup_psi(cgrp);
new = psi_trigger_create(psi, buf, res);
if (IS_ERR(new)) {
cgroup_put(cgrp);
@@ -3735,7 +3735,7 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IRQ);
}
--
2.37.2
Way back when PSI_MEM_FULL was accounted from the timer tick, task
switching could simply iterate next and prev to the common ancestor to
update TSK_ONCPU and be done.
Then memstall ticks were replaced with checking curr->in_memstall
directly in psi_group_change(). That meant that now if the task switch
was between a memstall and a !memstall task, we had to iterate through
the common ancestors at least ONCE to fix up their state_masks.
We added the identical_state filter to make sure the common ancestor
elimination was skipped in that case. It seems that was always a
little too eager, because it caused us to walk the common ancestors
*twice* instead of the required once: the iteration for next could
have stopped at the common ancestor; prev could have updated TSK_ONCPU
up to the common ancestor, then finish to the root without changing
any flags, just to get the new curr->in_memstall into the state_masks.
This patch recognizes this and makes it so that we walk to the root
exactly once if state_mask needs updating, which is simply catching up
on a missed optimization that could have been done in commit 7fae6c8171d2
("psi: Use ONCPU state tracking machinery to detect reclaim") directly.
Apart from this, it's also necessary for the next patch "sched/psi: remove
NR_ONCPU task accounting". Suppose we walk the common ancestors twice:
(1) psi_group_change(.clear = 0, .set = TSK_ONCPU)
(2) psi_group_change(.clear = TSK_ONCPU, .set = 0)
We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++
in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct.
The next patch change to use one bit in state mask to record TSK_ONCPU,
PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause
the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set!
With this patch, we will never walk the common ancestors twice, so won't
have above problem.
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
---
kernel/sched/psi.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 77d53c03a76f..d71dbc2356ff 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -820,20 +820,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
u64 now = cpu_clock(cpu);
if (next->pid) {
- bool identical_state;
-
psi_flags_change(next, 0, TSK_ONCPU);
/*
- * When switching between tasks that have an identical
- * runtime state, the cgroup that contains both tasks
- * we reach the first common ancestor. Iterate @next's
- * ancestors only until we encounter @prev's ONCPU.
+ * Set TSK_ONCPU on @next's cgroups. If @next shares any
+ * ancestors with @prev, those will already have @prev's
+ * TSK_ONCPU bit set, and we can stop the iteration there.
*/
- identical_state = prev->psi_flags == next->psi_flags;
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (identical_state &&
- per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
common = group;
break;
}
@@ -877,10 +872,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_group_change(group, cpu, clear, set, now, wake_clock);
/*
- * TSK_ONCPU is handled up to the common ancestor. If we're tasked
- * with dequeuing too, finish that for the rest of the hierarchy.
+ * TSK_ONCPU is handled up to the common ancestor. If there are
+ * any other differences between the two tasks (e.g. prev goes
+ * to sleep, or only one task is memstall), finish propagating
+ * those differences all the way up to the root.
*/
- if (sleep) {
+ if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, &iter))
psi_group_change(group, cpu, clear, set, now, wake_clock);
--
2.37.2
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.
In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.
This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group->parent to iterate.
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/psi_types.h | 2 ++
kernel/sched/psi.c | 49 +++++++++++++++------------------------
2 files changed, 21 insertions(+), 30 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 40c28171cd91..a0b746258c68 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -151,6 +151,8 @@ struct psi_trigger {
};
struct psi_group {
+ struct psi_group *parent;
+
/* Protects data used by the aggregator */
struct mutex avgs_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2545a78f82d8..9a8aee80a087 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -772,27 +772,12 @@ static void psi_group_change(struct psi_group *group, int cpu,
schedule_delayed_work(&group->avgs_work, PSI_FREQ);
}
-static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+static inline struct psi_group *task_psi_group(struct task_struct *task)
{
- if (*iter == &psi_system)
- return NULL;
-
#ifdef CONFIG_CGROUPS
- if (static_branch_likely(&psi_cgroups_enabled)) {
- struct cgroup *cgroup = NULL;
-
- if (!*iter)
- cgroup = task->cgroups->dfl_cgrp;
- else
- cgroup = cgroup_parent(*iter);
-
- if (cgroup && cgroup_parent(cgroup)) {
- *iter = cgroup;
- return cgroup_psi(cgroup);
- }
- }
+ if (static_branch_likely(&psi_cgroups_enabled))
+ return cgroup_psi(task_dfl_cgroup(task));
#endif
- *iter = &psi_system;
return &psi_system;
}
@@ -815,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- void *iter = NULL;
u64 now;
if (!task->pid)
@@ -825,8 +809,10 @@ void psi_task_change(struct task_struct *task, int clear, int set)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter)))
+ group = task_psi_group(task);
+ do {
psi_group_change(group, cpu, clear, set, now, true);
+ } while ((group = group->parent));
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -834,7 +820,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
{
struct psi_group *group, *common = NULL;
int cpu = task_cpu(prev);
- void *iter;
u64 now = cpu_clock(cpu);
if (next->pid) {
@@ -844,8 +829,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
* ancestors with @prev, those will already have @prev's
* TSK_ONCPU bit set, and we can stop the iteration there.
*/
- iter = NULL;
- while ((group = iterate_groups(next, &iter))) {
+ group = task_psi_group(next);
+ do {
if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
PSI_ONCPU) {
common = group;
@@ -853,7 +838,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
- }
+ } while ((group = group->parent));
}
if (prev->pid) {
@@ -886,9 +871,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_flags_change(prev, clear, set);
- iter = NULL;
- while ((group = iterate_groups(prev, &iter)) && group != common)
+ group = task_psi_group(prev);
+ do {
+ if (group == common)
+ break;
psi_group_change(group, cpu, clear, set, now, wake_clock);
+ } while ((group = group->parent));
/*
* TSK_ONCPU is handled up to the common ancestor. If there are
@@ -898,7 +886,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
- for (; group; group = iterate_groups(prev, &iter))
+ for (; group; group = group->parent)
psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
@@ -908,7 +896,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void psi_account_irqtime(struct task_struct *task, u32 delta)
{
int cpu = task_cpu(task);
- void *iter = NULL;
struct psi_group *group;
struct psi_group_cpu *groupc;
u64 now;
@@ -918,7 +905,8 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter))) {
+ group = task_psi_group(task);
+ do {
groupc = per_cpu_ptr(group->pcpu, cpu);
write_seqcount_begin(&groupc->seq);
@@ -930,7 +918,7 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
if (group->poll_states & (1 << PSI_IRQ_FULL))
psi_schedule_poll_work(group, 1);
- }
+ } while ((group = group->parent));
}
#endif
@@ -1010,6 +998,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
return -ENOMEM;
}
group_init(cgroup->psi);
+ cgroup->psi->parent = cgroup_psi(cgroup_parent(cgroup));
return 0;
}
--
2.37.2
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.
Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 39463dcc16bb..77d53c03a76f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -201,6 +201,7 @@ void __init psi_init(void)
{
if (!psi_enable) {
static_branch_enable(&psi_disabled);
+ static_branch_disable(&psi_cgroups_enabled);
return;
}
@@ -950,7 +951,7 @@ void psi_memstall_leave(unsigned long *flags)
#ifdef CONFIG_CGROUPS
int psi_cgroup_alloc(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return 0;
cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL);
@@ -968,7 +969,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
void psi_cgroup_free(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return;
cancel_delayed_work_sync(&cgroup->psi->avgs_work);
@@ -996,7 +997,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
struct rq_flags rf;
struct rq *rq;
- if (static_branch_likely(&psi_disabled)) {
+ if (!static_branch_likely(&psi_cgroups_enabled)) {
/*
* Lame to do this here, but the scheduler cannot be locked
* from the outside, so we move cgroups from inside sched/.
--
2.37.2
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI can be configured to skip per-cgroup stall accounting. And
doesn't expose PSI files in cgroup hierarchy.
This patch do the same thing when psi_disabled.
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
kernel/cgroup/cgroup.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 03dbbf8a8c28..2f79ddf9a85d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3748,6 +3748,9 @@ static void cgroup_pressure_release(struct kernfs_open_file *of)
bool cgroup_psi_enabled(void)
{
+ if (static_branch_likely(&psi_disabled))
+ return false;
+
return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) == 0;
}
--
2.37.2
From: Johannes Weiner <[email protected]>
We put all fields updated by the scheduler in the first cacheline of
struct psi_group_cpu for performance.
Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
we need to reclaim space first. This patch remove NR_ONCPU task accounting
in struct psi_group_cpu, use one bit in state_mask to track instead.
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Tested-by: Chengming Zhou <[email protected]>
---
include/linux/psi_types.h | 16 +++++++--------
kernel/sched/psi.c | 41 ++++++++++++++++++++++++++++-----------
2 files changed, 37 insertions(+), 20 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..54cb74946db4 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -15,13 +15,6 @@ enum psi_task_count {
NR_IOWAIT,
NR_MEMSTALL,
NR_RUNNING,
- /*
- * This can't have values other than 0 or 1 and could be
- * implemented as a bit flag. But for now we still have room
- * in the first cacheline of psi_group_cpu, and this way we
- * don't have to special case any state tracking for it.
- */
- NR_ONCPU,
/*
* For IO and CPU stalls the presence of running/oncpu tasks
* in the domain means a partial rather than a full stall.
@@ -32,16 +25,18 @@ enum psi_task_count {
* threads and memstall ones.
*/
NR_MEMSTALL_RUNNING,
- NR_PSI_TASK_COUNTS = 5,
+ NR_PSI_TASK_COUNTS = 4,
};
/* Task state bitmasks */
#define TSK_IOWAIT (1 << NR_IOWAIT)
#define TSK_MEMSTALL (1 << NR_MEMSTALL)
#define TSK_RUNNING (1 << NR_RUNNING)
-#define TSK_ONCPU (1 << NR_ONCPU)
#define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING)
+/* Only one task can be scheduled, no corresponding task count */
+#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS)
+
/* Resources that workloads could be stalled on */
enum psi_res {
PSI_IO,
@@ -68,6 +63,9 @@ enum psi_states {
NR_PSI_STATES = 7,
};
+/* Use one bit in the state mask to track TSK_ONCPU */
+#define PSI_ONCPU (1 << NR_PSI_STATES)
+
enum psi_aggregators {
PSI_AVGS = 0,
PSI_POLL,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d71dbc2356ff..4702a770e272 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -212,7 +212,7 @@ void __init psi_init(void)
group_init(&psi_system);
}
-static bool test_state(unsigned int *tasks, enum psi_states state)
+static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
{
switch (state) {
case PSI_IO_SOME:
@@ -225,9 +225,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
return unlikely(tasks[NR_MEMSTALL] &&
tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
case PSI_CPU_SOME:
- return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
+ return unlikely(tasks[NR_RUNNING] > oncpu);
case PSI_CPU_FULL:
- return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
+ return unlikely(tasks[NR_RUNNING] && !oncpu);
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -689,9 +689,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
bool wake_clock)
{
struct psi_group_cpu *groupc;
- u32 state_mask = 0;
unsigned int t, m;
enum psi_states s;
+ u32 state_mask;
groupc = per_cpu_ptr(group->pcpu, cpu);
@@ -707,17 +707,36 @@ static void psi_group_change(struct psi_group *group, int cpu,
record_times(groupc, now);
+ /*
+ * Start with TSK_ONCPU, which doesn't have a corresponding
+ * task count - it's just a boolean flag directly encoded in
+ * the state mask. Clear, set, or carry the current state if
+ * no changes are requested.
+ */
+ if (unlikely(clear & TSK_ONCPU)) {
+ state_mask = 0;
+ clear &= ~TSK_ONCPU;
+ } else if (unlikely(set & TSK_ONCPU)) {
+ state_mask = PSI_ONCPU;
+ set &= ~TSK_ONCPU;
+ } else {
+ state_mask = groupc->state_mask & PSI_ONCPU;
+ }
+
+ /*
+ * The rest of the state mask is calculated based on the task
+ * counts. Update those first, then construct the mask.
+ */
for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
continue;
if (groupc->tasks[t]) {
groupc->tasks[t]--;
} else if (!psi_bug) {
- printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u %u] clear=%x set=%x\n",
+ printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n",
cpu, t, groupc->tasks[0],
groupc->tasks[1], groupc->tasks[2],
- groupc->tasks[3], groupc->tasks[4],
- clear, set);
+ groupc->tasks[3], clear, set);
psi_bug = 1;
}
}
@@ -726,9 +745,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (set & (1 << t))
groupc->tasks[t]++;
- /* Calculate state mask representing active states */
for (s = 0; s < NR_PSI_STATES; s++) {
- if (test_state(groupc->tasks, s))
+ if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
state_mask |= (1 << s);
}
@@ -740,7 +758,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
* task in a cgroup is in_memstall, the corresponding groupc
* on that cpu is in PSI_MEM_FULL state.
*/
- if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
+ if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
groupc->state_mask = state_mask;
@@ -828,7 +846,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
+ PSI_ONCPU) {
common = group;
break;
}
--
2.37.2
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 6 ++
include/linux/psi_types.h | 10 +++-
kernel/cgroup/cgroup.c | 27 +++++++++
kernel/sched/core.c | 1 +
kernel/sched/psi.c | 74 ++++++++++++++++++++++++-
kernel/sched/stats.h | 2 +
6 files changed, 116 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index be4a77baf784..971c418bc778 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,6 +976,12 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ irq.pressure
+ A read-write nested-keyed file.
+
+ Shows pressure stall information for IRQ/SOFTIRQ. See
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
+
Controllers
===========
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 54cb74946db4..40c28171cd91 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -42,7 +42,10 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
- NR_PSI_RESOURCES = 3,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ,
+#endif
+ NR_PSI_RESOURCES,
};
/*
@@ -58,9 +61,12 @@ enum psi_states {
PSI_MEM_FULL,
PSI_CPU_SOME,
PSI_CPU_FULL,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ_FULL,
+#endif
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 7,
+ NR_PSI_STATES,
};
/* Use one bit in the state mask to track TSK_ONCPU */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2f79ddf9a85d..371131a8b6f8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3731,6 +3731,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+
+ return psi_show(seq, psi, PSI_IRQ);
+}
+
+static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+}
+#endif
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5150,6 +5167,16 @@ static struct cftype cgroup_base_files[] = {
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ {
+ .name = "irq.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_irq_pressure_show,
+ .write = cgroup_irq_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
+ },
+#endif
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61436b8e0337..178f9836ae96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+ psi_account_irqtime(rq->curr, irq_delta);
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4702a770e272..2545a78f82d8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -904,6 +904,36 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+void psi_account_irqtime(struct task_struct *task, u32 delta)
+{
+ int cpu = task_cpu(task);
+ void *iter = NULL;
+ struct psi_group *group;
+ struct psi_group_cpu *groupc;
+ u64 now;
+
+ if (!task->pid)
+ return;
+
+ now = cpu_clock(cpu);
+
+ while ((group = iterate_groups(task, &iter))) {
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ write_seqcount_begin(&groupc->seq);
+
+ record_times(groupc, now);
+ groupc->times[PSI_IRQ_FULL] += delta;
+
+ write_seqcount_end(&groupc->seq);
+
+ if (group->poll_states & (1 << PSI_IRQ_FULL))
+ psi_schedule_poll_work(group, 1);
+ }
+}
+#endif
+
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
@@ -1065,6 +1095,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
{
+ bool only_full = false;
int full;
u64 now;
@@ -1079,7 +1110,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);
- for (full = 0; full < 2; full++) {
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ only_full = res == PSI_IRQ;
+#endif
+
+ for (full = 0; full < 2 - only_full; full++) {
unsigned long avg[3] = { 0, };
u64 total = 0;
int w;
@@ -1093,7 +1128,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
}
seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
- full ? "full" : "some",
+ full || only_full ? "full" : "some",
LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
@@ -1121,6 +1156,11 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
else
return ERR_PTR(-EINVAL);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ if (res == PSI_IRQ && --state != PSI_IRQ_FULL)
+ return ERR_PTR(-EINVAL);
+#endif
+
if (state >= PSI_NONIDLE)
return ERR_PTR(-EINVAL);
@@ -1405,6 +1445,33 @@ static const struct proc_ops psi_cpu_proc_ops = {
.proc_release = psi_fop_release,
};
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int psi_irq_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IRQ);
+}
+
+static int psi_irq_open(struct inode *inode, struct file *file)
+{
+ return psi_open(file, psi_irq_show);
+}
+
+static ssize_t psi_irq_write(struct file *file, const char __user *user_buf,
+ size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_IRQ);
+}
+
+static const struct proc_ops psi_irq_proc_ops = {
+ .proc_open = psi_irq_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_write = psi_irq_write,
+ .proc_poll = psi_fop_poll,
+ .proc_release = psi_fop_release,
+};
+#endif
+
static int __init psi_proc_init(void)
{
if (psi_enable) {
@@ -1412,6 +1479,9 @@ static int __init psi_proc_init(void)
proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops);
proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
}
return 0;
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467ece43..84a188913cc9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -110,6 +110,7 @@ __schedstats_from_se(struct sched_entity *se)
void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
+void psi_account_irqtime(struct task_struct *task, u32 delta);
/*
* PSI tracks state that persists across sleeps, such as iowaits and
@@ -205,6 +206,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
+static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {}
#endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
--
2.37.2
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 17 ++++++
include/linux/cgroup-defs.h | 3 ++
include/linux/psi.h | 2 +
include/linux/psi_types.h | 1 +
kernel/cgroup/cgroup.c | 70 ++++++++++++++++++++++---
kernel/sched/psi.c | 70 ++++++++++++++++++++++---
6 files changed, 150 insertions(+), 13 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 971c418bc778..4cad4e2b31ec 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,6 +976,23 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ cgroup.pressure
+ A read-write single value file that allowed values are "0" and "1".
+ The default is "1".
+
+ Writing "0" to the file will disable the cgroup PSI accounting.
+ Writing "1" to the file will re-enable the cgroup PSI accounting.
+
+ This control attribute is not hierarchical, so disable or enable PSI
+ accounting in a cgroup does not affect PSI accounting in descendants
+ and doesn't need pass enablement via ancestors from root.
+
+ The reason this control attribute exists is that PSI accounts stalls for
+ each cgroup separately and aggregates it at each level of the hierarchy.
+ This may cause non-negligible overhead for some workloads when under
+ deep level of the hierarchy, in which case this control attribute can
+ be used to disable PSI accounting in the non-leaf cgroups.
+
irq.pressure
A read-write nested-keyed file.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1283993d7ea8..cfdb74a89c5c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -428,6 +428,9 @@ struct cgroup {
struct cgroup_file procs_file; /* handle for "cgroup.procs" */
struct cgroup_file events_file; /* handle for "cgroup.events" */
+ /* handles for "{cpu,memory,io,irq}.pressure" */
+ struct cgroup_file psi_files[NR_PSI_RESOURCES];
+
/*
* The bitmask of subsystems enabled on the child cgroups.
* ->subtree_control is the one configured through
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 362a74ca1d3b..b029a847def1 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -39,6 +39,7 @@ static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
+void psi_cgroup_restart(struct psi_group *group);
#endif
#else /* CONFIG_PSI */
@@ -60,6 +61,7 @@ static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
{
rcu_assign_pointer(p->cgroups, to);
}
+static inline void psi_cgroup_restart(struct psi_group *group) {}
#endif
#endif /* CONFIG_PSI */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index a0b746258c68..ab1f9b463df9 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -152,6 +152,7 @@ struct psi_trigger {
struct psi_group {
struct psi_group *parent;
+ bool enabled;
/* Protects data used by the aggregator */
struct mutex avgs_lock;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1d392c91ef95..454c9e64e079 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3676,8 +3676,8 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
return psi_show(seq, psi, PSI_CPU);
}
-static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
- size_t nbytes, enum psi_res res)
+static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, enum psi_res res)
{
struct cgroup_file_ctx *ctx = of->priv;
struct psi_trigger *new;
@@ -3714,21 +3714,21 @@ static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
+ return pressure_write(of, buf, nbytes, PSI_IO);
}
static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
+ return pressure_write(of, buf, nbytes, PSI_MEM);
}
static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
+ return pressure_write(of, buf, nbytes, PSI_CPU);
}
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -3744,10 +3744,58 @@ static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+ return pressure_write(of, buf, nbytes, PSI_IRQ);
}
#endif
+static int cgroup_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_psi(cgrp);
+
+ seq_printf(seq, "%d\n", psi->enabled);
+
+ return 0;
+}
+
+static ssize_t cgroup_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ ssize_t ret;
+ int enable;
+ struct cgroup *cgrp;
+ struct psi_group *psi;
+
+ ret = kstrtoint(strstrip(buf), 0, &enable);
+ if (ret)
+ return ret;
+
+ if (enable < 0 || enable > 1)
+ return -ERANGE;
+
+ cgrp = cgroup_kn_lock_live(of->kn, false);
+ if (!cgrp)
+ return -ENOENT;
+
+ psi = cgroup_psi(cgrp);
+ if (psi->enabled != enable) {
+ int i;
+
+ /* show or hide {cpu,memory,io,irq}.pressure files */
+ for (i = 0; i < NR_PSI_RESOURCES; i++)
+ cgroup_file_show(&cgrp->psi_files[i], enable);
+
+ psi->enabled = enable;
+ if (enable)
+ psi_cgroup_restart(psi);
+ }
+
+ cgroup_kn_unlock(of->kn);
+
+ return nbytes;
+}
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5146,6 +5194,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "io.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]),
.seq_show = cgroup_io_pressure_show,
.write = cgroup_io_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5154,6 +5203,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "memory.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]),
.seq_show = cgroup_memory_pressure_show,
.write = cgroup_memory_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5162,6 +5212,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "cpu.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]),
.seq_show = cgroup_cpu_pressure_show,
.write = cgroup_cpu_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5171,12 +5222,19 @@ static struct cftype cgroup_base_files[] = {
{
.name = "irq.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]),
.seq_show = cgroup_irq_pressure_show,
.write = cgroup_irq_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
#endif
+ {
+ .name = "cgroup.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_pressure_show,
+ .write = cgroup_pressure_write,
+ },
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 9a8aee80a087..9711827e31e5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
{
int cpu;
+ group->enabled = true;
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
group->avg_last_update = sched_clock();
@@ -696,17 +697,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
groupc = per_cpu_ptr(group->pcpu, cpu);
/*
- * First we assess the aggregate resource states this CPU's
- * tasks have been in since the last change, and account any
- * SOME and FULL time these may have resulted in.
- *
- * Then we update the task counts according to the state
+ * First we update the task counts according to the state
* change requested through the @clear and @set bits.
+ *
+ * Then if the cgroup PSI stats accounting enabled, we
+ * assess the aggregate resource states this CPU's tasks
+ * have been in since the last change, and account any
+ * SOME and FULL time these may have resulted in.
*/
write_seqcount_begin(&groupc->seq);
- record_times(groupc, now);
-
/*
* Start with TSK_ONCPU, which doesn't have a corresponding
* task count - it's just a boolean flag directly encoded in
@@ -745,6 +745,23 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (set & (1 << t))
groupc->tasks[t]++;
+ if (!group->enabled) {
+ /*
+ * On the first group change after disabling PSI, conclude
+ * the current state and flush its time. This is unlikely
+ * to matter to the user, but aggregation (get_recent_times)
+ * may have already incorporated the live state into times_prev;
+ * avoid a delta sample underflow when PSI is later re-enabled.
+ */
+ if (unlikely(groupc->state_mask & (1 << PSI_NONIDLE)))
+ record_times(groupc, now);
+
+ groupc->state_mask = state_mask;
+
+ write_seqcount_end(&groupc->seq);
+ return;
+ }
+
for (s = 0; s < NR_PSI_STATES; s++) {
if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
state_mask |= (1 << s);
@@ -761,6 +778,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
+ record_times(groupc, now);
+
groupc->state_mask = state_mask;
write_seqcount_end(&groupc->seq);
@@ -907,6 +926,9 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
group = task_psi_group(task);
do {
+ if (!group->enabled)
+ continue;
+
groupc = per_cpu_ptr(group->pcpu, cpu);
write_seqcount_begin(&groupc->seq);
@@ -1080,6 +1102,40 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
task_rq_unlock(rq, task, &rf);
}
+
+void psi_cgroup_restart(struct psi_group *group)
+{
+ int cpu;
+
+ /*
+ * After we disable psi_group->enabled, we don't actually
+ * stop percpu tasks accounting in each psi_group_cpu,
+ * instead only stop test_state() loop, record_times()
+ * and averaging worker, see psi_group_change() for details.
+ *
+ * When disable cgroup PSI, this function has nothing to sync
+ * since cgroup pressure files are hidden and percpu psi_group_cpu
+ * would see !psi_group->enabled and only do task accounting.
+ *
+ * When re-enable cgroup PSI, this function use psi_group_change()
+ * to get correct state mask from test_state() loop on tasks[],
+ * and restart groupc->state_start from now, use .clear = .set = 0
+ * here since no task status really changed.
+ */
+ if (!group->enabled)
+ return;
+
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+ u64 now;
+
+ rq_lock_irq(rq, &rf);
+ now = cpu_clock(cpu);
+ psi_group_change(group, cpu, 0, 0, now, true);
+ rq_unlock_irq(rq, &rf);
+ }
+}
#endif /* CONFIG_CGROUPS */
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
--
2.37.2
This patch move psi_task_change/psi_task_switch declarations out of
PSI public header, since they are only needed for implementing the
PSI stats tracking in sched/stats.h
psi_task_switch is obvious, psi_task_change can't be public helper
since it doesn't check psi_disabled static key. And there is no
any user now, so put it in sched/stats.h too.
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/psi.h | 4 ----
kernel/sched/stats.h | 4 ++++
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index dd74411ac21d..fffd229fbf19 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -18,10 +18,6 @@ extern struct psi_group psi_system;
void psi_init(void);
-void psi_task_change(struct task_struct *task, int clear, int set);
-void psi_task_switch(struct task_struct *prev, struct task_struct *next,
- bool sleep);
-
void psi_memstall_enter(unsigned long *flags);
void psi_memstall_leave(unsigned long *flags);
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index baa839c1ba96..c39b467ece43 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se)
}
#ifdef CONFIG_PSI
+void psi_task_change(struct task_struct *task, int clear, int set);
+void psi_task_switch(struct task_struct *prev, struct task_struct *next,
+ bool sleep);
+
/*
* PSI tracks state that persists across sleeps, such as iowaits and
* memory stalls. As a result, it has to distinguish between sleeps,
--
2.37.2
On Fri, Aug 26, 2022 at 12:41:06AM +0800, Chengming Zhou wrote:
> Way back when PSI_MEM_FULL was accounted from the timer tick, task
> switching could simply iterate next and prev to the common ancestor to
> update TSK_ONCPU and be done.
>
> Then memstall ticks were replaced with checking curr->in_memstall
> directly in psi_group_change(). That meant that now if the task switch
> was between a memstall and a !memstall task, we had to iterate through
> the common ancestors at least ONCE to fix up their state_masks.
>
> We added the identical_state filter to make sure the common ancestor
> elimination was skipped in that case. It seems that was always a
> little too eager, because it caused us to walk the common ancestors
> *twice* instead of the required once: the iteration for next could
> have stopped at the common ancestor; prev could have updated TSK_ONCPU
> up to the common ancestor, then finish to the root without changing
> any flags, just to get the new curr->in_memstall into the state_masks.
>
> This patch recognizes this and makes it so that we walk to the root
> exactly once if state_mask needs updating, which is simply catching up
> on a missed optimization that could have been done in commit 7fae6c8171d2
> ("psi: Use ONCPU state tracking machinery to detect reclaim") directly.
>
> Apart from this, it's also necessary for the next patch "sched/psi: remove
> NR_ONCPU task accounting". Suppose we walk the common ancestors twice:
>
> (1) psi_group_change(.clear = 0, .set = TSK_ONCPU)
> (2) psi_group_change(.clear = TSK_ONCPU, .set = 0)
>
> We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++
> in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct.
>
> The next patch change to use one bit in state mask to record TSK_ONCPU,
> PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause
> the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set!
>
> With this patch, we will never walk the common ancestors twice, so won't
> have above problem.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Fri, Aug 26, 2022 at 12:41:08AM +0800, Chengming Zhou wrote:
> Now PSI already tracked workload pressure stall information for
> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
> obvious impact on some workload productivity, such as web service
> workload.
>
> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
> from update_rq_clock_task(), in which we can record that delta
> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>
> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
> the current task on the CPU, make nothing productive could run
> even if it were runnable, so we only use PSI_IRQ_FULL.
>
> Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Hello,
Could this series be merged into the linux-next?
Thanks.
On 2022/8/26 00:41, Chengming Zhou wrote:
> Hi all,
>
> This patch series are some optimizations and extensions for PSI.
>
> patch 1/10 fix periodic aggregation shut off problem introduced by earlier
> commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups").
>
> patch 2-4 are some misc optimizations, so put them in front of this series.
>
> patch 5/10 optimize task switch inside shared cgroups when in_memstall status
> of prev task and next task are different.
>
> patch 6/10 remove NR_ONCPU task accounting to save 4 bytes in the first
> cacheline to be used by the following patch 7/10, which introduce new
> PSI resource PSI_IRQ to track IRQ/SOFTIRQ pressure stall information.
>
> patch 8-9 cache parent psi_group in struct psi_group to speed up the
> hot iteration path.
>
> patch 10/10 introduce a per-cgroup interface "cgroup.pressure" to disable
> or re-enable PSI in the cgroup level, and we implement hiding and unhiding
> the pressure files per Tejun's suggestion[1], which depends on his work[2].
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
>
> Performance test using mmtests/config-scheduler-perfpipe in
> /user.slice/user-0.slice/session-4.scope:
>
> next patched patched/only-leaf
> Min Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> 1st-qrtle Time 8.90 ( 0.00%) 8.58 ( 3.63%) 8.05 ( 9.58%)
> 2nd-qrtle Time 8.94 ( 0.00%) 8.61 ( 3.65%) 8.09 ( 9.50%)
> 3rd-qrtle Time 8.99 ( 0.00%) 8.65 ( 3.75%) 8.15 ( 9.35%)
> Max-1 Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> Max-5 Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> Max-10 Time 8.84 ( 0.00%) 8.55 ( 3.20%) 8.04 ( 9.05%)
> Max-90 Time 9.04 ( 0.00%) 8.67 ( 4.10%) 8.18 ( 9.51%)
> Max-95 Time 9.04 ( 0.00%) 8.68 ( 4.03%) 8.20 ( 9.26%)
> Max-99 Time 9.07 ( 0.00%) 8.73 ( 3.82%) 8.25 ( 9.11%)
> Max Time 9.12 ( 0.00%) 8.89 ( 2.54%) 8.27 ( 9.29%)
> Amean Time 8.95 ( 0.00%) 8.62 * 3.67%* 8.11 * 9.43%*
>
> Big thanks to Johannes Weiner, Tejun Heo and Michal Koutný for your
> suggestions and review!
>
>
> Changes in v4:
> - Collect Acked-by tags from Johannes Weiner.
> - Add many clear comments and changelogs per Johannes Weiner.
> - Replace for_each_psi_group() with better open-code.
> - Change to use better names cgroup_pressure_show() and
> cgroup_pressure_write().
> - Change to use better name psi_cgroup_restart() and only
> call it on enabling.
>
> Changes in v3:
> - Rebase on linux-next and reorder patches to put misc optimizations
> patches in the front of this series.
> - Drop patch "sched/psi: don't change task psi_flags when migrate CPU/group"
> since it caused a little performance regression and it's just
> code refactoring, so drop it.
> - Don't define PSI_IRQ and PSI_IRQ_FULL when !CONFIG_IRQ_TIME_ACCOUNTING,
> in which case they are not used.
> - Add patch 8/10 "sched/psi: consolidate cgroup_psi()" make cgroup_psi()
> can handle all cgroups including root cgroup, make patch 9/10 simpler.
> - Rename interface to "cgroup.pressure" and add some explanation
> per Michal's suggestion.
> - Hide and unhide pressure files when disable/re-enable cgroup PSI,
> depends on Tejun's work.
>
> Changes in v2:
> - Add Acked-by tags from Johannes Weiner. Thanks for review!
> - Fix periodic aggregation wakeup for common ancestors in
> psi_task_switch().
> - Add patch 7/10 from Johannes Weiner, which remove NR_ONCPU
> task accounting to save 4 bytes in the first cacheline.
> - Remove "psi_irq=" kernel cmdline parameter in last version.
> - Add per-cgroup interface "cgroup.psi" to disable/re-enable
> PSI stats accounting in the cgroup level.
>
>
> Chengming Zhou (9):
> sched/psi: fix periodic aggregation shut off
> sched/psi: don't create cgroup PSI files when psi_disabled
> sched/psi: save percpu memory when !psi_cgroups_enabled
> sched/psi: move private helpers to sched/stats.h
> sched/psi: optimize task switch inside shared cgroups again
> sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
> sched/psi: consolidate cgroup_psi()
> sched/psi: cache parent psi_group to speed up groups iterate
> sched/psi: per-cgroup PSI accounting disable/re-enable interface
>
> Johannes Weiner (1):
> sched/psi: remove NR_ONCPU task accounting
>
> Documentation/admin-guide/cgroup-v2.rst | 23 ++
> include/linux/cgroup-defs.h | 3 +
> include/linux/cgroup.h | 5 -
> include/linux/psi.h | 12 +-
> include/linux/psi_types.h | 29 ++-
> kernel/cgroup/cgroup.c | 106 ++++++++-
> kernel/sched/core.c | 1 +
> kernel/sched/psi.c | 280 +++++++++++++++++-------
> kernel/sched/stats.h | 6 +
> 9 files changed, 362 insertions(+), 103 deletions(-)
>
On Tue, Sep 06, 2022 at 09:13:27PM +0800, Chengming Zhou wrote:
Ah, I see Johannes has acked them all, I missed that.
> > Chengming Zhou (9):
> > sched/psi: fix periodic aggregation shut off
> > sched/psi: don't create cgroup PSI files when psi_disabled
> > sched/psi: save percpu memory when !psi_cgroups_enabled
> > sched/psi: move private helpers to sched/stats.h
> > sched/psi: optimize task switch inside shared cgroups again
> > sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
> > sched/psi: consolidate cgroup_psi()
> > sched/psi: cache parent psi_group to speed up groups iterate
> > sched/psi: per-cgroup PSI accounting disable/re-enable interface
> >
> > Johannes Weiner (1):
> > sched/psi: remove NR_ONCPU task accounting
For future reference:
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
Note all patches violate 1.2.2 for not starting the patch description
with a uppercase letter. I'll go manually fix up this time.
On 2022/9/6 22:43, Peter Zijlstra wrote:
> On Tue, Sep 06, 2022 at 09:13:27PM +0800, Chengming Zhou wrote:
>
> Ah, I see Johannes has acked them all, I missed that.
>
>>> Chengming Zhou (9):
>>> sched/psi: fix periodic aggregation shut off
>>> sched/psi: don't create cgroup PSI files when psi_disabled
>>> sched/psi: save percpu memory when !psi_cgroups_enabled
>>> sched/psi: move private helpers to sched/stats.h
>>> sched/psi: optimize task switch inside shared cgroups again
>>> sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
>>> sched/psi: consolidate cgroup_psi()
>>> sched/psi: cache parent psi_group to speed up groups iterate
>>> sched/psi: per-cgroup PSI accounting disable/re-enable interface
>>>
>>> Johannes Weiner (1):
>>> sched/psi: remove NR_ONCPU task accounting
>
> For future reference:
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
>
> Note all patches violate 1.2.2 for not starting the patch description
> with a uppercase letter. I'll go manually fix up this time.
Sorry about that, thanks for the reference and your manual fix up!
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
Updated version to fix build error when !CONFIG_PSI.
---
Documentation/admin-guide/cgroup-v2.rst | 17 ++++++
include/linux/cgroup-defs.h | 3 ++
include/linux/psi.h | 2 +
include/linux/psi_types.h | 3 ++
kernel/cgroup/cgroup.c | 70 ++++++++++++++++++++++---
kernel/sched/psi.c | 70 ++++++++++++++++++++++---
6 files changed, 152 insertions(+), 13 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 971c418bc778..4cad4e2b31ec 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,6 +976,23 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ cgroup.pressure
+ A read-write single value file that allowed values are "0" and "1".
+ The default is "1".
+
+ Writing "0" to the file will disable the cgroup PSI accounting.
+ Writing "1" to the file will re-enable the cgroup PSI accounting.
+
+ This control attribute is not hierarchical, so disable or enable PSI
+ accounting in a cgroup does not affect PSI accounting in descendants
+ and doesn't need pass enablement via ancestors from root.
+
+ The reason this control attribute exists is that PSI accounts stalls for
+ each cgroup separately and aggregates it at each level of the hierarchy.
+ This may cause non-negligible overhead for some workloads when under
+ deep level of the hierarchy, in which case this control attribute can
+ be used to disable PSI accounting in the non-leaf cgroups.
+
irq.pressure
A read-write nested-keyed file.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4bcf56b3491c..7df76b318245 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -428,6 +428,9 @@ struct cgroup {
struct cgroup_file procs_file; /* handle for "cgroup.procs" */
struct cgroup_file events_file; /* handle for "cgroup.events" */
+ /* handles for "{cpu,memory,io,irq}.pressure" */
+ struct cgroup_file psi_files[NR_PSI_RESOURCES];
+
/*
* The bitmask of subsystems enabled on the child cgroups.
* ->subtree_control is the one configured through
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 362a74ca1d3b..b029a847def1 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -39,6 +39,7 @@ static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
+void psi_cgroup_restart(struct psi_group *group);
#endif
#else /* CONFIG_PSI */
@@ -60,6 +61,7 @@ static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
{
rcu_assign_pointer(p->cgroups, to);
}
+static inline void psi_cgroup_restart(struct psi_group *group) {}
#endif
#endif /* CONFIG_PSI */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index a0b746258c68..6e4372735068 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -152,6 +152,7 @@ struct psi_trigger {
struct psi_group {
struct psi_group *parent;
+ bool enabled;
/* Protects data used by the aggregator */
struct mutex avgs_lock;
@@ -194,6 +195,8 @@ struct psi_group {
#else /* CONFIG_PSI */
+#define NR_PSI_RESOURCES 0
+
struct psi_group { };
#endif /* CONFIG_PSI */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 4f72a71820db..0bca8f29361d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3708,8 +3708,8 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
return psi_show(seq, psi, PSI_CPU);
}
-static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
- size_t nbytes, enum psi_res res)
+static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, enum psi_res res)
{
struct cgroup_file_ctx *ctx = of->priv;
struct psi_trigger *new;
@@ -3746,21 +3746,21 @@ static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
+ return pressure_write(of, buf, nbytes, PSI_IO);
}
static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
+ return pressure_write(of, buf, nbytes, PSI_MEM);
}
static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
+ return pressure_write(of, buf, nbytes, PSI_CPU);
}
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -3776,10 +3776,58 @@ static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+ return pressure_write(of, buf, nbytes, PSI_IRQ);
}
#endif
+static int cgroup_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_psi(cgrp);
+
+ seq_printf(seq, "%d\n", psi->enabled);
+
+ return 0;
+}
+
+static ssize_t cgroup_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ ssize_t ret;
+ int enable;
+ struct cgroup *cgrp;
+ struct psi_group *psi;
+
+ ret = kstrtoint(strstrip(buf), 0, &enable);
+ if (ret)
+ return ret;
+
+ if (enable < 0 || enable > 1)
+ return -ERANGE;
+
+ cgrp = cgroup_kn_lock_live(of->kn, false);
+ if (!cgrp)
+ return -ENOENT;
+
+ psi = cgroup_psi(cgrp);
+ if (psi->enabled != enable) {
+ int i;
+
+ /* show or hide {cpu,memory,io,irq}.pressure files */
+ for (i = 0; i < NR_PSI_RESOURCES; i++)
+ cgroup_file_show(&cgrp->psi_files[i], enable);
+
+ psi->enabled = enable;
+ if (enable)
+ psi_cgroup_restart(psi);
+ }
+
+ cgroup_kn_unlock(of->kn);
+
+ return nbytes;
+}
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5155,6 +5203,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "io.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]),
.seq_show = cgroup_io_pressure_show,
.write = cgroup_io_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5163,6 +5212,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "memory.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]),
.seq_show = cgroup_memory_pressure_show,
.write = cgroup_memory_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5171,6 +5221,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "cpu.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]),
.seq_show = cgroup_cpu_pressure_show,
.write = cgroup_cpu_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5180,12 +5231,19 @@ static struct cftype cgroup_base_files[] = {
{
.name = "irq.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]),
.seq_show = cgroup_irq_pressure_show,
.write = cgroup_irq_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
#endif
+ {
+ .name = "cgroup.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_pressure_show,
+ .write = cgroup_pressure_write,
+ },
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 9a8aee80a087..9711827e31e5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
{
int cpu;
+ group->enabled = true;
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
group->avg_last_update = sched_clock();
@@ -696,17 +697,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
groupc = per_cpu_ptr(group->pcpu, cpu);
/*
- * First we assess the aggregate resource states this CPU's
- * tasks have been in since the last change, and account any
- * SOME and FULL time these may have resulted in.
- *
- * Then we update the task counts according to the state
+ * First we update the task counts according to the state
* change requested through the @clear and @set bits.
+ *
+ * Then if the cgroup PSI stats accounting enabled, we
+ * assess the aggregate resource states this CPU's tasks
+ * have been in since the last change, and account any
+ * SOME and FULL time these may have resulted in.
*/
write_seqcount_begin(&groupc->seq);
- record_times(groupc, now);
-
/*
* Start with TSK_ONCPU, which doesn't have a corresponding
* task count - it's just a boolean flag directly encoded in
@@ -745,6 +745,23 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (set & (1 << t))
groupc->tasks[t]++;
+ if (!group->enabled) {
+ /*
+ * On the first group change after disabling PSI, conclude
+ * the current state and flush its time. This is unlikely
+ * to matter to the user, but aggregation (get_recent_times)
+ * may have already incorporated the live state into times_prev;
+ * avoid a delta sample underflow when PSI is later re-enabled.
+ */
+ if (unlikely(groupc->state_mask & (1 << PSI_NONIDLE)))
+ record_times(groupc, now);
+
+ groupc->state_mask = state_mask;
+
+ write_seqcount_end(&groupc->seq);
+ return;
+ }
+
for (s = 0; s < NR_PSI_STATES; s++) {
if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
state_mask |= (1 << s);
@@ -761,6 +778,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
+ record_times(groupc, now);
+
groupc->state_mask = state_mask;
write_seqcount_end(&groupc->seq);
@@ -907,6 +926,9 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
group = task_psi_group(task);
do {
+ if (!group->enabled)
+ continue;
+
groupc = per_cpu_ptr(group->pcpu, cpu);
write_seqcount_begin(&groupc->seq);
@@ -1080,6 +1102,40 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
task_rq_unlock(rq, task, &rf);
}
+
+void psi_cgroup_restart(struct psi_group *group)
+{
+ int cpu;
+
+ /*
+ * After we disable psi_group->enabled, we don't actually
+ * stop percpu tasks accounting in each psi_group_cpu,
+ * instead only stop test_state() loop, record_times()
+ * and averaging worker, see psi_group_change() for details.
+ *
+ * When disable cgroup PSI, this function has nothing to sync
+ * since cgroup pressure files are hidden and percpu psi_group_cpu
+ * would see !psi_group->enabled and only do task accounting.
+ *
+ * When re-enable cgroup PSI, this function use psi_group_change()
+ * to get correct state mask from test_state() loop on tasks[],
+ * and restart groupc->state_start from now, use .clear = .set = 0
+ * here since no task status really changed.
+ */
+ if (!group->enabled)
+ return;
+
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+ u64 now;
+
+ rq_lock_irq(rq, &rf);
+ now = cpu_clock(cpu);
+ psi_group_change(group, cpu, 0, 0, now, true);
+ rq_unlock_irq(rq, &rf);
+ }
+}
#endif /* CONFIG_CGROUPS */
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
--
2.37.2
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: 58d8c2586cedb8a67f6f0dffa5eaed0f89135b39
Gitweb: https://git.kernel.org/tip/58d8c2586cedb8a67f6f0dffa5eaed0f89135b39
Author: Chengming Zhou <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:03 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:31 +02:00
sched/psi: Don't create cgroup PSI files when psi_disabled
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI can be configured to skip per-cgroup stall accounting. And
doesn't expose PSI files in cgroup hierarchy.
This patch do the same thing when psi_disabled.
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/cgroup/cgroup.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 718a70c..96aefdb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3780,6 +3780,9 @@ static void cgroup_pressure_release(struct kernfs_open_file *of)
bool cgroup_psi_enabled(void)
{
+ if (static_branch_likely(&psi_disabled))
+ return false;
+
return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) == 0;
}
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: 57899a6610e67ba26fa3251ebbef4a5ed21efc5d
Gitweb: https://git.kernel.org/tip/57899a6610e67ba26fa3251ebbef4a5ed21efc5d
Author: Chengming Zhou <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:09 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:33 +02:00
sched/psi: Consolidate cgroup_psi()
cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".
This patch move cgroup_psi() definition to <linux/psi.h>, in which
we can return psi_system for root cgroup, so can handle all cgroups.
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/cgroup.h | 5 -----
include/linux/psi.h | 6 ++++++
kernel/cgroup/cgroup.c | 10 +++++-----
3 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b0914aa..80cb970 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -673,11 +673,6 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
}
-static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
-{
- return cgrp->psi;
-}
-
bool cgroup_psi_enabled(void);
static inline void cgroup_init_kthreadd(void)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index fffd229..362a74c 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/poll.h>
#include <linux/cgroup-defs.h>
+#include <linux/cgroup.h>
struct seq_file;
struct css_set;
@@ -30,6 +31,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
#ifdef CONFIG_CGROUPS
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+}
+
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b46d39b..772b35d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3689,21 +3689,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IO);
}
static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_MEM);
}
static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_CPU);
}
@@ -3729,7 +3729,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
return -EBUSY;
}
- psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ psi = cgroup_psi(cgrp);
new = psi_trigger_create(psi, buf, res);
if (IS_ERR(new)) {
cgroup_put(cgrp);
@@ -3767,7 +3767,7 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IRQ);
}
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: dc86aba751e2867244411adda1562f6664747019
Gitweb: https://git.kernel.org/tip/dc86aba751e2867244411adda1562f6664747019
Author: Chengming Zhou <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:10 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:33 +02:00
sched/psi: Cache parent psi_group to speed up group iteration
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.
In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.
This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group->parent to iterate.
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/psi_types.h | 2 ++-
kernel/sched/psi.c | 49 ++++++++++++++------------------------
2 files changed, 21 insertions(+), 30 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 40c2817..a0b7462 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -151,6 +151,8 @@ struct psi_trigger {
};
struct psi_group {
+ struct psi_group *parent;
+
/* Protects data used by the aggregator */
struct mutex avgs_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2545a78..9a8aee8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -772,27 +772,12 @@ static void psi_group_change(struct psi_group *group, int cpu,
schedule_delayed_work(&group->avgs_work, PSI_FREQ);
}
-static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+static inline struct psi_group *task_psi_group(struct task_struct *task)
{
- if (*iter == &psi_system)
- return NULL;
-
#ifdef CONFIG_CGROUPS
- if (static_branch_likely(&psi_cgroups_enabled)) {
- struct cgroup *cgroup = NULL;
-
- if (!*iter)
- cgroup = task->cgroups->dfl_cgrp;
- else
- cgroup = cgroup_parent(*iter);
-
- if (cgroup && cgroup_parent(cgroup)) {
- *iter = cgroup;
- return cgroup_psi(cgroup);
- }
- }
+ if (static_branch_likely(&psi_cgroups_enabled))
+ return cgroup_psi(task_dfl_cgroup(task));
#endif
- *iter = &psi_system;
return &psi_system;
}
@@ -815,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- void *iter = NULL;
u64 now;
if (!task->pid)
@@ -825,8 +809,10 @@ void psi_task_change(struct task_struct *task, int clear, int set)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter)))
+ group = task_psi_group(task);
+ do {
psi_group_change(group, cpu, clear, set, now, true);
+ } while ((group = group->parent));
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -834,7 +820,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
{
struct psi_group *group, *common = NULL;
int cpu = task_cpu(prev);
- void *iter;
u64 now = cpu_clock(cpu);
if (next->pid) {
@@ -844,8 +829,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
* ancestors with @prev, those will already have @prev's
* TSK_ONCPU bit set, and we can stop the iteration there.
*/
- iter = NULL;
- while ((group = iterate_groups(next, &iter))) {
+ group = task_psi_group(next);
+ do {
if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
PSI_ONCPU) {
common = group;
@@ -853,7 +838,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
- }
+ } while ((group = group->parent));
}
if (prev->pid) {
@@ -886,9 +871,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_flags_change(prev, clear, set);
- iter = NULL;
- while ((group = iterate_groups(prev, &iter)) && group != common)
+ group = task_psi_group(prev);
+ do {
+ if (group == common)
+ break;
psi_group_change(group, cpu, clear, set, now, wake_clock);
+ } while ((group = group->parent));
/*
* TSK_ONCPU is handled up to the common ancestor. If there are
@@ -898,7 +886,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
- for (; group; group = iterate_groups(prev, &iter))
+ for (; group; group = group->parent)
psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
@@ -908,7 +896,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void psi_account_irqtime(struct task_struct *task, u32 delta)
{
int cpu = task_cpu(task);
- void *iter = NULL;
struct psi_group *group;
struct psi_group_cpu *groupc;
u64 now;
@@ -918,7 +905,8 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter))) {
+ group = task_psi_group(task);
+ do {
groupc = per_cpu_ptr(group->pcpu, cpu);
write_seqcount_begin(&groupc->seq);
@@ -930,7 +918,7 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
if (group->poll_states & (1 << PSI_IRQ_FULL))
psi_schedule_poll_work(group, 1);
- }
+ } while ((group = group->parent));
}
#endif
@@ -1010,6 +998,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
return -ENOMEM;
}
group_init(cgroup->psi);
+ cgroup->psi->parent = cgroup_psi(cgroup_parent(cgroup));
return 0;
}
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: c530a3c716b963625e43aa915e0de6b4d1ce8ad9
Gitweb: https://git.kernel.org/tip/c530a3c716b963625e43aa915e0de6b4d1ce8ad9
Author: Chengming Zhou <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:02 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:30 +02:00
sched/psi: Fix periodic aggregation shut off
We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.
Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().
But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.
So this patch move this check to psi_task_switch().
Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/psi.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ecb4b4f..39463dc 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,7 +796,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- bool wake_clock = true;
void *iter = NULL;
u64 now;
@@ -806,19 +805,9 @@ void psi_task_change(struct task_struct *task, int clear, int set)
psi_flags_change(task, clear, set);
now = cpu_clock(cpu);
- /*
- * Periodic aggregation shuts off if there is a period of no
- * task changes, so we wake it back up if necessary. However,
- * don't do this if the task change is the aggregation worker
- * itself going to sleep, or we'll ping-pong forever.
- */
- if (unlikely((clear & TSK_RUNNING) &&
- (task->flags & PF_WQ_WORKER) &&
- wq_worker_last_func(task) == psi_avgs_work))
- wake_clock = false;
while ((group = iterate_groups(task, &iter)))
- psi_group_change(group, cpu, clear, set, now, wake_clock);
+ psi_group_change(group, cpu, clear, set, now, true);
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -854,6 +843,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (prev->pid) {
int clear = TSK_ONCPU, set = 0;
+ bool wake_clock = true;
/*
* When we're going to sleep, psi_dequeue() lets us
@@ -867,13 +857,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
clear |= TSK_MEMSTALL_RUNNING;
if (prev->in_iowait)
set |= TSK_IOWAIT;
+
+ /*
+ * Periodic aggregation shuts off if there is a period of no
+ * task changes, so we wake it back up if necessary. However,
+ * don't do this if the task change is the aggregation worker
+ * itself going to sleep, or we'll ping-pong forever.
+ */
+ if (unlikely((prev->flags & PF_WQ_WORKER) &&
+ wq_worker_last_func(prev) == psi_avgs_work))
+ wake_clock = false;
}
psi_flags_change(prev, clear, set);
iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
/*
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -882,7 +882,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (sleep) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, &iter))
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
}
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: 34f26a15611afb03c33df6819359d36f5b382589
Gitweb: https://git.kernel.org/tip/34f26a15611afb03c33df6819359d36f5b382589
Author: Chengming Zhou <[email protected]>
AuthorDate: Wed, 07 Sep 2022 17:03:32 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:33 +02:00
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
Documentation/admin-guide/cgroup-v2.rst | 17 ++++++-
include/linux/cgroup-defs.h | 3 +-
include/linux/psi.h | 2 +-
include/linux/psi_types.h | 3 +-
kernel/cgroup/cgroup.c | 70 +++++++++++++++++++++---
kernel/sched/psi.c | 70 +++++++++++++++++++++---
6 files changed, 152 insertions(+), 13 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 971c418..4cad4e2 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,6 +976,23 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ cgroup.pressure
+ A read-write single value file that allowed values are "0" and "1".
+ The default is "1".
+
+ Writing "0" to the file will disable the cgroup PSI accounting.
+ Writing "1" to the file will re-enable the cgroup PSI accounting.
+
+ This control attribute is not hierarchical, so disable or enable PSI
+ accounting in a cgroup does not affect PSI accounting in descendants
+ and doesn't need pass enablement via ancestors from root.
+
+ The reason this control attribute exists is that PSI accounts stalls for
+ each cgroup separately and aggregates it at each level of the hierarchy.
+ This may cause non-negligible overhead for some workloads when under
+ deep level of the hierarchy, in which case this control attribute can
+ be used to disable PSI accounting in the non-leaf cgroups.
+
irq.pressure
A read-write nested-keyed file.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4bcf56b..7df76b3 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -428,6 +428,9 @@ struct cgroup {
struct cgroup_file procs_file; /* handle for "cgroup.procs" */
struct cgroup_file events_file; /* handle for "cgroup.events" */
+ /* handles for "{cpu,memory,io,irq}.pressure" */
+ struct cgroup_file psi_files[NR_PSI_RESOURCES];
+
/*
* The bitmask of subsystems enabled on the child cgroups.
* ->subtree_control is the one configured through
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 362a74c..b029a84 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -39,6 +39,7 @@ static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
+void psi_cgroup_restart(struct psi_group *group);
#endif
#else /* CONFIG_PSI */
@@ -60,6 +61,7 @@ static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
{
rcu_assign_pointer(p->cgroups, to);
}
+static inline void psi_cgroup_restart(struct psi_group *group) {}
#endif
#endif /* CONFIG_PSI */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index a0b7462..6e43727 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -152,6 +152,7 @@ struct psi_trigger {
struct psi_group {
struct psi_group *parent;
+ bool enabled;
/* Protects data used by the aggregator */
struct mutex avgs_lock;
@@ -194,6 +195,8 @@ struct psi_group {
#else /* CONFIG_PSI */
+#define NR_PSI_RESOURCES 0
+
struct psi_group { };
#endif /* CONFIG_PSI */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 772b35d..fa1cf83 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3708,8 +3708,8 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
return psi_show(seq, psi, PSI_CPU);
}
-static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
- size_t nbytes, enum psi_res res)
+static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, enum psi_res res)
{
struct cgroup_file_ctx *ctx = of->priv;
struct psi_trigger *new;
@@ -3746,21 +3746,21 @@ static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
+ return pressure_write(of, buf, nbytes, PSI_IO);
}
static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
+ return pressure_write(of, buf, nbytes, PSI_MEM);
}
static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
+ return pressure_write(of, buf, nbytes, PSI_CPU);
}
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -3776,10 +3776,58 @@ static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
- return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+ return pressure_write(of, buf, nbytes, PSI_IRQ);
}
#endif
+static int cgroup_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_psi(cgrp);
+
+ seq_printf(seq, "%d\n", psi->enabled);
+
+ return 0;
+}
+
+static ssize_t cgroup_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ ssize_t ret;
+ int enable;
+ struct cgroup *cgrp;
+ struct psi_group *psi;
+
+ ret = kstrtoint(strstrip(buf), 0, &enable);
+ if (ret)
+ return ret;
+
+ if (enable < 0 || enable > 1)
+ return -ERANGE;
+
+ cgrp = cgroup_kn_lock_live(of->kn, false);
+ if (!cgrp)
+ return -ENOENT;
+
+ psi = cgroup_psi(cgrp);
+ if (psi->enabled != enable) {
+ int i;
+
+ /* show or hide {cpu,memory,io,irq}.pressure files */
+ for (i = 0; i < NR_PSI_RESOURCES; i++)
+ cgroup_file_show(&cgrp->psi_files[i], enable);
+
+ psi->enabled = enable;
+ if (enable)
+ psi_cgroup_restart(psi);
+ }
+
+ cgroup_kn_unlock(of->kn);
+
+ return nbytes;
+}
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5175,6 +5223,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "io.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]),
.seq_show = cgroup_io_pressure_show,
.write = cgroup_io_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5183,6 +5232,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "memory.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]),
.seq_show = cgroup_memory_pressure_show,
.write = cgroup_memory_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5191,6 +5241,7 @@ static struct cftype cgroup_base_files[] = {
{
.name = "cpu.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]),
.seq_show = cgroup_cpu_pressure_show,
.write = cgroup_cpu_pressure_write,
.poll = cgroup_pressure_poll,
@@ -5200,12 +5251,19 @@ static struct cftype cgroup_base_files[] = {
{
.name = "irq.pressure",
.flags = CFTYPE_PRESSURE,
+ .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]),
.seq_show = cgroup_irq_pressure_show,
.write = cgroup_irq_pressure_write,
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
#endif
+ {
+ .name = "cgroup.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_pressure_show,
+ .write = cgroup_pressure_write,
+ },
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 9a8aee8..9711827 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
{
int cpu;
+ group->enabled = true;
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
group->avg_last_update = sched_clock();
@@ -696,17 +697,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
groupc = per_cpu_ptr(group->pcpu, cpu);
/*
- * First we assess the aggregate resource states this CPU's
- * tasks have been in since the last change, and account any
- * SOME and FULL time these may have resulted in.
- *
- * Then we update the task counts according to the state
+ * First we update the task counts according to the state
* change requested through the @clear and @set bits.
+ *
+ * Then if the cgroup PSI stats accounting enabled, we
+ * assess the aggregate resource states this CPU's tasks
+ * have been in since the last change, and account any
+ * SOME and FULL time these may have resulted in.
*/
write_seqcount_begin(&groupc->seq);
- record_times(groupc, now);
-
/*
* Start with TSK_ONCPU, which doesn't have a corresponding
* task count - it's just a boolean flag directly encoded in
@@ -745,6 +745,23 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (set & (1 << t))
groupc->tasks[t]++;
+ if (!group->enabled) {
+ /*
+ * On the first group change after disabling PSI, conclude
+ * the current state and flush its time. This is unlikely
+ * to matter to the user, but aggregation (get_recent_times)
+ * may have already incorporated the live state into times_prev;
+ * avoid a delta sample underflow when PSI is later re-enabled.
+ */
+ if (unlikely(groupc->state_mask & (1 << PSI_NONIDLE)))
+ record_times(groupc, now);
+
+ groupc->state_mask = state_mask;
+
+ write_seqcount_end(&groupc->seq);
+ return;
+ }
+
for (s = 0; s < NR_PSI_STATES; s++) {
if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
state_mask |= (1 << s);
@@ -761,6 +778,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
+ record_times(groupc, now);
+
groupc->state_mask = state_mask;
write_seqcount_end(&groupc->seq);
@@ -907,6 +926,9 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
group = task_psi_group(task);
do {
+ if (!group->enabled)
+ continue;
+
groupc = per_cpu_ptr(group->pcpu, cpu);
write_seqcount_begin(&groupc->seq);
@@ -1080,6 +1102,40 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
task_rq_unlock(rq, task, &rf);
}
+
+void psi_cgroup_restart(struct psi_group *group)
+{
+ int cpu;
+
+ /*
+ * After we disable psi_group->enabled, we don't actually
+ * stop percpu tasks accounting in each psi_group_cpu,
+ * instead only stop test_state() loop, record_times()
+ * and averaging worker, see psi_group_change() for details.
+ *
+ * When disable cgroup PSI, this function has nothing to sync
+ * since cgroup pressure files are hidden and percpu psi_group_cpu
+ * would see !psi_group->enabled and only do task accounting.
+ *
+ * When re-enable cgroup PSI, this function use psi_group_change()
+ * to get correct state mask from test_state() loop on tasks[],
+ * and restart groupc->state_start from now, use .clear = .set = 0
+ * here since no task status really changed.
+ */
+ if (!group->enabled)
+ return;
+
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+ u64 now;
+
+ rq_lock_irq(rq, &rf);
+ now = cpu_clock(cpu);
+ psi_group_change(group, cpu, 0, 0, now, true);
+ rq_unlock_irq(rq, &rf);
+ }
+}
#endif /* CONFIG_CGROUPS */
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: 65176f59a18d888684525658a1d0b8bf749d24f3
Gitweb: https://git.kernel.org/tip/65176f59a18d888684525658a1d0b8bf749d24f3
Author: Chengming Zhou <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:06 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:32 +02:00
sched/psi: Optimize task switch inside shared cgroups again
Way back when PSI_MEM_FULL was accounted from the timer tick, task
switching could simply iterate next and prev to the common ancestor to
update TSK_ONCPU and be done.
Then memstall ticks were replaced with checking curr->in_memstall
directly in psi_group_change(). That meant that now if the task switch
was between a memstall and a !memstall task, we had to iterate through
the common ancestors at least ONCE to fix up their state_masks.
We added the identical_state filter to make sure the common ancestor
elimination was skipped in that case. It seems that was always a
little too eager, because it caused us to walk the common ancestors
*twice* instead of the required once: the iteration for next could
have stopped at the common ancestor; prev could have updated TSK_ONCPU
up to the common ancestor, then finish to the root without changing
any flags, just to get the new curr->in_memstall into the state_masks.
This patch recognizes this and makes it so that we walk to the root
exactly once if state_mask needs updating, which is simply catching up
on a missed optimization that could have been done in commit 7fae6c8171d2
("psi: Use ONCPU state tracking machinery to detect reclaim") directly.
Apart from this, it's also necessary for the next patch "sched/psi: remove
NR_ONCPU task accounting". Suppose we walk the common ancestors twice:
(1) psi_group_change(.clear = 0, .set = TSK_ONCPU)
(2) psi_group_change(.clear = TSK_ONCPU, .set = 0)
We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++
in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct.
The next patch change to use one bit in state mask to record TSK_ONCPU,
PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause
the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set!
With this patch, we will never walk the common ancestors twice, so won't
have above problem.
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/psi.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 77d53c0..d71dbc2 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -820,20 +820,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
u64 now = cpu_clock(cpu);
if (next->pid) {
- bool identical_state;
-
psi_flags_change(next, 0, TSK_ONCPU);
/*
- * When switching between tasks that have an identical
- * runtime state, the cgroup that contains both tasks
- * we reach the first common ancestor. Iterate @next's
- * ancestors only until we encounter @prev's ONCPU.
+ * Set TSK_ONCPU on @next's cgroups. If @next shares any
+ * ancestors with @prev, those will already have @prev's
+ * TSK_ONCPU bit set, and we can stop the iteration there.
*/
- identical_state = prev->psi_flags == next->psi_flags;
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (identical_state &&
- per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
common = group;
break;
}
@@ -877,10 +872,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_group_change(group, cpu, clear, set, now, wake_clock);
/*
- * TSK_ONCPU is handled up to the common ancestor. If we're tasked
- * with dequeuing too, finish that for the rest of the hierarchy.
+ * TSK_ONCPU is handled up to the common ancestor. If there are
+ * any other differences between the two tasks (e.g. prev goes
+ * to sleep, or only one task is memstall), finish propagating
+ * those differences all the way up to the root.
*/
- if (sleep) {
+ if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, &iter))
psi_group_change(group, cpu, clear, set, now, wake_clock);
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: 71dbdde7914d32e86f01ac1f6e54e964c9dfdbd9
Gitweb: https://git.kernel.org/tip/71dbdde7914d32e86f01ac1f6e54e964c9dfdbd9
Author: Johannes Weiner <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:07 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:32 +02:00
sched/psi: Remove NR_ONCPU task accounting
We put all fields updated by the scheduler in the first cacheline of
struct psi_group_cpu for performance.
Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
we need to reclaim space first. This patch remove NR_ONCPU task accounting
in struct psi_group_cpu, use one bit in state_mask to track instead.
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Tested-by: Chengming Zhou <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/psi_types.h | 16 ++++++---------
kernel/sched/psi.c | 41 +++++++++++++++++++++++++++-----------
2 files changed, 37 insertions(+), 20 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c0..54cb749 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -16,13 +16,6 @@ enum psi_task_count {
NR_MEMSTALL,
NR_RUNNING,
/*
- * This can't have values other than 0 or 1 and could be
- * implemented as a bit flag. But for now we still have room
- * in the first cacheline of psi_group_cpu, and this way we
- * don't have to special case any state tracking for it.
- */
- NR_ONCPU,
- /*
* For IO and CPU stalls the presence of running/oncpu tasks
* in the domain means a partial rather than a full stall.
* For memory it's not so simple because of page reclaimers:
@@ -32,16 +25,18 @@ enum psi_task_count {
* threads and memstall ones.
*/
NR_MEMSTALL_RUNNING,
- NR_PSI_TASK_COUNTS = 5,
+ NR_PSI_TASK_COUNTS = 4,
};
/* Task state bitmasks */
#define TSK_IOWAIT (1 << NR_IOWAIT)
#define TSK_MEMSTALL (1 << NR_MEMSTALL)
#define TSK_RUNNING (1 << NR_RUNNING)
-#define TSK_ONCPU (1 << NR_ONCPU)
#define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING)
+/* Only one task can be scheduled, no corresponding task count */
+#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS)
+
/* Resources that workloads could be stalled on */
enum psi_res {
PSI_IO,
@@ -68,6 +63,9 @@ enum psi_states {
NR_PSI_STATES = 7,
};
+/* Use one bit in the state mask to track TSK_ONCPU */
+#define PSI_ONCPU (1 << NR_PSI_STATES)
+
enum psi_aggregators {
PSI_AVGS = 0,
PSI_POLL,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d71dbc2..4702a77 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -212,7 +212,7 @@ void __init psi_init(void)
group_init(&psi_system);
}
-static bool test_state(unsigned int *tasks, enum psi_states state)
+static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
{
switch (state) {
case PSI_IO_SOME:
@@ -225,9 +225,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
return unlikely(tasks[NR_MEMSTALL] &&
tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
case PSI_CPU_SOME:
- return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
+ return unlikely(tasks[NR_RUNNING] > oncpu);
case PSI_CPU_FULL:
- return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
+ return unlikely(tasks[NR_RUNNING] && !oncpu);
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -689,9 +689,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
bool wake_clock)
{
struct psi_group_cpu *groupc;
- u32 state_mask = 0;
unsigned int t, m;
enum psi_states s;
+ u32 state_mask;
groupc = per_cpu_ptr(group->pcpu, cpu);
@@ -707,17 +707,36 @@ static void psi_group_change(struct psi_group *group, int cpu,
record_times(groupc, now);
+ /*
+ * Start with TSK_ONCPU, which doesn't have a corresponding
+ * task count - it's just a boolean flag directly encoded in
+ * the state mask. Clear, set, or carry the current state if
+ * no changes are requested.
+ */
+ if (unlikely(clear & TSK_ONCPU)) {
+ state_mask = 0;
+ clear &= ~TSK_ONCPU;
+ } else if (unlikely(set & TSK_ONCPU)) {
+ state_mask = PSI_ONCPU;
+ set &= ~TSK_ONCPU;
+ } else {
+ state_mask = groupc->state_mask & PSI_ONCPU;
+ }
+
+ /*
+ * The rest of the state mask is calculated based on the task
+ * counts. Update those first, then construct the mask.
+ */
for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
continue;
if (groupc->tasks[t]) {
groupc->tasks[t]--;
} else if (!psi_bug) {
- printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u %u] clear=%x set=%x\n",
+ printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n",
cpu, t, groupc->tasks[0],
groupc->tasks[1], groupc->tasks[2],
- groupc->tasks[3], groupc->tasks[4],
- clear, set);
+ groupc->tasks[3], clear, set);
psi_bug = 1;
}
}
@@ -726,9 +745,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (set & (1 << t))
groupc->tasks[t]++;
- /* Calculate state mask representing active states */
for (s = 0; s < NR_PSI_STATES; s++) {
- if (test_state(groupc->tasks, s))
+ if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
state_mask |= (1 << s);
}
@@ -740,7 +758,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
* task in a cgroup is in_memstall, the corresponding groupc
* on that cpu is in PSI_MEM_FULL state.
*/
- if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
+ if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
groupc->state_mask = state_mask;
@@ -828,7 +846,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
+ PSI_ONCPU) {
common = group;
break;
}
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: 52b1364ba0b105122d6de0e719b36db705011ac1
Gitweb: https://git.kernel.org/tip/52b1364ba0b105122d6de0e719b36db705011ac1
Author: Chengming Zhou <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:08 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:32 +02:00
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/admin-guide/cgroup-v2.rst | 6 ++-
include/linux/psi_types.h | 10 ++-
kernel/cgroup/cgroup.c | 27 +++++++++-
kernel/sched/core.c | 1 +-
kernel/sched/psi.c | 74 +++++++++++++++++++++++-
kernel/sched/stats.h | 2 +-
6 files changed, 116 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index be4a77b..971c418 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,6 +976,12 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ irq.pressure
+ A read-write nested-keyed file.
+
+ Shows pressure stall information for IRQ/SOFTIRQ. See
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
+
Controllers
===========
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 54cb749..40c2817 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -42,7 +42,10 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
- NR_PSI_RESOURCES = 3,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ,
+#endif
+ NR_PSI_RESOURCES,
};
/*
@@ -58,9 +61,12 @@ enum psi_states {
PSI_MEM_FULL,
PSI_CPU_SOME,
PSI_CPU_FULL,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ_FULL,
+#endif
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 7,
+ NR_PSI_STATES,
};
/* Use one bit in the state mask to track TSK_ONCPU */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 96aefdb..b46d39b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3763,6 +3763,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+
+ return psi_show(seq, psi, PSI_IRQ);
+}
+
+static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+}
+#endif
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5179,6 +5196,16 @@ static struct cftype cgroup_base_files[] = {
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ {
+ .name = "irq.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_irq_pressure_show,
+ .write = cgroup_irq_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
+ },
+#endif
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253..7d1ea92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+ psi_account_irqtime(rq->curr, irq_delta);
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4702a77..2545a78 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -904,6 +904,36 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+void psi_account_irqtime(struct task_struct *task, u32 delta)
+{
+ int cpu = task_cpu(task);
+ void *iter = NULL;
+ struct psi_group *group;
+ struct psi_group_cpu *groupc;
+ u64 now;
+
+ if (!task->pid)
+ return;
+
+ now = cpu_clock(cpu);
+
+ while ((group = iterate_groups(task, &iter))) {
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ write_seqcount_begin(&groupc->seq);
+
+ record_times(groupc, now);
+ groupc->times[PSI_IRQ_FULL] += delta;
+
+ write_seqcount_end(&groupc->seq);
+
+ if (group->poll_states & (1 << PSI_IRQ_FULL))
+ psi_schedule_poll_work(group, 1);
+ }
+}
+#endif
+
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
@@ -1065,6 +1095,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
{
+ bool only_full = false;
int full;
u64 now;
@@ -1079,7 +1110,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);
- for (full = 0; full < 2; full++) {
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ only_full = res == PSI_IRQ;
+#endif
+
+ for (full = 0; full < 2 - only_full; full++) {
unsigned long avg[3] = { 0, };
u64 total = 0;
int w;
@@ -1093,7 +1128,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
}
seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
- full ? "full" : "some",
+ full || only_full ? "full" : "some",
LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
@@ -1121,6 +1156,11 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
else
return ERR_PTR(-EINVAL);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ if (res == PSI_IRQ && --state != PSI_IRQ_FULL)
+ return ERR_PTR(-EINVAL);
+#endif
+
if (state >= PSI_NONIDLE)
return ERR_PTR(-EINVAL);
@@ -1405,6 +1445,33 @@ static const struct proc_ops psi_cpu_proc_ops = {
.proc_release = psi_fop_release,
};
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int psi_irq_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IRQ);
+}
+
+static int psi_irq_open(struct inode *inode, struct file *file)
+{
+ return psi_open(file, psi_irq_show);
+}
+
+static ssize_t psi_irq_write(struct file *file, const char __user *user_buf,
+ size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_IRQ);
+}
+
+static const struct proc_ops psi_irq_proc_ops = {
+ .proc_open = psi_irq_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_write = psi_irq_write,
+ .proc_poll = psi_fop_poll,
+ .proc_release = psi_fop_release,
+};
+#endif
+
static int __init psi_proc_init(void)
{
if (psi_enable) {
@@ -1412,6 +1479,9 @@ static int __init psi_proc_init(void)
proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops);
proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
}
return 0;
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467..84a1889 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -110,6 +110,7 @@ __schedstats_from_se(struct sched_entity *se)
void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
+void psi_account_irqtime(struct task_struct *task, u32 delta);
/*
* PSI tracks state that persists across sleeps, such as iowaits and
@@ -205,6 +206,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
+static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {}
#endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: d79ddb069c5257a924456eb99b53fc1ea715c0a3
Gitweb: https://git.kernel.org/tip/d79ddb069c5257a924456eb99b53fc1ea715c0a3
Author: Chengming Zhou <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:05 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:31 +02:00
sched/psi: Move private helpers to sched/stats.h
This patch move psi_task_change/psi_task_switch declarations out of
PSI public header, since they are only needed for implementing the
PSI stats tracking in sched/stats.h
psi_task_switch is obvious, psi_task_change can't be public helper
since it doesn't check psi_disabled static key. And there is no
any user now, so put it in sched/stats.h too.
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/psi.h | 4 ----
kernel/sched/stats.h | 4 ++++
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index dd74411..fffd229 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -18,10 +18,6 @@ extern struct psi_group psi_system;
void psi_init(void);
-void psi_task_change(struct task_struct *task, int clear, int set);
-void psi_task_switch(struct task_struct *prev, struct task_struct *next,
- bool sleep);
-
void psi_memstall_enter(unsigned long *flags);
void psi_memstall_leave(unsigned long *flags);
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index baa839c..c39b467 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se)
}
#ifdef CONFIG_PSI
+void psi_task_change(struct task_struct *task, int clear, int set);
+void psi_task_switch(struct task_struct *prev, struct task_struct *next,
+ bool sleep);
+
/*
* PSI tracks state that persists across sleeps, such as iowaits and
* memory stalls. As a result, it has to distinguish between sleeps,
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: e2ad8ab04c5cdfc8dc2f382c45d248ab01dee991
Gitweb: https://git.kernel.org/tip/e2ad8ab04c5cdfc8dc2f382c45d248ab01dee991
Author: Chengming Zhou <[email protected]>
AuthorDate: Fri, 26 Aug 2022 00:41:04 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Fri, 09 Sep 2022 11:08:31 +02:00
sched/psi: Save percpu memory when !psi_cgroups_enabled
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.
Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 39463dc..77d53c0 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -201,6 +201,7 @@ void __init psi_init(void)
{
if (!psi_enable) {
static_branch_enable(&psi_disabled);
+ static_branch_disable(&psi_cgroups_enabled);
return;
}
@@ -950,7 +951,7 @@ void psi_memstall_leave(unsigned long *flags)
#ifdef CONFIG_CGROUPS
int psi_cgroup_alloc(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return 0;
cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL);
@@ -968,7 +969,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
void psi_cgroup_free(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return;
cancel_delayed_work_sync(&cgroup->psi->avgs_work);
@@ -996,7 +997,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
struct rq_flags rf;
struct rq *rq;
- if (static_branch_likely(&psi_disabled)) {
+ if (!static_branch_likely(&psi_cgroups_enabled)) {
/*
* Lame to do this here, but the scheduler cannot be locked
* from the outside, so we move cgroups from inside sched/.