Hi all,
This patch series are some optimization and extension for PSI.
patch 1/9 fix periodic aggregation shut off problem introduced by earlier
commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups").
patch 2/9 optimize task switch inside shared cgroups when in_memstall status
of prev task and next task are different.
patch 3-4 optimize and simplify PSI status tracking by don't change task
psi_flags when migrate CPU/cgroup.
patch 7-8 introduce new kernel cmdline parameter "psi_inner_cgroup=" to
configure whether or not to track PSI stall information for inner cgroups.
patch 9/9 introduce new PSI resource PSI_IRQ to track IRQ/SOFTIRQ pressure
stall information when CONFIG_IRQ_TIME_ACCOUNTING.
Performance test on Intel Xeon Platinum with 3 levels of cgroup, in which
run mmtests config-scheduler-perfpipe:
tip tip tip patched patched patched patched
default cgroup_disable=pressure IRQ_TIME_ACCOUNTING default psi_inner_cgroup=off PSI_IRQ PSI_IRQ + psi_inner_cgroup=off
Min Time 9.89 ( 0.00%) 8.99 ( 9.12%) 10.04 ( -1.53%) 9.63 ( 2.58%) 9.27 ( 6.22%) 10.09 ( -2.04%) 9.45 ( 4.41%)
1st-qrtle Time 10.01 ( 0.00%) 9.15 ( 8.66%) 10.16 ( -1.45%) 9.72 ( 2.89%) 9.35 ( 6.61%) 10.20 ( -1.81%) 9.54 ( 4.77%)
2nd-qrtle Time 10.07 ( 0.00%) 9.25 ( 8.12%) 10.19 ( -1.21%) 9.79 ( 2.73%) 9.38 ( 6.78%) 10.24 ( -1.75%) 9.59 ( 4.68%)
3rd-qrtle Time 10.14 ( 0.00%) 9.30 ( 8.32%) 10.23 ( -0.88%) 9.84 ( 3.00%) 9.44 ( 6.92%) 10.27 ( -1.21%) 9.62 ( 5.18%)
Max-1 Time 9.89 ( 0.00%) 8.99 ( 9.12%) 10.04 ( -1.53%) 9.63 ( 2.58%) 9.27 ( 6.22%) 10.09 ( -2.04%) 9.45 ( 4.41%)
Max-5 Time 9.89 ( 0.00%) 8.99 ( 9.12%) 10.04 ( -1.53%) 9.63 ( 2.58%) 9.27 ( 6.22%) 10.09 ( -2.04%) 9.45 ( 4.41%)
Max-10 Time 9.92 ( 0.00%) 9.09 ( 8.33%) 10.11 ( -1.97%) 9.67 ( 2.51%) 9.29 ( 6.29%) 10.15 ( -2.30%) 9.48 ( 4.46%)
Max-90 Time 10.20 ( 0.00%) 9.33 ( 8.53%) 10.33 ( -1.24%) 9.87 ( 3.29%) 9.49 ( 6.99%) 10.29 ( -0.85%) 9.66 ( 5.32%)
Max-95 Time 10.23 ( 0.00%) 9.34 ( 8.70%) 10.37 ( -1.39%) 9.94 ( 2.83%) 9.53 ( 6.88%) 10.30 ( -0.65%) 9.67 ( 5.51%)
Max-99 Time 10.23 ( 0.00%) 9.37 ( 8.43%) 10.40 ( -1.63%) 9.99 ( 2.41%) 9.76 ( 4.57%) 10.31 ( -0.74%) 9.69 ( 5.25%)
Max Time 10.34 ( 0.00%) 9.46 ( 8.50%) 10.43 ( -0.83%) 17.04 ( -64.80%) 9.79 ( 5.36%) 10.32 ( 0.20%) 9.71 ( 6.07%)
Amean Time 10.08 ( 0.00%) 9.23 * 8.39%* 10.21 * -1.33%* 10.03 ( 0.47%) 9.41 * 6.59%* 10.23 * -1.53%* 9.59 * 4.87%*
Thanks!
Chengming Zhou (9):
sched/psi: fix periodic aggregation shut off
sched/psi: optimize task switch inside shared cgroups again
sched/psi: move private helpers to sched/stats.h
sched/psi: don't change task psi_flags when migrate CPU/group
sched/psi: don't create cgroup PSI files when psi_disabled
sched/psi: save percpu memory when !psi_cgroups_enabled
sched/psi: cache parent psi_group to speed up groups iterate
sched/psi: add kernel cmdline parameter psi_inner_cgroup
sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
.../admin-guide/kernel-parameters.txt | 11 +
include/linux/psi.h | 5 +-
include/linux/psi_types.h | 9 +-
include/linux/sched.h | 3 -
kernel/cgroup/cgroup.c | 30 +++
kernel/sched/core.c | 2 +
kernel/sched/psi.c | 194 +++++++++++++-----
kernel/sched/stats.h | 71 ++++---
8 files changed, 232 insertions(+), 93 deletions(-)
--
2.36.1
We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.
Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().
But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.
So this patch move this check to psi_task_switch(). Note for defer sleep
case, we should wake periodic avgs work for common ancestors groups,
since those groups have next task sched_in.
Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <[email protected]>
---
kernel/sched/psi.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a337f3e35997..c8a4e644cd2c 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -800,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- bool wake_clock = true;
void *iter = NULL;
u64 now;
@@ -810,19 +809,9 @@ void psi_task_change(struct task_struct *task, int clear, int set)
psi_flags_change(task, clear, set);
now = cpu_clock(cpu);
- /*
- * Periodic aggregation shuts off if there is a period of no
- * task changes, so we wake it back up if necessary. However,
- * don't do this if the task change is the aggregation worker
- * itself going to sleep, or we'll ping-pong forever.
- */
- if (unlikely((clear & TSK_RUNNING) &&
- (task->flags & PF_WQ_WORKER) &&
- wq_worker_last_func(task) == psi_avgs_work))
- wake_clock = false;
while ((group = iterate_groups(task, &iter)))
- psi_group_change(group, cpu, clear, set, now, wake_clock);
+ psi_group_change(group, cpu, clear, set, now, true);
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -858,6 +847,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (prev->pid) {
int clear = TSK_ONCPU, set = 0;
+ bool wake_clock = true;
/*
* When we're going to sleep, psi_dequeue() lets us
@@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
clear |= TSK_MEMSTALL_RUNNING;
if (prev->in_iowait)
set |= TSK_IOWAIT;
+
+ /*
+ * Periodic aggregation shuts off if there is a period of no
+ * task changes, so we wake it back up if necessary. However,
+ * don't do this if the task change is the aggregation worker
+ * itself going to sleep, or we'll ping-pong forever.
+ */
+ if (unlikely((prev->flags & PF_WQ_WORKER) &&
+ wq_worker_last_func(prev) == psi_avgs_work))
+ wake_clock = false;
}
psi_flags_change(prev, clear, set);
iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
/*
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
--
2.36.1
This patch move psi_task_change/psi_task_switch declarations out of
PSI public header, since they are only needed for implementing the
PSI stats tracking in sched/stats.h
psi_task_switch is obvious, psi_task_change can't be public helper
since it doesn't check psi_disabled static key. And there is no
any user now, so put it in sched/stats.h too.
Signed-off-by: Chengming Zhou <[email protected]>
---
include/linux/psi.h | 4 ----
kernel/sched/stats.h | 4 ++++
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 89784763d19e..aa168a038242 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -18,10 +18,6 @@ extern struct psi_group psi_system;
void psi_init(void);
-void psi_task_change(struct task_struct *task, int clear, int set);
-void psi_task_switch(struct task_struct *prev, struct task_struct *next,
- bool sleep);
-
void psi_memstall_enter(unsigned long *flags);
void psi_memstall_leave(unsigned long *flags);
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index baa839c1ba96..c39b467ece43 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se)
}
#ifdef CONFIG_PSI
+void psi_task_change(struct task_struct *task, int clear, int set);
+void psi_task_switch(struct task_struct *prev, struct task_struct *next,
+ bool sleep);
+
/*
* PSI tracks state that persists across sleeps, such as iowaits and
* memory stalls. As a result, it has to distinguish between sleeps,
--
2.36.1
The current code use psi_task_change() at every scheduling point,
in which change task psi_flags then change all its psi_groups.
So we have to heavily rely on the task scheduling states to calculate
what to set and what to clear at every scheduling point, which make
the PSI stats tracking code much complex and error prone.
In fact, the task psi_flags only change at wakeup and sleep (except
ONCPU state at switch), it doesn't change at all when migrate CPU/group.
If we keep its psi_flags unchanged when migrate CPU/group, we can
just use task->psi_flags to clear(migrate out) or set(migrate in),
which will make PSI stats tracking much simplier and more efficient.
Note: ENQUEUE_WAKEUP only means wakeup task from sleep state, don't
include wakeup new task, so add psi_enqueue() in wake_up_new_task().
Performance test on Intel Xeon Platinum with 3 levels of cgroup:
1. before the patch:
$ perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.034 [sec]
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes
Total time: 8.210 [sec]
8.210600 usecs/op
121793 ops/sec
2. after the patch:
$ perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.032 [sec]
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes
Total time: 8.077 [sec]
8.077648 usecs/op
123798 ops/sec
Signed-off-by: Chengming Zhou <[email protected]>
---
include/linux/sched.h | 3 ---
kernel/sched/core.c | 1 +
kernel/sched/psi.c | 24 ++++++++++---------
kernel/sched/stats.h | 54 +++++++++++++++++++++----------------------
4 files changed, 40 insertions(+), 42 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 88b8817b827d..20a94786cad8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -879,9 +879,6 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
-#ifdef CONFIG_PSI
- unsigned sched_psi_wake_requeue:1;
-#endif
/* Force alignment to the next boundary: */
unsigned :0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a463dbc92fcd..f5f2d3542b05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4642,6 +4642,7 @@ void wake_up_new_task(struct task_struct *p)
post_init_entity_util_avg(p);
activate_task(rq, p, ENQUEUE_NOCLOCK);
+ psi_enqueue(p, true);
trace_sched_wakeup_new(p);
check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index e04041d8251b..6ba159fe2a4f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,22 +796,24 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
task->psi_flags |= set;
}
-void psi_task_change(struct task_struct *task, int clear, int set)
+void psi_change_groups(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
void *iter = NULL;
- u64 now;
+ u64 now = cpu_clock(cpu);
+
+ while ((group = iterate_groups(task, &iter)))
+ psi_group_change(group, cpu, clear, set, now, true);
+}
+void psi_task_change(struct task_struct *task, int clear, int set)
+{
if (!task->pid)
return;
psi_flags_change(task, clear, set);
-
- now = cpu_clock(cpu);
-
- while ((group = iterate_groups(task, &iter)))
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_change_groups(task, clear, set);
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -1015,9 +1017,9 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
* pick_next_task()
* rq_unlock()
* rq_lock()
- * psi_task_change() // old cgroup
+ * psi_change_groups() // old cgroup
* task->cgroups = to
- * psi_task_change() // new cgroup
+ * psi_change_groups() // new cgroup
* rq_unlock()
* rq_lock()
* psi_sched_switch() // does deferred updates in new cgroup
@@ -1027,13 +1029,13 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
task_flags = task->psi_flags;
if (task_flags)
- psi_task_change(task, task_flags, 0);
+ psi_change_groups(task, task_flags, 0);
/* See comment above */
rcu_assign_pointer(task->cgroups, to);
if (task_flags)
- psi_task_change(task, 0, task_flags);
+ psi_change_groups(task, 0, task_flags);
task_rq_unlock(rq, task, &rf);
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467ece43..e930b8fa6253 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -107,6 +107,7 @@ __schedstats_from_se(struct sched_entity *se)
}
#ifdef CONFIG_PSI
+void psi_change_groups(struct task_struct *task, int clear, int set);
void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
@@ -124,42 +125,46 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
if (static_branch_likely(&psi_disabled))
return;
- if (p->in_memstall)
- set |= TSK_MEMSTALL_RUNNING;
+ if (!wakeup) {
+ if (p->psi_flags)
+ psi_change_groups(p, 0, p->psi_flags);
+ return;
+ }
- if (!wakeup || p->sched_psi_wake_requeue) {
- if (p->in_memstall)
+ /*
+ * wakeup (including wakeup migrate) need to change task psi_flags,
+ * specifically need to set TSK_RUNNING or TSK_MEMSTALL_RUNNING.
+ * Since we clear task->psi_flags for wakeup migrated task, we need
+ * to check task->psi_flags to see what should be set and clear.
+ */
+ if (unlikely(p->in_memstall)) {
+ set |= TSK_MEMSTALL_RUNNING;
+ if (!(p->psi_flags & TSK_MEMSTALL))
set |= TSK_MEMSTALL;
- if (p->sched_psi_wake_requeue)
- p->sched_psi_wake_requeue = 0;
- } else {
- if (p->in_iowait)
- clear |= TSK_IOWAIT;
}
+ if (p->psi_flags & TSK_IOWAIT)
+ clear |= TSK_IOWAIT;
psi_task_change(p, clear, set);
}
static inline void psi_dequeue(struct task_struct *p, bool sleep)
{
- int clear = TSK_RUNNING;
-
if (static_branch_likely(&psi_disabled))
return;
+ if (!sleep) {
+ if (p->psi_flags)
+ psi_change_groups(p, p->psi_flags, 0);
+ return;
+ }
+
/*
* A voluntary sleep is a dequeue followed by a task switch. To
* avoid walking all ancestors twice, psi_task_switch() handles
* TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
* Do nothing here.
*/
- if (sleep)
- return;
-
- if (p->in_memstall)
- clear |= (TSK_MEMSTALL | TSK_MEMSTALL_RUNNING);
-
- psi_task_change(p, clear, 0);
}
static inline void psi_ttwu_dequeue(struct task_struct *p)
@@ -169,21 +174,14 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
/*
* Is the task being migrated during a wakeup? Make sure to
* deregister its sleep-persistent psi states from the old
- * queue, and let psi_enqueue() know it has to requeue.
+ * queue.
*/
- if (unlikely(p->in_iowait || p->in_memstall)) {
+ if (unlikely(p->psi_flags)) {
struct rq_flags rf;
struct rq *rq;
- int clear = 0;
-
- if (p->in_iowait)
- clear |= TSK_IOWAIT;
- if (p->in_memstall)
- clear |= TSK_MEMSTALL;
rq = __task_rq_lock(p, &rf);
- psi_task_change(p, clear, 0);
- p->sched_psi_wake_requeue = 1;
+ psi_task_change(p, p->psi_flags, 0);
__task_rq_unlock(rq, &rf);
}
}
--
2.36.1
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may case non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
For our use case, we also want leaf cgroup PSI accounted for userspace
adjustment on that cgroup, apart from only system-wide management.
So this patch add kernel cmdline parameter "psi_inner_cgroup" to control
whether or not to account for inner cgroups, which is default to true
for compatibility.
Performance test on Intel Xeon Platinum with 3 levels of cgroup:
1. default (psi_inner_cgroup=true)
$ perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.032 [sec]
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes
Total time: 7.758 [sec]
7.758354 usecs/op
128893 ops/sec
2. psi_inner_cgroup=false
$ perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.032 [sec]
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes
Total time: 7.309 [sec]
7.309436 usecs/op
136809 ops/sec
Signed-off-by: Chengming Zhou <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 6 ++++++
kernel/sched/psi.c | 11 ++++++++++-
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8090130b544b..6beef5b8bc36 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4419,6 +4419,12 @@
tracking.
Format: <bool>
+ psi_inner_cgroup=
+ [KNL] Enable or disable pressure stall information
+ tracking for the inner cgroups.
+ Format: <bool>
+ default: enabled
+
psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to
probe for; one of (bare|imps|exps|lifebook|any).
psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2228cbf3bdd3..8d76920f47b3 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -147,12 +147,21 @@ static bool psi_enable;
#else
static bool psi_enable = true;
#endif
+
+static bool psi_inner_cgroup __read_mostly = true;
+
static int __init setup_psi(char *str)
{
return kstrtobool(str, &psi_enable) == 0;
}
__setup("psi=", setup_psi);
+static int __init setup_psi_inner_cgroup(char *str)
+{
+ return kstrtobool(str, &psi_inner_cgroup) == 0;
+}
+__setup("psi_inner_cgroup=", setup_psi_inner_cgroup);
+
/* Running averages - we need to be higher-res than loadavg */
#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
@@ -958,7 +967,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
group_init(&cgroup->psi);
parent = cgroup_parent(cgroup);
- if (parent && cgroup_parent(parent))
+ if (parent && cgroup_parent(parent) && psi_inner_cgroup)
cgroup->psi.parent = cgroup_psi(parent);
else
cgroup->psi.parent = &psi_system;
--
2.36.1
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.
For performance impact consideration, this is enabled by default when
CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline
parameter "psi_irq=".
Signed-off-by: Chengming Zhou <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 5 ++
include/linux/psi.h | 1 +
include/linux/psi_types.h | 7 +-
kernel/cgroup/cgroup.c | 27 +++++++
kernel/sched/core.c | 1 +
kernel/sched/psi.c | 76 ++++++++++++++++++-
kernel/sched/stats.h | 13 ++++
7 files changed, 126 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6beef5b8bc36..1067dde299a0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4425,6 +4425,11 @@
Format: <bool>
default: enabled
+ psi_irq= [KNL] Enable or disable IRQ/SOFTIRQ pressure stall
+ information tracking.
+ Format: <bool>
+ default: enabled when CONFIG_IRQ_TIME_ACCOUNTING.
+
psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to
probe for; one of (bare|imps|exps|lifebook|any).
psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports
diff --git a/include/linux/psi.h b/include/linux/psi.h
index aa168a038242..f5cf3e45d5a5 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -14,6 +14,7 @@ struct css_set;
#ifdef CONFIG_PSI
extern struct static_key_false psi_disabled;
+extern struct static_key_true psi_irq_enabled;
extern struct psi_group psi_system;
void psi_init(void);
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c124f7d186d0..195f123b1cd1 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -47,7 +47,8 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
- NR_PSI_RESOURCES = 3,
+ PSI_IRQ,
+ NR_PSI_RESOURCES = 4,
};
/*
@@ -63,9 +64,11 @@ enum psi_states {
PSI_MEM_FULL,
PSI_CPU_SOME,
PSI_CPU_FULL,
+ PSI_IRQ_SOME,
+ PSI_IRQ_FULL,
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 7,
+ NR_PSI_STATES = 9,
};
enum psi_aggregators {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1424da7ed2c4..cf61df0ac892 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3683,6 +3683,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
+
+ return psi_show(seq, psi, PSI_IRQ);
+}
+
+static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+}
+#endif
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5079,6 +5096,16 @@ static struct cftype cgroup_base_files[] = {
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ {
+ .name = "irq.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_irq_pressure_show,
+ .write = cgroup_irq_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
+ },
+#endif
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5f2d3542b05..08637cfb7ed9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+ psi_account_irqtime(rq->curr, irq_delta);
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8d76920f47b3..6a0894e28780 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -141,6 +141,7 @@ static int psi_bug __read_mostly;
DEFINE_STATIC_KEY_FALSE(psi_disabled);
DEFINE_STATIC_KEY_TRUE(psi_cgroups_enabled);
+DEFINE_STATIC_KEY_TRUE(psi_irq_enabled);
#ifdef CONFIG_PSI_DEFAULT_DISABLED
static bool psi_enable;
@@ -150,6 +151,12 @@ static bool psi_enable = true;
static bool psi_inner_cgroup __read_mostly = true;
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static bool psi_irq_enable = true;
+#else
+static bool psi_irq_enable;
+#endif
+
static int __init setup_psi(char *str)
{
return kstrtobool(str, &psi_enable) == 0;
@@ -162,6 +169,12 @@ static int __init setup_psi_inner_cgroup(char *str)
}
__setup("psi_inner_cgroup=", setup_psi_inner_cgroup);
+static int __init setup_psi_irq(char *str)
+{
+ return kstrtobool(str, &psi_irq_enable) == 0;
+}
+__setup("psi_irq=", setup_psi_irq);
+
/* Running averages - we need to be higher-res than loadavg */
#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
@@ -215,12 +228,16 @@ void __init psi_init(void)
if (!psi_enable) {
static_branch_enable(&psi_disabled);
static_branch_disable(&psi_cgroups_enabled);
+ static_branch_disable(&psi_irq_enabled);
return;
}
if (!cgroup_psi_enabled())
static_branch_disable(&psi_cgroups_enabled);
+ if (!psi_irq_enable)
+ static_branch_disable(&psi_irq_enabled);
+
psi_period = jiffies_to_nsecs(PSI_FREQ);
group_init(&psi_system);
}
@@ -893,6 +910,28 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}
+void psi_groups_account_irqtime(struct task_struct *task, u32 delta)
+{
+ struct psi_group *group = task_psi_group(task);
+ int cpu = task_cpu(task);
+ u64 now = cpu_clock(cpu);
+ struct psi_group_cpu *groupc;
+
+ for_each_psi_group(group) {
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ write_seqcount_begin(&groupc->seq);
+
+ record_times(groupc, now);
+ groupc->times[PSI_IRQ_FULL] += delta;
+
+ write_seqcount_end(&groupc->seq);
+
+ if (group->poll_states & (1 << PSI_IRQ_FULL))
+ psi_schedule_poll_work(group, 1);
+ }
+}
+
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
@@ -1069,7 +1108,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);
- for (full = 0; full < 2; full++) {
+ for (full = (res == PSI_IRQ); full < 2; full++) {
unsigned long avg[3] = { 0, };
u64 total = 0;
int w;
@@ -1111,9 +1150,12 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
else
return ERR_PTR(-EINVAL);
- if (state >= PSI_NONIDLE)
+ if (state >= PSI_NONIDLE || state == PSI_IRQ_SOME)
return ERR_PTR(-EINVAL);
+ if (!static_branch_likely(&psi_irq_enabled) && state == PSI_IRQ_FULL)
+ return ERR_PTR(-EOPNOTSUPP);
+
if (window_us < WINDOW_MIN_US ||
window_us > WINDOW_MAX_US)
return ERR_PTR(-EINVAL);
@@ -1395,6 +1437,33 @@ static const struct proc_ops psi_cpu_proc_ops = {
.proc_release = psi_fop_release,
};
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int psi_irq_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IRQ);
+}
+
+static int psi_irq_open(struct inode *inode, struct file *file)
+{
+ return psi_open(file, psi_irq_show);
+}
+
+static ssize_t psi_irq_write(struct file *file, const char __user *user_buf,
+ size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_IRQ);
+}
+
+static const struct proc_ops psi_irq_proc_ops = {
+ .proc_open = psi_irq_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_write = psi_irq_write,
+ .proc_poll = psi_fop_poll,
+ .proc_release = psi_fop_release,
+};
+#endif
+
static int __init psi_proc_init(void)
{
if (psi_enable) {
@@ -1402,6 +1471,9 @@ static int __init psi_proc_init(void)
proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops);
proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
}
return 0;
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index e930b8fa6253..10926cdaccc8 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -111,6 +111,7 @@ void psi_change_groups(struct task_struct *task, int clear, int set);
void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
+void psi_groups_account_irqtime(struct task_struct *task, u32 delta);
/*
* PSI tracks state that persists across sleeps, such as iowaits and
@@ -196,6 +197,17 @@ static inline void psi_sched_switch(struct task_struct *prev,
psi_task_switch(prev, next, sleep);
}
+static inline void psi_account_irqtime(struct task_struct *task, u32 delta)
+{
+ if (!static_branch_likely(&psi_irq_enabled))
+ return;
+
+ if (!task->pid)
+ return;
+
+ psi_groups_account_irqtime(task, delta);
+}
+
#else /* CONFIG_PSI */
static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
@@ -203,6 +215,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
+static inline void psi_account_irqtime(struct task_struct *curr, u32 delta) {}
#endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
--
2.36.1
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI can be configured to skip per-cgroup stall accounting. And
doesn't expose PSI files in cgroup hierarchy.
This patch do the same thing when psi_disabled.
Signed-off-by: Chengming Zhou <[email protected]>
---
kernel/cgroup/cgroup.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..1424da7ed2c4 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3700,6 +3700,9 @@ static void cgroup_pressure_release(struct kernfs_open_file *of)
bool cgroup_psi_enabled(void)
{
+ if (static_branch_likely(&psi_disabled))
+ return false;
+
return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) == 0;
}
--
2.36.1
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.
In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.
This patch cache parent psi_group, only need to get psi_group of task
itself first, then just use group->parent to iterate.
And this patch is preparation for the following patch, in which we can
configure PSI to only account for leaf cgroups and system-wide.
Performance test on Intel Xeon Platinum with 3 levels of cgroup:
1. before the patch:
$ perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.032 [sec]
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes
Total time: 8.077 [sec]
8.077648 usecs/op
123798 ops/sec
2. after the patch:
$ perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.032 [sec]
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes
Total time: 7.758 [sec]
7.758354 usecs/op
128893 ops/sec
Signed-off-by: Chengming Zhou <[email protected]>
---
include/linux/psi_types.h | 2 ++
kernel/sched/psi.c | 48 ++++++++++++++++++++-------------------
2 files changed, 27 insertions(+), 23 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..c124f7d186d0 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -147,6 +147,8 @@ struct psi_trigger {
};
struct psi_group {
+ struct psi_group *parent;
+
/* Protects data used by the aggregator */
struct mutex avgs_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index aa40bf888102..2228cbf3bdd3 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -758,30 +758,22 @@ static void psi_group_change(struct psi_group *group, int cpu,
schedule_delayed_work(&group->avgs_work, PSI_FREQ);
}
-static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+static inline struct psi_group *task_psi_group(struct task_struct *task)
{
- if (*iter == &psi_system)
- return NULL;
-
#ifdef CONFIG_CGROUPS
if (static_branch_likely(&psi_cgroups_enabled)) {
- struct cgroup *cgroup = NULL;
-
- if (!*iter)
- cgroup = task->cgroups->dfl_cgrp;
- else
- cgroup = cgroup_parent(*iter);
+ struct cgroup *cgroup = task_dfl_cgroup(task);
- if (cgroup && cgroup_parent(cgroup)) {
- *iter = cgroup;
+ if (cgroup && cgroup_parent(cgroup))
return cgroup_psi(cgroup);
- }
}
#endif
- *iter = &psi_system;
return &psi_system;
}
+#define for_each_psi_group(group) \
+ for (; group; group = group->parent)
+
static void psi_flags_change(struct task_struct *task, int clear, int set)
{
if (((task->psi_flags & set) ||
@@ -799,12 +791,11 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
void psi_change_groups(struct task_struct *task, int clear, int set)
{
+ struct psi_group *group = task_psi_group(task);
int cpu = task_cpu(task);
- struct psi_group *group;
- void *iter = NULL;
u64 now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter)))
+ for_each_psi_group(group)
psi_group_change(group, cpu, clear, set, now, true);
}
@@ -822,7 +813,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
{
struct psi_group *group, *common = NULL;
int cpu = task_cpu(prev);
- void *iter;
u64 now = cpu_clock(cpu);
if (next->pid) {
@@ -833,8 +823,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
* we reach the first common ancestor. Iterate @next's
* ancestors only until we encounter @prev's ONCPU.
*/
- iter = NULL;
- while ((group = iterate_groups(next, &iter))) {
+ group = task_psi_group(next);
+ for_each_psi_group(group) {
if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
common = group;
break;
@@ -874,9 +864,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_flags_change(prev, clear, set);
- iter = NULL;
- while ((group = iterate_groups(prev, &iter)) && group != common)
+ group = task_psi_group(prev);
+ for_each_psi_group(group) {
+ if (group == common)
+ break;
psi_group_change(group, cpu, clear, set, now, wake_clock);
+ }
/*
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -884,7 +877,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
if (sleep || unlikely(prev->in_memstall != next->in_memstall)) {
clear &= ~TSK_ONCPU;
- for (; group; group = iterate_groups(prev, &iter))
+
+ for_each_psi_group(group)
psi_group_change(group, cpu, clear, set, now, true);
}
}
@@ -953,6 +947,8 @@ void psi_memstall_leave(unsigned long *flags)
#ifdef CONFIG_CGROUPS
int psi_cgroup_alloc(struct cgroup *cgroup)
{
+ struct cgroup *parent;
+
if (!static_branch_likely(&psi_cgroups_enabled))
return 0;
@@ -960,6 +956,12 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
if (!cgroup->psi.pcpu)
return -ENOMEM;
group_init(&cgroup->psi);
+
+ parent = cgroup_parent(cgroup);
+ if (parent && cgroup_parent(parent))
+ cgroup->psi.parent = cgroup_psi(parent);
+ else
+ cgroup->psi.parent = &psi_system;
return 0;
}
--
2.36.1
commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer prev task sleep handling to psi_task_switch(), so we don't need
to clear and set TSK_ONCPU state for common cgroups.
A
|
B
/ \
C D
/ \
prev next
After that commit psi_task_switch() do:
1. psi_group_change(next, .set=TSK_ONCPU) for D
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C
3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A
But there is a limitation "prev->psi_flags == next->psi_flags" that
if not satisfied, will make this cgroups optimization unusable for both
sleep switch or running switch cases. For example:
prev->in_memstall != next->in_memstall when sleep switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A
prev->in_memstall != next->in_memstall when running switch:
1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A
2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A
The reason why this limitation exist is that we consider a group is
PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive
could run even if it were runnable. So when CPU curr changed from prev
to next and their in_memstall status is different, we have to change
PSI_MEM_FULL status for their common cgroups.
This patch remove this limitation by making psi_group_change() change
PSI_MEM_FULL status depend on CPU curr->in_memstall status.
Signed-off-by: Chengming Zhou <[email protected]>
---
kernel/sched/psi.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index c8a4e644cd2c..e04041d8251b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
u64 now = cpu_clock(cpu);
if (next->pid) {
- bool identical_state;
-
psi_flags_change(next, 0, TSK_ONCPU);
/*
* When switching between tasks that have an identical
@@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
* we reach the first common ancestor. Iterate @next's
* ancestors only until we encounter @prev's ONCPU.
*/
- identical_state = prev->psi_flags == next->psi_flags;
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (identical_state &&
- per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
common = group;
break;
}
@@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
* with dequeuing too, finish that for the rest of the hierarchy.
*/
- if (sleep) {
+ if (sleep || unlikely(prev->in_memstall != next->in_memstall)) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, &iter))
psi_group_change(group, cpu, clear, set, now, true);
--
2.36.1
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.
Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().
Signed-off-by: Chengming Zhou <[email protected]>
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 6ba159fe2a4f..aa40bf888102 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -205,6 +205,7 @@ void __init psi_init(void)
{
if (!psi_enable) {
static_branch_enable(&psi_disabled);
+ static_branch_disable(&psi_cgroups_enabled);
return;
}
@@ -952,7 +953,7 @@ void psi_memstall_leave(unsigned long *flags)
#ifdef CONFIG_CGROUPS
int psi_cgroup_alloc(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return 0;
cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu);
@@ -964,7 +965,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
void psi_cgroup_free(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return;
cancel_delayed_work_sync(&cgroup->psi.avgs_work);
@@ -991,7 +992,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
struct rq_flags rf;
struct rq *rq;
- if (static_branch_likely(&psi_disabled)) {
+ if (!static_branch_likely(&psi_cgroups_enabled)) {
/*
* Lame to do this here, but the scheduler cannot be locked
* from the outside, so we move cgroups from inside sched/.
--
2.36.1
Hi Chengming,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on akpm-mm/mm-everything linus/master v5.19-rc7]
[cannot apply to tj-cgroup/for-next next-20220720]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 401e4963bf45c800e3e9ea0d3a0289d738005fd4
config: riscv-randconfig-s032-20220718 (https://download.01.org/0day-ci/archive/20220721/[email protected]/config)
compiler: riscv64-linux-gcc (GCC) 12.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/d14f2a9ff31fefc5b28a16addaa832dc80d84189
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833
git checkout d14f2a9ff31fefc5b28a16addaa832dc80d84189
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=riscv SHELL=/bin/bash kernel/sched/
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>
sparse warnings: (new ones prefixed by >>)
>> kernel/sched/core.c:711:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *task @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:711:31: sparse: expected struct task_struct *task
kernel/sched/core.c:711:31: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:1028:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:1028:38: sparse: expected struct task_struct *curr
kernel/sched/core.c:1028:38: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:2192:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:2192:33: sparse: expected struct task_struct *p
kernel/sched/core.c:2192:33: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:2192:68: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:2192:68: sparse: expected struct task_struct *tsk
kernel/sched/core.c:2192:68: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:3592:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/core.c:3592:17: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/core.c:3592:17: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/core.c:3789:28: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct const *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:3789:28: sparse: expected struct task_struct const *p
kernel/sched/core.c:3789:28: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:9089:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *push_task @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:9089:43: sparse: expected struct task_struct *push_task
kernel/sched/core.c:9089:43: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:5376:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:5376:38: sparse: expected struct task_struct *curr
kernel/sched/core.c:5376:38: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:6322:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *prev @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:6322:14: sparse: expected struct task_struct *prev
kernel/sched/core.c:6322:14: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:6848:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/core.c:6848:17: sparse: struct task_struct *
kernel/sched/core.c:6848:17: sparse: struct task_struct [noderef] __rcu *
kernel/sched/core.c:7064:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/core.c:7064:22: sparse: struct task_struct [noderef] __rcu *
kernel/sched/core.c:7064:22: sparse: struct task_struct *
kernel/sched/core.c:11121:25: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:11121:25: sparse: expected struct task_struct *p
kernel/sched/core.c:11121:25: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:537:6: sparse: sparse: context imbalance in 'raw_spin_rq_lock_nested' - wrong count at exit
kernel/sched/core.c:562:6: sparse: sparse: context imbalance in 'raw_spin_rq_trylock' - wrong count at exit
kernel/sched/core.c:586:6: sparse: sparse: context imbalance in 'raw_spin_rq_unlock' - unexpected unlock
kernel/sched/core.c: note: in included file:
kernel/sched/sched.h:1580:9: sparse: sparse: context imbalance in '__task_rq_lock' - wrong count at exit
kernel/sched/sched.h:1580:9: sparse: sparse: context imbalance in 'task_rq_lock' - wrong count at exit
kernel/sched/core.c: note: in included file:
kernel/sched/pelt.h:97:13: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct const *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/pelt.h:97:13: sparse: expected struct task_struct const *p
kernel/sched/pelt.h:97:13: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:2183:33: sparse: sparse: dereference of noderef expression
kernel/sched/core.c:2184:19: sparse: sparse: dereference of noderef expression
kernel/sched/core.c:2185:18: sparse: sparse: dereference of noderef expression
kernel/sched/core.c: note: in included file:
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/core.c:2158:38: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/core.c:2158:38: sparse: struct task_struct [noderef] __rcu *
kernel/sched/core.c:2158:38: sparse: struct task_struct const *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
vim +711 kernel/sched/core.c
675
676 /*
677 * RQ-clock updating methods:
678 */
679
680 static void update_rq_clock_task(struct rq *rq, s64 delta)
681 {
682 /*
683 * In theory, the compile should just see 0 here, and optimize out the call
684 * to sched_rt_avg_update. But I don't trust it...
685 */
686 s64 __maybe_unused steal = 0, irq_delta = 0;
687
688 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
689 irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
690
691 /*
692 * Since irq_time is only updated on {soft,}irq_exit, we might run into
693 * this case when a previous update_rq_clock() happened inside a
694 * {soft,}irq region.
695 *
696 * When this happens, we stop ->clock_task and only update the
697 * prev_irq_time stamp to account for the part that fit, so that a next
698 * update will consume the rest. This ensures ->clock_task is
699 * monotonic.
700 *
701 * It does however cause some slight miss-attribution of {soft,}irq
702 * time, a more accurate solution would be to update the irq_time using
703 * the current rq->clock timestamp, except that would require using
704 * atomic ops.
705 */
706 if (irq_delta > delta)
707 irq_delta = delta;
708
709 rq->prev_irq_time += irq_delta;
710 delta -= irq_delta;
> 711 psi_account_irqtime(rq->curr, irq_delta);
712 #endif
713 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
714 if (static_key_false((¶virt_steal_rq_enabled))) {
715 steal = paravirt_steal_clock(cpu_of(rq));
716 steal -= rq->prev_steal_time_rq;
717
718 if (unlikely(steal > delta))
719 steal = delta;
720
721 rq->prev_steal_time_rq += steal;
722 delta -= steal;
723 }
724 #endif
725
726 rq->clock_task += delta;
727
--
0-DAY CI Kernel Test Service
https://01.org/lkp
Hi Chengming,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on akpm-mm/mm-everything linus/master v5.19-rc7]
[cannot apply to tj-cgroup/for-next next-20220721]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 401e4963bf45c800e3e9ea0d3a0289d738005fd4
config: x86_64-randconfig-s022-20220718 (https://download.01.org/0day-ci/archive/20220722/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/d14f2a9ff31fefc5b28a16addaa832dc80d84189
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833
git checkout d14f2a9ff31fefc5b28a16addaa832dc80d84189
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>
sparse warnings: (new ones prefixed by >>)
>> kernel/sched/core.c:711:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:711:31: sparse: expected struct task_struct *curr
kernel/sched/core.c:711:31: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:781:48: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:781:48: sparse: expected struct task_struct *p
kernel/sched/core.c:781:48: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:1028:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:1028:38: sparse: expected struct task_struct *curr
kernel/sched/core.c:1028:38: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:2192:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:2192:33: sparse: expected struct task_struct *p
kernel/sched/core.c:2192:33: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:2192:68: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:2192:68: sparse: expected struct task_struct *tsk
kernel/sched/core.c:2192:68: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:5376:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:5376:38: sparse: expected struct task_struct *curr
kernel/sched/core.c:5376:38: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:6322:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *prev @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:6322:14: sparse: expected struct task_struct *prev
kernel/sched/core.c:6322:14: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:6848:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/core.c:6848:17: sparse: struct task_struct *
kernel/sched/core.c:6848:17: sparse: struct task_struct [noderef] __rcu *
kernel/sched/core.c:7064:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/core.c:7064:22: sparse: struct task_struct [noderef] __rcu *
kernel/sched/core.c:7064:22: sparse: struct task_struct *
kernel/sched/core.c:11121:25: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/core.c:11121:25: sparse: expected struct task_struct *p
kernel/sched/core.c:11121:25: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/core.c:537:6: sparse: sparse: context imbalance in 'raw_spin_rq_lock_nested' - wrong count at exit
kernel/sched/core.c:570:23: sparse: sparse: context imbalance in 'raw_spin_rq_trylock' - wrong count at exit
kernel/sched/core.c:586:6: sparse: sparse: context imbalance in 'raw_spin_rq_unlock' - unexpected unlock
kernel/sched/core.c:624:36: sparse: sparse: context imbalance in '__task_rq_lock' - wrong count at exit
kernel/sched/core.c:665:36: sparse: sparse: context imbalance in 'task_rq_lock' - wrong count at exit
kernel/sched/core.c:781:11: sparse: sparse: dereference of noderef expression
kernel/sched/core.c:2183:33: sparse: sparse: dereference of noderef expression
kernel/sched/core.c:2184:19: sparse: sparse: dereference of noderef expression
kernel/sched/core.c:2185:18: sparse: sparse: dereference of noderef expression
kernel/sched/core.c: note: in included file:
kernel/sched/sched.h:1592:9: sparse: sparse: context imbalance in 'ttwu_runnable' - wrong count at exit
kernel/sched/core.c:4262:9: sparse: sparse: context imbalance in 'task_call_func' - wrong count at exit
kernel/sched/sched.h:1592:9: sparse: sparse: context imbalance in 'wake_up_new_task' - wrong count at exit
kernel/sched/core.c:5035:9: sparse: sparse: context imbalance in 'finish_task_switch' - wrong count at exit
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2210:9: sparse: struct task_struct *
kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2053:25: sparse: struct task_struct *
vim +711 kernel/sched/core.c
675
676 /*
677 * RQ-clock updating methods:
678 */
679
680 static void update_rq_clock_task(struct rq *rq, s64 delta)
681 {
682 /*
683 * In theory, the compile should just see 0 here, and optimize out the call
684 * to sched_rt_avg_update. But I don't trust it...
685 */
686 s64 __maybe_unused steal = 0, irq_delta = 0;
687
688 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
689 irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
690
691 /*
692 * Since irq_time is only updated on {soft,}irq_exit, we might run into
693 * this case when a previous update_rq_clock() happened inside a
694 * {soft,}irq region.
695 *
696 * When this happens, we stop ->clock_task and only update the
697 * prev_irq_time stamp to account for the part that fit, so that a next
698 * update will consume the rest. This ensures ->clock_task is
699 * monotonic.
700 *
701 * It does however cause some slight miss-attribution of {soft,}irq
702 * time, a more accurate solution would be to update the irq_time using
703 * the current rq->clock timestamp, except that would require using
704 * atomic ops.
705 */
706 if (irq_delta > delta)
707 irq_delta = delta;
708
709 rq->prev_irq_time += irq_delta;
710 delta -= irq_delta;
> 711 psi_account_irqtime(rq->curr, irq_delta);
712 #endif
713 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
714 if (static_key_false((¶virt_steal_rq_enabled))) {
715 steal = paravirt_steal_clock(cpu_of(rq));
716 steal -= rq->prev_steal_time_rq;
717
718 if (unlikely(steal > delta))
719 steal = delta;
720
721 rq->prev_steal_time_rq += steal;
722 delta -= steal;
723 }
724 #endif
725
726 rq->clock_task += delta;
727
--
0-DAY CI Kernel Test Service
https://01.org/lkp
Hi Chengming,
On 7/21/22 12:04 PM, Chengming Zhou Wrote:
> Now PSI already tracked workload pressure stall information for
> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
> obvious impact on some workload productivity, such as web service
> workload.
>
> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
> from update_rq_clock_task(), in which we can record that delta
> to CPU curr task's cgroups as PSI_IRQ_FULL status.
The {soft,}irq affection should be equal to all the runnable tasks
on that cpu, not only rq->curr. Further I think irqstall is per-cpu
rather than per-cgroup.
Abel
On 2022/7/22 11:30, Abel Wu wrote:
> Hi Chengming,
>
> On 7/21/22 12:04 PM, Chengming Zhou Wrote:
>> Now PSI already tracked workload pressure stall information for
>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>> obvious impact on some workload productivity, such as web service
>> workload.
>>
>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>> from update_rq_clock_task(), in which we can record that delta
>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>
> The {soft,}irq affection should be equal to all the runnable tasks
> on that cpu, not only rq->curr. Further I think irqstall is per-cpu
> rather than per-cgroup.
Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time
and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL.
So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups
as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system.
Thanks.
>
> Abel
On 7/22/22 2:13 PM, Chengming Zhou Wrote:
> On 2022/7/22 11:30, Abel Wu wrote:
>> Hi Chengming,
>>
>> On 7/21/22 12:04 PM, Chengming Zhou Wrote:
>>> Now PSI already tracked workload pressure stall information for
>>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>>> obvious impact on some workload productivity, such as web service
>>> workload.
>>>
>>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>>> from update_rq_clock_task(), in which we can record that delta
>>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>>
>> The {soft,}irq affection should be equal to all the runnable tasks
>> on that cpu, not only rq->curr. Further I think irqstall is per-cpu
>> rather than per-cgroup.
>
> Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time
> and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL.
I don't think rq->curr pays for it if you mean consuming quota here.
And it doesn't seem appropriate to let other groups treat it as cpu
stall because the rq->curr is also the victim rather than the one
causes stall (so it's different from rq->curr causing memstall and
observed as cpustall by others).
>
> So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups
> as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system.
>
On 2022/7/22 15:14, Abel Wu wrote:
> On 7/22/22 2:13 PM, Chengming Zhou Wrote:
>> On 2022/7/22 11:30, Abel Wu wrote:
>>> Hi Chengming,
>>>
>>> On 7/21/22 12:04 PM, Chengming Zhou Wrote:
>>>> Now PSI already tracked workload pressure stall information for
>>>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>>>> obvious impact on some workload productivity, such as web service
>>>> workload.
>>>>
>>>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>>>> from update_rq_clock_task(), in which we can record that delta
>>>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>>>
>>> The {soft,}irq affection should be equal to all the runnable tasks
>>> on that cpu, not only rq->curr. Further I think irqstall is per-cpu
>>> rather than per-cgroup.
>>
>> Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time
>> and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL.
>
> I don't think rq->curr pays for it if you mean consuming quota here.
Yes, it makes rq->curr's groups look more productive than it actually is,
which are clearly different from other groups.
> And it doesn't seem appropriate to let other groups treat it as cpu
> stall because the rq->curr is also the victim rather than the one
> causes stall (so it's different from rq->curr causing memstall and
> observed as cpustall by others).
IMHO, we don't care who causes stall, instead we care about what affects
workload productivity.
>
>>
>> So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups
>> as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system.
>>
>
On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote:
> @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> clear |= TSK_MEMSTALL_RUNNING;
> if (prev->in_iowait)
> set |= TSK_IOWAIT;
> +
> + /*
> + * Periodic aggregation shuts off if there is a period of no
> + * task changes, so we wake it back up if necessary. However,
> + * don't do this if the task change is the aggregation worker
> + * itself going to sleep, or we'll ping-pong forever.
> + */
> + if (unlikely((prev->flags & PF_WQ_WORKER) &&
> + wq_worker_last_func(prev) == psi_avgs_work))
> + wake_clock = false;
> }
>
> psi_flags_change(prev, clear, set);
>
> iter = NULL;
> while ((group = iterate_groups(prev, &iter)) && group != common)
> - psi_group_change(group, cpu, clear, set, now, true);
> + psi_group_change(group, cpu, clear, set, now, wake_clock);
>
> /*
> * TSK_ONCPU is handled up to the common ancestor. If we're tasked
Wait, there is another psi_group_change() below this, which handles
the clearing of TSK_RUNNING for common ancestors. We don't want to
wake those either, so it needs s/true/wake_clock/ as well.
On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote:
> We don't want to wake periodic aggregation work back up if the
> task change is the aggregation worker itself going to sleep, or
> we'll ping-pong forever.
>
> Previously, we would use psi_task_change() in psi_dequeue() when
> task going to sleep, so this check was put in psi_task_change().
>
> But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> defer task sleep handling to psi_task_switch(), won't go through
> psi_task_change() anymore.
>
> So this patch move this check to psi_task_switch(). Note for defer sleep
> case, we should wake periodic avgs work for common ancestors groups,
> since those groups have next task sched_in.
>
> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> Signed-off-by: Chengming Zhou <[email protected]>
Good catch!
Acked-by: Johannes Weiner <[email protected]>
On Thu, Jul 21, 2022 at 12:04:33PM +0800, Chengming Zhou wrote:
> This patch move psi_task_change/psi_task_switch declarations out of
> PSI public header, since they are only needed for implementing the
> PSI stats tracking in sched/stats.h
>
> psi_task_switch is obvious, psi_task_change can't be public helper
> since it doesn't check psi_disabled static key. And there is no
> any user now, so put it in sched/stats.h too.
>
> Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Thu, Jul 21, 2022 at 12:04:36PM +0800, Chengming Zhou wrote:
> We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
> bother to alloc percpu memory and init for it.
>
> Also don't need to migrate task PSI stats between cgroups in
> cgroup_move_task().
>
> Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
> PSI accounts stalls for each cgroup separately and aggregates it
> at each level of the hierarchy. This may case non-negligible overhead
> for some workloads when under deep level of the hierarchy.
>
> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
> make PSI to skip per-cgroup stall accounting, only account system-wide
> to avoid this each level overhead.
>
> For our use case, we also want leaf cgroup PSI accounted for userspace
> adjustment on that cgroup, apart from only system-wide management.
I hear the overhead argument. But skipping accounting in intermediate
levels is a bit odd and unprecedented in the cgroup interface. Once we
do this, it's conceivable people would like to do the same thing for
other stats and accounting, like for instance memory.stat.
Tejun, what are your thoughts on this?
On Thu, Jul 21, 2022 at 12:04:35PM +0800, Chengming Zhou wrote:
> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
> make PSI can be configured to skip per-cgroup stall accounting. And
> doesn't expose PSI files in cgroup hierarchy.
>
> This patch do the same thing when psi_disabled.
>
> Signed-off-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote:
> Now PSI already tracked workload pressure stall information for
> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
> obvious impact on some workload productivity, such as web service
> workload.
>
> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
> from update_rq_clock_task(), in which we can record that delta
> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>
> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
> the current task on the CPU, make nothing productive could run
> even if it were runnable, so we only use PSI_IRQ_FULL.
That sounds reasonable.
> For performance impact consideration, this is enabled by default when
> CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline
> parameter "psi_irq=".
If there isn't a concrete usecase already, let's not add another
commandline parameter for now.
> @@ -63,9 +64,11 @@ enum psi_states {
> PSI_MEM_FULL,
> PSI_CPU_SOME,
> PSI_CPU_FULL,
> + PSI_IRQ_SOME,
> + PSI_IRQ_FULL,
> /* Only per-CPU, to weigh the CPU in the global average: */
> PSI_NONIDLE,
> - NR_PSI_STATES = 7,
> + NR_PSI_STATES = 9,
> };
Unfortunately, this grows the psi state touched by the scheduler into
a second cacheline. :( Please reclaim space first.
I think we can remove the NR_CPU task count, which frees up one
u32. Something like the below diff should work (untested!)
And you should be able to remove PSI_IRQ_SOME, since it's not used
anyway. Then we'd be good to go.
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..31dc76e2d8ea 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -15,13 +15,6 @@ enum psi_task_count {
NR_IOWAIT,
NR_MEMSTALL,
NR_RUNNING,
- /*
- * This can't have values other than 0 or 1 and could be
- * implemented as a bit flag. But for now we still have room
- * in the first cacheline of psi_group_cpu, and this way we
- * don't have to special case any state tracking for it.
- */
- NR_ONCPU,
/*
* For IO and CPU stalls the presence of running/oncpu tasks
* in the domain means a partial rather than a full stall.
@@ -39,9 +32,11 @@ enum psi_task_count {
#define TSK_IOWAIT (1 << NR_IOWAIT)
#define TSK_MEMSTALL (1 << NR_MEMSTALL)
#define TSK_RUNNING (1 << NR_RUNNING)
-#define TSK_ONCPU (1 << NR_ONCPU)
#define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING)
+/* Only one task can be scheduled, no corresponding task count */
+#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS)
+
/* Resources that workloads could be stalled on */
enum psi_res {
PSI_IO,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a4fa3aadfcba..232e1dbfad46 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -215,7 +215,7 @@ void __init psi_init(void)
group_init(&psi_system);
}
-static bool test_state(unsigned int *tasks, enum psi_states state)
+static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
{
switch (state) {
case PSI_IO_SOME:
@@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
return unlikely(tasks[NR_MEMSTALL] &&
tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
case PSI_CPU_SOME:
- return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
+ return unlikely(tasks[NR_RUNNING] > oncpu);
case PSI_CPU_FULL:
- return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
+ return unlikely(tasks[NR_RUNNING] && !oncpu);
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
bool wake_clock)
{
struct psi_group_cpu *groupc;
- u32 state_mask = 0;
unsigned int t, m;
enum psi_states s;
+ u32 state_mask;
groupc = per_cpu_ptr(group->pcpu, cpu);
@@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
record_times(groupc, now);
+ /*
+ * Start with TSK_ONCPU, which doesn't have a corresponding
+ * task count - it's just a boolean flag directly encoded in
+ * the state mask. Clear, set, or carry the current state if
+ * no changes are requested.
+ */
+ if (clear & TSK_ONCPU) {
+ state_mask = 0;
+ clear &= ~TSK_ONCPU;
+ } else if (set & TSK_ONCPU) {
+ state_mask = TSK_ONCPU;
+ set &= ~TSK_ONCPU;
+ } else {
+ state_mask = groupc->state_mask & TSK_ONCPU;
+ }
+
+ /*
+ * The rest of the state mask is calculated based on the task
+ * counts. Update those first, then construct the mask.
+ */
for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
continue;
@@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (set & (1 << t))
groupc->tasks[t]++;
- /* Calculate state mask representing active states */
for (s = 0; s < NR_PSI_STATES; s++) {
- if (test_state(groupc->tasks, s))
+ if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU))
state_mask |= (1 << s);
}
@@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
* task in a cgroup is in_memstall, the corresponding groupc
* on that cpu is in PSI_MEM_FULL state.
*/
- if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
+ if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
groupc->state_mask = state_mask;
@@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
if (identical_state &&
- per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
+ (per_cpu_ptr(group->pcpu, cpu)->state_mask &
+ TSK_ONCPU)) {
common = group;
break;
}
On 2022/7/25 23:39, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote:
>> @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>> clear |= TSK_MEMSTALL_RUNNING;
>> if (prev->in_iowait)
>> set |= TSK_IOWAIT;
>> +
>> + /*
>> + * Periodic aggregation shuts off if there is a period of no
>> + * task changes, so we wake it back up if necessary. However,
>> + * don't do this if the task change is the aggregation worker
>> + * itself going to sleep, or we'll ping-pong forever.
>> + */
>> + if (unlikely((prev->flags & PF_WQ_WORKER) &&
>> + wq_worker_last_func(prev) == psi_avgs_work))
>> + wake_clock = false;
>> }
>>
>> psi_flags_change(prev, clear, set);
>>
>> iter = NULL;
>> while ((group = iterate_groups(prev, &iter)) && group != common)
>> - psi_group_change(group, cpu, clear, set, now, true);
>> + psi_group_change(group, cpu, clear, set, now, wake_clock);
>>
>> /*
>> * TSK_ONCPU is handled up to the common ancestor. If we're tasked
>
> Wait, there is another psi_group_change() below this, which handles
> the clearing of TSK_RUNNING for common ancestors. We don't want to
> wake those either, so it needs s/true/wake_clock/ as well.
Yes, I was wrong, will fix.
Thanks!
On 2022/7/26 02:26, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote:
>> Now PSI already tracked workload pressure stall information for
>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>> obvious impact on some workload productivity, such as web service
>> workload.
>>
>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>> from update_rq_clock_task(), in which we can record that delta
>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>>
>> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
>> the current task on the CPU, make nothing productive could run
>> even if it were runnable, so we only use PSI_IRQ_FULL.
>
> That sounds reasonable.
>
>> For performance impact consideration, this is enabled by default when
>> CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline
>> parameter "psi_irq=".
>
> If there isn't a concrete usecase already, let's not add another
> commandline parameter for now.
Ok, will remove it.
>
>> @@ -63,9 +64,11 @@ enum psi_states {
>> PSI_MEM_FULL,
>> PSI_CPU_SOME,
>> PSI_CPU_FULL,
>> + PSI_IRQ_SOME,
>> + PSI_IRQ_FULL,
>> /* Only per-CPU, to weigh the CPU in the global average: */
>> PSI_NONIDLE,
>> - NR_PSI_STATES = 7,
>> + NR_PSI_STATES = 9,
>> };
>
> Unfortunately, this grows the psi state touched by the scheduler into
> a second cacheline. :( Please reclaim space first.
Thank you for pointing this out!
>
> I think we can remove the NR_CPU task count, which frees up one
> u32. Something like the below diff should work (untested!)
>
> And you should be able to remove PSI_IRQ_SOME, since it's not used
> anyway. Then we'd be good to go.
Very good solution for this, I will test it later.
Thanks!
>
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c7fe7c089718..31dc76e2d8ea 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -15,13 +15,6 @@ enum psi_task_count {
> NR_IOWAIT,
> NR_MEMSTALL,
> NR_RUNNING,
> - /*
> - * This can't have values other than 0 or 1 and could be
> - * implemented as a bit flag. But for now we still have room
> - * in the first cacheline of psi_group_cpu, and this way we
> - * don't have to special case any state tracking for it.
> - */
> - NR_ONCPU,
> /*
> * For IO and CPU stalls the presence of running/oncpu tasks
> * in the domain means a partial rather than a full stall.
> @@ -39,9 +32,11 @@ enum psi_task_count {
> #define TSK_IOWAIT (1 << NR_IOWAIT)
> #define TSK_MEMSTALL (1 << NR_MEMSTALL)
> #define TSK_RUNNING (1 << NR_RUNNING)
> -#define TSK_ONCPU (1 << NR_ONCPU)
> #define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING)
>
> +/* Only one task can be scheduled, no corresponding task count */
> +#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS)
> +
> /* Resources that workloads could be stalled on */
> enum psi_res {
> PSI_IO,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index a4fa3aadfcba..232e1dbfad46 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -215,7 +215,7 @@ void __init psi_init(void)
> group_init(&psi_system);
> }
>
> -static bool test_state(unsigned int *tasks, enum psi_states state)
> +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
> {
> switch (state) {
> case PSI_IO_SOME:
> @@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
> return unlikely(tasks[NR_MEMSTALL] &&
> tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
> case PSI_CPU_SOME:
> - return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
> + return unlikely(tasks[NR_RUNNING] > oncpu);
> case PSI_CPU_FULL:
> - return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
> + return unlikely(tasks[NR_RUNNING] && !oncpu);
> case PSI_NONIDLE:
> return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
> tasks[NR_RUNNING];
> @@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
> bool wake_clock)
> {
> struct psi_group_cpu *groupc;
> - u32 state_mask = 0;
> unsigned int t, m;
> enum psi_states s;
> + u32 state_mask;
>
> groupc = per_cpu_ptr(group->pcpu, cpu);
>
> @@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
>
> record_times(groupc, now);
>
> + /*
> + * Start with TSK_ONCPU, which doesn't have a corresponding
> + * task count - it's just a boolean flag directly encoded in
> + * the state mask. Clear, set, or carry the current state if
> + * no changes are requested.
> + */
> + if (clear & TSK_ONCPU) {
> + state_mask = 0;
> + clear &= ~TSK_ONCPU;
> + } else if (set & TSK_ONCPU) {
> + state_mask = TSK_ONCPU;
> + set &= ~TSK_ONCPU;
> + } else {
> + state_mask = groupc->state_mask & TSK_ONCPU;
> + }
> +
> + /*
> + * The rest of the state mask is calculated based on the task
> + * counts. Update those first, then construct the mask.
> + */
> for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
> if (!(m & (1 << t)))
> continue;
> @@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
> if (set & (1 << t))
> groupc->tasks[t]++;
>
> - /* Calculate state mask representing active states */
> for (s = 0; s < NR_PSI_STATES; s++) {
> - if (test_state(groupc->tasks, s))
> + if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU))
> state_mask |= (1 << s);
> }
>
> @@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
> * task in a cgroup is in_memstall, the corresponding groupc
> * on that cpu is in PSI_MEM_FULL state.
> */
> - if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> + if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall))
> state_mask |= (1 << PSI_MEM_FULL);
>
> groupc->state_mask = state_mask;
> @@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> iter = NULL;
> while ((group = iterate_groups(next, &iter))) {
> if (identical_state &&
> - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
> + (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> + TSK_ONCPU)) {
> common = group;
> break;
> }
On 2022/7/26 00:52, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
>> PSI accounts stalls for each cgroup separately and aggregates it
>> at each level of the hierarchy. This may case non-negligible overhead
>> for some workloads when under deep level of the hierarchy.
>>
>> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
>> make PSI to skip per-cgroup stall accounting, only account system-wide
>> to avoid this each level overhead.
>>
>> For our use case, we also want leaf cgroup PSI accounted for userspace
>> adjustment on that cgroup, apart from only system-wide management.
>
> I hear the overhead argument. But skipping accounting in intermediate
> levels is a bit odd and unprecedented in the cgroup interface. Once we
> do this, it's conceivable people would like to do the same thing for
> other stats and accounting, like for instance memory.stat.
Right, it's a bit odd... We don't use PSI stats in intermediate levels
in our use case, but don't know what other use scenarios are. If they are
useful for other people, this patch can be dropped.
Thanks.
>
> Tejun, what are your thoughts on this?
Hello,
On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
> > PSI accounts stalls for each cgroup separately and aggregates it
> > at each level of the hierarchy. This may case non-negligible overhead
> > for some workloads when under deep level of the hierarchy.
> >
> > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
> > make PSI to skip per-cgroup stall accounting, only account system-wide
> > to avoid this each level overhead.
> >
> > For our use case, we also want leaf cgroup PSI accounted for userspace
> > adjustment on that cgroup, apart from only system-wide management.
>
> I hear the overhead argument. But skipping accounting in intermediate
> levels is a bit odd and unprecedented in the cgroup interface. Once we
> do this, it's conceivable people would like to do the same thing for
> other stats and accounting, like for instance memory.stat.
>
> Tejun, what are your thoughts on this?
Given that PSI requires on-the-spot recursive accumulation unlike other
stats, it can add quite a bit of overhead, so I'm sympathetic to the
argument because PSI can't be made cheaper by kernel being better (or at
least we don't know how to yet).
That said, "leaf-only" feels really hacky to me. My memory is hazy but
there's nothing preventing any cgroup from being skipped over when updating
PSI states, right? The state count propagation is recursive but it's each
task's state being propagated upwards not the child cgroup's, so we can skip
over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on
any given cgroup without worrying about affecting others. Am I correct?
Assuming the above isn't wrong, if we can figure out how we can re-enable
it, which is more difficult as the counters need to be resynchronized with
the current state, that'd be ideal. Then, we can just allow each cgroup to
enable / disable PSI reporting dynamically as they see fit.
Thanks.
--
tejun
On 2022/7/26 02:26, Johannes Weiner wrote:
> On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote:
>> Now PSI already tracked workload pressure stall information for
>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
>> obvious impact on some workload productivity, such as web service
>> workload.
>>
>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
>> from update_rq_clock_task(), in which we can record that delta
>> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>>
>> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
>> the current task on the CPU, make nothing productive could run
>> even if it were runnable, so we only use PSI_IRQ_FULL.
>
> That sounds reasonable.
>
>> For performance impact consideration, this is enabled by default when
>> CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline
>> parameter "psi_irq=".
>
> If there isn't a concrete usecase already, let's not add another
> commandline parameter for now.
>
>> @@ -63,9 +64,11 @@ enum psi_states {
>> PSI_MEM_FULL,
>> PSI_CPU_SOME,
>> PSI_CPU_FULL,
>> + PSI_IRQ_SOME,
>> + PSI_IRQ_FULL,
>> /* Only per-CPU, to weigh the CPU in the global average: */
>> PSI_NONIDLE,
>> - NR_PSI_STATES = 7,
>> + NR_PSI_STATES = 9,
>> };
>
> Unfortunately, this grows the psi state touched by the scheduler into
> a second cacheline. :( Please reclaim space first.
>
> I think we can remove the NR_CPU task count, which frees up one
> u32. Something like the below diff should work (untested!)
Hi, I tested ok, would you mind if I put this patch in this series?
Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting
We put all fields updated by the scheduler in the first cacheline of
struct psi_group_cpu for performance.
Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
we need to reclaim space first. This patch remove NR_ONCPU task accounting
in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Tested-by: Chengming Zhou <[email protected]>
>
> And you should be able to remove PSI_IRQ_SOME, since it's not used
> anyway. Then we'd be good to go.
>
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c7fe7c089718..31dc76e2d8ea 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -15,13 +15,6 @@ enum psi_task_count {
> NR_IOWAIT,
> NR_MEMSTALL,
> NR_RUNNING,
> - /*
> - * This can't have values other than 0 or 1 and could be
> - * implemented as a bit flag. But for now we still have room
> - * in the first cacheline of psi_group_cpu, and this way we
> - * don't have to special case any state tracking for it.
> - */
> - NR_ONCPU,
> /*
> * For IO and CPU stalls the presence of running/oncpu tasks
> * in the domain means a partial rather than a full stall.
> @@ -39,9 +32,11 @@ enum psi_task_count {
> #define TSK_IOWAIT (1 << NR_IOWAIT)
> #define TSK_MEMSTALL (1 << NR_MEMSTALL)
> #define TSK_RUNNING (1 << NR_RUNNING)
> -#define TSK_ONCPU (1 << NR_ONCPU)
> #define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING)
>
> +/* Only one task can be scheduled, no corresponding task count */
> +#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS)
> +
> /* Resources that workloads could be stalled on */
> enum psi_res {
> PSI_IO,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index a4fa3aadfcba..232e1dbfad46 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -215,7 +215,7 @@ void __init psi_init(void)
> group_init(&psi_system);
> }
>
> -static bool test_state(unsigned int *tasks, enum psi_states state)
> +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu)
> {
> switch (state) {
> case PSI_IO_SOME:
> @@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
> return unlikely(tasks[NR_MEMSTALL] &&
> tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
> case PSI_CPU_SOME:
> - return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
> + return unlikely(tasks[NR_RUNNING] > oncpu);
> case PSI_CPU_FULL:
> - return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
> + return unlikely(tasks[NR_RUNNING] && !oncpu);
> case PSI_NONIDLE:
> return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
> tasks[NR_RUNNING];
> @@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu,
> bool wake_clock)
> {
> struct psi_group_cpu *groupc;
> - u32 state_mask = 0;
> unsigned int t, m;
> enum psi_states s;
> + u32 state_mask;
>
> groupc = per_cpu_ptr(group->pcpu, cpu);
>
> @@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu,
>
> record_times(groupc, now);
>
> + /*
> + * Start with TSK_ONCPU, which doesn't have a corresponding
> + * task count - it's just a boolean flag directly encoded in
> + * the state mask. Clear, set, or carry the current state if
> + * no changes are requested.
> + */
> + if (clear & TSK_ONCPU) {
> + state_mask = 0;
> + clear &= ~TSK_ONCPU;
> + } else if (set & TSK_ONCPU) {
> + state_mask = TSK_ONCPU;
> + set &= ~TSK_ONCPU;
> + } else {
> + state_mask = groupc->state_mask & TSK_ONCPU;
> + }
> +
> + /*
> + * The rest of the state mask is calculated based on the task
> + * counts. Update those first, then construct the mask.
> + */
> for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
> if (!(m & (1 << t)))
> continue;
> @@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu,
> if (set & (1 << t))
> groupc->tasks[t]++;
>
> - /* Calculate state mask representing active states */
> for (s = 0; s < NR_PSI_STATES; s++) {
> - if (test_state(groupc->tasks, s))
> + if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU))
> state_mask |= (1 << s);
> }
>
> @@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
> * task in a cgroup is in_memstall, the corresponding groupc
> * on that cpu is in PSI_MEM_FULL state.
> */
> - if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> + if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall))
> state_mask |= (1 << PSI_MEM_FULL);
>
> groupc->state_mask = state_mask;
> @@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> iter = NULL;
> while ((group = iterate_groups(next, &iter))) {
> if (identical_state &&
> - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
> + (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> + TSK_ONCPU)) {
> common = group;
> break;
> }
On Wed, Jul 27, 2022 at 07:28:37PM +0800, Chengming Zhou wrote:
> On 2022/7/26 02:26, Johannes Weiner wrote:
> > I think we can remove the NR_CPU task count, which frees up one
> > u32. Something like the below diff should work (untested!)
>
> Hi, I tested ok, would you mind if I put this patch in this series?
>
> Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting
>
> We put all fields updated by the scheduler in the first cacheline of
> struct psi_group_cpu for performance.
>
> Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
> we need to reclaim space first. This patch remove NR_ONCPU task accounting
> in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead.
Thanks for testing it, that sounds good.
> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Chengming Zhou <[email protected]>
> Tested-by: Chengming Zhou <[email protected]>
Since you're handling the patch, you need to add your own
Signed-off-by: as well. And keep From: Johannes (git commit --author).
Thanks!
On 2022/7/27 21:00, Johannes Weiner wrote:
> On Wed, Jul 27, 2022 at 07:28:37PM +0800, Chengming Zhou wrote:
>> On 2022/7/26 02:26, Johannes Weiner wrote:
>>> I think we can remove the NR_CPU task count, which frees up one
>>> u32. Something like the below diff should work (untested!)
>>
>> Hi, I tested ok, would you mind if I put this patch in this series?
>>
>> Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting
>>
>> We put all fields updated by the scheduler in the first cacheline of
>> struct psi_group_cpu for performance.
>>
>> Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
>> we need to reclaim space first. This patch remove NR_ONCPU task accounting
>> in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead.
>
> Thanks for testing it, that sounds good.
>
>> Signed-off-by: Johannes Weiner <[email protected]>
>> Reviewed-by: Chengming Zhou <[email protected]>
>> Tested-by: Chengming Zhou <[email protected]>
>
> Since you're handling the patch, you need to add your own
> Signed-off-by: as well. And keep From: Johannes (git commit --author).
Got it. Thanks!
On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote:
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c124f7d186d0..195f123b1cd1 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -47,7 +47,8 @@ enum psi_res {
> PSI_IO,
> PSI_MEM,
> PSI_CPU,
> - NR_PSI_RESOURCES = 3,
> + PSI_IRQ,
> + NR_PSI_RESOURCES = 4,
> };
>
> /*
> @@ -63,9 +64,11 @@ enum psi_states {
> PSI_MEM_FULL,
> PSI_CPU_SOME,
> PSI_CPU_FULL,
> + PSI_IRQ_SOME,
> + PSI_IRQ_FULL,
> /* Only per-CPU, to weigh the CPU in the global average: */
> PSI_NONIDLE,
> - NR_PSI_STATES = 7,
> + NR_PSI_STATES = 9,
> };
>
> enum psi_aggregators {
$ pahole -EC psi_group_cpu defconfig-build/kernel/sched/build_utility.o
struct psi_group_cpu {
/* typedef seqcount_t */ struct seqcount {
unsigned int sequence; /* 0 4 */
} seq __attribute__((__aligned__(64))); /* 0 4 */
unsigned int tasks[5]; /* 4 20 */
/* typedef u32 -> __u32 */ unsigned int state_mask; /* 24 4 */
/* typedef u32 -> __u32 */ unsigned int times[7]; /* 28 28 */
/* typedef u64 -> __u64 */ long long unsigned int state_start; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
/* typedef u32 -> __u32 */ unsigned int times_prev[2][7] __attribute__((__aligned__(64))); /* 64 56 */
/* size: 128, cachelines: 2, members: 6 */
/* padding: 8 */
/* forced alignments: 2 */
} __attribute__((__aligned__(64)));
$ pahole -EC psi_group_cpu defconfig-build/kernel/sched/build_utility.o
struct psi_group_cpu {
/* typedef seqcount_t */ struct seqcount {
unsigned int sequence; /* 0 4 */
} seq __attribute__((__aligned__(64))); /* 0 4 */
unsigned int tasks[5]; /* 4 20 */
/* typedef u32 -> __u32 */ unsigned int state_mask; /* 24 4 */
/* typedef u32 -> __u32 */ unsigned int times[9]; /* 28 36 */
/* --- cacheline 1 boundary (64 bytes) --- */
/* typedef u64 -> __u64 */ long long unsigned int state_start; /* 64 8 */
/* XXX 56 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
/* typedef u32 -> __u32 */ unsigned int times_prev[2][9] __attribute__((__aligned__(64))); /* 128 72 */
/* size: 256, cachelines: 4, members: 6 */
/* sum members: 144, holes: 1, sum holes: 56 */
/* padding: 56 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 56 */
} __attribute__((__aligned__(64)));
So yeah, I think not.
On 2022/7/27 01:54, Tejun Heo wrote:
> Hello,
>
> On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote:
>> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote:
>>> PSI accounts stalls for each cgroup separately and aggregates it
>>> at each level of the hierarchy. This may case non-negligible overhead
>>> for some workloads when under deep level of the hierarchy.
>>>
>>> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
>>> make PSI to skip per-cgroup stall accounting, only account system-wide
>>> to avoid this each level overhead.
>>>
>>> For our use case, we also want leaf cgroup PSI accounted for userspace
>>> adjustment on that cgroup, apart from only system-wide management.
>>
>> I hear the overhead argument. But skipping accounting in intermediate
>> levels is a bit odd and unprecedented in the cgroup interface. Once we
>> do this, it's conceivable people would like to do the same thing for
>> other stats and accounting, like for instance memory.stat.
>>
>> Tejun, what are your thoughts on this?
>
> Given that PSI requires on-the-spot recursive accumulation unlike other
> stats, it can add quite a bit of overhead, so I'm sympathetic to the
> argument because PSI can't be made cheaper by kernel being better (or at
> least we don't know how to yet).
>
> That said, "leaf-only" feels really hacky to me. My memory is hazy but
> there's nothing preventing any cgroup from being skipped over when updating
> PSI states, right? The state count propagation is recursive but it's each
> task's state being propagated upwards not the child cgroup's, so we can skip
> over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on
> any given cgroup without worrying about affecting others. Am I correct?
Yes, I think it's correct.
>
> Assuming the above isn't wrong, if we can figure out how we can re-enable
> it, which is more difficult as the counters need to be resynchronized with
> the current state, that'd be ideal. Then, we can just allow each cgroup to
> enable / disable PSI reporting dynamically as they see fit.
This method is more fine-grained but more difficult like you said above.
I think it may meet most needs to disable PSI stats in intermediate cgroups?
Thanks!
>
> Thanks.
>
Hello,
On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
> > Assuming the above isn't wrong, if we can figure out how we can re-enable
> > it, which is more difficult as the counters need to be resynchronized with
> > the current state, that'd be ideal. Then, we can just allow each cgroup to
> > enable / disable PSI reporting dynamically as they see fit.
>
> This method is more fine-grained but more difficult like you said above.
> I think it may meet most needs to disable PSI stats in intermediate cgroups?
So, I'm not necessarily against implementing something easier but we at
least wanna get the interface right, so that if we decide to do the full
thing later we can easily expand on the existing interface. ie. let's please
not be too hacky. I don't think it'd be that difficult to implement
per-cgroup disable-only operation that we can later expand to allow
re-enabling, right?
Thanks.
--
tejun
On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
> > > Assuming the above isn't wrong, if we can figure out how we can re-enable
> > > it, which is more difficult as the counters need to be resynchronized with
> > > the current state, that'd be ideal. Then, we can just allow each cgroup to
> > > enable / disable PSI reporting dynamically as they see fit.
> >
> > This method is more fine-grained but more difficult like you said above.
> > I think it may meet most needs to disable PSI stats in intermediate cgroups?
>
> So, I'm not necessarily against implementing something easier but we at
> least wanna get the interface right, so that if we decide to do the full
> thing later we can easily expand on the existing interface. ie. let's please
> not be too hacky. I don't think it'd be that difficult to implement
> per-cgroup disable-only operation that we can later expand to allow
> re-enabling, right?
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level, if
we can live with losing history from while it was disabled. I.e. the
avgs will restart from 0, total= will have gaps - should be okay, IMO.
Where it gets trickier is also stopping the tracking of task counts in
a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
and cgroup state and find all tasks of interest across all CPUs for
the given cgroup to recreate the counts. I'm not quite sure whether
that's feasible, and if so, whether it's worth the savings.
It might be good to benchmark the two disabling steps independently.
Maybe stopping aggregation while keeping task counts is good enough,
and we can commit to a disable/re-enable interface from the start.
Or maybe it's all in the cachelines and iteration, and stopping the
aggregation while still writing task counts isn't saving much. In that
case we'd have to look closer at reconstructing task counts, to see if
later re-enabling is actually a practical option or whether a one-off
kill switch is more realistic.
Chengming, can you experiment with disabling: record_times(), the
test_state() loop and state_mask construction, and the averaging
worker - while keeping the groupc->tasks updates?
Hello,
On Wed, Aug 03, 2022 at 03:22:16PM -0400, Johannes Weiner wrote:
> Where it gets trickier is also stopping the tracking of task counts in
> a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
> and cgroup state and find all tasks of interest across all CPUs for
> the given cgroup to recreate the counts. I'm not quite sure whether
> that's feasible, and if so, whether it's worth the savings.
If this turns out to be necessary, I wonder whether we can just be
opportunistic. ie. don't bother with walking all the tasks but only remember
whether a task is accounted at a given level or not (this can be a bitmap
which is allocated at cgroup attach type and in most caess will be pretty
small). Then, maybe we can just start accounting them as they cycle through
state transitions - we ignore the ones leaving states that they weren't
accounted for and start remembering the new states they enter.
Thanks.
--
tejun
On 2022/8/4 01:58, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
>>> Assuming the above isn't wrong, if we can figure out how we can re-enable
>>> it, which is more difficult as the counters need to be resynchronized with
>>> the current state, that'd be ideal. Then, we can just allow each cgroup to
>>> enable / disable PSI reporting dynamically as they see fit.
>>
>> This method is more fine-grained but more difficult like you said above.
>> I think it may meet most needs to disable PSI stats in intermediate cgroups?
>
> So, I'm not necessarily against implementing something easier but we at
> least wanna get the interface right, so that if we decide to do the full
> thing later we can easily expand on the existing interface. ie. let's please
> not be too hacky. I don't think it'd be that difficult to implement
> per-cgroup disable-only operation that we can later expand to allow
> re-enabling, right?
Agree, the interface is important, per-cgroup disable-only operation maybe easier
to implement. I will look into this more.
Thanks!
On 2022/8/4 03:22, Johannes Weiner wrote:
> On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
>>>> Assuming the above isn't wrong, if we can figure out how we can re-enable
>>>> it, which is more difficult as the counters need to be resynchronized with
>>>> the current state, that'd be ideal. Then, we can just allow each cgroup to
>>>> enable / disable PSI reporting dynamically as they see fit.
>>>
>>> This method is more fine-grained but more difficult like you said above.
>>> I think it may meet most needs to disable PSI stats in intermediate cgroups?
>>
>> So, I'm not necessarily against implementing something easier but we at
>> least wanna get the interface right, so that if we decide to do the full
>> thing later we can easily expand on the existing interface. ie. let's please
>> not be too hacky. I don't think it'd be that difficult to implement
>> per-cgroup disable-only operation that we can later expand to allow
>> re-enabling, right?
>
> It should be relatively straight-forward to disable and re-enable
> state aggregation, time tracking, averaging on a per-cgroup level, if
> we can live with losing history from while it was disabled. I.e. the
> avgs will restart from 0, total= will have gaps - should be okay, IMO.
>
> Where it gets trickier is also stopping the tracking of task counts in
> a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
> and cgroup state and find all tasks of interest across all CPUs for
> the given cgroup to recreate the counts. I'm not quite sure whether
> that's feasible, and if so, whether it's worth the savings.
>
> It might be good to benchmark the two disabling steps independently.
> Maybe stopping aggregation while keeping task counts is good enough,
> and we can commit to a disable/re-enable interface from the start.
>
> Or maybe it's all in the cachelines and iteration, and stopping the
> aggregation while still writing task counts isn't saving much. In that
> case we'd have to look closer at reconstructing task counts, to see if
> later re-enabling is actually a practical option or whether a one-off
> kill switch is more realistic.
>
> Chengming, can you experiment with disabling: record_times(), the
> test_state() loop and state_mask construction, and the averaging
> worker - while keeping the groupc->tasks updates?
Hello,
I did this experiment today with disabling record_times(), test_state()
loop and averaging worker, while only keeping groupc->tasks[] updates,
the results look promising.
mmtests/config-scheduler-perfpipe on Intel Xeon Platinum with 3 levels of cgroup:
perfpipe
tip tip patched
psi=off psi=on only groupc->tasks[]
Min Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%)
1st-qrtle Time 8.11 ( 0.00%) 8.94 ( -10.22%) 8.39 ( -3.46%)
2nd-qrtle Time 8.17 ( 0.00%) 9.02 ( -10.42%) 8.44 ( -3.37%)
3rd-qrtle Time 8.20 ( 0.00%) 9.08 ( -10.72%) 8.48 ( -3.43%)
Max-1 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%)
Max-5 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%)
Max-10 Time 8.09 ( 0.00%) 8.89 ( -9.96%) 8.35 ( -3.22%)
Max-90 Time 8.31 ( 0.00%) 9.13 ( -9.90%) 8.55 ( -2.95%)
Max-95 Time 8.32 ( 0.00%) 9.14 ( -9.88%) 8.55 ( -2.81%)
Max-99 Time 8.39 ( 0.00%) 9.26 ( -10.30%) 8.57 ( -2.09%)
Max Time 8.56 ( 0.00%) 9.26 ( -8.23%) 8.72 ( -1.90%)
Amean Time 8.19 ( 0.00%) 9.03 * -10.26%* 8.45 * -3.27%*
Tejun suggested using a bitmap in task to remember whether the task is accounted
at a given level or not, which I think also is a very good idea, but I haven't
clearly figure out how to do it.
The above performance test result looks good to me, so I think we can implement this
per-cgroup "cgroup.psi" interface to disable/re-enable PSI stats from the start,
and we can change to a better implementation if needed later?
Thanks!
On Thu, Aug 04, 2022 at 09:51:31PM +0800, Chengming Zhou wrote:
> On 2022/8/4 03:22, Johannes Weiner wrote:
> > On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
> >>>> Assuming the above isn't wrong, if we can figure out how we can re-enable
> >>>> it, which is more difficult as the counters need to be resynchronized with
> >>>> the current state, that'd be ideal. Then, we can just allow each cgroup to
> >>>> enable / disable PSI reporting dynamically as they see fit.
> >>>
> >>> This method is more fine-grained but more difficult like you said above.
> >>> I think it may meet most needs to disable PSI stats in intermediate cgroups?
> >>
> >> So, I'm not necessarily against implementing something easier but we at
> >> least wanna get the interface right, so that if we decide to do the full
> >> thing later we can easily expand on the existing interface. ie. let's please
> >> not be too hacky. I don't think it'd be that difficult to implement
> >> per-cgroup disable-only operation that we can later expand to allow
> >> re-enabling, right?
> >
> > It should be relatively straight-forward to disable and re-enable
> > state aggregation, time tracking, averaging on a per-cgroup level, if
> > we can live with losing history from while it was disabled. I.e. the
> > avgs will restart from 0, total= will have gaps - should be okay, IMO.
> >
> > Where it gets trickier is also stopping the tracking of task counts in
> > a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
> > and cgroup state and find all tasks of interest across all CPUs for
> > the given cgroup to recreate the counts. I'm not quite sure whether
> > that's feasible, and if so, whether it's worth the savings.
> >
> > It might be good to benchmark the two disabling steps independently.
> > Maybe stopping aggregation while keeping task counts is good enough,
> > and we can commit to a disable/re-enable interface from the start.
> >
> > Or maybe it's all in the cachelines and iteration, and stopping the
> > aggregation while still writing task counts isn't saving much. In that
> > case we'd have to look closer at reconstructing task counts, to see if
> > later re-enabling is actually a practical option or whether a one-off
> > kill switch is more realistic.
> >
> > Chengming, can you experiment with disabling: record_times(), the
> > test_state() loop and state_mask construction, and the averaging
> > worker - while keeping the groupc->tasks updates?
>
> Hello,
>
> I did this experiment today with disabling record_times(), test_state()
> loop and averaging worker, while only keeping groupc->tasks[] updates,
> the results look promising.
>
> mmtests/config-scheduler-perfpipe on Intel Xeon Platinum with 3 levels of cgroup:
>
> perfpipe
> tip tip patched
> psi=off psi=on only groupc->tasks[]
> Min Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%)
> 1st-qrtle Time 8.11 ( 0.00%) 8.94 ( -10.22%) 8.39 ( -3.46%)
> 2nd-qrtle Time 8.17 ( 0.00%) 9.02 ( -10.42%) 8.44 ( -3.37%)
> 3rd-qrtle Time 8.20 ( 0.00%) 9.08 ( -10.72%) 8.48 ( -3.43%)
> Max-1 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%)
> Max-5 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%)
> Max-10 Time 8.09 ( 0.00%) 8.89 ( -9.96%) 8.35 ( -3.22%)
> Max-90 Time 8.31 ( 0.00%) 9.13 ( -9.90%) 8.55 ( -2.95%)
> Max-95 Time 8.32 ( 0.00%) 9.14 ( -9.88%) 8.55 ( -2.81%)
> Max-99 Time 8.39 ( 0.00%) 9.26 ( -10.30%) 8.57 ( -2.09%)
> Max Time 8.56 ( 0.00%) 9.26 ( -8.23%) 8.72 ( -1.90%)
> Amean Time 8.19 ( 0.00%) 9.03 * -10.26%* 8.45 * -3.27%*
Fantastic!
> Tejun suggested using a bitmap in task to remember whether the task is accounted
> at a given level or not, which I think also is a very good idea, but I haven't
> clearly figure out how to do it.
>
> The above performance test result looks good to me, so I think we can implement this
> per-cgroup "cgroup.psi" interface to disable/re-enable PSI stats from the start,
> and we can change to a better implementation if needed later?
Yes, that sounds good to me.