2021-09-09 14:06:22

by Zhaoyang Huang

[permalink] [raw]
Subject: [RFC PATCH] psi : calc psi memstall time more precisely

From: Zhaoyang Huang <[email protected]>

psi's memstall time is counted as simple as exit - entry so far, which ignore
the task's off cpu time. Fix it by calc the percentage of off time via task and
rq's util and runq load.

Signed-off-by: Zhaoyang Huang <[email protected]>
---
kernel/sched/psi.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index cc25a3c..6812c7e 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -182,6 +182,8 @@ struct psi_group psi_system = {

static void psi_avgs_work(struct work_struct *work);

+static unsigned long psi_memtime_fixup(u32 growth);
+
static void group_init(struct psi_group *group)
{
int cpu;
@@ -492,6 +494,23 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
return growth;
}

+static unsigned long psi_memtime_fixup(u32 growth)
+{
+ struct rq *rq = task_rq(current);
+ unsigned long growth_fixed = (unsigned long)growth;
+
+ if !(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)
+ return growth_fixed;
+
+ if ((current->in_memstall)
+ && (current->se.avg.util_avg <= rq->cfs.avg.util_avg)
+ && current->se.avg.util_avg != 0)
+ growth_fixed = div64_ul((current->se.avg.util_avg + 1) * growth,
+ rq->cfs.avg.util_avg + rq->avg_rt.util_avg + 1);
+
+ return growth_fixed;
+}
+
static void init_triggers(struct psi_group *group, u64 now)
{
struct psi_trigger *t;
@@ -658,6 +677,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
}

if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
+ delta = psi_memtime_fixup(delta);
groupc->times[PSI_MEM_SOME] += delta;
if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
@@ -928,8 +948,8 @@ void psi_memstall_leave(unsigned long *flags)
*/
rq = this_rq_lock_irq(&rf);

- current->in_memstall = 0;
psi_task_change(current, TSK_MEMSTALL, 0);
+ current->in_memstall = 0;

rq_unlock_irq(rq, &rf);
}
--
1.9.1


2021-09-09 15:55:58

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH] psi : calc psi memstall time more precisely

On Thu, Sep 09, 2021 at 08:00:24PM +0800, Huangzhaoyang wrote:
> From: Zhaoyang Huang <[email protected]>
>
> psi's memstall time is counted as simple as exit - entry so far, which ignore
> the task's off cpu time. Fix it by calc the percentage of off time via task and
> rq's util and runq load.
>
> Signed-off-by: Zhaoyang Huang <[email protected]>

Can you please explain what practical problem you are trying to solve?

If a reclaimer gets preempted and has to wait for CPU, should that
stall be attributed to a lack of memory? Some of it should, since page
reclaim consumed CPU budget that would've otherwise been available for
doing real work. The application of course may still have experienced
a CPU wait outside of reclaim, but potentially a shorter one. Memory
pressure can definitely increase CPU pressure (as it can IO pressure).

Proportional and transitive accounting - how much of total CPU load is
page reclaim, and thus how much of each runq wait is due to memory
pressure - would give more precise answers. But generally discounting
off-CPU time in a stall is not any more correct than including it all.

This is doable, but I think there needs to be better justification for
providing this level of precision, since it comes with code complexity
that has performance and maintenance overhead.

2021-09-14 01:32:52

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [RFC PATCH] psi : calc psi memstall time more precisely

On Thu, Sep 9, 2021 at 11:54 PM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 08:00:24PM +0800, Huangzhaoyang wrote:
> > From: Zhaoyang Huang <[email protected]>
> >
> > psi's memstall time is counted as simple as exit - entry so far, which ignore
> > the task's off cpu time. Fix it by calc the percentage of off time via task and
> > rq's util and runq load.
> >
> > Signed-off-by: Zhaoyang Huang <[email protected]>
>
> Can you please explain what practical problem you are trying to solve?
>
> If a reclaimer gets preempted and has to wait for CPU, should that
> stall be attributed to a lack of memory? Some of it should, since page
> reclaim consumed CPU budget that would've otherwise been available for
> doing real work. The application of course may still have experienced
> a CPU wait outside of reclaim, but potentially a shorter one. Memory
> pressure can definitely increase CPU pressure (as it can IO pressure).
The preempted time which is mentioned here can be separated into 2
categories. First one is cfs task preempted because running out of the
share of schedule_lantency. The second one is preempted by RT,DL and
IRQs. IMO, the previous is reasonable to be counted in stall time,
while the second one NOT.
>
> Proportional and transitive accounting - how much of total CPU load is
> page reclaim, and thus how much of each runq wait is due to memory
> pressure - would give more precise answers. But generally discounting
> off-CPU time in a stall is not any more correct than including it all.
>
> This is doable, but I think there needs to be better justification for
> providing this level of precision, since it comes with code complexity
> that has performance and maintenance overhead.
The rq's utilization of load tracking provides an easy way to compute
such proportion. A new commit has been given out which mainly deals
with the 2nd scenario described above. The statistics of the precision
are provided together.