Hi
The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
run from a kworker thread, PSI_NONIDLE condition would be observed as
there is a RUNNING task. So we would always end up re-arming the work.
If the work is re-armed from the psi_avgs_work() it self, the backing off
logic in psi_task_change() (will be moved to psi_task_switch soon) can't
help. The work is already scheduled. so we don't do anything there.
Probably I am missing some thing here. Can you please clarify how we
shut off re-arming the psi avg work?
Thanks,
Pavan
On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> Hi
>
> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> run from a kworker thread, PSI_NONIDLE condition would be observed as
> there is a RUNNING task. So we would always end up re-arming the work.
>
> If the work is re-armed from the psi_avgs_work() it self, the backing off
> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> help. The work is already scheduled. so we don't do anything there.
>
> Probably I am missing some thing here. Can you please clarify how we
> shut off re-arming the psi avg work?
>
I have collected traces on an idle system (running android12-5.10 with minimal
user space). This is a older kernel, however the issue remain on latest kernel
as per code inspection.
I have eliminated noise created by other work items. For example, vmstat_work.
This is a deferrable work but gets executed since this is queued on the same
CPU on which PSI work timer is queued. So I have increased
sysctl_stat_interval to 60 * HZ to supress this work.
As we can see from the traces, CPU#7 comes out of idle only to execute PSI
work for every 2 seconds. The work is always re-armed from the psi_avgs_work()
as it finds PSI_NONIDLE condition. The non-idle time is essentially
non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)
The first term accounts the non-idle time since the task woken up (queued) to
the execution of the work item. It is around ~4 usec (54.119420 - 54.119416)
The second term account for the previous update. ~2 usec (52.135424 -
52.135422).
PSI work needs to be run when there is some activity after the last update is done
i.e last time the work is run. Since we use non-deferrable timer, the other
deferrable timers gets woken up and they might queue work or wakeup other threads
and creates activity which inturn makes PSI work to be scheduled.
PSI work can't just be made deferrable work. Because, it is a system level
work and if the CPU on which it is queued is idle for longer duration but the
other CPUs are active, we miss PSI updates. What we probably need is a global
deferrable timers [1] i.e this timer should not be bound to any CPU but
run when any of the CPU comes out of idle. As long as one CPU is busy, we keep
running the PSI but if the whole system is idle, we never wakeup.
<idle>-0 [007] 52.135402: cpu_idle: state=4294967295 cpu_id=7
<idle>-0 [007] 52.135415: workqueue_activate_work: work struct 0xffffffc011bd5010
<idle>-0 [007] 52.135417: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
<idle>-0 [007] 52.135421: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
kworker/7:3-196 [007] 52.135421: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
kworker/7:3-196 [007] 52.135422: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294905814 [timeout=494] cpu=7 idx=123 flags=D|P|I
kworker/7:3-196 [007] 52.135422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
kworker/7:3-196 [007] 52.135424: sched_switch: prev_comm=kworker/7:3 prev_pid=196 prev_prio=120 prev_state=I ==> next_comm=swapper/7 next_pid=0 next_prio=120
<idle>-0 [007] 52.135428: cpu_idle: state=0 cpu_id=7
<system is idle and gets woken up after 2 seconds due to PSI work>
<idle>-0 [007] 54.119402: cpu_idle: state=4294967295 cpu_id=7
<idle>-0 [007] 54.119414: workqueue_activate_work: work struct 0xffffffc011bd5010
<idle>-0 [007] 54.119416: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
<idle>-0 [007] 54.119420: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
kworker/7:3-196 [007] 54.119420: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
kworker/7:3-196 [007] 54.119421: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294906315 [timeout=499] cpu=7 idx=122 flags=D|P|I
kworker/7:3-196 [007] 54.119422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
[1]
https://lore.kernel.org/lkml/[email protected]/
Thanks,
Pavan
On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
<[email protected]> wrote:
>
> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> > Hi
> >
> > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > there is a RUNNING task. So we would always end up re-arming the work.
> >
> > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > help. The work is already scheduled. so we don't do anything there.
Hi Pavan,
Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
issue. At the time it was written I tested it and it seemed to work.
Maybe I missed something or some other change introduced afterwards
affected the shutoff logic. I'll take a closer look next week when I'm
back at my computer and will consult with Johannes.
Thanks,
Suren.
[1] 1b69ac6b40eb "psi: fix aggregation idle shut-off"
> >
> > Probably I am missing some thing here. Can you please clarify how we
> > shut off re-arming the psi avg work?
> >
>
> I have collected traces on an idle system (running android12-5.10 with minimal
> user space). This is a older kernel, however the issue remain on latest kernel
> as per code inspection.
>
> I have eliminated noise created by other work items. For example, vmstat_work.
> This is a deferrable work but gets executed since this is queued on the same
> CPU on which PSI work timer is queued. So I have increased
> sysctl_stat_interval to 60 * HZ to supress this work.
>
> As we can see from the traces, CPU#7 comes out of idle only to execute PSI
> work for every 2 seconds. The work is always re-armed from the psi_avgs_work()
> as it finds PSI_NONIDLE condition. The non-idle time is essentially
>
> non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)
>
> The first term accounts the non-idle time since the task woken up (queued) to
> the execution of the work item. It is around ~4 usec (54.119420 - 54.119416)
>
> The second term account for the previous update. ~2 usec (52.135424 -
> 52.135422).
>
> PSI work needs to be run when there is some activity after the last update is done
> i.e last time the work is run. Since we use non-deferrable timer, the other
> deferrable timers gets woken up and they might queue work or wakeup other threads
> and creates activity which inturn makes PSI work to be scheduled.
>
> PSI work can't just be made deferrable work. Because, it is a system level
> work and if the CPU on which it is queued is idle for longer duration but the
> other CPUs are active, we miss PSI updates. What we probably need is a global
> deferrable timers [1] i.e this timer should not be bound to any CPU but
> run when any of the CPU comes out of idle. As long as one CPU is busy, we keep
> running the PSI but if the whole system is idle, we never wakeup.
>
> <idle>-0 [007] 52.135402: cpu_idle: state=4294967295 cpu_id=7
> <idle>-0 [007] 52.135415: workqueue_activate_work: work struct 0xffffffc011bd5010
> <idle>-0 [007] 52.135417: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
> <idle>-0 [007] 52.135421: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
> kworker/7:3-196 [007] 52.135421: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
> kworker/7:3-196 [007] 52.135422: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294905814 [timeout=494] cpu=7 idx=123 flags=D|P|I
> kworker/7:3-196 [007] 52.135422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
> kworker/7:3-196 [007] 52.135424: sched_switch: prev_comm=kworker/7:3 prev_pid=196 prev_prio=120 prev_state=I ==> next_comm=swapper/7 next_pid=0 next_prio=120
> <idle>-0 [007] 52.135428: cpu_idle: state=0 cpu_id=7
>
> <system is idle and gets woken up after 2 seconds due to PSI work>
>
> <idle>-0 [007] 54.119402: cpu_idle: state=4294967295 cpu_id=7
> <idle>-0 [007] 54.119414: workqueue_activate_work: work struct 0xffffffc011bd5010
> <idle>-0 [007] 54.119416: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
> <idle>-0 [007] 54.119420: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
> kworker/7:3-196 [007] 54.119420: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
> kworker/7:3-196 [007] 54.119421: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294906315 [timeout=499] cpu=7 idx=122 flags=D|P|I
> kworker/7:3-196 [007] 54.119422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/
>
> Thanks,
> Pavan
On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> <[email protected]> wrote:
> >
> > On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> > > Hi
> > >
> > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > there is a RUNNING task. So we would always end up re-arming the work.
> > >
> > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > help. The work is already scheduled. so we don't do anything there.
>
> Hi Pavan,
> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> issue. At the time it was written I tested it and it seemed to work.
> Maybe I missed something or some other change introduced afterwards
> affected the shutoff logic. I'll take a closer look next week when I'm
> back at my computer and will consult with Johannes.
Sorry for the delay. I had some time to look into this and test psi
shutoff on my device and I think you are right. The patch I mentioned
prevents new psi_avgs_work from being scheduled when the only non-idle
task is psi_avgs_work itself, however the regular 2sec averaging work
will still go on. I think we could record the fact that the only
active task is psi_avgs_work in record_times() using a new
psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
rescheduling itself if that flag is set for all non-idle cpus. I'll
test this approach and will post a patch for review if that works.
Thanks,
Suren.
> Thanks,
> Suren.
>
> [1] 1b69ac6b40eb "psi: fix aggregation idle shut-off"
>
> > >
> > > Probably I am missing some thing here. Can you please clarify how we
> > > shut off re-arming the psi avg work?
> > >
> >
> > I have collected traces on an idle system (running android12-5.10 with minimal
> > user space). This is a older kernel, however the issue remain on latest kernel
> > as per code inspection.
> >
> > I have eliminated noise created by other work items. For example, vmstat_work.
> > This is a deferrable work but gets executed since this is queued on the same
> > CPU on which PSI work timer is queued. So I have increased
> > sysctl_stat_interval to 60 * HZ to supress this work.
> >
> > As we can see from the traces, CPU#7 comes out of idle only to execute PSI
> > work for every 2 seconds. The work is always re-armed from the psi_avgs_work()
> > as it finds PSI_NONIDLE condition. The non-idle time is essentially
> >
> > non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)
> >
> > The first term accounts the non-idle time since the task woken up (queued) to
> > the execution of the work item. It is around ~4 usec (54.119420 - 54.119416)
> >
> > The second term account for the previous update. ~2 usec (52.135424 -
> > 52.135422).
> >
> > PSI work needs to be run when there is some activity after the last update is done
> > i.e last time the work is run. Since we use non-deferrable timer, the other
> > deferrable timers gets woken up and they might queue work or wakeup other threads
> > and creates activity which inturn makes PSI work to be scheduled.
> >
> > PSI work can't just be made deferrable work. Because, it is a system level
> > work and if the CPU on which it is queued is idle for longer duration but the
> > other CPUs are active, we miss PSI updates. What we probably need is a global
> > deferrable timers [1] i.e this timer should not be bound to any CPU but
> > run when any of the CPU comes out of idle. As long as one CPU is busy, we keep
> > running the PSI but if the whole system is idle, we never wakeup.
> >
> > <idle>-0 [007] 52.135402: cpu_idle: state=4294967295 cpu_id=7
> > <idle>-0 [007] 52.135415: workqueue_activate_work: work struct 0xffffffc011bd5010
> > <idle>-0 [007] 52.135417: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
> > <idle>-0 [007] 52.135421: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
> > kworker/7:3-196 [007] 52.135421: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
> > kworker/7:3-196 [007] 52.135422: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294905814 [timeout=494] cpu=7 idx=123 flags=D|P|I
> > kworker/7:3-196 [007] 52.135422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
> > kworker/7:3-196 [007] 52.135424: sched_switch: prev_comm=kworker/7:3 prev_pid=196 prev_prio=120 prev_state=I ==> next_comm=swapper/7 next_pid=0 next_prio=120
> > <idle>-0 [007] 52.135428: cpu_idle: state=0 cpu_id=7
> >
> > <system is idle and gets woken up after 2 seconds due to PSI work>
> >
> > <idle>-0 [007] 54.119402: cpu_idle: state=4294967295 cpu_id=7
> > <idle>-0 [007] 54.119414: workqueue_activate_work: work struct 0xffffffc011bd5010
> > <idle>-0 [007] 54.119416: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
> > <idle>-0 [007] 54.119420: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
> > kworker/7:3-196 [007] 54.119420: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
> > kworker/7:3-196 [007] 54.119421: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294906315 [timeout=499] cpu=7 idx=122 flags=D|P|I
> > kworker/7:3-196 [007] 54.119422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
> >
> > [1]
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > Thanks,
> > Pavan
On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> > <[email protected]> wrote:
> > >
> > > On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> > > > Hi
> > > >
> > > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > > there is a RUNNING task. So we would always end up re-arming the work.
> > > >
> > > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > > help. The work is already scheduled. so we don't do anything there.
> >
> > Hi Pavan,
> > Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> > issue. At the time it was written I tested it and it seemed to work.
> > Maybe I missed something or some other change introduced afterwards
> > affected the shutoff logic. I'll take a closer look next week when I'm
> > back at my computer and will consult with Johannes.
>
> Sorry for the delay. I had some time to look into this and test psi
> shutoff on my device and I think you are right. The patch I mentioned
> prevents new psi_avgs_work from being scheduled when the only non-idle
> task is psi_avgs_work itself, however the regular 2sec averaging work
> will still go on. I think we could record the fact that the only
> active task is psi_avgs_work in record_times() using a new
> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> rescheduling itself if that flag is set for all non-idle cpus. I'll
> test this approach and will post a patch for review if that works.
Hi Pavan,
Testing PSI shutoff on Android proved more difficult than I expected.
Lots of tasks to silence and I keep encountering new ones.
The approach I was thinking about is something like this:
---
include/linux/psi_types.h | 3 +++
kernel/sched/psi.c | 12 +++++++++---
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index c7fe7c089718..8d936f22cb5b 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -68,6 +68,9 @@ enum psi_states {
NR_PSI_STATES = 7,
};
+/* state_mask flag to keep re-arming averaging work */
+#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
+
enum psi_aggregators {
PSI_AVGS = 0,
PSI_POLL,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ecb4b4ff4ce0..dd62ad28bacd 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
*group, int cpu,
if (delta)
*pchanged_states |= (1 << s);
}
+ *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
}
static void calc_avgs(unsigned long avg[3], int missed_periods,
@@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
struct delayed_work *dwork;
struct psi_group *group;
u32 changed_states;
- bool nonidle;
+ bool wake_clock;
u64 now;
dwork = to_delayed_work(work);
@@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
now = sched_clock();
collect_percpu_times(group, PSI_AVGS, &changed_states);
- nonidle = changed_states & (1 << PSI_NONIDLE);
+ wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
/*
* If there is task activity, periodically fold the per-cpu
* times and feed samples into the running averages. If things
@@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
if (now >= group->avg_next_update)
group->avg_next_update = update_averages(group, now);
- if (nonidle) {
+ if (wake_clock) {
schedule_delayed_work(dwork, nsecs_to_jiffies(
group->avg_next_update - now) + 1);
}
@@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
*group, int cpu,
if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
+ if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
+ /* psi_avgs_work was not the only task on the CPU */
+ state_mask |= PSI_STATE_WAKE_CLOCK;
+ }
+
groupc->state_mask = state_mask;
write_seqcount_end(&groupc->seq);
--
This should detect the activity caused by psi_avgs_work() itself and
ignore it when deciding to reschedule the averaging work. In the
formula you posted:
non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)
the first term is calculated only if the PSI state is still active
(https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L271).
psi_group_change() will reset that state if psi_avgs_work() was the
only task on that CPU, so it won't affect non_idle_time. The code
above is to take care of the second term. Could you please check if
this approach helps? As I mentioned I'm having trouble getting all the
tasks silent on Android for a clear test.
The issue with deferrable timers that you mentioned, how often does
that happen? If it happens only occasionally and prevents PSI shutoff
for a couple of update cycles then I don't think that's a huge
problem. Once PSI shutoff happens it should stay shut. Is that the
case?
Thanks,
Suren.
> Thanks,
> Suren.
>
> > Thanks,
> > Suren.
> >
> > [1] 1b69ac6b40eb "psi: fix aggregation idle shut-off"
> >
> > > >
> > > > Probably I am missing some thing here. Can you please clarify how we
> > > > shut off re-arming the psi avg work?
> > > >
> > >
> > > I have collected traces on an idle system (running android12-5.10 with minimal
> > > user space). This is a older kernel, however the issue remain on latest kernel
> > > as per code inspection.
> > >
> > > I have eliminated noise created by other work items. For example, vmstat_work.
> > > This is a deferrable work but gets executed since this is queued on the same
> > > CPU on which PSI work timer is queued. So I have increased
> > > sysctl_stat_interval to 60 * HZ to supress this work.
> > >
> > > As we can see from the traces, CPU#7 comes out of idle only to execute PSI
> > > work for every 2 seconds. The work is always re-armed from the psi_avgs_work()
> > > as it finds PSI_NONIDLE condition. The non-idle time is essentially
> > >
> > > non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)
> > >
> > > The first term accounts the non-idle time since the task woken up (queued) to
> > > the execution of the work item. It is around ~4 usec (54.119420 - 54.119416)
> > >
> > > The second term account for the previous update. ~2 usec (52.135424 -
> > > 52.135422).
> > >
> > > PSI work needs to be run when there is some activity after the last update is done
> > > i.e last time the work is run. Since we use non-deferrable timer, the other
> > > deferrable timers gets woken up and they might queue work or wakeup other threads
> > > and creates activity which inturn makes PSI work to be scheduled.
> > >
> > > PSI work can't just be made deferrable work. Because, it is a system level
> > > work and if the CPU on which it is queued is idle for longer duration but the
> > > other CPUs are active, we miss PSI updates. What we probably need is a global
> > > deferrable timers [1] i.e this timer should not be bound to any CPU but
> > > run when any of the CPU comes out of idle. As long as one CPU is busy, we keep
> > > running the PSI but if the whole system is idle, we never wakeup.
> > >
> > > <idle>-0 [007] 52.135402: cpu_idle: state=4294967295 cpu_id=7
> > > <idle>-0 [007] 52.135415: workqueue_activate_work: work struct 0xffffffc011bd5010
> > > <idle>-0 [007] 52.135417: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
> > > <idle>-0 [007] 52.135421: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
> > > kworker/7:3-196 [007] 52.135421: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
> > > kworker/7:3-196 [007] 52.135422: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294905814 [timeout=494] cpu=7 idx=123 flags=D|P|I
> > > kworker/7:3-196 [007] 52.135422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
> > > kworker/7:3-196 [007] 52.135424: sched_switch: prev_comm=kworker/7:3 prev_pid=196 prev_prio=120 prev_state=I ==> next_comm=swapper/7 next_pid=0 next_prio=120
> > > <idle>-0 [007] 52.135428: cpu_idle: state=0 cpu_id=7
> > >
> > > <system is idle and gets woken up after 2 seconds due to PSI work>
> > >
> > > <idle>-0 [007] 54.119402: cpu_idle: state=4294967295 cpu_id=7
> > > <idle>-0 [007] 54.119414: workqueue_activate_work: work struct 0xffffffc011bd5010
> > > <idle>-0 [007] 54.119416: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
> > > <idle>-0 [007] 54.119420: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
> > > kworker/7:3-196 [007] 54.119420: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
> > > kworker/7:3-196 [007] 54.119421: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294906315 [timeout=499] cpu=7 idx=122 flags=D|P|I
> > > kworker/7:3-196 [007] 54.119422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
> > >
> > > [1]
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Thanks,
> > > Pavan
Hello,
I just saw these emails, sorry for late.
On 2022/10/6 00:32, Suren Baghdasaryan wrote:
> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
>>
>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
>>>
>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
>>> <[email protected]> wrote:
>>>>
>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
>>>>> Hi
>>>>>
>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
>>>>> there is a RUNNING task. So we would always end up re-arming the work.
>>>>>
>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
>>>>> help. The work is already scheduled. so we don't do anything there.
>>>
>>> Hi Pavan,
>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
>>> issue. At the time it was written I tested it and it seemed to work.
>>> Maybe I missed something or some other change introduced afterwards
>>> affected the shutoff logic. I'll take a closer look next week when I'm
>>> back at my computer and will consult with Johannes.
>>
>> Sorry for the delay. I had some time to look into this and test psi
>> shutoff on my device and I think you are right. The patch I mentioned
>> prevents new psi_avgs_work from being scheduled when the only non-idle
>> task is psi_avgs_work itself, however the regular 2sec averaging work
>> will still go on. I think we could record the fact that the only
>> active task is psi_avgs_work in record_times() using a new
>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
>> rescheduling itself if that flag is set for all non-idle cpus. I'll
>> test this approach and will post a patch for review if that works.
>
> Hi Pavan,
> Testing PSI shutoff on Android proved more difficult than I expected.
> Lots of tasks to silence and I keep encountering new ones.
> The approach I was thinking about is something like this:
>
> ---
> include/linux/psi_types.h | 3 +++
> kernel/sched/psi.c | 12 +++++++++---
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c7fe7c089718..8d936f22cb5b 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -68,6 +68,9 @@ enum psi_states {
> NR_PSI_STATES = 7,
> };
>
> +/* state_mask flag to keep re-arming averaging work */
> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
> +
> enum psi_aggregators {
> PSI_AVGS = 0,
> PSI_POLL,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index ecb4b4ff4ce0..dd62ad28bacd 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
> if (delta)
> *pchanged_states |= (1 << s);
> }
> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
If the avgs_work kworker is running on this CPU, it will still see
PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
Maybe I missed something... but I have another different idea which
I want to implement later only for discussion.
Thanks.
> }
>
> static void calc_avgs(unsigned long avg[3], int missed_periods,
> @@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
> struct delayed_work *dwork;
> struct psi_group *group;
> u32 changed_states;
> - bool nonidle;
> + bool wake_clock;
> u64 now;
>
> dwork = to_delayed_work(work);
> @@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
> now = sched_clock();
>
> collect_percpu_times(group, PSI_AVGS, &changed_states);
> - nonidle = changed_states & (1 << PSI_NONIDLE);
> + wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
> /*
> * If there is task activity, periodically fold the per-cpu
> * times and feed samples into the running averages. If things
> @@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
> if (now >= group->avg_next_update)
> group->avg_next_update = update_averages(group, now);
>
> - if (nonidle) {
> + if (wake_clock) {
> schedule_delayed_work(dwork, nsecs_to_jiffies(
> group->avg_next_update - now) + 1);
> }
> @@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
> *group, int cpu,
> if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> state_mask |= (1 << PSI_MEM_FULL);
>
> + if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
> + /* psi_avgs_work was not the only task on the CPU */
> + state_mask |= PSI_STATE_WAKE_CLOCK;
> + }
> +
> groupc->state_mask = state_mask;
>
> write_seqcount_end(&groupc->seq);
On 2022/10/9 20:41, Chengming Zhou wrote:
> Hello,
>
> I just saw these emails, sorry for late.
>
> On 2022/10/6 00:32, Suren Baghdasaryan wrote:
>> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
>>>
>>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
>>>>
>>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
>>>> <[email protected]> wrote:
>>>>>
>>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
>>>>>> Hi
>>>>>>
>>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
>>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
>>>>>> there is a RUNNING task. So we would always end up re-arming the work.
>>>>>>
>>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
>>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
>>>>>> help. The work is already scheduled. so we don't do anything there.
>>>>
>>>> Hi Pavan,
>>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
>>>> issue. At the time it was written I tested it and it seemed to work.
>>>> Maybe I missed something or some other change introduced afterwards
>>>> affected the shutoff logic. I'll take a closer look next week when I'm
>>>> back at my computer and will consult with Johannes.
>>>
>>> Sorry for the delay. I had some time to look into this and test psi
>>> shutoff on my device and I think you are right. The patch I mentioned
>>> prevents new psi_avgs_work from being scheduled when the only non-idle
>>> task is psi_avgs_work itself, however the regular 2sec averaging work
>>> will still go on. I think we could record the fact that the only
>>> active task is psi_avgs_work in record_times() using a new
>>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
>>> rescheduling itself if that flag is set for all non-idle cpus. I'll
>>> test this approach and will post a patch for review if that works.
>>
>> Hi Pavan,
>> Testing PSI shutoff on Android proved more difficult than I expected.
>> Lots of tasks to silence and I keep encountering new ones.
>> The approach I was thinking about is something like this:
>>
>> ---
>> include/linux/psi_types.h | 3 +++
>> kernel/sched/psi.c | 12 +++++++++---
>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
>> index c7fe7c089718..8d936f22cb5b 100644
>> --- a/include/linux/psi_types.h
>> +++ b/include/linux/psi_types.h
>> @@ -68,6 +68,9 @@ enum psi_states {
>> NR_PSI_STATES = 7,
>> };
>>
>> +/* state_mask flag to keep re-arming averaging work */
>> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
>> +
>> enum psi_aggregators {
>> PSI_AVGS = 0,
>> PSI_POLL,
>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> index ecb4b4ff4ce0..dd62ad28bacd 100644
>> --- a/kernel/sched/psi.c
>> +++ b/kernel/sched/psi.c
>> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
>> *group, int cpu,
>> if (delta)
>> *pchanged_states |= (1 << s);
>> }
>> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
>
> If the avgs_work kworker is running on this CPU, it will still see
> PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
>
> Maybe I missed something... but I have another different idea which
> I want to implement later only for discussion.
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ee2ecc081422..f322e8fd8d41 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
enum psi_aggregators aggregator, u32 *times,
u32 *pchanged_states)
{
+ int current_cpu = raw_smp_processor_id();
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
u64 now, state_start;
enum psi_states s;
unsigned int seq;
u32 state_mask;
+ bool only_avgs_work = false;
*pchanged_states = 0;
@@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
memcpy(times, groupc->times, sizeof(groupc->times));
state_mask = groupc->state_mask;
state_start = groupc->state_start;
+ /*
+ * This CPU has only avgs_work kworker running, snapshot the
+ * newest times then don't need to re-arm work for this groupc.
+ * Normally this kworker will sleep soon and won't
+ * wake_clock in psi_group_change().
+ */
+ if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
+ only_avgs_work = true;
} while (read_seqcount_retry(&groupc->seq, seq));
/* Calculate state time deltas against the previous snapshot */
@@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
if (delta)
*pchanged_states |= (1 << s);
}
+
+ /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
+ if (only_avgs_work)
+ *pchanged_states &= ~(1 << PSI_NONIDLE);
}
static void calc_avgs(unsigned long avg[3], int missed_periods,
>
> Thanks.
>
>> }
>>
>> static void calc_avgs(unsigned long avg[3], int missed_periods,
>> @@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
>> struct delayed_work *dwork;
>> struct psi_group *group;
>> u32 changed_states;
>> - bool nonidle;
>> + bool wake_clock;
>> u64 now;
>>
>> dwork = to_delayed_work(work);
>> @@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
>> now = sched_clock();
>>
>> collect_percpu_times(group, PSI_AVGS, &changed_states);
>> - nonidle = changed_states & (1 << PSI_NONIDLE);
>> + wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
>> /*
>> * If there is task activity, periodically fold the per-cpu
>> * times and feed samples into the running averages. If things
>> @@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
>> if (now >= group->avg_next_update)
>> group->avg_next_update = update_averages(group, now);
>>
>> - if (nonidle) {
>> + if (wake_clock) {
>> schedule_delayed_work(dwork, nsecs_to_jiffies(
>> group->avg_next_update - now) + 1);
>> }
>> @@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
>> *group, int cpu,
>> if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
>> state_mask |= (1 << PSI_MEM_FULL);
>>
>> + if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
>> + /* psi_avgs_work was not the only task on the CPU */
>> + state_mask |= PSI_STATE_WAKE_CLOCK;
>> + }
>> +
>> groupc->state_mask = state_mask;
>>
>> write_seqcount_end(&groupc->seq);
On Sun, Oct 09, 2022 at 09:17:34PM +0800, Chengming Zhou wrote:
> On 2022/10/9 20:41, Chengming Zhou wrote:
> > Hello,
> >
> > I just saw these emails, sorry for late.
> >
> > On 2022/10/6 00:32, Suren Baghdasaryan wrote:
> >> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
> >>>
> >>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> >>>>
> >>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> >>>>>> Hi
> >>>>>>
> >>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> >>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
> >>>>>> there is a RUNNING task. So we would always end up re-arming the work.
> >>>>>>
> >>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
> >>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> >>>>>> help. The work is already scheduled. so we don't do anything there.
> >>>>
> >>>> Hi Pavan,
> >>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> >>>> issue. At the time it was written I tested it and it seemed to work.
> >>>> Maybe I missed something or some other change introduced afterwards
> >>>> affected the shutoff logic. I'll take a closer look next week when I'm
> >>>> back at my computer and will consult with Johannes.
> >>>
> >>> Sorry for the delay. I had some time to look into this and test psi
> >>> shutoff on my device and I think you are right. The patch I mentioned
> >>> prevents new psi_avgs_work from being scheduled when the only non-idle
> >>> task is psi_avgs_work itself, however the regular 2sec averaging work
> >>> will still go on. I think we could record the fact that the only
> >>> active task is psi_avgs_work in record_times() using a new
> >>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> >>> rescheduling itself if that flag is set for all non-idle cpus. I'll
> >>> test this approach and will post a patch for review if that works.
> >>
> >> Hi Pavan,
> >> Testing PSI shutoff on Android proved more difficult than I expected.
> >> Lots of tasks to silence and I keep encountering new ones.
> >> The approach I was thinking about is something like this:
> >>
> >> ---
> >> include/linux/psi_types.h | 3 +++
> >> kernel/sched/psi.c | 12 +++++++++---
> >> 2 files changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> >> index c7fe7c089718..8d936f22cb5b 100644
> >> --- a/include/linux/psi_types.h
> >> +++ b/include/linux/psi_types.h
> >> @@ -68,6 +68,9 @@ enum psi_states {
> >> NR_PSI_STATES = 7,
> >> };
> >>
> >> +/* state_mask flag to keep re-arming averaging work */
> >> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
> >> +
> >> enum psi_aggregators {
> >> PSI_AVGS = 0,
> >> PSI_POLL,
> >> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> >> index ecb4b4ff4ce0..dd62ad28bacd 100644
> >> --- a/kernel/sched/psi.c
> >> +++ b/kernel/sched/psi.c
> >> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> >> *group, int cpu,
> >> if (delta)
> >> *pchanged_states |= (1 << s);
> >> }
> >> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
> >
> > If the avgs_work kworker is running on this CPU, it will still see
> > PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
> >
> > Maybe I missed something... but I have another different idea which
> > I want to implement later only for discussion.
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index ee2ecc081422..f322e8fd8d41 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
> enum psi_aggregators aggregator, u32 *times,
> u32 *pchanged_states)
> {
> + int current_cpu = raw_smp_processor_id();
> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> u64 now, state_start;
> enum psi_states s;
> unsigned int seq;
> u32 state_mask;
> + bool only_avgs_work = false;
>
> *pchanged_states = 0;
>
> @@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
> memcpy(times, groupc->times, sizeof(groupc->times));
> state_mask = groupc->state_mask;
> state_start = groupc->state_start;
> + /*
> + * This CPU has only avgs_work kworker running, snapshot the
> + * newest times then don't need to re-arm work for this groupc.
> + * Normally this kworker will sleep soon and won't
> + * wake_clock in psi_group_change().
> + */
> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
> + only_avgs_work = true;
> } while (read_seqcount_retry(&groupc->seq, seq));
>
> /* Calculate state time deltas against the previous snapshot */
> @@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> if (delta)
> *pchanged_states |= (1 << s);
> }
> +
> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> + if (only_avgs_work)
> + *pchanged_states &= ~(1 << PSI_NONIDLE);
> }
>
Thanks Chengming for the patch. I will test this patch and report my
observations. It makes sense to consider this CPU as non-idle if the PSI kworker
is the only task running. It could run other works but that decision is now
deferred to schedule out path. Ideally if this is the only (or last) work
running, we should not see PSI work not re-arming it self.
Thanks,
Pavan
On Wed, Oct 05, 2022 at 09:32:44AM -0700, Suren Baghdasaryan wrote:
> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> > > <[email protected]> wrote:
> > > >
> > > > On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> > > > > Hi
> > > > >
> > > > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > > > there is a RUNNING task. So we would always end up re-arming the work.
> > > > >
> > > > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > > > help. The work is already scheduled. so we don't do anything there.
> > >
> > > Hi Pavan,
> > > Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> > > issue. At the time it was written I tested it and it seemed to work.
> > > Maybe I missed something or some other change introduced afterwards
> > > affected the shutoff logic. I'll take a closer look next week when I'm
> > > back at my computer and will consult with Johannes.
> >
> > Sorry for the delay. I had some time to look into this and test psi
> > shutoff on my device and I think you are right. The patch I mentioned
> > prevents new psi_avgs_work from being scheduled when the only non-idle
> > task is psi_avgs_work itself, however the regular 2sec averaging work
> > will still go on. I think we could record the fact that the only
> > active task is psi_avgs_work in record_times() using a new
> > psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> > rescheduling itself if that flag is set for all non-idle cpus. I'll
> > test this approach and will post a patch for review if that works.
>
> Hi Pavan,
> Testing PSI shutoff on Android proved more difficult than I expected.
> Lots of tasks to silence and I keep encountering new ones.
> The approach I was thinking about is something like this:
>
> ---
> include/linux/psi_types.h | 3 +++
> kernel/sched/psi.c | 12 +++++++++---
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index c7fe7c089718..8d936f22cb5b 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -68,6 +68,9 @@ enum psi_states {
> NR_PSI_STATES = 7,
> };
>
> +/* state_mask flag to keep re-arming averaging work */
> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
> +
> enum psi_aggregators {
> PSI_AVGS = 0,
> PSI_POLL,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index ecb4b4ff4ce0..dd62ad28bacd 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
> if (delta)
> *pchanged_states |= (1 << s);
> }
> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
> }
>
> static void calc_avgs(unsigned long avg[3], int missed_periods,
> @@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
> struct delayed_work *dwork;
> struct psi_group *group;
> u32 changed_states;
> - bool nonidle;
> + bool wake_clock;
> u64 now;
>
> dwork = to_delayed_work(work);
> @@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
> now = sched_clock();
>
> collect_percpu_times(group, PSI_AVGS, &changed_states);
> - nonidle = changed_states & (1 << PSI_NONIDLE);
> + wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
> /*
> * If there is task activity, periodically fold the per-cpu
> * times and feed samples into the running averages. If things
> @@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
> if (now >= group->avg_next_update)
> group->avg_next_update = update_averages(group, now);
>
> - if (nonidle) {
> + if (wake_clock) {
> schedule_delayed_work(dwork, nsecs_to_jiffies(
> group->avg_next_update - now) + 1);
> }
> @@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
> *group, int cpu,
> if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> state_mask |= (1 << PSI_MEM_FULL);
>
> + if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
> + /* psi_avgs_work was not the only task on the CPU */
> + state_mask |= PSI_STATE_WAKE_CLOCK;
> + }
> +
Thanks Suren for taking a look. I was not at work last week, so could not
reply earlier.
As chengming pointed out already, when kworker is woken up (some tasks ran in
the last window, so we scheudled a work. now kworker woke up when system is
idle), PSI_NONIDLE condition will be true here, we end up setting the
PSI_STATE_WAKE_CLOCK flag. Correct? Any ways, I am going to test this patch and
report my observations.
Thanks,
Pavan
On Wed, Oct 05, 2022 at 09:32:44AM -0700, Suren Baghdasaryan wrote:
> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> > > <[email protected]> wrote:
> > > >
> > > > On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> > > > > Hi
> > > > >
> > > > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > > > there is a RUNNING task. So we would always end up re-arming the work.
> > > > >
> > > > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > > > help. The work is already scheduled. so we don't do anything there.
> > >
> > > Hi Pavan,
> > > Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> > > issue. At the time it was written I tested it and it seemed to work.
> > > Maybe I missed something or some other change introduced afterwards
> > > affected the shutoff logic. I'll take a closer look next week when I'm
> > > back at my computer and will consult with Johannes.
> >
> > Sorry for the delay. I had some time to look into this and test psi
> > shutoff on my device and I think you are right. The patch I mentioned
> > prevents new psi_avgs_work from being scheduled when the only non-idle
> > task is psi_avgs_work itself, however the regular 2sec averaging work
> > will still go on. I think we could record the fact that the only
> > active task is psi_avgs_work in record_times() using a new
> > psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> > rescheduling itself if that flag is set for all non-idle cpus. I'll
> > test this approach and will post a patch for review if that works.
>
<snip>
>
> This should detect the activity caused by psi_avgs_work() itself and
> ignore it when deciding to reschedule the averaging work. In the
> formula you posted:
>
> non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)
>
> the first term is calculated only if the PSI state is still active
> (https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L271).
> psi_group_change() will reset that state if psi_avgs_work() was the
> only task on that CPU, so it won't affect non_idle_time. The code
> above is to take care of the second term. Could you please check if
> this approach helps? As I mentioned I'm having trouble getting all the
> tasks silent on Android for a clear test.
>
> The issue with deferrable timers that you mentioned, how often does
> that happen? If it happens only occasionally and prevents PSI shutoff
> for a couple of update cycles then I don't think that's a huge
> problem. Once PSI shutoff happens it should stay shut. Is that the
> case?
In our system, there are periodic but deferrable timers/works (DCVS, QOS related)
whose period is lower than PSI period. So once PSI average work runs, I see
other unrelated work getting scheduled due to which PSI runs again. I actually
added a hack not to re-arm the PSI work if the system has only one work
(nr_running() == 1) and still see the problem. The moment, I made PSI work as
deferrable, all the problems gone. I know that is wrong as it could simply not
run PSI work when other CPUs are busy.
Thanks,
Pavan
On Mon, Oct 10, 2022 at 11:48:49AM +0530, Pavan Kondeti wrote:
> On Sun, Oct 09, 2022 at 09:17:34PM +0800, Chengming Zhou wrote:
> > On 2022/10/9 20:41, Chengming Zhou wrote:
> > > Hello,
> > >
> > > I just saw these emails, sorry for late.
> > >
> > > On 2022/10/6 00:32, Suren Baghdasaryan wrote:
> > >> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
> > >>>
> > >>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> > >>>>
> > >>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> > >>>> <[email protected]> wrote:
> > >>>>>
> > >>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> > >>>>>> Hi
> > >>>>>>
> > >>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > >>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
> > >>>>>> there is a RUNNING task. So we would always end up re-arming the work.
> > >>>>>>
> > >>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
> > >>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > >>>>>> help. The work is already scheduled. so we don't do anything there.
> > >>>>
> > >>>> Hi Pavan,
> > >>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> > >>>> issue. At the time it was written I tested it and it seemed to work.
> > >>>> Maybe I missed something or some other change introduced afterwards
> > >>>> affected the shutoff logic. I'll take a closer look next week when I'm
> > >>>> back at my computer and will consult with Johannes.
> > >>>
> > >>> Sorry for the delay. I had some time to look into this and test psi
> > >>> shutoff on my device and I think you are right. The patch I mentioned
> > >>> prevents new psi_avgs_work from being scheduled when the only non-idle
> > >>> task is psi_avgs_work itself, however the regular 2sec averaging work
> > >>> will still go on. I think we could record the fact that the only
> > >>> active task is psi_avgs_work in record_times() using a new
> > >>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> > >>> rescheduling itself if that flag is set for all non-idle cpus. I'll
> > >>> test this approach and will post a patch for review if that works.
> > >>
> > >> Hi Pavan,
> > >> Testing PSI shutoff on Android proved more difficult than I expected.
> > >> Lots of tasks to silence and I keep encountering new ones.
> > >> The approach I was thinking about is something like this:
> > >>
> > >> ---
> > >> include/linux/psi_types.h | 3 +++
> > >> kernel/sched/psi.c | 12 +++++++++---
> > >> 2 files changed, 12 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > >> index c7fe7c089718..8d936f22cb5b 100644
> > >> --- a/include/linux/psi_types.h
> > >> +++ b/include/linux/psi_types.h
> > >> @@ -68,6 +68,9 @@ enum psi_states {
> > >> NR_PSI_STATES = 7,
> > >> };
> > >>
> > >> +/* state_mask flag to keep re-arming averaging work */
> > >> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
> > >> +
> > >> enum psi_aggregators {
> > >> PSI_AVGS = 0,
> > >> PSI_POLL,
> > >> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > >> index ecb4b4ff4ce0..dd62ad28bacd 100644
> > >> --- a/kernel/sched/psi.c
> > >> +++ b/kernel/sched/psi.c
> > >> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> > >> *group, int cpu,
> > >> if (delta)
> > >> *pchanged_states |= (1 << s);
> > >> }
> > >> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
> > >
> > > If the avgs_work kworker is running on this CPU, it will still see
> > > PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
> > >
> > > Maybe I missed something... but I have another different idea which
> > > I want to implement later only for discussion.
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index ee2ecc081422..f322e8fd8d41 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > enum psi_aggregators aggregator, u32 *times,
> > u32 *pchanged_states)
> > {
> > + int current_cpu = raw_smp_processor_id();
> > struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> > u64 now, state_start;
> > enum psi_states s;
> > unsigned int seq;
> > u32 state_mask;
> > + bool only_avgs_work = false;
> >
> > *pchanged_states = 0;
> >
> > @@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > memcpy(times, groupc->times, sizeof(groupc->times));
> > state_mask = groupc->state_mask;
> > state_start = groupc->state_start;
> > + /*
> > + * This CPU has only avgs_work kworker running, snapshot the
> > + * newest times then don't need to re-arm work for this groupc.
> > + * Normally this kworker will sleep soon and won't
> > + * wake_clock in psi_group_change().
> > + */
> > + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
> > + only_avgs_work = true;
> > } while (read_seqcount_retry(&groupc->seq, seq));
> >
> > /* Calculate state time deltas against the previous snapshot */
> > @@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > if (delta)
> > *pchanged_states |= (1 << s);
> > }
> > +
> > + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> > + if (only_avgs_work)
> > + *pchanged_states &= ~(1 << PSI_NONIDLE);
> > }
> >
> Thanks Chengming for the patch. I will test this patch and report my
> observations. It makes sense to consider this CPU as non-idle if the PSI kworker
> is the only task running. It could run other works but that decision is now
> deferred to schedule out path. Ideally if this is the only (or last) work
> running, we should not see PSI work not re-arming it self.
>
is condition groupc->tasks[NR_RUNNING] == 1 alone sufficient to clear NONIDLE?
or should we also make sure that !NR_IOWAIT && !NR_MEMSTALL condition on this CPU?
Thanks,
Pavan
On 2022/10/10 14:43, Pavan Kondeti wrote:
> On Mon, Oct 10, 2022 at 11:48:49AM +0530, Pavan Kondeti wrote:
>> On Sun, Oct 09, 2022 at 09:17:34PM +0800, Chengming Zhou wrote:
>>> On 2022/10/9 20:41, Chengming Zhou wrote:
>>>> Hello,
>>>>
>>>> I just saw these emails, sorry for late.
>>>>
>>>> On 2022/10/6 00:32, Suren Baghdasaryan wrote:
>>>>> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
>>>>>>
>>>>>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
>>>>>>>
>>>>>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
>>>>>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
>>>>>>>>> there is a RUNNING task. So we would always end up re-arming the work.
>>>>>>>>>
>>>>>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
>>>>>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
>>>>>>>>> help. The work is already scheduled. so we don't do anything there.
>>>>>>>
>>>>>>> Hi Pavan,
>>>>>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
>>>>>>> issue. At the time it was written I tested it and it seemed to work.
>>>>>>> Maybe I missed something or some other change introduced afterwards
>>>>>>> affected the shutoff logic. I'll take a closer look next week when I'm
>>>>>>> back at my computer and will consult with Johannes.
>>>>>>
>>>>>> Sorry for the delay. I had some time to look into this and test psi
>>>>>> shutoff on my device and I think you are right. The patch I mentioned
>>>>>> prevents new psi_avgs_work from being scheduled when the only non-idle
>>>>>> task is psi_avgs_work itself, however the regular 2sec averaging work
>>>>>> will still go on. I think we could record the fact that the only
>>>>>> active task is psi_avgs_work in record_times() using a new
>>>>>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
>>>>>> rescheduling itself if that flag is set for all non-idle cpus. I'll
>>>>>> test this approach and will post a patch for review if that works.
>>>>>
>>>>> Hi Pavan,
>>>>> Testing PSI shutoff on Android proved more difficult than I expected.
>>>>> Lots of tasks to silence and I keep encountering new ones.
>>>>> The approach I was thinking about is something like this:
>>>>>
>>>>> ---
>>>>> include/linux/psi_types.h | 3 +++
>>>>> kernel/sched/psi.c | 12 +++++++++---
>>>>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
>>>>> index c7fe7c089718..8d936f22cb5b 100644
>>>>> --- a/include/linux/psi_types.h
>>>>> +++ b/include/linux/psi_types.h
>>>>> @@ -68,6 +68,9 @@ enum psi_states {
>>>>> NR_PSI_STATES = 7,
>>>>> };
>>>>>
>>>>> +/* state_mask flag to keep re-arming averaging work */
>>>>> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
>>>>> +
>>>>> enum psi_aggregators {
>>>>> PSI_AVGS = 0,
>>>>> PSI_POLL,
>>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>>> index ecb4b4ff4ce0..dd62ad28bacd 100644
>>>>> --- a/kernel/sched/psi.c
>>>>> +++ b/kernel/sched/psi.c
>>>>> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
>>>>> *group, int cpu,
>>>>> if (delta)
>>>>> *pchanged_states |= (1 << s);
>>>>> }
>>>>> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
>>>>
>>>> If the avgs_work kworker is running on this CPU, it will still see
>>>> PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
>>>>
>>>> Maybe I missed something... but I have another different idea which
>>>> I want to implement later only for discussion.
>>>
>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>> index ee2ecc081422..f322e8fd8d41 100644
>>> --- a/kernel/sched/psi.c
>>> +++ b/kernel/sched/psi.c
>>> @@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>> enum psi_aggregators aggregator, u32 *times,
>>> u32 *pchanged_states)
>>> {
>>> + int current_cpu = raw_smp_processor_id();
>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>> u64 now, state_start;
>>> enum psi_states s;
>>> unsigned int seq;
>>> u32 state_mask;
>>> + bool only_avgs_work = false;
>>>
>>> *pchanged_states = 0;
>>>
>>> @@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>> memcpy(times, groupc->times, sizeof(groupc->times));
>>> state_mask = groupc->state_mask;
>>> state_start = groupc->state_start;
>>> + /*
>>> + * This CPU has only avgs_work kworker running, snapshot the
>>> + * newest times then don't need to re-arm work for this groupc.
>>> + * Normally this kworker will sleep soon and won't
>>> + * wake_clock in psi_group_change().
>>> + */
>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
>>> + only_avgs_work = true;
>>> } while (read_seqcount_retry(&groupc->seq, seq));
>>>
>>> /* Calculate state time deltas against the previous snapshot */
>>> @@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>> if (delta)
>>> *pchanged_states |= (1 << s);
>>> }
>>> +
>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>>> + if (only_avgs_work)
>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
>>> }
>>>
>> Thanks Chengming for the patch. I will test this patch and report my
>> observations. It makes sense to consider this CPU as non-idle if the PSI kworker
>> is the only task running. It could run other works but that decision is now
>> deferred to schedule out path. Ideally if this is the only (or last) work
>> running, we should not see PSI work not re-arming it self.
>>
>
> is condition groupc->tasks[NR_RUNNING] == 1 alone sufficient to clear NONIDLE?
> or should we also make sure that !NR_IOWAIT && !NR_MEMSTALL condition on this CPU?
Yes, I think you're correct, we should check !NR_IOWAIT && !NR_MEMSTALL too.
Thanks!
On 2022/10/10 14:57, Chengming Zhou wrote:
> On 2022/10/10 14:43, Pavan Kondeti wrote:
>> On Mon, Oct 10, 2022 at 11:48:49AM +0530, Pavan Kondeti wrote:
>>> On Sun, Oct 09, 2022 at 09:17:34PM +0800, Chengming Zhou wrote:
>>>> On 2022/10/9 20:41, Chengming Zhou wrote:
>>>>> Hello,
>>>>>
>>>>> I just saw these emails, sorry for late.
>>>>>
>>>>> On 2022/10/6 00:32, Suren Baghdasaryan wrote:
>>>>>> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
>>>>>>>
>>>>>>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
>>>>>>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
>>>>>>>>>> there is a RUNNING task. So we would always end up re-arming the work.
>>>>>>>>>>
>>>>>>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
>>>>>>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
>>>>>>>>>> help. The work is already scheduled. so we don't do anything there.
>>>>>>>>
>>>>>>>> Hi Pavan,
>>>>>>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
>>>>>>>> issue. At the time it was written I tested it and it seemed to work.
>>>>>>>> Maybe I missed something or some other change introduced afterwards
>>>>>>>> affected the shutoff logic. I'll take a closer look next week when I'm
>>>>>>>> back at my computer and will consult with Johannes.
>>>>>>>
>>>>>>> Sorry for the delay. I had some time to look into this and test psi
>>>>>>> shutoff on my device and I think you are right. The patch I mentioned
>>>>>>> prevents new psi_avgs_work from being scheduled when the only non-idle
>>>>>>> task is psi_avgs_work itself, however the regular 2sec averaging work
>>>>>>> will still go on. I think we could record the fact that the only
>>>>>>> active task is psi_avgs_work in record_times() using a new
>>>>>>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
>>>>>>> rescheduling itself if that flag is set for all non-idle cpus. I'll
>>>>>>> test this approach and will post a patch for review if that works.
>>>>>>
>>>>>> Hi Pavan,
>>>>>> Testing PSI shutoff on Android proved more difficult than I expected.
>>>>>> Lots of tasks to silence and I keep encountering new ones.
>>>>>> The approach I was thinking about is something like this:
>>>>>>
>>>>>> ---
>>>>>> include/linux/psi_types.h | 3 +++
>>>>>> kernel/sched/psi.c | 12 +++++++++---
>>>>>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
>>>>>> index c7fe7c089718..8d936f22cb5b 100644
>>>>>> --- a/include/linux/psi_types.h
>>>>>> +++ b/include/linux/psi_types.h
>>>>>> @@ -68,6 +68,9 @@ enum psi_states {
>>>>>> NR_PSI_STATES = 7,
>>>>>> };
>>>>>>
>>>>>> +/* state_mask flag to keep re-arming averaging work */
>>>>>> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
>>>>>> +
>>>>>> enum psi_aggregators {
>>>>>> PSI_AVGS = 0,
>>>>>> PSI_POLL,
>>>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>>>> index ecb4b4ff4ce0..dd62ad28bacd 100644
>>>>>> --- a/kernel/sched/psi.c
>>>>>> +++ b/kernel/sched/psi.c
>>>>>> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
>>>>>> *group, int cpu,
>>>>>> if (delta)
>>>>>> *pchanged_states |= (1 << s);
>>>>>> }
>>>>>> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
>>>>>
>>>>> If the avgs_work kworker is running on this CPU, it will still see
>>>>> PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
>>>>>
>>>>> Maybe I missed something... but I have another different idea which
>>>>> I want to implement later only for discussion.
>>>>
>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>> index ee2ecc081422..f322e8fd8d41 100644
>>>> --- a/kernel/sched/psi.c
>>>> +++ b/kernel/sched/psi.c
>>>> @@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>> enum psi_aggregators aggregator, u32 *times,
>>>> u32 *pchanged_states)
>>>> {
>>>> + int current_cpu = raw_smp_processor_id();
>>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>>> u64 now, state_start;
>>>> enum psi_states s;
>>>> unsigned int seq;
>>>> u32 state_mask;
>>>> + bool only_avgs_work = false;
>>>>
>>>> *pchanged_states = 0;
>>>>
>>>> @@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>> memcpy(times, groupc->times, sizeof(groupc->times));
>>>> state_mask = groupc->state_mask;
>>>> state_start = groupc->state_start;
>>>> + /*
>>>> + * This CPU has only avgs_work kworker running, snapshot the
>>>> + * newest times then don't need to re-arm work for this groupc.
>>>> + * Normally this kworker will sleep soon and won't
>>>> + * wake_clock in psi_group_change().
>>>> + */
>>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
>>>> + only_avgs_work = true;
>>>> } while (read_seqcount_retry(&groupc->seq, seq));
>>>>
>>>> /* Calculate state time deltas against the previous snapshot */
>>>> @@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>> if (delta)
>>>> *pchanged_states |= (1 << s);
>>>> }
>>>> +
>>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>>>> + if (only_avgs_work)
>>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
>>>> }
>>>>
>>> Thanks Chengming for the patch. I will test this patch and report my
>>> observations. It makes sense to consider this CPU as non-idle if the PSI kworker
>>> is the only task running. It could run other works but that decision is now
>>> deferred to schedule out path. Ideally if this is the only (or last) work
>>> running, we should not see PSI work not re-arming it self.
>>>
>>
>> is condition groupc->tasks[NR_RUNNING] == 1 alone sufficient to clear NONIDLE?
>> or should we also make sure that !NR_IOWAIT && !NR_MEMSTALL condition on this CPU?
>
> Yes, I think you're correct, we should check !NR_IOWAIT && !NR_MEMSTALL too.
>
Hello,
Looks like this patch plus the !NR_IOWAIT && !NR_MEMSTALL condition can solve the
avgs_work re-arm problem.
Real machine is hard to reproduce, so I set up a QEMU VM with only one CPU and hack
PSI_FREQ=100ms, with "cgroup_disable=pressure" cmdline to disable cgroup levels PSI.
After launching the VM, I ssh into it and enable some tracepoints, then exit to
let it idle.
<idle>-0 [000] d..1. 77.102839: cpu_idle: state=1 cpu_id=0
<idle>-0 [000] d.s2. 77.142840: workqueue_queue_work: work struct=00000000cd023738 function=psi_avgs_work workqueue=events req_cpu=512 cpu=0
<idle>-0 [000] d.s2. 77.142841: workqueue_activate_work: work struct 00000000cd023738
<idle>-0 [000] dNs5. 77.142849: sched_wakeup: comm=kworker/0:5 pid=164 prio=120 target_cpu=000
<idle>-0 [000] dNs2. 77.142852: timer_start: timer=00000000f41058a8 function=tcp_orphan_update expires=4294744860 [timeout=100] cpu=0 idx=100 flags=D
<idle>-0 [000] .N.1. 77.142854: cpu_idle: state=4294967295 cpu_id=0
<idle>-0 [000] d..2. 77.142860: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:5 next_pid=164 next_prio=120
kworker/0:5-164 [000] ..... 77.142864: workqueue_execute_start: work struct 00000000cd023738: function psi_avgs_work
kworker/0:5-164 [000] ..... 77.142866: workqueue_execute_end: work struct 00000000cd023738: function psi_avgs_work
[avgs_work won't be re-armed again]
kworker/0:5-164 [000] d..2. 77.142872: sched_switch: prev_comm=kworker/0:5 prev_pid=164 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d..1. 77.142878: cpu_idle: state=1 cpu_id=0
[avgs_work kworker sleep, won't wake_clock]
<idle>-0 [000] dNs4. 77.470857: sched_wakeup: comm=jbd2/vda3-8 pid=73 prio=120 target_cpu=000
[new activity again, avgs_work scheduled delay PSI_FREQ=100ms ...]
<idle>-0 [000] d.s2. 77.574816: workqueue_queue_work: work struct=00000000cd023738 function=psi_avgs_work workqueue=events req_cpu=512 cpu=0
<idle>-0 [000] d.s2. 77.574817: workqueue_activate_work: work struct 00000000cd023738
<idle>-0 [000] dNs5. 77.574824: sched_wakeup: comm=kworker/0:5 pid=164 prio=120 target_cpu=000
<idle>-0 [000] dNs2. 77.574826: timer_start: timer=00000000f41058a8 function=tcp_orphan_update expires=4294745292 [timeout=100] cpu=0 idx=90 flags=D
<idle>-0 [000] .N.1. 77.574827: cpu_idle: state=4294967295 cpu_id=0
<idle>-0 [000] d..2. 77.574833: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:5 next_pid=164 next_prio=120
kworker/0:5-164 [000] ..... 77.574837: workqueue_execute_start: work struct 00000000cd023738: function psi_avgs_work
kworker/0:5-164 [000] ..... 77.574839: workqueue_execute_end: work struct 00000000cd023738: function psi_avgs_work
[avgs_work run as expected at 77.470857+100ms, then idle again, won't be re-armed]
kworker/0:5-164 [000] d..2. 77.574844: sched_switch: prev_comm=kworker/0:5 prev_pid=164 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
<idle>-0 [000] d..1. 77.574850: cpu_idle: state=1 cpu_id=0
<idle>-0 [000] d.s5. 77.606824: timer_start: timer=000000005f4b8bb9 function=delayed_work_timer_fn expires=4294745324 [timeout=100] cpu=0 idx=94 flags=I
[...]
On Mon, Oct 10, 2022 at 11:27:14AM +0530, Pavan Kondeti wrote:
> On Wed, Oct 05, 2022 at 09:32:44AM -0700, Suren Baghdasaryan wrote:
> > On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> > > > > > Hi
> > > > > >
> > > > > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > > > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > > > > there is a RUNNING task. So we would always end up re-arming the work.
> > > > > >
> > > > > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > > > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > > > > help. The work is already scheduled. so we don't do anything there.
> > > >
> > > > Hi Pavan,
> > > > Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> > > > issue. At the time it was written I tested it and it seemed to work.
> > > > Maybe I missed something or some other change introduced afterwards
> > > > affected the shutoff logic. I'll take a closer look next week when I'm
> > > > back at my computer and will consult with Johannes.
> > >
> > > Sorry for the delay. I had some time to look into this and test psi
> > > shutoff on my device and I think you are right. The patch I mentioned
> > > prevents new psi_avgs_work from being scheduled when the only non-idle
> > > task is psi_avgs_work itself, however the regular 2sec averaging work
> > > will still go on. I think we could record the fact that the only
> > > active task is psi_avgs_work in record_times() using a new
> > > psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> > > rescheduling itself if that flag is set for all non-idle cpus. I'll
> > > test this approach and will post a patch for review if that works.
> >
> > Hi Pavan,
> > Testing PSI shutoff on Android proved more difficult than I expected.
> > Lots of tasks to silence and I keep encountering new ones.
> > The approach I was thinking about is something like this:
> >
> > ---
> > include/linux/psi_types.h | 3 +++
> > kernel/sched/psi.c | 12 +++++++++---
> > 2 files changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > index c7fe7c089718..8d936f22cb5b 100644
> > --- a/include/linux/psi_types.h
> > +++ b/include/linux/psi_types.h
> > @@ -68,6 +68,9 @@ enum psi_states {
> > NR_PSI_STATES = 7,
> > };
> >
> > +/* state_mask flag to keep re-arming averaging work */
> > +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
> > +
> > enum psi_aggregators {
> > PSI_AVGS = 0,
> > PSI_POLL,
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index ecb4b4ff4ce0..dd62ad28bacd 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> > *group, int cpu,
> > if (delta)
> > *pchanged_states |= (1 << s);
> > }
> > + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
> > }
> >
> > static void calc_avgs(unsigned long avg[3], int missed_periods,
> > @@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
> > struct delayed_work *dwork;
> > struct psi_group *group;
> > u32 changed_states;
> > - bool nonidle;
> > + bool wake_clock;
> > u64 now;
> >
> > dwork = to_delayed_work(work);
> > @@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
> > now = sched_clock();
> >
> > collect_percpu_times(group, PSI_AVGS, &changed_states);
> > - nonidle = changed_states & (1 << PSI_NONIDLE);
> > + wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
> > /*
> > * If there is task activity, periodically fold the per-cpu
> > * times and feed samples into the running averages. If things
> > @@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
> > if (now >= group->avg_next_update)
> > group->avg_next_update = update_averages(group, now);
> >
> > - if (nonidle) {
> > + if (wake_clock) {
> > schedule_delayed_work(dwork, nsecs_to_jiffies(
> > group->avg_next_update - now) + 1);
> > }
I have added an else block here and it is never hit.
> > @@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
> > *group, int cpu,
> > if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> > state_mask |= (1 << PSI_MEM_FULL);
> >
> > + if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
> > + /* psi_avgs_work was not the only task on the CPU */
> > + state_mask |= PSI_STATE_WAKE_CLOCK;
> > + }
> > +
>
> Thanks Suren for taking a look. I was not at work last week, so could not
> reply earlier.
>
> As chengming pointed out already, when kworker is woken up (some tasks ran in
> the last window, so we scheudled a work. now kworker woke up when system is
> idle), PSI_NONIDLE condition will be true here, we end up setting the
> PSI_STATE_WAKE_CLOCK flag. Correct? Any ways, I am going to test this patch and
> report my observations.
>
I have tested this patch and it did not help i.e PSI work has been re-armed
periodically.
Thanks,
Pavan
On Mon, Oct 10, 2022 at 04:30:36PM +0800, Chengming Zhou wrote:
> On 2022/10/10 14:57, Chengming Zhou wrote:
> > On 2022/10/10 14:43, Pavan Kondeti wrote:
> >> On Mon, Oct 10, 2022 at 11:48:49AM +0530, Pavan Kondeti wrote:
> >>> On Sun, Oct 09, 2022 at 09:17:34PM +0800, Chengming Zhou wrote:
> >>>> On 2022/10/9 20:41, Chengming Zhou wrote:
> >>>>> Hello,
> >>>>>
> >>>>> I just saw these emails, sorry for late.
> >>>>>
> >>>>> On 2022/10/6 00:32, Suren Baghdasaryan wrote:
> >>>>>> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> >>>>>>>>>> Hi
> >>>>>>>>>>
> >>>>>>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> >>>>>>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
> >>>>>>>>>> there is a RUNNING task. So we would always end up re-arming the work.
> >>>>>>>>>>
> >>>>>>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
> >>>>>>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> >>>>>>>>>> help. The work is already scheduled. so we don't do anything there.
> >>>>>>>>
> >>>>>>>> Hi Pavan,
> >>>>>>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> >>>>>>>> issue. At the time it was written I tested it and it seemed to work.
> >>>>>>>> Maybe I missed something or some other change introduced afterwards
> >>>>>>>> affected the shutoff logic. I'll take a closer look next week when I'm
> >>>>>>>> back at my computer and will consult with Johannes.
> >>>>>>>
> >>>>>>> Sorry for the delay. I had some time to look into this and test psi
> >>>>>>> shutoff on my device and I think you are right. The patch I mentioned
> >>>>>>> prevents new psi_avgs_work from being scheduled when the only non-idle
> >>>>>>> task is psi_avgs_work itself, however the regular 2sec averaging work
> >>>>>>> will still go on. I think we could record the fact that the only
> >>>>>>> active task is psi_avgs_work in record_times() using a new
> >>>>>>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> >>>>>>> rescheduling itself if that flag is set for all non-idle cpus. I'll
> >>>>>>> test this approach and will post a patch for review if that works.
> >>>>>>
> >>>>>> Hi Pavan,
> >>>>>> Testing PSI shutoff on Android proved more difficult than I expected.
> >>>>>> Lots of tasks to silence and I keep encountering new ones.
> >>>>>> The approach I was thinking about is something like this:
> >>>>>>
> >>>>>> ---
> >>>>>> include/linux/psi_types.h | 3 +++
> >>>>>> kernel/sched/psi.c | 12 +++++++++---
> >>>>>> 2 files changed, 12 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> >>>>>> index c7fe7c089718..8d936f22cb5b 100644
> >>>>>> --- a/include/linux/psi_types.h
> >>>>>> +++ b/include/linux/psi_types.h
> >>>>>> @@ -68,6 +68,9 @@ enum psi_states {
> >>>>>> NR_PSI_STATES = 7,
> >>>>>> };
> >>>>>>
> >>>>>> +/* state_mask flag to keep re-arming averaging work */
> >>>>>> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
> >>>>>> +
> >>>>>> enum psi_aggregators {
> >>>>>> PSI_AVGS = 0,
> >>>>>> PSI_POLL,
> >>>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> >>>>>> index ecb4b4ff4ce0..dd62ad28bacd 100644
> >>>>>> --- a/kernel/sched/psi.c
> >>>>>> +++ b/kernel/sched/psi.c
> >>>>>> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> >>>>>> *group, int cpu,
> >>>>>> if (delta)
> >>>>>> *pchanged_states |= (1 << s);
> >>>>>> }
> >>>>>> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
> >>>>>
> >>>>> If the avgs_work kworker is running on this CPU, it will still see
> >>>>> PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
> >>>>>
> >>>>> Maybe I missed something... but I have another different idea which
> >>>>> I want to implement later only for discussion.
> >>>>
> >>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> >>>> index ee2ecc081422..f322e8fd8d41 100644
> >>>> --- a/kernel/sched/psi.c
> >>>> +++ b/kernel/sched/psi.c
> >>>> @@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >>>> enum psi_aggregators aggregator, u32 *times,
> >>>> u32 *pchanged_states)
> >>>> {
> >>>> + int current_cpu = raw_smp_processor_id();
> >>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> >>>> u64 now, state_start;
> >>>> enum psi_states s;
> >>>> unsigned int seq;
> >>>> u32 state_mask;
> >>>> + bool only_avgs_work = false;
> >>>>
> >>>> *pchanged_states = 0;
> >>>>
> >>>> @@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >>>> memcpy(times, groupc->times, sizeof(groupc->times));
> >>>> state_mask = groupc->state_mask;
> >>>> state_start = groupc->state_start;
> >>>> + /*
> >>>> + * This CPU has only avgs_work kworker running, snapshot the
> >>>> + * newest times then don't need to re-arm work for this groupc.
> >>>> + * Normally this kworker will sleep soon and won't
> >>>> + * wake_clock in psi_group_change().
> >>>> + */
> >>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
> >>>> + only_avgs_work = true;
> >>>> } while (read_seqcount_retry(&groupc->seq, seq));
> >>>>
> >>>> /* Calculate state time deltas against the previous snapshot */
> >>>> @@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >>>> if (delta)
> >>>> *pchanged_states |= (1 << s);
> >>>> }
> >>>> +
> >>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> >>>> + if (only_avgs_work)
> >>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
> >>>> }
> >>>>
> >>> Thanks Chengming for the patch. I will test this patch and report my
> >>> observations. It makes sense to consider this CPU as non-idle if the PSI kworker
> >>> is the only task running. It could run other works but that decision is now
> >>> deferred to schedule out path. Ideally if this is the only (or last) work
> >>> running, we should not see PSI work not re-arming it self.
> >>>
> >>
> >> is condition groupc->tasks[NR_RUNNING] == 1 alone sufficient to clear NONIDLE?
> >> or should we also make sure that !NR_IOWAIT && !NR_MEMSTALL condition on this CPU?
> >
> > Yes, I think you're correct, we should check !NR_IOWAIT && !NR_MEMSTALL too.
> >
>
> Hello,
>
> Looks like this patch plus the !NR_IOWAIT && !NR_MEMSTALL condition can solve the
> avgs_work re-arm problem.
>
Yes. This is inline with my observations. Without this patch, I see the PSI
work getting scheduled ~15 times in a 30 seconds trace. It is expected since
PSI FREQ is 2 seconds by default. With your patch, I see less number of PSI
work scheduled. More importantly, it is not scheduled from the work it self
when there is no other activity.
Like I said in the other thread, there is lot of noise I had to suppress to
reduce this problem. The moment, I removed all the suppression, things did
not look bright. one or the other deferrable timer expires due to PSI waking
up the CPU and then scheduling other tasks due to which PSI again queue the
work. I guess that varies from platform to platform.
This patch in principle makes sense to me. Atleast, we have a way of mechanism
for idle shutoff from work it self.
Thanks,
Pavan
On 2022/10/10 17:09, Pavan Kondeti wrote:
> On Mon, Oct 10, 2022 at 04:30:36PM +0800, Chengming Zhou wrote:
>> On 2022/10/10 14:57, Chengming Zhou wrote:
>>> On 2022/10/10 14:43, Pavan Kondeti wrote:
>>>> On Mon, Oct 10, 2022 at 11:48:49AM +0530, Pavan Kondeti wrote:
>>>>> On Sun, Oct 09, 2022 at 09:17:34PM +0800, Chengming Zhou wrote:
>>>>>> On 2022/10/9 20:41, Chengming Zhou wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I just saw these emails, sorry for late.
>>>>>>>
>>>>>>> On 2022/10/6 00:32, Suren Baghdasaryan wrote:
>>>>>>>> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
>>>>>>>>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
>>>>>>>>>>>> there is a RUNNING task. So we would always end up re-arming the work.
>>>>>>>>>>>>
>>>>>>>>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
>>>>>>>>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
>>>>>>>>>>>> help. The work is already scheduled. so we don't do anything there.
>>>>>>>>>>
>>>>>>>>>> Hi Pavan,
>>>>>>>>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
>>>>>>>>>> issue. At the time it was written I tested it and it seemed to work.
>>>>>>>>>> Maybe I missed something or some other change introduced afterwards
>>>>>>>>>> affected the shutoff logic. I'll take a closer look next week when I'm
>>>>>>>>>> back at my computer and will consult with Johannes.
>>>>>>>>>
>>>>>>>>> Sorry for the delay. I had some time to look into this and test psi
>>>>>>>>> shutoff on my device and I think you are right. The patch I mentioned
>>>>>>>>> prevents new psi_avgs_work from being scheduled when the only non-idle
>>>>>>>>> task is psi_avgs_work itself, however the regular 2sec averaging work
>>>>>>>>> will still go on. I think we could record the fact that the only
>>>>>>>>> active task is psi_avgs_work in record_times() using a new
>>>>>>>>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
>>>>>>>>> rescheduling itself if that flag is set for all non-idle cpus. I'll
>>>>>>>>> test this approach and will post a patch for review if that works.
>>>>>>>>
>>>>>>>> Hi Pavan,
>>>>>>>> Testing PSI shutoff on Android proved more difficult than I expected.
>>>>>>>> Lots of tasks to silence and I keep encountering new ones.
>>>>>>>> The approach I was thinking about is something like this:
>>>>>>>>
>>>>>>>> ---
>>>>>>>> include/linux/psi_types.h | 3 +++
>>>>>>>> kernel/sched/psi.c | 12 +++++++++---
>>>>>>>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
>>>>>>>> index c7fe7c089718..8d936f22cb5b 100644
>>>>>>>> --- a/include/linux/psi_types.h
>>>>>>>> +++ b/include/linux/psi_types.h
>>>>>>>> @@ -68,6 +68,9 @@ enum psi_states {
>>>>>>>> NR_PSI_STATES = 7,
>>>>>>>> };
>>>>>>>>
>>>>>>>> +/* state_mask flag to keep re-arming averaging work */
>>>>>>>> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
>>>>>>>> +
>>>>>>>> enum psi_aggregators {
>>>>>>>> PSI_AVGS = 0,
>>>>>>>> PSI_POLL,
>>>>>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>>>>>> index ecb4b4ff4ce0..dd62ad28bacd 100644
>>>>>>>> --- a/kernel/sched/psi.c
>>>>>>>> +++ b/kernel/sched/psi.c
>>>>>>>> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
>>>>>>>> *group, int cpu,
>>>>>>>> if (delta)
>>>>>>>> *pchanged_states |= (1 << s);
>>>>>>>> }
>>>>>>>> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
>>>>>>>
>>>>>>> If the avgs_work kworker is running on this CPU, it will still see
>>>>>>> PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
>>>>>>>
>>>>>>> Maybe I missed something... but I have another different idea which
>>>>>>> I want to implement later only for discussion.
>>>>>>
>>>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>>>> index ee2ecc081422..f322e8fd8d41 100644
>>>>>> --- a/kernel/sched/psi.c
>>>>>> +++ b/kernel/sched/psi.c
>>>>>> @@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> enum psi_aggregators aggregator, u32 *times,
>>>>>> u32 *pchanged_states)
>>>>>> {
>>>>>> + int current_cpu = raw_smp_processor_id();
>>>>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>>>>> u64 now, state_start;
>>>>>> enum psi_states s;
>>>>>> unsigned int seq;
>>>>>> u32 state_mask;
>>>>>> + bool only_avgs_work = false;
>>>>>>
>>>>>> *pchanged_states = 0;
>>>>>>
>>>>>> @@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> memcpy(times, groupc->times, sizeof(groupc->times));
>>>>>> state_mask = groupc->state_mask;
>>>>>> state_start = groupc->state_start;
>>>>>> + /*
>>>>>> + * This CPU has only avgs_work kworker running, snapshot the
>>>>>> + * newest times then don't need to re-arm work for this groupc.
>>>>>> + * Normally this kworker will sleep soon and won't
>>>>>> + * wake_clock in psi_group_change().
>>>>>> + */
>>>>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
>>>>>> + only_avgs_work = true;
>>>>>> } while (read_seqcount_retry(&groupc->seq, seq));
>>>>>>
>>>>>> /* Calculate state time deltas against the previous snapshot */
>>>>>> @@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> if (delta)
>>>>>> *pchanged_states |= (1 << s);
>>>>>> }
>>>>>> +
>>>>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>>>>>> + if (only_avgs_work)
>>>>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
>>>>>> }
>>>>>>
>>>>> Thanks Chengming for the patch. I will test this patch and report my
>>>>> observations. It makes sense to consider this CPU as non-idle if the PSI kworker
>>>>> is the only task running. It could run other works but that decision is now
>>>>> deferred to schedule out path. Ideally if this is the only (or last) work
>>>>> running, we should not see PSI work not re-arming it self.
>>>>>
>>>>
>>>> is condition groupc->tasks[NR_RUNNING] == 1 alone sufficient to clear NONIDLE?
>>>> or should we also make sure that !NR_IOWAIT && !NR_MEMSTALL condition on this CPU?
>>>
>>> Yes, I think you're correct, we should check !NR_IOWAIT && !NR_MEMSTALL too.
>>>
>>
>> Hello,
>>
>> Looks like this patch plus the !NR_IOWAIT && !NR_MEMSTALL condition can solve the
>> avgs_work re-arm problem.
>>
> Yes. This is inline with my observations. Without this patch, I see the PSI
> work getting scheduled ~15 times in a 30 seconds trace. It is expected since
> PSI FREQ is 2 seconds by default. With your patch, I see less number of PSI
> work scheduled. More importantly, it is not scheduled from the work it self
> when there is no other activity.
Good, thanks for your testing!
>
> Like I said in the other thread, there is lot of noise I had to suppress to
> reduce this problem. The moment, I removed all the suppression, things did
> not look bright. one or the other deferrable timer expires due to PSI waking
> up the CPU and then scheduling other tasks due to which PSI again queue the
> work. I guess that varies from platform to platform.
Yes, it's also hard for me to reproduce :/
>
> This patch in principle makes sense to me. Atleast, we have a way of mechanism
> for idle shutoff from work it self.
Ok, I will send a patch later.
Thanks!
Pavan reported a problem that PSI avgs_work idle shutoff is not
working at all. Because PSI_NONIDLE condition would be observed in
psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
only the kworker running avgs_work on the CPU.
Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
still will always re-arm the avgs_work, so shutoff is not working.
This patch changes to consider current CPU groupc as IDLE if the
kworker running avgs_work is the only task running and no IOWAIT
or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
if other CPUs' groupc are also IDLE.
One potential problem is that the brief period of non-idle time
incurred between the aggregation run and the kworker's dequeue will
be stranded in the per-cpu buckets until avgs_work run next time.
The buckets can hold 4s worth of time, and future activity will wake
the avgs_work with a 2s delay, giving us 2s worth of data we can leave
behind when shut off the avgs_work. If the kworker run other works after
avgs_work shut off and doesn't have any scheduler activities for 2s,
this maybe a problem.
Reported-by: Pavan Kondeti <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
---
kernel/sched/psi.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ee2ecc081422..f4cdf6f184ba 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
u32 *pchanged_states)
{
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ int current_cpu = raw_smp_processor_id();
+ bool only_avgs_work = false;
u64 now, state_start;
enum psi_states s;
unsigned int seq;
@@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
memcpy(times, groupc->times, sizeof(groupc->times));
state_mask = groupc->state_mask;
state_start = groupc->state_start;
+ /*
+ * This CPU has only avgs_work kworker running, snapshot the
+ * newest times then don't need to re-arm for this groupc.
+ * Normally this kworker will sleep soon and won't wake
+ * avgs_work back up in psi_group_change().
+ */
+ if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
+ !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
+ only_avgs_work = true;
} while (read_seqcount_retry(&groupc->seq, seq));
/* Calculate state time deltas against the previous snapshot */
@@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
if (delta)
*pchanged_states |= (1 << s);
}
+
+ /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
+ if (only_avgs_work)
+ *pchanged_states &= ~(1 << PSI_NONIDLE);
}
static void calc_avgs(unsigned long avg[3], int missed_periods,
--
2.37.2
On Sun, Oct 9, 2022 at 5:41 AM Chengming Zhou
<[email protected]> wrote:
>
> Hello,
>
> I just saw these emails, sorry for late.
>
> On 2022/10/6 00:32, Suren Baghdasaryan wrote:
> > On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
> >>
> >> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> >>>
> >>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> >>> <[email protected]> wrote:
> >>>>
> >>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> >>>>> Hi
> >>>>>
> >>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> >>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
> >>>>> there is a RUNNING task. So we would always end up re-arming the work.
> >>>>>
> >>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
> >>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> >>>>> help. The work is already scheduled. so we don't do anything there.
> >>>
> >>> Hi Pavan,
> >>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> >>> issue. At the time it was written I tested it and it seemed to work.
> >>> Maybe I missed something or some other change introduced afterwards
> >>> affected the shutoff logic. I'll take a closer look next week when I'm
> >>> back at my computer and will consult with Johannes.
> >>
> >> Sorry for the delay. I had some time to look into this and test psi
> >> shutoff on my device and I think you are right. The patch I mentioned
> >> prevents new psi_avgs_work from being scheduled when the only non-idle
> >> task is psi_avgs_work itself, however the regular 2sec averaging work
> >> will still go on. I think we could record the fact that the only
> >> active task is psi_avgs_work in record_times() using a new
> >> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> >> rescheduling itself if that flag is set for all non-idle cpus. I'll
> >> test this approach and will post a patch for review if that works.
> >
> > Hi Pavan,
> > Testing PSI shutoff on Android proved more difficult than I expected.
> > Lots of tasks to silence and I keep encountering new ones.
> > The approach I was thinking about is something like this:
> >
> > ---
> > include/linux/psi_types.h | 3 +++
> > kernel/sched/psi.c | 12 +++++++++---
> > 2 files changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > index c7fe7c089718..8d936f22cb5b 100644
> > --- a/include/linux/psi_types.h
> > +++ b/include/linux/psi_types.h
> > @@ -68,6 +68,9 @@ enum psi_states {
> > NR_PSI_STATES = 7,
> > };
> >
> > +/* state_mask flag to keep re-arming averaging work */
> > +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
> > +
> > enum psi_aggregators {
> > PSI_AVGS = 0,
> > PSI_POLL,
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index ecb4b4ff4ce0..dd62ad28bacd 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> > *group, int cpu,
> > if (delta)
> > *pchanged_states |= (1 << s);
> > }
> > + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
>
> If the avgs_work kworker is running on this CPU, it will still see
> PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
Ah, you are right. psi_group_change will set this flag again once
avgs_work starts running and we will see it here.
>
> Maybe I missed something... but I have another different idea which
> I want to implement later only for discussion.
>
> Thanks.
>
> > }
> >
> > static void calc_avgs(unsigned long avg[3], int missed_periods,
> > @@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
> > struct delayed_work *dwork;
> > struct psi_group *group;
> > u32 changed_states;
> > - bool nonidle;
> > + bool wake_clock;
> > u64 now;
> >
> > dwork = to_delayed_work(work);
> > @@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
> > now = sched_clock();
> >
> > collect_percpu_times(group, PSI_AVGS, &changed_states);
> > - nonidle = changed_states & (1 << PSI_NONIDLE);
> > + wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
> > /*
> > * If there is task activity, periodically fold the per-cpu
> > * times and feed samples into the running averages. If things
> > @@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
> > if (now >= group->avg_next_update)
> > group->avg_next_update = update_averages(group, now);
> >
> > - if (nonidle) {
> > + if (wake_clock) {
> > schedule_delayed_work(dwork, nsecs_to_jiffies(
> > group->avg_next_update - now) + 1);
> > }
> > @@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
> > *group, int cpu,
> > if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> > state_mask |= (1 << PSI_MEM_FULL);
> >
> > + if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
> > + /* psi_avgs_work was not the only task on the CPU */
> > + state_mask |= PSI_STATE_WAKE_CLOCK;
> > + }
> > +
> > groupc->state_mask = state_mask;
> >
> > write_seqcount_end(&groupc->seq);
On Sun, Oct 9, 2022 at 6:17 AM Chengming Zhou
<[email protected]> wrote:
>
> On 2022/10/9 20:41, Chengming Zhou wrote:
> > Hello,
> >
> > I just saw these emails, sorry for late.
> >
> > On 2022/10/6 00:32, Suren Baghdasaryan wrote:
> >> On Sun, Oct 2, 2022 at 11:11 PM Suren Baghdasaryan <[email protected]> wrote:
> >>>
> >>> On Fri, Sep 16, 2022 at 10:45 PM Suren Baghdasaryan <[email protected]> wrote:
> >>>>
> >>>> On Wed, Sep 14, 2022 at 11:20 PM Pavan Kondeti
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> >>>>>> Hi
> >>>>>>
> >>>>>> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> >>>>>> run from a kworker thread, PSI_NONIDLE condition would be observed as
> >>>>>> there is a RUNNING task. So we would always end up re-arming the work.
> >>>>>>
> >>>>>> If the work is re-armed from the psi_avgs_work() it self, the backing off
> >>>>>> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> >>>>>> help. The work is already scheduled. so we don't do anything there.
> >>>>
> >>>> Hi Pavan,
> >>>> Thanks for reporting the issue. IIRC [1] was meant to fix exactly this
> >>>> issue. At the time it was written I tested it and it seemed to work.
> >>>> Maybe I missed something or some other change introduced afterwards
> >>>> affected the shutoff logic. I'll take a closer look next week when I'm
> >>>> back at my computer and will consult with Johannes.
> >>>
> >>> Sorry for the delay. I had some time to look into this and test psi
> >>> shutoff on my device and I think you are right. The patch I mentioned
> >>> prevents new psi_avgs_work from being scheduled when the only non-idle
> >>> task is psi_avgs_work itself, however the regular 2sec averaging work
> >>> will still go on. I think we could record the fact that the only
> >>> active task is psi_avgs_work in record_times() using a new
> >>> psi_group_cpu.state_mask flag and then prevent psi_avgs_work() from
> >>> rescheduling itself if that flag is set for all non-idle cpus. I'll
> >>> test this approach and will post a patch for review if that works.
> >>
> >> Hi Pavan,
> >> Testing PSI shutoff on Android proved more difficult than I expected.
> >> Lots of tasks to silence and I keep encountering new ones.
> >> The approach I was thinking about is something like this:
> >>
> >> ---
> >> include/linux/psi_types.h | 3 +++
> >> kernel/sched/psi.c | 12 +++++++++---
> >> 2 files changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> >> index c7fe7c089718..8d936f22cb5b 100644
> >> --- a/include/linux/psi_types.h
> >> +++ b/include/linux/psi_types.h
> >> @@ -68,6 +68,9 @@ enum psi_states {
> >> NR_PSI_STATES = 7,
> >> };
> >>
> >> +/* state_mask flag to keep re-arming averaging work */
> >> +#define PSI_STATE_WAKE_CLOCK (1 << NR_PSI_STATES)
> >> +
> >> enum psi_aggregators {
> >> PSI_AVGS = 0,
> >> PSI_POLL,
> >> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> >> index ecb4b4ff4ce0..dd62ad28bacd 100644
> >> --- a/kernel/sched/psi.c
> >> +++ b/kernel/sched/psi.c
> >> @@ -278,6 +278,7 @@ static void get_recent_times(struct psi_group
> >> *group, int cpu,
> >> if (delta)
> >> *pchanged_states |= (1 << s);
> >> }
> >> + *pchanged_states |= (state_mask & PSI_STATE_WAKE_CLOCK);
> >
> > If the avgs_work kworker is running on this CPU, it will still see
> > PSI_STATE_WAKE_CLOCK set in state_mask? So the work will be re-armed?
> >
> > Maybe I missed something... but I have another different idea which
> > I want to implement later only for discussion.
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index ee2ecc081422..f322e8fd8d41 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -241,11 +241,13 @@ static void get_recent_times(struct psi_group *group, int cpu,
> enum psi_aggregators aggregator, u32 *times,
> u32 *pchanged_states)
> {
> + int current_cpu = raw_smp_processor_id();
> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> u64 now, state_start;
> enum psi_states s;
> unsigned int seq;
> u32 state_mask;
> + bool only_avgs_work = false;
>
> *pchanged_states = 0;
>
> @@ -256,6 +258,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
> memcpy(times, groupc->times, sizeof(groupc->times));
> state_mask = groupc->state_mask;
> state_start = groupc->state_start;
> + /*
> + * This CPU has only avgs_work kworker running, snapshot the
> + * newest times then don't need to re-arm work for this groupc.
> + * Normally this kworker will sleep soon and won't
> + * wake_clock in psi_group_change().
> + */
> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1)
> + only_avgs_work = true;
Why do you determine only_avgs_work while taking a snapshot? The
read_seqcount_retry() might fail and the loop gets retried, which
might lead to a wrong only_avgs_work value if the state changes
between retries. I think it's safer to do this after the snapshot was
taken and to use tasks[NR_RUNNING] instead of groupc->tasks.
> } while (read_seqcount_retry(&groupc->seq, seq));
>
> /* Calculate state time deltas against the previous snapshot */
> @@ -280,6 +290,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> if (delta)
> *pchanged_states |= (1 << s);
> }
> +
> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> + if (only_avgs_work)
> + *pchanged_states &= ~(1 << PSI_NONIDLE);
This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
only for re-arming psi_avgs_work, however semantically this is
incorrect. The CPU was not idle when it was executing psi_avgs_work.
IMO a separate flag would avoid this confusion.
> }
>
> static void calc_avgs(unsigned long avg[3], int missed_periods,
>
>
> >
> > Thanks.
> >
> >> }
> >>
> >> static void calc_avgs(unsigned long avg[3], int missed_periods,
> >> @@ -413,7 +414,7 @@ static void psi_avgs_work(struct work_struct *work)
> >> struct delayed_work *dwork;
> >> struct psi_group *group;
> >> u32 changed_states;
> >> - bool nonidle;
> >> + bool wake_clock;
> >> u64 now;
> >>
> >> dwork = to_delayed_work(work);
> >> @@ -424,7 +425,7 @@ static void psi_avgs_work(struct work_struct *work)
> >> now = sched_clock();
> >>
> >> collect_percpu_times(group, PSI_AVGS, &changed_states);
> >> - nonidle = changed_states & (1 << PSI_NONIDLE);
> >> + wake_clock = changed_states & PSI_STATE_WAKE_CLOCK;
> >> /*
> >> * If there is task activity, periodically fold the per-cpu
> >> * times and feed samples into the running averages. If things
> >> @@ -435,7 +436,7 @@ static void psi_avgs_work(struct work_struct *work)
> >> if (now >= group->avg_next_update)
> >> group->avg_next_update = update_averages(group, now);
> >>
> >> - if (nonidle) {
> >> + if (wake_clock) {
> >> schedule_delayed_work(dwork, nsecs_to_jiffies(
> >> group->avg_next_update - now) + 1);
> >> }
> >> @@ -742,6 +743,11 @@ static void psi_group_change(struct psi_group
> >> *group, int cpu,
> >> if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> >> state_mask |= (1 << PSI_MEM_FULL);
> >>
> >> + if (wake_clock || test_state(groupc->tasks, PSI_NONIDLE)) {
> >> + /* psi_avgs_work was not the only task on the CPU */
> >> + state_mask |= PSI_STATE_WAKE_CLOCK;
> >> + }
> >> +
> >> groupc->state_mask = state_mask;
> >>
> >> write_seqcount_end(&groupc->seq);
On Mon, Oct 10, 2022 at 3:57 AM Hillf Danton <[email protected]> wrote:
>
> On 13 Sep 2022 19:38:17 +0530 Pavan Kondeti <[email protected]>
> > Hi
> >
> > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > there is a RUNNING task. So we would always end up re-arming the work.
> >
> > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > help. The work is already scheduled. so we don't do anything there.
> >
> > Probably I am missing some thing here. Can you please clarify how we
> > shut off re-arming the psi avg work?
>
> Instead of open coding schedule_delayed_work() in bid to check if timer
> hits the idle task (see delayed_work_timer_fn()), the idle task is tracked
> in psi_task_switch() and checked by kworker to see if it preempted the idle
> task.
>
> Only for thoughts now.
>
> Hillf
>
> +++ b/kernel/sched/psi.c
> @@ -412,6 +412,8 @@ static u64 update_averages(struct psi_gr
> return avg_next_update;
> }
>
> +static DEFINE_PER_CPU(int, prev_task_is_idle);
> +
> static void psi_avgs_work(struct work_struct *work)
> {
> struct delayed_work *dwork;
> @@ -439,7 +441,7 @@ static void psi_avgs_work(struct work_st
> if (now >= group->avg_next_update)
> group->avg_next_update = update_averages(group, now);
>
> - if (nonidle) {
> + if (nonidle && 0 == per_cpu(prev_task_is_idle, raw_smp_processor_id())) {
This condition would be incorrect if nonidle was set by a cpu other
than raw_smp_processor_id() and
prev_task_is_idle[raw_smp_processor_id()] == 0. IOW, if some activity
happens on a non-current cpu, we would fail to reschedule
psi_avgs_work for it. This can be fixed in collect_percpu_times() by
considering prev_task_is_idle for all other CPUs as well. However
Chengming's approach seems simpler to me TBH and does not require an
additional per-cpu variable.
> schedule_delayed_work(dwork, nsecs_to_jiffies(
> group->avg_next_update - now) + 1);
> }
> @@ -859,6 +861,7 @@ void psi_task_switch(struct task_struct
> if (prev->pid) {
> int clear = TSK_ONCPU, set = 0;
>
> + per_cpu(prev_task_is_idle, cpu) = 0;
> /*
> * When we're going to sleep, psi_dequeue() lets us
> * handle TSK_RUNNING, TSK_MEMSTALL_RUNNING and
> @@ -888,7 +891,8 @@ void psi_task_switch(struct task_struct
> for (; group; group = iterate_groups(prev, &iter))
> psi_group_change(group, cpu, clear, set, now, true);
> }
> - }
> + } else
> + per_cpu(prev_task_is_idle, cpu) = 1;
> }
>
> /**
>
On Mon, Oct 10, 2022 at 3:42 AM Chengming Zhou
<[email protected]> wrote:
>
> Pavan reported a problem that PSI avgs_work idle shutoff is not
> working at all. Because PSI_NONIDLE condition would be observed in
> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
> only the kworker running avgs_work on the CPU.
>
> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
> still will always re-arm the avgs_work, so shutoff is not working.
>
> This patch changes to consider current CPU groupc as IDLE if the
> kworker running avgs_work is the only task running and no IOWAIT
> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
> if other CPUs' groupc are also IDLE.
>
> One potential problem is that the brief period of non-idle time
> incurred between the aggregation run and the kworker's dequeue will
> be stranded in the per-cpu buckets until avgs_work run next time.
> The buckets can hold 4s worth of time, and future activity will wake
> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
> behind when shut off the avgs_work. If the kworker run other works after
> avgs_work shut off and doesn't have any scheduler activities for 2s,
> this maybe a problem.
>
> Reported-by: Pavan Kondeti <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>
Copying my comments from
https://lore.kernel.org/all/CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com/
in case you want to continue the discussion here...
> ---
> kernel/sched/psi.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index ee2ecc081422..f4cdf6f184ba 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
> u32 *pchanged_states)
> {
> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> + int current_cpu = raw_smp_processor_id();
> + bool only_avgs_work = false;
> u64 now, state_start;
> enum psi_states s;
> unsigned int seq;
> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
> memcpy(times, groupc->times, sizeof(groupc->times));
> state_mask = groupc->state_mask;
> state_start = groupc->state_start;
> + /*
> + * This CPU has only avgs_work kworker running, snapshot the
> + * newest times then don't need to re-arm for this groupc.
> + * Normally this kworker will sleep soon and won't wake
> + * avgs_work back up in psi_group_change().
> + */
> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
> + only_avgs_work = true;
Why do you determine only_avgs_work while taking a snapshot? The
read_seqcount_retry() might fail and the loop gets retried, which
might lead to a wrong only_avgs_work value if the state changes
between retries. I think it's safer to do this after the snapshot was
taken and to use tasks[NR_RUNNING] instead of roupc->tasks.
> } while (read_seqcount_retry(&groupc->seq, seq));
>
> /* Calculate state time deltas against the previous snapshot */
> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> if (delta)
> *pchanged_states |= (1 << s);
> }
> +
> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> + if (only_avgs_work)
> + *pchanged_states &= ~(1 << PSI_NONIDLE);
This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
only for re-arming psi_avgs_work, however semantically this is
incorrect. The CPU was not idle when it was executing psi_avgs_work.
IMO a separate flag would avoid this confusion.
> }
>
> static void calc_avgs(unsigned long avg[3], int missed_periods,
> --
> 2.37.2
>
Hello,
On 2022/10/11 05:21, Suren Baghdasaryan wrote:
> On Mon, Oct 10, 2022 at 3:42 AM Chengming Zhou
> <[email protected]> wrote:
>>
>> Pavan reported a problem that PSI avgs_work idle shutoff is not
>> working at all. Because PSI_NONIDLE condition would be observed in
>> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
>> only the kworker running avgs_work on the CPU.
>>
>> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
>> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
>> still will always re-arm the avgs_work, so shutoff is not working.
>>
>> This patch changes to consider current CPU groupc as IDLE if the
>> kworker running avgs_work is the only task running and no IOWAIT
>> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
>> if other CPUs' groupc are also IDLE.
>>
>> One potential problem is that the brief period of non-idle time
>> incurred between the aggregation run and the kworker's dequeue will
>> be stranded in the per-cpu buckets until avgs_work run next time.
>> The buckets can hold 4s worth of time, and future activity will wake
>> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
>> behind when shut off the avgs_work. If the kworker run other works after
>> avgs_work shut off and doesn't have any scheduler activities for 2s,
>> this maybe a problem.
>>
>> Reported-by: Pavan Kondeti <[email protected]>
>> Signed-off-by: Chengming Zhou <[email protected]>
>
> Copying my comments from
> https://lore.kernel.org/all/CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com/
> in case you want to continue the discussion here...
>
>> ---
>> kernel/sched/psi.c | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> index ee2ecc081422..f4cdf6f184ba 100644
>> --- a/kernel/sched/psi.c
>> +++ b/kernel/sched/psi.c
>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>> u32 *pchanged_states)
>> {
>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>> + int current_cpu = raw_smp_processor_id();
>> + bool only_avgs_work = false;
>> u64 now, state_start;
>> enum psi_states s;
>> unsigned int seq;
>> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
>> memcpy(times, groupc->times, sizeof(groupc->times));
>> state_mask = groupc->state_mask;
>> state_start = groupc->state_start;
>> + /*
>> + * This CPU has only avgs_work kworker running, snapshot the
>> + * newest times then don't need to re-arm for this groupc.
>> + * Normally this kworker will sleep soon and won't wake
>> + * avgs_work back up in psi_group_change().
>> + */
>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>> + only_avgs_work = true;
>
> Why do you determine only_avgs_work while taking a snapshot? The
> read_seqcount_retry() might fail and the loop gets retried, which
> might lead to a wrong only_avgs_work value if the state changes
> between retries. I think it's safer to do this after the snapshot was
> taken and to use tasks[NR_RUNNING] instead of roupc->tasks.
Ah, you are right, coping groupc->tasks[NR_RUNNING] and tasks[NR_IOWAIT], tasks[NR_MEMSTALL]
is ok for me. (Maybe we only need to copy groupc->tasks[NR_RUNNING]?)
Another way is to add an else branch:
if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
!groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
only_avgs_work = true;
else
only_avgs_work = false;
Both are ok for me.
>
>> } while (read_seqcount_retry(&groupc->seq, seq));
>>
>> /* Calculate state time deltas against the previous snapshot */
>> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>> if (delta)
>> *pchanged_states |= (1 << s);
>> }
>> +
>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>> + if (only_avgs_work)
>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
>
> This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
> only for re-arming psi_avgs_work, however semantically this is
> incorrect. The CPU was not idle when it was executing psi_avgs_work.
> IMO a separate flag would avoid this confusion.
Yes, it's safe, but has this confusion. Use a separate flag looks better, like PSI_ONLY_AVGS_WORK.
But then in collect_percpu_times() we still have to clear PSI_NONIDLE of this CPU if PSI_ONLY_AVGS_WORK
has been set.
for_each_possible_cpu(cpu) {
u32 times[NR_PSI_STATES];
u32 nonidle;
u32 cpu_changed_states;
get_recent_times(group, cpu, aggregator, times,
&cpu_changed_states);
changed_states |= cpu_changed_states;
cpu_changed_states should clear PSI_NONIDLE if PSI_ONLY_AVGS_WORK already set.
I'm not sure if it's better to clear here, maybe we can add a comment in get_recent_times()
when do PSI_NONIDLE clear?
Thanks!
On Mon, Oct 10, 2022 at 5:07 PM Chengming Zhou
<[email protected]> wrote:
>
> Hello,
>
> On 2022/10/11 05:21, Suren Baghdasaryan wrote:
> > On Mon, Oct 10, 2022 at 3:42 AM Chengming Zhou
> > <[email protected]> wrote:
> >>
> >> Pavan reported a problem that PSI avgs_work idle shutoff is not
> >> working at all. Because PSI_NONIDLE condition would be observed in
> >> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
> >> only the kworker running avgs_work on the CPU.
> >>
> >> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
> >> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
> >> still will always re-arm the avgs_work, so shutoff is not working.
> >>
> >> This patch changes to consider current CPU groupc as IDLE if the
> >> kworker running avgs_work is the only task running and no IOWAIT
> >> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
> >> if other CPUs' groupc are also IDLE.
> >>
> >> One potential problem is that the brief period of non-idle time
> >> incurred between the aggregation run and the kworker's dequeue will
> >> be stranded in the per-cpu buckets until avgs_work run next time.
> >> The buckets can hold 4s worth of time, and future activity will wake
> >> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
> >> behind when shut off the avgs_work. If the kworker run other works after
> >> avgs_work shut off and doesn't have any scheduler activities for 2s,
> >> this maybe a problem.
> >>
> >> Reported-by: Pavan Kondeti <[email protected]>
> >> Signed-off-by: Chengming Zhou <[email protected]>
> >
> > Copying my comments from
> > https://lore.kernel.org/all/CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com/
> > in case you want to continue the discussion here...
> >
> >> ---
> >> kernel/sched/psi.c | 15 +++++++++++++++
> >> 1 file changed, 15 insertions(+)
> >>
> >> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> >> index ee2ecc081422..f4cdf6f184ba 100644
> >> --- a/kernel/sched/psi.c
> >> +++ b/kernel/sched/psi.c
> >> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >> u32 *pchanged_states)
> >> {
> >> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> >> + int current_cpu = raw_smp_processor_id();
> >> + bool only_avgs_work = false;
> >> u64 now, state_start;
> >> enum psi_states s;
> >> unsigned int seq;
> >> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >> memcpy(times, groupc->times, sizeof(groupc->times));
> >> state_mask = groupc->state_mask;
> >> state_start = groupc->state_start;
> >> + /*
> >> + * This CPU has only avgs_work kworker running, snapshot the
> >> + * newest times then don't need to re-arm for this groupc.
> >> + * Normally this kworker will sleep soon and won't wake
> >> + * avgs_work back up in psi_group_change().
> >> + */
> >> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
> >> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
> >> + only_avgs_work = true;
> >
> > Why do you determine only_avgs_work while taking a snapshot? The
> > read_seqcount_retry() might fail and the loop gets retried, which
> > might lead to a wrong only_avgs_work value if the state changes
> > between retries. I think it's safer to do this after the snapshot was
> > taken and to use tasks[NR_RUNNING] instead of roupc->tasks.
>
> Ah, you are right, coping groupc->tasks[NR_RUNNING] and tasks[NR_IOWAIT], tasks[NR_MEMSTALL]
> is ok for me. (Maybe we only need to copy groupc->tasks[NR_RUNNING]?)
>
> Another way is to add an else branch:
>
> if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
> !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
> only_avgs_work = true;
> else
> only_avgs_work = false;
>
> Both are ok for me.
Let's use the simple way and use the tasks[] after the snapshot is taken.
>
> >
> >> } while (read_seqcount_retry(&groupc->seq, seq));
> >>
> >> /* Calculate state time deltas against the previous snapshot */
> >> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >> if (delta)
> >> *pchanged_states |= (1 << s);
> >> }
> >> +
> >> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> >> + if (only_avgs_work)
> >> + *pchanged_states &= ~(1 << PSI_NONIDLE);
> >
> > This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
> > only for re-arming psi_avgs_work, however semantically this is
> > incorrect. The CPU was not idle when it was executing psi_avgs_work.
> > IMO a separate flag would avoid this confusion.
>
> Yes, it's safe, but has this confusion. Use a separate flag looks better, like PSI_ONLY_AVGS_WORK.
> But then in collect_percpu_times() we still have to clear PSI_NONIDLE of this CPU if PSI_ONLY_AVGS_WORK
> has been set.
>
> for_each_possible_cpu(cpu) {
> u32 times[NR_PSI_STATES];
> u32 nonidle;
> u32 cpu_changed_states;
>
> get_recent_times(group, cpu, aggregator, times,
> &cpu_changed_states);
> changed_states |= cpu_changed_states;
>
> cpu_changed_states should clear PSI_NONIDLE if PSI_ONLY_AVGS_WORK already set.
No, PSI_NONIDLE should not be affected by PSI_ONLY_AVGS_WORK. These
flags should be independent and aggregated into changed_states without
affecting each other. Something similar to how I suggested with
PSI_STATE_WAKE_CLOCK in
https://lore.kernel.org/all/CAJuCfpFr3JfwkWbDqkU=NUJbCYuCWGySwNusMCdmS3z95WD2AQ@mail.gmail.com/#t.
Thanks,
Suren.
>
> I'm not sure if it's better to clear here, maybe we can add a comment in get_recent_times()
> when do PSI_NONIDLE clear?
>
> Thanks!
>
On Tue, Oct 11, 2022 at 4:38 AM Hillf Danton <[email protected]> wrote:
>
> On 10 Oct 2022 14:16:26 -0700 Suren Baghdasaryan <[email protected]>
> > On Mon, Oct 10, 2022 at 3:57 AM Hillf Danton <[email protected]> wrote:
> > > On 13 Sep 2022 19:38:17 +0530 Pavan Kondeti <[email protected]>
> > > > Hi
> > > >
> > > > The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> > > > run from a kworker thread, PSI_NONIDLE condition would be observed as
> > > > there is a RUNNING task. So we would always end up re-arming the work.
> > > >
> > > > If the work is re-armed from the psi_avgs_work() it self, the backing off
> > > > logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> > > > help. The work is already scheduled. so we don't do anything there.
> > > >
> > > > Probably I am missing some thing here. Can you please clarify how we
> > > > shut off re-arming the psi avg work?
> > >
> > > Instead of open coding schedule_delayed_work() in bid to check if timer
> > > hits the idle task (see delayed_work_timer_fn()), the idle task is tracked
> > > in psi_task_switch() and checked by kworker to see if it preempted the idle
> > > task.
> > >
> > > Only for thoughts now.
> > >
> > > Hillf
> > >
> > > +++ b/kernel/sched/psi.c
> > > @@ -412,6 +412,8 @@ static u64 update_averages(struct psi_gr
> > > return avg_next_update;
> > > }
> > >
> > > +static DEFINE_PER_CPU(int, prev_task_is_idle);
> > > +
> > > static void psi_avgs_work(struct work_struct *work)
> > > {
> > > struct delayed_work *dwork;
> > > @@ -439,7 +441,7 @@ static void psi_avgs_work(struct work_st
> > > if (now >= group->avg_next_update)
> > > group->avg_next_update = update_averages(group, now);
> > >
> > > - if (nonidle) {
> > > + if (nonidle && 0 == per_cpu(prev_task_is_idle, raw_smp_processor_id())) {
> >
> > This condition would be incorrect if nonidle was set by a cpu other
> > than raw_smp_processor_id() and
> > prev_task_is_idle[raw_smp_processor_id()] == 0.
>
> Thanks for taking a look.
Thanks for the suggestion!
>
> > IOW, if some activity happens on a non-current cpu, we would fail to
> > reschedule psi_avgs_work for it.
>
> Given activities on remote CPUs, can you specify what prevents psi_avgs_work
> from being scheduled on remote CPUs if for example the local CPU has been
> idle for a second?
I'm not a scheduler expert but I can imagine some work that finished
running on a big core A and generated some activity since the last
time psi_avgs_work executed. With no other activity the next
psi_avgs_work could be scheduled on a small core B to conserve power.
There might be other cases involving cpuset limitation changes or cpu
offlining but I didn't think too hard about these. The bottom line, I
don't think we should be designing mechanisms which rely on
assumptions about how tasks will be scheduled. Even if these
assumptions are correct today they might change in the future and
things will break in unexpected places.
>
> > This can be fixed in collect_percpu_times() by
> > considering prev_task_is_idle for all other CPUs as well. However
> > Chengming's approach seems simpler to me TBH and does not require an
> > additional per-cpu variable.
>
> Good ideas are always welcome.
No question about that. Thanks!
On 2022/10/12 01:00, Suren Baghdasaryan wrote:
> On Mon, Oct 10, 2022 at 5:07 PM Chengming Zhou
> <[email protected]> wrote:
>>
>> Hello,
>>
>> On 2022/10/11 05:21, Suren Baghdasaryan wrote:
>>> On Mon, Oct 10, 2022 at 3:42 AM Chengming Zhou
>>> <[email protected]> wrote:
>>>>
>>>> Pavan reported a problem that PSI avgs_work idle shutoff is not
>>>> working at all. Because PSI_NONIDLE condition would be observed in
>>>> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
>>>> only the kworker running avgs_work on the CPU.
>>>>
>>>> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
>>>> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
>>>> still will always re-arm the avgs_work, so shutoff is not working.
>>>>
>>>> This patch changes to consider current CPU groupc as IDLE if the
>>>> kworker running avgs_work is the only task running and no IOWAIT
>>>> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
>>>> if other CPUs' groupc are also IDLE.
>>>>
>>>> One potential problem is that the brief period of non-idle time
>>>> incurred between the aggregation run and the kworker's dequeue will
>>>> be stranded in the per-cpu buckets until avgs_work run next time.
>>>> The buckets can hold 4s worth of time, and future activity will wake
>>>> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
>>>> behind when shut off the avgs_work. If the kworker run other works after
>>>> avgs_work shut off and doesn't have any scheduler activities for 2s,
>>>> this maybe a problem.
>>>>
>>>> Reported-by: Pavan Kondeti <[email protected]>
>>>> Signed-off-by: Chengming Zhou <[email protected]>
>>>
>>> Copying my comments from
>>> https://lore.kernel.org/all/CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com/
>>> in case you want to continue the discussion here...
>>>
>>>> ---
>>>> kernel/sched/psi.c | 15 +++++++++++++++
>>>> 1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>> index ee2ecc081422..f4cdf6f184ba 100644
>>>> --- a/kernel/sched/psi.c
>>>> +++ b/kernel/sched/psi.c
>>>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>> u32 *pchanged_states)
>>>> {
>>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>>> + int current_cpu = raw_smp_processor_id();
>>>> + bool only_avgs_work = false;
>>>> u64 now, state_start;
>>>> enum psi_states s;
>>>> unsigned int seq;
>>>> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>> memcpy(times, groupc->times, sizeof(groupc->times));
>>>> state_mask = groupc->state_mask;
>>>> state_start = groupc->state_start;
>>>> + /*
>>>> + * This CPU has only avgs_work kworker running, snapshot the
>>>> + * newest times then don't need to re-arm for this groupc.
>>>> + * Normally this kworker will sleep soon and won't wake
>>>> + * avgs_work back up in psi_group_change().
>>>> + */
>>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>>>> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>>>> + only_avgs_work = true;
>>>
>>> Why do you determine only_avgs_work while taking a snapshot? The
>>> read_seqcount_retry() might fail and the loop gets retried, which
>>> might lead to a wrong only_avgs_work value if the state changes
>>> between retries. I think it's safer to do this after the snapshot was
>>> taken and to use tasks[NR_RUNNING] instead of roupc->tasks.
>>
>> Ah, you are right, coping groupc->tasks[NR_RUNNING] and tasks[NR_IOWAIT], tasks[NR_MEMSTALL]
>> is ok for me. (Maybe we only need to copy groupc->tasks[NR_RUNNING]?)
>>
>> Another way is to add an else branch:
>>
>> if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>> !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>> only_avgs_work = true;
>> else
>> only_avgs_work = false;
>>
>> Both are ok for me.
>
> Let's use the simple way and use the tasks[] after the snapshot is taken.
Ok, I changed like this:
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ int current_cpu = raw_smp_processor_id();
+ unsigned int tasks[NR_PSI_TASK_COUNTS];
u64 now, state_start;
enum psi_states s;
unsigned int seq;
@@ -256,6 +258,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
memcpy(times, groupc->times, sizeof(groupc->times));
state_mask = groupc->state_mask;
state_start = groupc->state_start;
+ if (cpu == current_cpu)
+ memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
} while (read_seqcount_retry(&groupc->seq, seq));
>
>>
>>>
>>>> } while (read_seqcount_retry(&groupc->seq, seq));
>>>>
>>>> /* Calculate state time deltas against the previous snapshot */
>>>> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>> if (delta)
>>>> *pchanged_states |= (1 << s);
>>>> }
>>>> +
>>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>>>> + if (only_avgs_work)
>>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
>>>
>>> This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
>>> only for re-arming psi_avgs_work, however semantically this is
>>> incorrect. The CPU was not idle when it was executing psi_avgs_work.
>>> IMO a separate flag would avoid this confusion.
>>
>> Yes, it's safe, but has this confusion. Use a separate flag looks better, like PSI_ONLY_AVGS_WORK.
>> But then in collect_percpu_times() we still have to clear PSI_NONIDLE of this CPU if PSI_ONLY_AVGS_WORK
>> has been set.
>>
>> for_each_possible_cpu(cpu) {
>> u32 times[NR_PSI_STATES];
>> u32 nonidle;
>> u32 cpu_changed_states;
>>
>> get_recent_times(group, cpu, aggregator, times,
>> &cpu_changed_states);
>> changed_states |= cpu_changed_states;
>>
>> cpu_changed_states should clear PSI_NONIDLE if PSI_ONLY_AVGS_WORK already set.
>
> No, PSI_NONIDLE should not be affected by PSI_ONLY_AVGS_WORK. These
> flags should be independent and aggregated into changed_states without
> affecting each other. Something similar to how I suggested with
> PSI_STATE_WAKE_CLOCK in
> https://lore.kernel.org/all/CAJuCfpFr3JfwkWbDqkU=NUJbCYuCWGySwNusMCdmS3z95WD2AQ@mail.gmail.com/#t.
If we don't clear PSI_NONIDLE of this CPU, changed_states |= cpu_changed_states;
so changed_states has PSI_NONIDLE, and we won't know if other CPUs are IDLE or not
in psi_avgs_work().
Thanks.
On Tue, Oct 11, 2022 at 11:20 PM Hillf Danton <[email protected]> wrote:
>
> On 11 Oct 2022 10:11:58 -0700 Suren Baghdasaryan <[email protected]>
> >On Tue, Oct 11, 2022 at 4:38 AM Hillf Danton <[email protected]> wrote:
> >>
> >> Given activities on remote CPUs, can you specify what prevents psi_avgs_work
> >> from being scheduled on remote CPUs if for example the local CPU has been
> >> idle for a second?
> >
> > I'm not a scheduler expert but I can imagine some work that finished
> > running on a big core A and generated some activity since the last
> > time psi_avgs_work executed. With no other activity the next
> > psi_avgs_work could be scheduled on a small core B to conserve power.
>
> Given core A and B, nothing prevents.
>
> > There might be other cases involving cpuset limitation changes or cpu
> > offlining but I didn't think too hard about these. The bottom line, I
> > don't think we should be designing mechanisms which rely on
> > assumptions about how tasks will be scheduled. Even if these
>
> The tasks here makes me guess that we are on different pages - scheduling
> work has little to do with how tasks are scheduled, and is no more than
> queuing work on the system_wq in the case of psi_avgs_work,
I must have misunderstood your question then. My original concern was
that in the above example your suggested patch would not reschedule
psi_avgs_work to aggregate the activity recorded from core A. Easily
fixable but looks like a simpler approach is possible.
>
> > assumptions are correct today they might change in the future and
> > things will break in unexpected places.
>
> with nothing assumed.
>
On Tue, Oct 11, 2022 at 7:11 PM Chengming Zhou
<[email protected]> wrote:
>
> On 2022/10/12 01:00, Suren Baghdasaryan wrote:
> > On Mon, Oct 10, 2022 at 5:07 PM Chengming Zhou
> > <[email protected]> wrote:
> >>
> >> Hello,
> >>
> >> On 2022/10/11 05:21, Suren Baghdasaryan wrote:
> >>> On Mon, Oct 10, 2022 at 3:42 AM Chengming Zhou
> >>> <[email protected]> wrote:
> >>>>
> >>>> Pavan reported a problem that PSI avgs_work idle shutoff is not
> >>>> working at all. Because PSI_NONIDLE condition would be observed in
> >>>> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
> >>>> only the kworker running avgs_work on the CPU.
> >>>>
> >>>> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
> >>>> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
> >>>> still will always re-arm the avgs_work, so shutoff is not working.
> >>>>
> >>>> This patch changes to consider current CPU groupc as IDLE if the
> >>>> kworker running avgs_work is the only task running and no IOWAIT
> >>>> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
> >>>> if other CPUs' groupc are also IDLE.
> >>>>
> >>>> One potential problem is that the brief period of non-idle time
> >>>> incurred between the aggregation run and the kworker's dequeue will
> >>>> be stranded in the per-cpu buckets until avgs_work run next time.
> >>>> The buckets can hold 4s worth of time, and future activity will wake
> >>>> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
> >>>> behind when shut off the avgs_work. If the kworker run other works after
> >>>> avgs_work shut off and doesn't have any scheduler activities for 2s,
> >>>> this maybe a problem.
> >>>>
> >>>> Reported-by: Pavan Kondeti <[email protected]>
> >>>> Signed-off-by: Chengming Zhou <[email protected]>
> >>>
> >>> Copying my comments from
> >>> https://lore.kernel.org/all/CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com/
> >>> in case you want to continue the discussion here...
> >>>
> >>>> ---
> >>>> kernel/sched/psi.c | 15 +++++++++++++++
> >>>> 1 file changed, 15 insertions(+)
> >>>>
> >>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> >>>> index ee2ecc081422..f4cdf6f184ba 100644
> >>>> --- a/kernel/sched/psi.c
> >>>> +++ b/kernel/sched/psi.c
> >>>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >>>> u32 *pchanged_states)
> >>>> {
> >>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> >>>> + int current_cpu = raw_smp_processor_id();
> >>>> + bool only_avgs_work = false;
> >>>> u64 now, state_start;
> >>>> enum psi_states s;
> >>>> unsigned int seq;
> >>>> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >>>> memcpy(times, groupc->times, sizeof(groupc->times));
> >>>> state_mask = groupc->state_mask;
> >>>> state_start = groupc->state_start;
> >>>> + /*
> >>>> + * This CPU has only avgs_work kworker running, snapshot the
> >>>> + * newest times then don't need to re-arm for this groupc.
> >>>> + * Normally this kworker will sleep soon and won't wake
> >>>> + * avgs_work back up in psi_group_change().
> >>>> + */
> >>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
> >>>> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
> >>>> + only_avgs_work = true;
> >>>
> >>> Why do you determine only_avgs_work while taking a snapshot? The
> >>> read_seqcount_retry() might fail and the loop gets retried, which
> >>> might lead to a wrong only_avgs_work value if the state changes
> >>> between retries. I think it's safer to do this after the snapshot was
> >>> taken and to use tasks[NR_RUNNING] instead of roupc->tasks.
> >>
> >> Ah, you are right, coping groupc->tasks[NR_RUNNING] and tasks[NR_IOWAIT], tasks[NR_MEMSTALL]
> >> is ok for me. (Maybe we only need to copy groupc->tasks[NR_RUNNING]?)
> >>
> >> Another way is to add an else branch:
> >>
> >> if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
> >> !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
> >> only_avgs_work = true;
> >> else
> >> only_avgs_work = false;
> >>
> >> Both are ok for me.
> >
> > Let's use the simple way and use the tasks[] after the snapshot is taken.
>
> Ok, I changed like this:
>
> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> + int current_cpu = raw_smp_processor_id();
> + unsigned int tasks[NR_PSI_TASK_COUNTS];
> u64 now, state_start;
> enum psi_states s;
> unsigned int seq;
> @@ -256,6 +258,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
> memcpy(times, groupc->times, sizeof(groupc->times));
> state_mask = groupc->state_mask;
> state_start = groupc->state_start;
> + if (cpu == current_cpu)
> + memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
> } while (read_seqcount_retry(&groupc->seq, seq));
>
> >
> >>
> >>>
> >>>> } while (read_seqcount_retry(&groupc->seq, seq));
> >>>>
> >>>> /* Calculate state time deltas against the previous snapshot */
> >>>> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> >>>> if (delta)
> >>>> *pchanged_states |= (1 << s);
> >>>> }
> >>>> +
> >>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> >>>> + if (only_avgs_work)
> >>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
> >>>
> >>> This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
> >>> only for re-arming psi_avgs_work, however semantically this is
> >>> incorrect. The CPU was not idle when it was executing psi_avgs_work.
> >>> IMO a separate flag would avoid this confusion.
> >>
> >> Yes, it's safe, but has this confusion. Use a separate flag looks better, like PSI_ONLY_AVGS_WORK.
> >> But then in collect_percpu_times() we still have to clear PSI_NONIDLE of this CPU if PSI_ONLY_AVGS_WORK
> >> has been set.
> >>
> >> for_each_possible_cpu(cpu) {
> >> u32 times[NR_PSI_STATES];
> >> u32 nonidle;
> >> u32 cpu_changed_states;
> >>
> >> get_recent_times(group, cpu, aggregator, times,
> >> &cpu_changed_states);
> >> changed_states |= cpu_changed_states;
> >>
> >> cpu_changed_states should clear PSI_NONIDLE if PSI_ONLY_AVGS_WORK already set.
> >
> > No, PSI_NONIDLE should not be affected by PSI_ONLY_AVGS_WORK. These
> > flags should be independent and aggregated into changed_states without
> > affecting each other. Something similar to how I suggested with
> > PSI_STATE_WAKE_CLOCK in
> > https://lore.kernel.org/all/CAJuCfpFr3JfwkWbDqkU=NUJbCYuCWGySwNusMCdmS3z95WD2AQ@mail.gmail.com/#t.
>
> If we don't clear PSI_NONIDLE of this CPU, changed_states |= cpu_changed_states;
> so changed_states has PSI_NONIDLE, and we won't know if other CPUs are IDLE or not
> in psi_avgs_work().
I was thinking something like this:
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -244,6 +244,7 @@ static void get_recent_times(struct psi_group
*group, int cpu,
enum psi_states s;
unsigned int seq;
u32 state_mask;
+ bool reschedule;
*pchanged_states = 0;
@@ -254,6 +255,14 @@ static void get_recent_times(struct psi_group
*group, int cpu,
memcpy(times, groupc->times, sizeof(groupc->times));
state_mask = groupc->state_mask;
state_start = groupc->state_start;
+ if (current_cpu == cpu)
+ reschedule = groupc->tasks[NR_RUNNING] +
+ groupc->tasks[NR_IOWAIT] +
+ groupc->tasks[NR_MEMSTALL] > 1;
+ else
+ reschedule = times[PSI_NONIDLE] >
+ groupc->times_prev[aggregator][PSI_NONIDLE];
+
} while (read_seqcount_retry(&groupc->seq, seq));
/* Calculate state time deltas against the previous snapshot */
@@ -278,6 +287,8 @@ static void get_recent_times(struct psi_group
*group, int cpu,
if (delta)
*pchanged_states |= (1 << s);
}
+ if (reschedule)
+ *pchanged_states |= PSI_STATE_RESCHEDULE;
}
static void calc_avgs(unsigned long avg[3], int missed_periods,
@@ -413,7 +424,7 @@ static void psi_avgs_work(struct work_struct *work)
struct delayed_work *dwork;
struct psi_group *group;
u32 changed_states;
- bool nonidle;
+ bool reschedule;
u64 now;
dwork = to_delayed_work(work);
@@ -424,7 +435,7 @@ static void psi_avgs_work(struct work_struct *work)
now = sched_clock();
collect_percpu_times(group, PSI_AVGS, &changed_states);
- nonidle = changed_states & (1 << PSI_NONIDLE);
+ reschedule = changed_states & (1 << PSI_STATE_RESCHEDULE);
/*
* If there is task activity, periodically fold the per-cpu
* times and feed samples into the running averages. If things
@@ -435,7 +446,7 @@ static void psi_avgs_work(struct work_struct *work)
if (now >= group->avg_next_update)
group->avg_next_update = update_averages(group, now);
- if (nonidle) {
+ if (reschedule) {
schedule_delayed_work(dwork, nsecs_to_jiffies(
group->avg_next_update - now) + 1);
}
Does that address your concern?
>
> Thanks.
On 2022/10/13 02:24, Suren Baghdasaryan wrote:
> On Tue, Oct 11, 2022 at 7:11 PM Chengming Zhou
> <[email protected]> wrote:
>>
>> On 2022/10/12 01:00, Suren Baghdasaryan wrote:
>>> On Mon, Oct 10, 2022 at 5:07 PM Chengming Zhou
>>> <[email protected]> wrote:
>>>>
>>>> Hello,
>>>>
>>>> On 2022/10/11 05:21, Suren Baghdasaryan wrote:
>>>>> On Mon, Oct 10, 2022 at 3:42 AM Chengming Zhou
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Pavan reported a problem that PSI avgs_work idle shutoff is not
>>>>>> working at all. Because PSI_NONIDLE condition would be observed in
>>>>>> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
>>>>>> only the kworker running avgs_work on the CPU.
>>>>>>
>>>>>> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
>>>>>> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
>>>>>> still will always re-arm the avgs_work, so shutoff is not working.
>>>>>>
>>>>>> This patch changes to consider current CPU groupc as IDLE if the
>>>>>> kworker running avgs_work is the only task running and no IOWAIT
>>>>>> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
>>>>>> if other CPUs' groupc are also IDLE.
>>>>>>
>>>>>> One potential problem is that the brief period of non-idle time
>>>>>> incurred between the aggregation run and the kworker's dequeue will
>>>>>> be stranded in the per-cpu buckets until avgs_work run next time.
>>>>>> The buckets can hold 4s worth of time, and future activity will wake
>>>>>> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
>>>>>> behind when shut off the avgs_work. If the kworker run other works after
>>>>>> avgs_work shut off and doesn't have any scheduler activities for 2s,
>>>>>> this maybe a problem.
>>>>>>
>>>>>> Reported-by: Pavan Kondeti <[email protected]>
>>>>>> Signed-off-by: Chengming Zhou <[email protected]>
>>>>>
>>>>> Copying my comments from
>>>>> https://lore.kernel.org/all/CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com/
>>>>> in case you want to continue the discussion here...
>>>>>
>>>>>> ---
>>>>>> kernel/sched/psi.c | 15 +++++++++++++++
>>>>>> 1 file changed, 15 insertions(+)
>>>>>>
>>>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>>>> index ee2ecc081422..f4cdf6f184ba 100644
>>>>>> --- a/kernel/sched/psi.c
>>>>>> +++ b/kernel/sched/psi.c
>>>>>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> u32 *pchanged_states)
>>>>>> {
>>>>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>>>>> + int current_cpu = raw_smp_processor_id();
>>>>>> + bool only_avgs_work = false;
>>>>>> u64 now, state_start;
>>>>>> enum psi_states s;
>>>>>> unsigned int seq;
>>>>>> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> memcpy(times, groupc->times, sizeof(groupc->times));
>>>>>> state_mask = groupc->state_mask;
>>>>>> state_start = groupc->state_start;
>>>>>> + /*
>>>>>> + * This CPU has only avgs_work kworker running, snapshot the
>>>>>> + * newest times then don't need to re-arm for this groupc.
>>>>>> + * Normally this kworker will sleep soon and won't wake
>>>>>> + * avgs_work back up in psi_group_change().
>>>>>> + */
>>>>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>>>>>> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>>>>>> + only_avgs_work = true;
>>>>>
>>>>> Why do you determine only_avgs_work while taking a snapshot? The
>>>>> read_seqcount_retry() might fail and the loop gets retried, which
>>>>> might lead to a wrong only_avgs_work value if the state changes
>>>>> between retries. I think it's safer to do this after the snapshot was
>>>>> taken and to use tasks[NR_RUNNING] instead of roupc->tasks.
>>>>
>>>> Ah, you are right, coping groupc->tasks[NR_RUNNING] and tasks[NR_IOWAIT], tasks[NR_MEMSTALL]
>>>> is ok for me. (Maybe we only need to copy groupc->tasks[NR_RUNNING]?)
>>>>
>>>> Another way is to add an else branch:
>>>>
>>>> if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>>>> !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>>>> only_avgs_work = true;
>>>> else
>>>> only_avgs_work = false;
>>>>
>>>> Both are ok for me.
>>>
>>> Let's use the simple way and use the tasks[] after the snapshot is taken.
>>
>> Ok, I changed like this:
>>
>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>> + int current_cpu = raw_smp_processor_id();
>> + unsigned int tasks[NR_PSI_TASK_COUNTS];
>> u64 now, state_start;
>> enum psi_states s;
>> unsigned int seq;
>> @@ -256,6 +258,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>> memcpy(times, groupc->times, sizeof(groupc->times));
>> state_mask = groupc->state_mask;
>> state_start = groupc->state_start;
>> + if (cpu == current_cpu)
>> + memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
>> } while (read_seqcount_retry(&groupc->seq, seq));
>>
>>>
>>>>
>>>>>
>>>>>> } while (read_seqcount_retry(&groupc->seq, seq));
>>>>>>
>>>>>> /* Calculate state time deltas against the previous snapshot */
>>>>>> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> if (delta)
>>>>>> *pchanged_states |= (1 << s);
>>>>>> }
>>>>>> +
>>>>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>>>>>> + if (only_avgs_work)
>>>>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
>>>>>
>>>>> This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
>>>>> only for re-arming psi_avgs_work, however semantically this is
>>>>> incorrect. The CPU was not idle when it was executing psi_avgs_work.
>>>>> IMO a separate flag would avoid this confusion.
>>>>
>>>> Yes, it's safe, but has this confusion. Use a separate flag looks better, like PSI_ONLY_AVGS_WORK.
>>>> But then in collect_percpu_times() we still have to clear PSI_NONIDLE of this CPU if PSI_ONLY_AVGS_WORK
>>>> has been set.
>>>>
>>>> for_each_possible_cpu(cpu) {
>>>> u32 times[NR_PSI_STATES];
>>>> u32 nonidle;
>>>> u32 cpu_changed_states;
>>>>
>>>> get_recent_times(group, cpu, aggregator, times,
>>>> &cpu_changed_states);
>>>> changed_states |= cpu_changed_states;
>>>>
>>>> cpu_changed_states should clear PSI_NONIDLE if PSI_ONLY_AVGS_WORK already set.
>>>
>>> No, PSI_NONIDLE should not be affected by PSI_ONLY_AVGS_WORK. These
>>> flags should be independent and aggregated into changed_states without
>>> affecting each other. Something similar to how I suggested with
>>> PSI_STATE_WAKE_CLOCK in
>>> https://lore.kernel.org/all/CAJuCfpFr3JfwkWbDqkU=NUJbCYuCWGySwNusMCdmS3z95WD2AQ@mail.gmail.com/#t.
>>
>> If we don't clear PSI_NONIDLE of this CPU, changed_states |= cpu_changed_states;
>> so changed_states has PSI_NONIDLE, and we won't know if other CPUs are IDLE or not
>> in psi_avgs_work().
>
> I was thinking something like this:
>
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -244,6 +244,7 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
> enum psi_states s;
> unsigned int seq;
> u32 state_mask;
> + bool reschedule;
>
> *pchanged_states = 0;
>
> @@ -254,6 +255,14 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
> memcpy(times, groupc->times, sizeof(groupc->times));
> state_mask = groupc->state_mask;
> state_start = groupc->state_start;
> + if (current_cpu == cpu)
> + reschedule = groupc->tasks[NR_RUNNING] +
> + groupc->tasks[NR_IOWAIT] +
> + groupc->tasks[NR_MEMSTALL] > 1;
> + else
> + reschedule = times[PSI_NONIDLE] >
> + groupc->times_prev[aggregator][PSI_NONIDLE];
> +
> } while (read_seqcount_retry(&groupc->seq, seq));
Ok, I get what you mean: for other CPUs, reschedule == PSI_NONIDLE,
for this current CPU, reschedule is calculated from groupc->tasks[].
Then we use PSI_STATE_RESCHEDULE in psi_avgs_work(), don't use PSI_NONIDLE,
so PSI_NONIDLE become useless in changed_states. Right?
Well, I'm not sure this is easier than just using PSI_NONIDLE. But anyway,
I will try this way in the next revision.
Thanks.
>
> /* Calculate state time deltas against the previous snapshot */
> @@ -278,6 +287,8 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
> if (delta)
> *pchanged_states |= (1 << s);
> }
> + if (reschedule)
> + *pchanged_states |= PSI_STATE_RESCHEDULE;
> }
>
> static void calc_avgs(unsigned long avg[3], int missed_periods,
> @@ -413,7 +424,7 @@ static void psi_avgs_work(struct work_struct *work)
> struct delayed_work *dwork;
> struct psi_group *group;
> u32 changed_states;
> - bool nonidle;
> + bool reschedule;
> u64 now;
>
> dwork = to_delayed_work(work);
> @@ -424,7 +435,7 @@ static void psi_avgs_work(struct work_struct *work)
> now = sched_clock();
>
> collect_percpu_times(group, PSI_AVGS, &changed_states);
> - nonidle = changed_states & (1 << PSI_NONIDLE);
> + reschedule = changed_states & (1 << PSI_STATE_RESCHEDULE);
> /*
> * If there is task activity, periodically fold the per-cpu
> * times and feed samples into the running averages. If things
> @@ -435,7 +446,7 @@ static void psi_avgs_work(struct work_struct *work)
> if (now >= group->avg_next_update)
> group->avg_next_update = update_averages(group, now);
>
> - if (nonidle) {
> + if (reschedule) {
> schedule_delayed_work(dwork, nsecs_to_jiffies(
> group->avg_next_update - now) + 1);
> }
>
> Does that address your concern?
>
>>
>> Thanks.
On 2022/10/13 02:24, Suren Baghdasaryan wrote:
> On Tue, Oct 11, 2022 at 7:11 PM Chengming Zhou
> <[email protected]> wrote:
>>
>> On 2022/10/12 01:00, Suren Baghdasaryan wrote:
>>> On Mon, Oct 10, 2022 at 5:07 PM Chengming Zhou
>>> <[email protected]> wrote:
>>>>
>>>> Hello,
>>>>
>>>> On 2022/10/11 05:21, Suren Baghdasaryan wrote:
>>>>> On Mon, Oct 10, 2022 at 3:42 AM Chengming Zhou
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Pavan reported a problem that PSI avgs_work idle shutoff is not
>>>>>> working at all. Because PSI_NONIDLE condition would be observed in
>>>>>> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
>>>>>> only the kworker running avgs_work on the CPU.
>>>>>>
>>>>>> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
>>>>>> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
>>>>>> still will always re-arm the avgs_work, so shutoff is not working.
>>>>>>
>>>>>> This patch changes to consider current CPU groupc as IDLE if the
>>>>>> kworker running avgs_work is the only task running and no IOWAIT
>>>>>> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
>>>>>> if other CPUs' groupc are also IDLE.
>>>>>>
>>>>>> One potential problem is that the brief period of non-idle time
>>>>>> incurred between the aggregation run and the kworker's dequeue will
>>>>>> be stranded in the per-cpu buckets until avgs_work run next time.
>>>>>> The buckets can hold 4s worth of time, and future activity will wake
>>>>>> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
>>>>>> behind when shut off the avgs_work. If the kworker run other works after
>>>>>> avgs_work shut off and doesn't have any scheduler activities for 2s,
>>>>>> this maybe a problem.
>>>>>>
>>>>>> Reported-by: Pavan Kondeti <[email protected]>
>>>>>> Signed-off-by: Chengming Zhou <[email protected]>
>>>>>
>>>>> Copying my comments from
>>>>> https://lore.kernel.org/all/CAJuCfpHyYMak-mfVmtEN9Z-hGYQ6Wko57TLjukz9HaN26EDAuA@mail.gmail.com/
>>>>> in case you want to continue the discussion here...
>>>>>
>>>>>> ---
>>>>>> kernel/sched/psi.c | 15 +++++++++++++++
>>>>>> 1 file changed, 15 insertions(+)
>>>>>>
>>>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>>>> index ee2ecc081422..f4cdf6f184ba 100644
>>>>>> --- a/kernel/sched/psi.c
>>>>>> +++ b/kernel/sched/psi.c
>>>>>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> u32 *pchanged_states)
>>>>>> {
>>>>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>>>>> + int current_cpu = raw_smp_processor_id();
>>>>>> + bool only_avgs_work = false;
>>>>>> u64 now, state_start;
>>>>>> enum psi_states s;
>>>>>> unsigned int seq;
>>>>>> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> memcpy(times, groupc->times, sizeof(groupc->times));
>>>>>> state_mask = groupc->state_mask;
>>>>>> state_start = groupc->state_start;
>>>>>> + /*
>>>>>> + * This CPU has only avgs_work kworker running, snapshot the
>>>>>> + * newest times then don't need to re-arm for this groupc.
>>>>>> + * Normally this kworker will sleep soon and won't wake
>>>>>> + * avgs_work back up in psi_group_change().
>>>>>> + */
>>>>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>>>>>> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>>>>>> + only_avgs_work = true;
>>>>>
>>>>> Why do you determine only_avgs_work while taking a snapshot? The
>>>>> read_seqcount_retry() might fail and the loop gets retried, which
>>>>> might lead to a wrong only_avgs_work value if the state changes
>>>>> between retries. I think it's safer to do this after the snapshot was
>>>>> taken and to use tasks[NR_RUNNING] instead of roupc->tasks.
>>>>
>>>> Ah, you are right, coping groupc->tasks[NR_RUNNING] and tasks[NR_IOWAIT], tasks[NR_MEMSTALL]
>>>> is ok for me. (Maybe we only need to copy groupc->tasks[NR_RUNNING]?)
>>>>
>>>> Another way is to add an else branch:
>>>>
>>>> if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>>>> !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>>>> only_avgs_work = true;
>>>> else
>>>> only_avgs_work = false;
>>>>
>>>> Both are ok for me.
>>>
>>> Let's use the simple way and use the tasks[] after the snapshot is taken.
>>
>> Ok, I changed like this:
>>
>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>> + int current_cpu = raw_smp_processor_id();
>> + unsigned int tasks[NR_PSI_TASK_COUNTS];
>> u64 now, state_start;
>> enum psi_states s;
>> unsigned int seq;
>> @@ -256,6 +258,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>> memcpy(times, groupc->times, sizeof(groupc->times));
>> state_mask = groupc->state_mask;
>> state_start = groupc->state_start;
>> + if (cpu == current_cpu)
>> + memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
>> } while (read_seqcount_retry(&groupc->seq, seq));
>>
>>>
>>>>
>>>>>
>>>>>> } while (read_seqcount_retry(&groupc->seq, seq));
>>>>>>
>>>>>> /* Calculate state time deltas against the previous snapshot */
>>>>>> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>>>>> if (delta)
>>>>>> *pchanged_states |= (1 << s);
>>>>>> }
>>>>>> +
>>>>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>>>>>> + if (only_avgs_work)
>>>>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
>>>>>
>>>>> This seems to be safe because changed_states&(1<< PSI_NONIDLE) is used
>>>>> only for re-arming psi_avgs_work, however semantically this is
>>>>> incorrect. The CPU was not idle when it was executing psi_avgs_work.
>>>>> IMO a separate flag would avoid this confusion.
>>>>
>>>> Yes, it's safe, but has this confusion. Use a separate flag looks better, like PSI_ONLY_AVGS_WORK.
>>>> But then in collect_percpu_times() we still have to clear PSI_NONIDLE of this CPU if PSI_ONLY_AVGS_WORK
>>>> has been set.
>>>>
>>>> for_each_possible_cpu(cpu) {
>>>> u32 times[NR_PSI_STATES];
>>>> u32 nonidle;
>>>> u32 cpu_changed_states;
>>>>
>>>> get_recent_times(group, cpu, aggregator, times,
>>>> &cpu_changed_states);
>>>> changed_states |= cpu_changed_states;
>>>>
>>>> cpu_changed_states should clear PSI_NONIDLE if PSI_ONLY_AVGS_WORK already set.
>>>
>>> No, PSI_NONIDLE should not be affected by PSI_ONLY_AVGS_WORK. These
>>> flags should be independent and aggregated into changed_states without
>>> affecting each other. Something similar to how I suggested with
>>> PSI_STATE_WAKE_CLOCK in
>>> https://lore.kernel.org/all/CAJuCfpFr3JfwkWbDqkU=NUJbCYuCWGySwNusMCdmS3z95WD2AQ@mail.gmail.com/#t.
>>
>> If we don't clear PSI_NONIDLE of this CPU, changed_states |= cpu_changed_states;
>> so changed_states has PSI_NONIDLE, and we won't know if other CPUs are IDLE or not
>> in psi_avgs_work().
>
> I was thinking something like this:
>
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -244,6 +244,7 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
> enum psi_states s;
> unsigned int seq;
> u32 state_mask;
> + bool reschedule;
>
> *pchanged_states = 0;
>
> @@ -254,6 +255,14 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
> memcpy(times, groupc->times, sizeof(groupc->times));
> state_mask = groupc->state_mask;
> state_start = groupc->state_start;
> + if (current_cpu == cpu)
> + reschedule = groupc->tasks[NR_RUNNING] +
> + groupc->tasks[NR_IOWAIT] +
> + groupc->tasks[NR_MEMSTALL] > 1;
> + else
> + reschedule = times[PSI_NONIDLE] >
> + groupc->times_prev[aggregator][PSI_NONIDLE];
> +
> } while (read_seqcount_retry(&groupc->seq, seq));
We can't use times[] here because it will incorporate currently active states
on the CPU.
Should I still need to copy groupc->tasks[] out for the current_cpu as you
suggested before?
>
> /* Calculate state time deltas against the previous snapshot */
> @@ -278,6 +287,8 @@ static void get_recent_times(struct psi_group
> *group, int cpu,
> if (delta)
> *pchanged_states |= (1 << s);
> }
> + if (reschedule)
> + *pchanged_states |= PSI_STATE_RESCHEDULE;
Is this ok?
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4bd3913dd176..0cbcbf89a934 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
u32 *pchanged_states)
{
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ int current_cpu = raw_smp_processor_id();
+ bool reschedule;
u64 now, state_start;
enum psi_states s;
unsigned int seq;
@@ -256,6 +258,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
memcpy(times, groupc->times, sizeof(groupc->times));
state_mask = groupc->state_mask;
state_start = groupc->state_start;
+ if (cpu == current_cpu)
+ reschedule = groupc->tasks[NR_RUNNING] +
+ groupc->tasks[NR_IOWAIT] +
+ groupc->tasks[NR_MEMSTALL] > 1;
} while (read_seqcount_retry(&groupc->seq, seq));
/* Calculate state time deltas against the previous snapshot */
@@ -280,6 +286,12 @@ static void get_recent_times(struct psi_group *group, int cpu,
if (delta)
*pchanged_states |= (1 << s);
}
+
+ if (cpu != current_cpu)
+ reschedule = *pchanged_states & (1 << PSI_NONIDLE);
+
+ if (reschedule)
+ *pchanged_states |= (1 << PSI_STATE_RESCHEDULE);
}
On Thu, Oct 13, 2022 at 07:06:55PM +0800, Chengming Zhou wrote:
> Should I still need to copy groupc->tasks[] out for the current_cpu as you
> suggested before?
It'd be my preference as well. This way the resched logic can be
consolidated into a single block of comment + code at the end of the
function.
> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
> u32 *pchanged_states)
> {
> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> + int current_cpu = raw_smp_processor_id();
> + bool reschedule;
> u64 now, state_start;
> enum psi_states s;
> unsigned int seq;
> @@ -256,6 +258,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> memcpy(times, groupc->times, sizeof(groupc->times));
> state_mask = groupc->state_mask;
> state_start = groupc->state_start;
> + if (cpu == current_cpu)
> + reschedule = groupc->tasks[NR_RUNNING] +
> + groupc->tasks[NR_IOWAIT] +
> + groupc->tasks[NR_MEMSTALL] > 1;
> } while (read_seqcount_retry(&groupc->seq, seq));
This also matches psi_show() and the poll worker. They don't currently
use the flag, but it's somewhat fragile and confusing. Add a test for
current_work() == &group->avgs_work?
On Thu, Oct 13, 2022 at 8:52 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Oct 13, 2022 at 07:06:55PM +0800, Chengming Zhou wrote:
> > Should I still need to copy groupc->tasks[] out for the current_cpu as you
> > suggested before?
>
> It'd be my preference as well. This way the resched logic can be
> consolidated into a single block of comment + code at the end of the
> function.
Sounds good to me. If we are copying times in the retry loop then
let's move the `reschedule =` decision out of that loop completely. At
the end of get_recent_times we can do:
if (cpu == current_cpu)
reschedule = tasks[NR_RUNNING] +
tasks[NR_IOWAIT] +
tasks[NR_MEMSTALL] > 1;
else
reschedule = *pchanged_states & (1 << PSI_NONIDLE);
>
> > @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > u32 *pchanged_states)
> > {
> > struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> > + int current_cpu = raw_smp_processor_id();
> > + bool reschedule;
> > u64 now, state_start;
> > enum psi_states s;
> > unsigned int seq;
> > @@ -256,6 +258,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > memcpy(times, groupc->times, sizeof(groupc->times));
> > state_mask = groupc->state_mask;
> > state_start = groupc->state_start;
> > + if (cpu == current_cpu)
> > + reschedule = groupc->tasks[NR_RUNNING] +
> > + groupc->tasks[NR_IOWAIT] +
> > + groupc->tasks[NR_MEMSTALL] > 1;
> > } while (read_seqcount_retry(&groupc->seq, seq));
>
> This also matches psi_show() and the poll worker. They don't currently
> use the flag, but it's somewhat fragile and confusing. Add a test for
> current_work() == &group->avgs_work?
Good point. (tasks[NR_RUNNING] + tasks[NR_IOWAIT] + tasks[NR_MEMSTALL]
> 1) condition should also contain this check.
On 2022/10/13 23:52, Johannes Weiner wrote:
> On Thu, Oct 13, 2022 at 07:06:55PM +0800, Chengming Zhou wrote:
>> Should I still need to copy groupc->tasks[] out for the current_cpu as you
>> suggested before?
>
> It'd be my preference as well. This way the resched logic can be
> consolidated into a single block of comment + code at the end of the
> function.
Ok, will move these resched logic to the end of get_recent_times().
>
>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>> u32 *pchanged_states)
>> {
>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>> + int current_cpu = raw_smp_processor_id();
>> + bool reschedule;
>> u64 now, state_start;
>> enum psi_states s;
>> unsigned int seq;
>> @@ -256,6 +258,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>> memcpy(times, groupc->times, sizeof(groupc->times));
>> state_mask = groupc->state_mask;
>> state_start = groupc->state_start;
>> + if (cpu == current_cpu)
>> + reschedule = groupc->tasks[NR_RUNNING] +
>> + groupc->tasks[NR_IOWAIT] +
>> + groupc->tasks[NR_MEMSTALL] > 1;
>> } while (read_seqcount_retry(&groupc->seq, seq));
>
> This also matches psi_show() and the poll worker. They don't currently
> use the flag, but it's somewhat fragile and confusing. Add a test for
> current_work() == &group->avgs_work?
Yes, only psi_avgs_work() use this to re-arm now, I will add this check next version.
Thanks.
On 2022/10/14 00:10, Suren Baghdasaryan wrote:
> On Thu, Oct 13, 2022 at 8:52 AM Johannes Weiner <[email protected]> wrote:
>>
>> On Thu, Oct 13, 2022 at 07:06:55PM +0800, Chengming Zhou wrote:
>>> Should I still need to copy groupc->tasks[] out for the current_cpu as you
>>> suggested before?
>>
>> It'd be my preference as well. This way the resched logic can be
>> consolidated into a single block of comment + code at the end of the
>> function.
>
> Sounds good to me. If we are copying times in the retry loop then
> let's move the `reschedule =` decision out of that loop completely. At
> the end of get_recent_times we can do:
>
> if (cpu == current_cpu)
> reschedule = tasks[NR_RUNNING] +
> tasks[NR_IOWAIT] +
> tasks[NR_MEMSTALL] > 1;
> else
> reschedule = *pchanged_states & (1 << PSI_NONIDLE);
>
Ok, I will send an updated patch later.
Thanks!
>
>>
>>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>> u32 *pchanged_states)
>>> {
>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>> + int current_cpu = raw_smp_processor_id();
>>> + bool reschedule;
>>> u64 now, state_start;
>>> enum psi_states s;
>>> unsigned int seq;
>>> @@ -256,6 +258,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>> memcpy(times, groupc->times, sizeof(groupc->times));
>>> state_mask = groupc->state_mask;
>>> state_start = groupc->state_start;
>>> + if (cpu == current_cpu)
>>> + reschedule = groupc->tasks[NR_RUNNING] +
>>> + groupc->tasks[NR_IOWAIT] +
>>> + groupc->tasks[NR_MEMSTALL] > 1;
>>> } while (read_seqcount_retry(&groupc->seq, seq));
>>
>> This also matches psi_show() and the poll worker. They don't currently
>> use the flag, but it's somewhat fragile and confusing. Add a test for
>> current_work() == &group->avgs_work?
>
> Good point. (tasks[NR_RUNNING] + tasks[NR_IOWAIT] + tasks[NR_MEMSTALL]
>> 1) condition should also contain this check.
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 7d89d7bb921c5ae5a428df282e64ee5692e26fe0
Gitweb: https://git.kernel.org/tip/7d89d7bb921c5ae5a428df282e64ee5692e26fe0
Author: Chengming Zhou <[email protected]>
AuthorDate: Mon, 10 Oct 2022 18:42:06 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:23 +02:00
sched/psi: Fix avgs_work re-arm in psi_avgs_work()
Pavan reported a problem that PSI avgs_work idle shutoff is not
working at all. Because PSI_NONIDLE condition would be observed in
psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
only the kworker running avgs_work on the CPU.
Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
still will always re-arm the avgs_work, so shutoff is not working.
This patch changes to consider current CPU groupc as IDLE if the
kworker running avgs_work is the only task running and no IOWAIT
or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
if other CPUs' groupc are also IDLE.
One potential problem is that the brief period of non-idle time
incurred between the aggregation run and the kworker's dequeue will
be stranded in the per-cpu buckets until avgs_work run next time.
The buckets can hold 4s worth of time, and future activity will wake
the avgs_work with a 2s delay, giving us 2s worth of data we can leave
behind when shut off the avgs_work. If the kworker run other works after
avgs_work shut off and doesn't have any scheduler activities for 2s,
this maybe a problem.
Reported-by: Pavan Kondeti <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/psi.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ee2ecc0..f4cdf6f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
u32 *pchanged_states)
{
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ int current_cpu = raw_smp_processor_id();
+ bool only_avgs_work = false;
u64 now, state_start;
enum psi_states s;
unsigned int seq;
@@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
memcpy(times, groupc->times, sizeof(groupc->times));
state_mask = groupc->state_mask;
state_start = groupc->state_start;
+ /*
+ * This CPU has only avgs_work kworker running, snapshot the
+ * newest times then don't need to re-arm for this groupc.
+ * Normally this kworker will sleep soon and won't wake
+ * avgs_work back up in psi_group_change().
+ */
+ if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
+ !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
+ only_avgs_work = true;
} while (read_seqcount_retry(&groupc->seq, seq));
/* Calculate state time deltas against the previous snapshot */
@@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
if (delta)
*pchanged_states |= (1 << s);
}
+
+ /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
+ if (only_avgs_work)
+ *pchanged_states &= ~(1 << PSI_NONIDLE);
}
static void calc_avgs(unsigned long avg[3], int missed_periods,
Hello,
Thanks for picking this up. There is a newer version which has been acked:
https://lore.kernel.org/all/[email protected]/
As well another PSI patch that has been acked by Johannes:
https://lore.kernel.org/all/[email protected]/
Thanks!
On 2022/10/28 14:42, tip-bot2 for Chengming Zhou wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: 7d89d7bb921c5ae5a428df282e64ee5692e26fe0
> Gitweb: https://git.kernel.org/tip/7d89d7bb921c5ae5a428df282e64ee5692e26fe0
> Author: Chengming Zhou <[email protected]>
> AuthorDate: Mon, 10 Oct 2022 18:42:06 +08:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Thu, 27 Oct 2022 11:01:23 +02:00
>
> sched/psi: Fix avgs_work re-arm in psi_avgs_work()
>
> Pavan reported a problem that PSI avgs_work idle shutoff is not
> working at all. Because PSI_NONIDLE condition would be observed in
> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
> only the kworker running avgs_work on the CPU.
>
> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
> still will always re-arm the avgs_work, so shutoff is not working.
>
> This patch changes to consider current CPU groupc as IDLE if the
> kworker running avgs_work is the only task running and no IOWAIT
> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
> if other CPUs' groupc are also IDLE.
>
> One potential problem is that the brief period of non-idle time
> incurred between the aggregation run and the kworker's dequeue will
> be stranded in the per-cpu buckets until avgs_work run next time.
> The buckets can hold 4s worth of time, and future activity will wake
> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
> behind when shut off the avgs_work. If the kworker run other works after
> avgs_work shut off and doesn't have any scheduler activities for 2s,
> this maybe a problem.
>
> Reported-by: Pavan Kondeti <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> ---
> kernel/sched/psi.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index ee2ecc0..f4cdf6f 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
> u32 *pchanged_states)
> {
> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> + int current_cpu = raw_smp_processor_id();
> + bool only_avgs_work = false;
> u64 now, state_start;
> enum psi_states s;
> unsigned int seq;
> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
> memcpy(times, groupc->times, sizeof(groupc->times));
> state_mask = groupc->state_mask;
> state_start = groupc->state_start;
> + /*
> + * This CPU has only avgs_work kworker running, snapshot the
> + * newest times then don't need to re-arm for this groupc.
> + * Normally this kworker will sleep soon and won't wake
> + * avgs_work back up in psi_group_change().
> + */
> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
> + only_avgs_work = true;
> } while (read_seqcount_retry(&groupc->seq, seq));
>
> /* Calculate state time deltas against the previous snapshot */
> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> if (delta)
> *pchanged_states |= (1 << s);
> }
> +
> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> + if (only_avgs_work)
> + *pchanged_states &= ~(1 << PSI_NONIDLE);
> }
>
> static void calc_avgs(unsigned long avg[3], int missed_periods,
On Thu, Oct 27, 2022 at 11:50 PM Chengming Zhou
<[email protected]> wrote:
>
> Hello,
>
> Thanks for picking this up. There is a newer version which has been acked:
> https://lore.kernel.org/all/[email protected]/
Hmm. Indeed this seems to be an older version and not the one I asked
Peter to pick up in
https://lore.kernel.org/all/CAJuCfpHeJuZBbv-q+WXjgNHwt_caMomFPL3L9rxosXOrZz3fBw@mail.gmail.com/.
Not sure what went wrong. Peter, could you please replace this one
with https://lore.kernel.org/all/[email protected]/?
Chengming, please do not top-post next time. Would be better if you
posted your note under the "Link:" field in this email.
Thanks!
>
> As well another PSI patch that has been acked by Johannes:
> https://lore.kernel.org/all/[email protected]/
>
> Thanks!
>
>
> On 2022/10/28 14:42, tip-bot2 for Chengming Zhou wrote:
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: 7d89d7bb921c5ae5a428df282e64ee5692e26fe0
> > Gitweb: https://git.kernel.org/tip/7d89d7bb921c5ae5a428df282e64ee5692e26fe0
> > Author: Chengming Zhou <[email protected]>
> > AuthorDate: Mon, 10 Oct 2022 18:42:06 +08:00
> > Committer: Peter Zijlstra <[email protected]>
> > CommitterDate: Thu, 27 Oct 2022 11:01:23 +02:00
> >
> > sched/psi: Fix avgs_work re-arm in psi_avgs_work()
> >
> > Pavan reported a problem that PSI avgs_work idle shutoff is not
> > working at all. Because PSI_NONIDLE condition would be observed in
> > psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
> > only the kworker running avgs_work on the CPU.
> >
> > Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
> > avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
> > still will always re-arm the avgs_work, so shutoff is not working.
> >
> > This patch changes to consider current CPU groupc as IDLE if the
> > kworker running avgs_work is the only task running and no IOWAIT
> > or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
> > if other CPUs' groupc are also IDLE.
> >
> > One potential problem is that the brief period of non-idle time
> > incurred between the aggregation run and the kworker's dequeue will
> > be stranded in the per-cpu buckets until avgs_work run next time.
> > The buckets can hold 4s worth of time, and future activity will wake
> > the avgs_work with a 2s delay, giving us 2s worth of data we can leave
> > behind when shut off the avgs_work. If the kworker run other works after
> > avgs_work shut off and doesn't have any scheduler activities for 2s,
> > this maybe a problem.
> >
> > Reported-by: Pavan Kondeti <[email protected]>
> > Signed-off-by: Chengming Zhou <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > Link: https://lore.kernel.org/r/[email protected]
> > ---
> > kernel/sched/psi.c | 15 +++++++++++++++
> > 1 file changed, 15 insertions(+)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index ee2ecc0..f4cdf6f 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > u32 *pchanged_states)
> > {
> > struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> > + int current_cpu = raw_smp_processor_id();
> > + bool only_avgs_work = false;
> > u64 now, state_start;
> > enum psi_states s;
> > unsigned int seq;
> > @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > memcpy(times, groupc->times, sizeof(groupc->times));
> > state_mask = groupc->state_mask;
> > state_start = groupc->state_start;
> > + /*
> > + * This CPU has only avgs_work kworker running, snapshot the
> > + * newest times then don't need to re-arm for this groupc.
> > + * Normally this kworker will sleep soon and won't wake
> > + * avgs_work back up in psi_group_change().
> > + */
> > + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
> > + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
> > + only_avgs_work = true;
> > } while (read_seqcount_retry(&groupc->seq, seq));
> >
> > /* Calculate state time deltas against the previous snapshot */
> > @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
> > if (delta)
> > *pchanged_states |= (1 << s);
> > }
> > +
> > + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
> > + if (only_avgs_work)
> > + *pchanged_states &= ~(1 << PSI_NONIDLE);
> > }
> >
> > static void calc_avgs(unsigned long avg[3], int missed_periods,
On 2022/10/28 23:58, Suren Baghdasaryan wrote:
> On Thu, Oct 27, 2022 at 11:50 PM Chengming Zhou
> <[email protected]> wrote:
>>
>> Hello,
>>
>> Thanks for picking this up. There is a newer version which has been acked:
>> https://lore.kernel.org/all/[email protected]/
>
> Hmm. Indeed this seems to be an older version and not the one I asked
> Peter to pick up in
> https://lore.kernel.org/all/CAJuCfpHeJuZBbv-q+WXjgNHwt_caMomFPL3L9rxosXOrZz3fBw@mail.gmail.com/.
> Not sure what went wrong. Peter, could you please replace this one
> with https://lore.kernel.org/all/[email protected]/?
Oh, I didn't notice that email.
>
> Chengming, please do not top-post next time. Would be better if you
> posted your note under the "Link:" field in this email.
Got it, I will do next time.
Thanks!
> Thanks!
>
>>
>> As well another PSI patch that has been acked by Johannes:
>> https://lore.kernel.org/all/[email protected]/
>>
>> Thanks!
>>
>>
>> On 2022/10/28 14:42, tip-bot2 for Chengming Zhou wrote:
>>> The following commit has been merged into the sched/core branch of tip:
>>>
>>> Commit-ID: 7d89d7bb921c5ae5a428df282e64ee5692e26fe0
>>> Gitweb: https://git.kernel.org/tip/7d89d7bb921c5ae5a428df282e64ee5692e26fe0
>>> Author: Chengming Zhou <[email protected]>
>>> AuthorDate: Mon, 10 Oct 2022 18:42:06 +08:00
>>> Committer: Peter Zijlstra <[email protected]>
>>> CommitterDate: Thu, 27 Oct 2022 11:01:23 +02:00
>>>
>>> sched/psi: Fix avgs_work re-arm in psi_avgs_work()
>>>
>>> Pavan reported a problem that PSI avgs_work idle shutoff is not
>>> working at all. Because PSI_NONIDLE condition would be observed in
>>> psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
>>> only the kworker running avgs_work on the CPU.
>>>
>>> Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
>>> avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
>>> still will always re-arm the avgs_work, so shutoff is not working.
>>>
>>> This patch changes to consider current CPU groupc as IDLE if the
>>> kworker running avgs_work is the only task running and no IOWAIT
>>> or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
>>> if other CPUs' groupc are also IDLE.
>>>
>>> One potential problem is that the brief period of non-idle time
>>> incurred between the aggregation run and the kworker's dequeue will
>>> be stranded in the per-cpu buckets until avgs_work run next time.
>>> The buckets can hold 4s worth of time, and future activity will wake
>>> the avgs_work with a 2s delay, giving us 2s worth of data we can leave
>>> behind when shut off the avgs_work. If the kworker run other works after
>>> avgs_work shut off and doesn't have any scheduler activities for 2s,
>>> this maybe a problem.
>>>
>>> Reported-by: Pavan Kondeti <[email protected]>
>>> Signed-off-by: Chengming Zhou <[email protected]>
>>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>> Link: https://lore.kernel.org/r/[email protected]
>>> ---
>>> kernel/sched/psi.c | 15 +++++++++++++++
>>> 1 file changed, 15 insertions(+)
>>>
>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>> index ee2ecc0..f4cdf6f 100644
>>> --- a/kernel/sched/psi.c
>>> +++ b/kernel/sched/psi.c
>>> @@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>> u32 *pchanged_states)
>>> {
>>> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
>>> + int current_cpu = raw_smp_processor_id();
>>> + bool only_avgs_work = false;
>>> u64 now, state_start;
>>> enum psi_states s;
>>> unsigned int seq;
>>> @@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>> memcpy(times, groupc->times, sizeof(groupc->times));
>>> state_mask = groupc->state_mask;
>>> state_start = groupc->state_start;
>>> + /*
>>> + * This CPU has only avgs_work kworker running, snapshot the
>>> + * newest times then don't need to re-arm for this groupc.
>>> + * Normally this kworker will sleep soon and won't wake
>>> + * avgs_work back up in psi_group_change().
>>> + */
>>> + if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
>>> + !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
>>> + only_avgs_work = true;
>>> } while (read_seqcount_retry(&groupc->seq, seq));
>>>
>>> /* Calculate state time deltas against the previous snapshot */
>>> @@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
>>> if (delta)
>>> *pchanged_states |= (1 << s);
>>> }
>>> +
>>> + /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
>>> + if (only_avgs_work)
>>> + *pchanged_states &= ~(1 << PSI_NONIDLE);
>>> }
>>>
>>> static void calc_avgs(unsigned long avg[3], int missed_periods,
On Fri, Oct 28, 2022 at 08:58:03AM -0700, Suren Baghdasaryan wrote:
> Not sure what went wrong. Peter, could you please replace this one
Probably me being an idiot and searching on subject instead of msgid :/
I'll go fix up -- tomorrow though, it's late and I'm likely to mess it
up again.
On Fri, Oct 28, 2022 at 09:53:57PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 28, 2022 at 08:58:03AM -0700, Suren Baghdasaryan wrote:
>
> > Not sure what went wrong. Peter, could you please replace this one
>
> Probably me being an idiot and searching on subject instead of msgid :/
>
> I'll go fix up -- tomorrow though, it's late and I'm likely to mess it
> up again.
Can you please check queue.git/sched/core ; did I get it right this
time?
On 2022/10/29 19:55, Peter Zijlstra wrote:
> On Fri, Oct 28, 2022 at 09:53:57PM +0200, Peter Zijlstra wrote:
>> On Fri, Oct 28, 2022 at 08:58:03AM -0700, Suren Baghdasaryan wrote:
>>
>>> Not sure what went wrong. Peter, could you please replace this one
>>
>> Probably me being an idiot and searching on subject instead of msgid :/
>>
>> I'll go fix up -- tomorrow though, it's late and I'm likely to mess it
>> up again.
>
> Can you please check queue.git/sched/core ; did I get it right this
> time?
I just checked that three patches, LGTM.
And would you mind picking up this, by the way?
https://lore.kernel.org/all/[email protected]/
Thanks!
On Sat, Oct 29, 2022 at 5:42 AM Chengming Zhou
<[email protected]> wrote:
>
> On 2022/10/29 19:55, Peter Zijlstra wrote:
> > On Fri, Oct 28, 2022 at 09:53:57PM +0200, Peter Zijlstra wrote:
> >> On Fri, Oct 28, 2022 at 08:58:03AM -0700, Suren Baghdasaryan wrote:
> >>
> >>> Not sure what went wrong. Peter, could you please replace this one
> >>
> >> Probably me being an idiot and searching on subject instead of msgid :/
> >>
> >> I'll go fix up -- tomorrow though, it's late and I'm likely to mess it
> >> up again.
> >
> > Can you please check queue.git/sched/core ; did I get it right this
> > time?
>
> I just checked that three patches, LGTM.
Yep, all three patches are correct. Thanks!
>
> And would you mind picking up this, by the way?
>
> https://lore.kernel.org/all/[email protected]/
>
> Thanks!