Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:
poll_timer_fn
wake_up_interruptible(&group->poll_wait);
psi_poll_worker
wait_event_interruptible(group->poll_wait, ...)
psi_poll_work
psi_schedule_poll_work
if (timer_pending(&group->poll_timer)) return;
...
mod_timer(&group->poll_timer, jiffies + delay);
Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06bdc.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:
Before the patch:
Run#1 Run#2 Run#3
Immediate reschedules attempted: 684365 1385156 1261240
Immediate reschedules blocked: 682846 1381654 1258682
Immediate reschedules (delta): 1519 3502 2558
Immediate reschedules (% of attempted): 0.22% 0.25% 0.20%
After the patch:
Run#1 Run#2 Run#3
Immediate reschedules attempted: 882244 770298 426218
Immediate reschedules blocked: 881996 769796 426074
Immediate reschedules (delta): 248 502 144
Immediate reschedules (% of attempted): 0.03% 0.07% 0.03%
The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.
Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang <[email protected]>
Reported-by: Wenju Xu <[email protected]>
Reported-by: Jonathan Chen <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Tested-by: SH Chen <[email protected]>
---
changes in v2:
- Simplified atomic_cmpxchg() condition per PeterZ
- Added Tested-by:
include/linux/psi_types.h | 1 +
kernel/sched/psi.c | 41 ++++++++++++++++++++++++++++-----------
2 files changed, 31 insertions(+), 11 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 0a23300d49af..ef8bd89d065e 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -158,6 +158,7 @@ struct psi_group {
struct timer_list poll_timer;
wait_queue_head_t poll_wait;
atomic_t poll_wakeup;
+ atomic_t poll_scheduled;
/* Protects data used by the monitor */
struct mutex trigger_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 58b36d17a09a..82b664ca9ee9 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -195,6 +195,7 @@ static void group_init(struct psi_group *group)
INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
mutex_init(&group->avgs_lock);
/* Init trigger-related members */
+ atomic_set(&group->poll_scheduled, 0);
mutex_init(&group->trigger_lock);
INIT_LIST_HEAD(&group->triggers);
memset(group->nr_triggers, 0, sizeof(group->nr_triggers));
@@ -555,18 +556,14 @@ static u64 update_triggers(struct psi_group *group, u64 now)
return now + group->poll_min_period;
}
-/* Schedule polling if it's not already scheduled. */
-static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
+/* Schedule polling if it's not already scheduled or forced. */
+static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay,
+ bool force)
{
struct task_struct *task;
- /*
- * Do not reschedule if already scheduled.
- * Possible race with a timer scheduled after this check but before
- * mod_timer below can be tolerated because group->polling_next_update
- * will keep updates on schedule.
- */
- if (timer_pending(&group->poll_timer))
+ /* cmpxchg should be called even when !force to set poll_scheduled */
+ if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
return;
rcu_read_lock();
@@ -578,12 +575,15 @@ static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay)
*/
if (likely(task))
mod_timer(&group->poll_timer, jiffies + delay);
+ else
+ atomic_set(&group->poll_scheduled, 0);
rcu_read_unlock();
}
static void psi_poll_work(struct psi_group *group)
{
+ bool force_reschedule = false;
u32 changed_states;
u64 now;
@@ -591,6 +591,23 @@ static void psi_poll_work(struct psi_group *group)
now = sched_clock();
+ if (now > group->polling_until) {
+ /*
+ * We are either about to start or might stop polling if no
+ * state change was recorded. Resetting poll_scheduled leaves
+ * a small window for psi_group_change to sneak in and schedule
+ * an immegiate poll_work before we get to rescheduling. One
+ * potential extra wakeup at the end of the polling window
+ * should be negligible and polling_next_update still keeps
+ * updates correctly on schedule.
+ */
+ atomic_set(&group->poll_scheduled, 0);
+ } else {
+ /* Polling window is not over, keep rescheduling */
+ force_reschedule = true;
+ }
+
+
collect_percpu_times(group, PSI_POLL, &changed_states);
if (changed_states & group->poll_states) {
@@ -616,7 +633,8 @@ static void psi_poll_work(struct psi_group *group)
group->polling_next_update = update_triggers(group, now);
psi_schedule_poll_work(group,
- nsecs_to_jiffies(group->polling_next_update - now) + 1);
+ nsecs_to_jiffies(group->polling_next_update - now) + 1,
+ force_reschedule);
out:
mutex_unlock(&group->trigger_lock);
@@ -740,7 +758,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
write_seqcount_end(&groupc->seq);
if (state_mask & group->poll_states)
- psi_schedule_poll_work(group, 1);
+ psi_schedule_poll_work(group, 1, false);
if (wake_clock && !delayed_work_pending(&group->avgs_work))
schedule_delayed_work(&group->avgs_work, PSI_FREQ);
@@ -1235,6 +1253,7 @@ static void psi_trigger_destroy(struct kref *ref)
* can no longer be found through group->poll_task.
*/
kthread_stop(task_to_destroy);
+ atomic_set(&group->poll_scheduled, 0);
}
kfree(t);
}
--
2.32.0.93.g670b81a890-goog
On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> Psi polling mechanism is trying to minimize the number of wakeups to
> run psi_poll_work and is currently relying on timer_pending() to detect
> when this work is already scheduled. This provides a window of opportunity
> for psi_group_change to schedule an immediate psi_poll_work after
> poll_timer_fn got called but before psi_poll_work could reschedule itself.
> Below is the depiction of this entire window:
>
> poll_timer_fn
> wake_up_interruptible(&group->poll_wait);
>
> psi_poll_worker
> wait_event_interruptible(group->poll_wait, ...)
> psi_poll_work
> psi_schedule_poll_work
> if (timer_pending(&group->poll_timer)) return;
> ...
> mod_timer(&group->poll_timer, jiffies + delay);
>
> Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
> reset and set back inside psi_poll_work and therefore this race window
> was much smaller.
> The larger window causes increased number of wakeups and our partners
> report visible power regression of ~10mA after applying 461daba06bdc.
> Bring back the poll_scheduled atomic and make this race window even
> narrower by resetting poll_scheduled only when we reach polling expiration
> time. This does not completely eliminate the possibility of extra wakeups
> caused by a race with psi_group_change however it will limit it to the
> worst case scenario of one extra wakeup per every tracking window (0.5s
> in the worst case).
> By tracing the number of immediate rescheduling attempts performed by
> psi_group_change and the number of these attempts being blocked due to
> psi monitor being already active, we can assess the effects of this change:
>
> Before the patch:
> Run#1 Run#2 Run#3
> Immediate reschedules attempted: 684365 1385156 1261240
> Immediate reschedules blocked: 682846 1381654 1258682
> Immediate reschedules (delta): 1519 3502 2558
> Immediate reschedules (% of attempted): 0.22% 0.25% 0.20%
>
> After the patch:
> Run#1 Run#2 Run#3
> Immediate reschedules attempted: 882244 770298 426218
> Immediate reschedules blocked: 881996 769796 426074
> Immediate reschedules (delta): 248 502 144
> Immediate reschedules (% of attempted): 0.03% 0.07% 0.03%
>
> The number of non-blocked immediate reschedules dropped from 0.22-0.25%
> to 0.03-0.07%. The drop is attributed to the decrease in the race window
> size and the fact that we allow this race only when psi monitors reach
> polling window expiration time.
>
> Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
> Reported-by: Kathleen Chang <[email protected]>
> Reported-by: Wenju Xu <[email protected]>
> Reported-by: Jonathan Chen <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Tested-by: SH Chen <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> + /* cmpxchg should be called even when !force to set poll_scheduled */
> + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> return;
Why is that a cmpxchg() ?
On Thu, Jul 1, 2021 at 1:59 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> > + /* cmpxchg should be called even when !force to set poll_scheduled */
> > + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> > return;
>
> Why is that a cmpxchg() ?
We want to set poll_scheduled and proceed with rescheduling the timer
unless it's already scheduled, so cmpxchg helps us to make that
decision atomically. Or did I misunderstand your question?
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Thu, Jul 01, 2021 at 09:09:25AM -0700, Suren Baghdasaryan wrote:
> On Thu, Jul 1, 2021 at 1:59 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> > > + /* cmpxchg should be called even when !force to set poll_scheduled */
> > > + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> > > return;
> >
> > Why is that a cmpxchg() ?
>
> We want to set poll_scheduled and proceed with rescheduling the timer
> unless it's already scheduled, so cmpxchg helps us to make that
> decision atomically. Or did I misunderstand your question?
What's wrong with: atomic_xchg(&group->poll_scheduled, 1) ?
On Thu, Jul 1, 2021 at 9:12 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jul 01, 2021 at 09:09:25AM -0700, Suren Baghdasaryan wrote:
> > On Thu, Jul 1, 2021 at 1:59 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> > > > + /* cmpxchg should be called even when !force to set poll_scheduled */
> > > > + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> > > > return;
> > >
> > > Why is that a cmpxchg() ?
> >
> > We want to set poll_scheduled and proceed with rescheduling the timer
> > unless it's already scheduled, so cmpxchg helps us to make that
> > decision atomically. Or did I misunderstand your question?
>
> What's wrong with: atomic_xchg(&group->poll_scheduled, 1) ?
Yes, since poll_scheduled can be only 0 or 1 atomic_xchg should work
fine here. Functionally equivalent but I assume atomic_xchg() is more
efficient due to no comparison.
Will respin v3 if the rest looks ok.
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Thu, Jul 01, 2021 at 10:58:24AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> > + /* cmpxchg should be called even when !force to set poll_scheduled */
> > + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> > return;
>
> Why is that a cmpxchg() ?
I now realize you had already pointed that out, but I dismissed it in
the context of poll_lock not being always taken after all.
But you're right, cmpxchg indeed seems inappropriate. xchg will do
just fine for this binary toggle.
When it comes to ordering, looking at it again, I think we actually
need ordering here that the seqcount doesn't provide. We have:
timer:
scheduled = 0
smp_rmb()
x = state
scheduler:
state = y
smp_wmb()
if xchg(scheduled, 1) == 0
mod_timer()
Again, the requirement is that when the scheduler sees the timer as
already or still pending, the timer must observe its state updates -
otherwise we miss poll events.
The seqcount provides the wmb and rmb, but the scheduler-side read of
@scheduled mustn't be reordered before the write to @state. Likewise,
the timer-side read of @state also mustn't occur before the write to
@scheduled.
AFAICS this is broken, not just in the patch, but also in the current
code when timer_pending() on the scheduler side gets reordered. (Not
sure if timer reading state can be reordered before the detach_timer()
of its own expiration, but I don't see full ordering between them.)
So it seems to me we need the ordered atomic_xchg() on the scheduler
side, and on the timer side an smp_mb() after we set scheduled to 0.
On Thu, Jul 1, 2021 at 9:39 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Jul 01, 2021 at 10:58:24AM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> > > + /* cmpxchg should be called even when !force to set poll_scheduled */
> > > + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> > > return;
> >
> > Why is that a cmpxchg() ?
>
> I now realize you had already pointed that out, but I dismissed it in
> the context of poll_lock not being always taken after all.
>
> But you're right, cmpxchg indeed seems inappropriate. xchg will do
> just fine for this binary toggle.
>
> When it comes to ordering, looking at it again, I think we actually
> need ordering here that the seqcount doesn't provide. We have:
>
> timer:
> scheduled = 0
> smp_rmb()
> x = state
>
> scheduler:
> state = y
> smp_wmb()
> if xchg(scheduled, 1) == 0
> mod_timer()
>
> Again, the requirement is that when the scheduler sees the timer as
> already or still pending, the timer must observe its state updates -
> otherwise we miss poll events.
>
> The seqcount provides the wmb and rmb, but the scheduler-side read of
> @scheduled mustn't be reordered before the write to @state. Likewise,
> the timer-side read of @state also mustn't occur before the write to
> @scheduled.
>
> AFAICS this is broken, not just in the patch, but also in the current
> code when timer_pending() on the scheduler side gets reordered. (Not
> sure if timer reading state can be reordered before the detach_timer()
> of its own expiration, but I don't see full ordering between them.)
>
> So it seems to me we need the ordered atomic_xchg() on the scheduler
> side, and on the timer side an smp_mb() after we set scheduled to 0.
Thanks for the analysis Johannes. Let me dwell on it a bit.
On Thu, Jul 01, 2021 at 09:28:04AM -0700, Suren Baghdasaryan wrote:
> On Thu, Jul 1, 2021 at 9:12 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Jul 01, 2021 at 09:09:25AM -0700, Suren Baghdasaryan wrote:
> > > On Thu, Jul 1, 2021 at 1:59 AM Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> > > > > + /* cmpxchg should be called even when !force to set poll_scheduled */
> > > > > + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> > > > > return;
> > > >
> > > > Why is that a cmpxchg() ?
> > >
> > > We want to set poll_scheduled and proceed with rescheduling the timer
> > > unless it's already scheduled, so cmpxchg helps us to make that
> > > decision atomically. Or did I misunderstand your question?
> >
> > What's wrong with: atomic_xchg(&group->poll_scheduled, 1) ?
>
> Yes, since poll_scheduled can be only 0 or 1 atomic_xchg should work
> fine here. Functionally equivalent but I assume atomic_xchg() is more
> efficient due to no comparison.
Mostly conceptually simpler; the cmpxchg-on-0 makes that you have to
check if there's ever any state outside of {0,1}. The xchg() thing is
the classical test-and-set pattern.
On top of all that, the cmpxchg() can fail, which brings ordering
issues.
Typically, I think, you want to ensure that everything that happens
before psi_schedule_poll_work() is visible to the work when it runs
(also see Johannes' email). In case poll_scheduled is already 1, the
cmpxchg will fail and *NOT* provide that ordering. Meaning the work
might not observe the latest changes. xchg() doesn't have this subtlety.
On Fri, Jul 2, 2021 at 2:28 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jul 01, 2021 at 09:28:04AM -0700, Suren Baghdasaryan wrote:
> > On Thu, Jul 1, 2021 at 9:12 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Thu, Jul 01, 2021 at 09:09:25AM -0700, Suren Baghdasaryan wrote:
> > > > On Thu, Jul 1, 2021 at 1:59 AM Peter Zijlstra <[email protected]> wrote:
> > > > >
> > > > > On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> > > > > > + /* cmpxchg should be called even when !force to set poll_scheduled */
> > > > > > + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> > > > > > return;
> > > > >
> > > > > Why is that a cmpxchg() ?
> > > >
> > > > We want to set poll_scheduled and proceed with rescheduling the timer
> > > > unless it's already scheduled, so cmpxchg helps us to make that
> > > > decision atomically. Or did I misunderstand your question?
> > >
> > > What's wrong with: atomic_xchg(&group->poll_scheduled, 1) ?
> >
> > Yes, since poll_scheduled can be only 0 or 1 atomic_xchg should work
> > fine here. Functionally equivalent but I assume atomic_xchg() is more
> > efficient due to no comparison.
>
> Mostly conceptually simpler; the cmpxchg-on-0 makes that you have to
> check if there's ever any state outside of {0,1}. The xchg() thing is
> the classical test-and-set pattern.
>
> On top of all that, the cmpxchg() can fail, which brings ordering
> issues.
Oh, I see. That was my mistake. I was wrongly assuming that all RMW
atomic operations are fully ordered but indeed, documentation states
that:
```
- RMW operations that have a return value are fully ordered;
- RMW operations that are conditional are unordered on FAILURE,
otherwise the above rules apply.
```
So that's the actual functional difference here. Thanks for catching
this and educating me!
>
> Typically, I think, you want to ensure that everything that happens
> before psi_schedule_poll_work() is visible to the work when it runs
> (also see Johannes' email).
Correct and I think I understand now the concern Johannes expressed.
> In case poll_scheduled is already 1, the
> cmpxchg will fail and *NOT* provide that ordering. Meaning the work
> might not observe the latest changes. xchg() doesn't have this subtlety.
Got it.
So I think the modifications needed to this patch is:
1. replacing atomic_cmpxchg(&group->poll_scheduled, 0, 1) with
atomic_chg(&group->poll_scheduled, 1)
2. an explicit smp_mb() barrier right after
atomic_set(&group->poll_scheduled, 0) in psi_poll_work().
I think that should ensure the correct ordering here.
If you folks agree I'll respin v3 with these changes (or maybe I
should respin and we continue discussion with that version?).
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Fri, Jul 2, 2021 at 8:49 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Fri, Jul 2, 2021 at 2:28 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Jul 01, 2021 at 09:28:04AM -0700, Suren Baghdasaryan wrote:
> > > On Thu, Jul 1, 2021 at 9:12 AM Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > On Thu, Jul 01, 2021 at 09:09:25AM -0700, Suren Baghdasaryan wrote:
> > > > > On Thu, Jul 1, 2021 at 1:59 AM Peter Zijlstra <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Jun 30, 2021 at 01:51:51PM -0700, Suren Baghdasaryan wrote:
> > > > > > > + /* cmpxchg should be called even when !force to set poll_scheduled */
> > > > > > > + if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) && !force)
> > > > > > > return;
> > > > > >
> > > > > > Why is that a cmpxchg() ?
> > > > >
> > > > > We want to set poll_scheduled and proceed with rescheduling the timer
> > > > > unless it's already scheduled, so cmpxchg helps us to make that
> > > > > decision atomically. Or did I misunderstand your question?
> > > >
> > > > What's wrong with: atomic_xchg(&group->poll_scheduled, 1) ?
> > >
> > > Yes, since poll_scheduled can be only 0 or 1 atomic_xchg should work
> > > fine here. Functionally equivalent but I assume atomic_xchg() is more
> > > efficient due to no comparison.
> >
> > Mostly conceptually simpler; the cmpxchg-on-0 makes that you have to
> > check if there's ever any state outside of {0,1}. The xchg() thing is
> > the classical test-and-set pattern.
> >
> > On top of all that, the cmpxchg() can fail, which brings ordering
> > issues.
>
> Oh, I see. That was my mistake. I was wrongly assuming that all RMW
> atomic operations are fully ordered but indeed, documentation states
> that:
> ```
> - RMW operations that have a return value are fully ordered;
> - RMW operations that are conditional are unordered on FAILURE,
> otherwise the above rules apply.
> ```
> So that's the actual functional difference here. Thanks for catching
> this and educating me!
>
> >
> > Typically, I think, you want to ensure that everything that happens
> > before psi_schedule_poll_work() is visible to the work when it runs
> > (also see Johannes' email).
>
> Correct and I think I understand now the concern Johannes expressed.
>
> > In case poll_scheduled is already 1, the
> > cmpxchg will fail and *NOT* provide that ordering. Meaning the work
> > might not observe the latest changes. xchg() doesn't have this subtlety.
>
> Got it.
> So I think the modifications needed to this patch is:
> 1. replacing atomic_cmpxchg(&group->poll_scheduled, 0, 1) with
> atomic_chg(&group->poll_scheduled, 1)
> 2. an explicit smp_mb() barrier right after
> atomic_set(&group->poll_scheduled, 0) in psi_poll_work().
>
> I think that should ensure the correct ordering here.
> If you folks agree I'll respin v3 with these changes (or maybe I
> should respin and we continue discussion with that version?).
To keep things moving I posted v3
(https://lore.kernel.org/patchwork/patch/1454547) with the changes I
mentioned above. Let's keep discussing it there. Thanks!
>
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >