psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly
thereafter, the kworker thread that executed the aggregation will go
idle and cause a scheduling change, during which the psi callback will
kick the !pending worker again. This will ping-pong forever, and is
equivalent to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking
logic when the state change is them going to sleep. To do this, tag
workers with the last work function they executed, and if in psi we
see a worker going to sleep after aggregating psi data, we will not
reschedule the aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and
the aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we
decide to go idle, the brief period of non-idle time incurred between
the aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s
worth of data we can leave behind when disabling aggregation. If it
takes a worker more than two seconds to go idle after it finishes its
last work item, we likely have bigger problems in the system, and
won't notice one sample that was averaged with a bogus per-CPU weight.
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Reported-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
kernel/sched/psi.c | 21 +++++++++++++++++----
kernel/workqueue.c | 23 +++++++++++++++++++++++
kernel/workqueue_internal.h | 6 +++++-
3 files changed, 45 insertions(+), 5 deletions(-)
Sending this for 5.0. I don't think CC stable is warranted on this.
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index fe24de3fbc93..c3484785b179 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -124,6 +124,7 @@
* sampling of the aggregate task states would be.
*/
+#include "../workqueue_internal.h"
#include <linux/sched/loadavg.h>
#include <linux/seq_file.h>
#include <linux/proc_fs.h>
@@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
groupc->tasks[t]++;
write_seqcount_end(&groupc->seq);
-
- if (!delayed_work_pending(&group->clock_work))
- schedule_delayed_work(&group->clock_work, PSI_FREQ);
}
static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
@@ -513,6 +511,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
+ bool wake_clock = true;
void *iter = NULL;
if (!task->pid)
@@ -530,8 +529,22 @@ void psi_task_change(struct task_struct *task, int clear, int set)
task->psi_flags &= ~clear;
task->psi_flags |= set;
- while ((group = iterate_groups(task, &iter)))
+ /*
+ * Periodic aggregation shuts off if there is a period of no
+ * task changes, so we wake it back up if necessary. However,
+ * don't do this if the task change is the aggregation worker
+ * itself going to sleep, or we'll ping-pong forever.
+ */
+ if (unlikely((clear & TSK_RUNNING) &&
+ (task->flags & PF_WQ_WORKER) &&
+ wq_worker_last_func(task) == psi_update_work))
+ wake_clock = false;
+
+ while ((group = iterate_groups(task, &iter))) {
psi_group_change(group, cpu, clear, set);
+ if (wake_clock && !delayed_work_pending(&group->clock_work))
+ schedule_delayed_work(&group->clock_work, PSI_FREQ);
+ }
}
void psi_memstall_tick(struct task_struct *task, int cpu)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 392be4b252f6..fc5d23d752a5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
return to_wakeup ? to_wakeup->task : NULL;
}
+/**
+ * wq_worker_last_func - retrieve worker's last work function
+ *
+ * Determine the last function a worker executed. This is called from
+ * the scheduler to get a worker's last known identity.
+ *
+ * CONTEXT:
+ * spin_lock_irq(rq->lock)
+ *
+ * Return:
+ * The last work function %current executed as a worker, NULL if it
+ * hasn't executed any work yet.
+ */
+work_func_t wq_worker_last_func(struct task_struct *task)
+{
+ struct worker *worker = kthread_data(task);
+
+ return worker->last_func;
+}
+
/**
* worker_set_flags - set worker flags and adjust nr_running accordingly
* @worker: self
@@ -2184,6 +2204,9 @@ __acquires(&pool->lock)
if (unlikely(cpu_intensive))
worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
+ /* tag the worker for identification in schedule() */
+ worker->last_func = worker->current_func;
+
/* we're done with it, release */
hash_del(&worker->hentry);
worker->current_work = NULL;
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 66fbb5a9e633..cb68b03ca89a 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -53,6 +53,9 @@ struct worker {
/* used only by rescuers to point to the target workqueue */
struct workqueue_struct *rescue_wq; /* I: the workqueue to rescue */
+
+ /* used by the scheduler to determine a worker's last known identity */
+ work_func_t last_func;
};
/**
@@ -67,9 +70,10 @@ static inline struct worker *current_wq_worker(void)
/*
* Scheduler hooks for concurrency managed workqueue. Only to be used from
- * sched/core.c and workqueue.c.
+ * sched/ and workqueue.c.
*/
void wq_worker_waking_up(struct task_struct *task, int cpu);
struct task_struct *wq_worker_sleeping(struct task_struct *task);
+work_func_t wq_worker_last_func(struct task_struct *task);
#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
--
2.20.1
Hi Johannes,
Thanks for fixing this. I'll test this approach and report back the
results within a day or two.
On Wed, Jan 16, 2019 at 11:35 AM Johannes Weiner <[email protected]> wrote:
>
> psi has provisions to shut off the periodic aggregation worker when
> there is a period of no task activity - and thus no data that needs
> aggregating. However, while developing psi monitoring, Suren noticed
> that the aggregation clock currently won't stay shut off for good.
>
> Debugging this revealed a flaw in the idle design: an aggregation run
> will see no task activity and decide to go to sleep; shortly
> thereafter, the kworker thread that executed the aggregation will go
> idle and cause a scheduling change, during which the psi callback will
> kick the !pending worker again. This will ping-pong forever, and is
> equivalent to having no shut-off logic at all (but with more code!)
>
> Fix this by exempting aggregation workers from psi's clock waking
> logic when the state change is them going to sleep. To do this, tag
The above sentence seems to be misspelled somehow.
> workers with the last work function they executed, and if in psi we
> see a worker going to sleep after aggregating psi data, we will not
> reschedule the aggregation work item.
>
> What if the worker is also executing other items before or after?
>
> Any psi state times that were incurred by work items preceding the
> aggregation work will have been collected from the per-cpu buckets
> during the aggregation itself. If there are work items following the
> aggregation work, the worker's last_func tag will be overwritten and
> the aggregator will be kept alive to process this genuine new activity.
>
> If the aggregation work is the last thing the worker does, and we
> decide to go idle, the brief period of non-idle time incurred between
> the aggregation run and the kworker's dequeue will be stranded in the
> per-cpu buckets until the clock is woken by later activity. But that
> should not be a problem. The buckets can hold 4s worth of time, and
> future activity will wake the clock with a 2s delay, giving us 2s
> worth of data we can leave behind when disabling aggregation. If it
> takes a worker more than two seconds to go idle after it finishes its
> last work item, we likely have bigger problems in the system, and
> won't notice one sample that was averaged with a bogus per-CPU weight.
>
> Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
> Reported-by: Suren Baghdasaryan <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> kernel/sched/psi.c | 21 +++++++++++++++++----
> kernel/workqueue.c | 23 +++++++++++++++++++++++
> kernel/workqueue_internal.h | 6 +++++-
> 3 files changed, 45 insertions(+), 5 deletions(-)
>
> Sending this for 5.0. I don't think CC stable is warranted on this.
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index fe24de3fbc93..c3484785b179 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -124,6 +124,7 @@
> * sampling of the aggregate task states would be.
> */
>
> +#include "../workqueue_internal.h"
> #include <linux/sched/loadavg.h>
> #include <linux/seq_file.h>
> #include <linux/proc_fs.h>
> @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
> groupc->tasks[t]++;
>
> write_seqcount_end(&groupc->seq);
> -
> - if (!delayed_work_pending(&group->clock_work))
> - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> }
>
> static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
> @@ -513,6 +511,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
> {
> int cpu = task_cpu(task);
> struct psi_group *group;
> + bool wake_clock = true;
> void *iter = NULL;
>
> if (!task->pid)
> @@ -530,8 +529,22 @@ void psi_task_change(struct task_struct *task, int clear, int set)
> task->psi_flags &= ~clear;
> task->psi_flags |= set;
>
> - while ((group = iterate_groups(task, &iter)))
> + /*
> + * Periodic aggregation shuts off if there is a period of no
> + * task changes, so we wake it back up if necessary. However,
> + * don't do this if the task change is the aggregation worker
> + * itself going to sleep, or we'll ping-pong forever.
> + */
> + if (unlikely((clear & TSK_RUNNING) &&
> + (task->flags & PF_WQ_WORKER) &&
> + wq_worker_last_func(task) == psi_update_work))
> + wake_clock = false;
> +
> + while ((group = iterate_groups(task, &iter))) {
> psi_group_change(group, cpu, clear, set);
> + if (wake_clock && !delayed_work_pending(&group->clock_work))
> + schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + }
> }
>
> void psi_memstall_tick(struct task_struct *task, int cpu)
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 392be4b252f6..fc5d23d752a5 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
> return to_wakeup ? to_wakeup->task : NULL;
> }
>
> +/**
> + * wq_worker_last_func - retrieve worker's last work function
> + *
> + * Determine the last function a worker executed. This is called from
> + * the scheduler to get a worker's last known identity.
> + *
> + * CONTEXT:
> + * spin_lock_irq(rq->lock)
> + *
> + * Return:
> + * The last work function %current executed as a worker, NULL if it
> + * hasn't executed any work yet.
> + */
> +work_func_t wq_worker_last_func(struct task_struct *task)
> +{
> + struct worker *worker = kthread_data(task);
> +
> + return worker->last_func;
> +}
> +
> /**
> * worker_set_flags - set worker flags and adjust nr_running accordingly
> * @worker: self
> @@ -2184,6 +2204,9 @@ __acquires(&pool->lock)
> if (unlikely(cpu_intensive))
> worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
>
> + /* tag the worker for identification in schedule() */
> + worker->last_func = worker->current_func;
> +
> /* we're done with it, release */
> hash_del(&worker->hentry);
> worker->current_work = NULL;
> diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
> index 66fbb5a9e633..cb68b03ca89a 100644
> --- a/kernel/workqueue_internal.h
> +++ b/kernel/workqueue_internal.h
> @@ -53,6 +53,9 @@ struct worker {
>
> /* used only by rescuers to point to the target workqueue */
> struct workqueue_struct *rescue_wq; /* I: the workqueue to rescue */
> +
> + /* used by the scheduler to determine a worker's last known identity */
> + work_func_t last_func;
> };
>
> /**
> @@ -67,9 +70,10 @@ static inline struct worker *current_wq_worker(void)
>
> /*
> * Scheduler hooks for concurrency managed workqueue. Only to be used from
> - * sched/core.c and workqueue.c.
> + * sched/ and workqueue.c.
> */
> void wq_worker_waking_up(struct task_struct *task, int cpu);
> struct task_struct *wq_worker_sleeping(struct task_struct *task);
> +work_func_t wq_worker_last_func(struct task_struct *task);
>
> #endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
> --
> 2.20.1
>
On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <[email protected]> wrote:
> psi has provisions to shut off the periodic aggregation worker when
> there is a period of no task activity - and thus no data that needs
> aggregating. However, while developing psi monitoring, Suren noticed
> that the aggregation clock currently won't stay shut off for good.
>
> Debugging this revealed a flaw in the idle design: an aggregation run
> will see no task activity and decide to go to sleep; shortly
> thereafter, the kworker thread that executed the aggregation will go
> idle and cause a scheduling change, during which the psi callback will
> kick the !pending worker again. This will ping-pong forever, and is
> equivalent to having no shut-off logic at all (but with more code!)
>
> Fix this by exempting aggregation workers from psi's clock waking
> logic when the state change is them going to sleep. To do this, tag
> workers with the last work function they executed, and if in psi we
> see a worker going to sleep after aggregating psi data, we will not
> reschedule the aggregation work item.
>
> What if the worker is also executing other items before or after?
>
> Any psi state times that were incurred by work items preceding the
> aggregation work will have been collected from the per-cpu buckets
> during the aggregation itself. If there are work items following the
> aggregation work, the worker's last_func tag will be overwritten and
> the aggregator will be kept alive to process this genuine new activity.
>
> If the aggregation work is the last thing the worker does, and we
> decide to go idle, the brief period of non-idle time incurred between
> the aggregation run and the kworker's dequeue will be stranded in the
> per-cpu buckets until the clock is woken by later activity. But that
> should not be a problem. The buckets can hold 4s worth of time, and
> future activity will wake the clock with a 2s delay, giving us 2s
> worth of data we can leave behind when disabling aggregation. If it
> takes a worker more than two seconds to go idle after it finishes its
> last work item, we likely have bigger problems in the system, and
> won't notice one sample that was averaged with a bogus per-CPU weight.
Did we ever hear back from Suren about the testing?
Some words here about the new wq_worker_last_func() would be
appropriate.
It's an ugly-looking thing :( Tejun, did you check this change?
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -124,6 +124,7 @@
> * sampling of the aggregate task states would be.
> */
>
> +#include "../workqueue_internal.h"
"Only to be included by workqueue and core kernel subsystems"
I'm not sure that psi qualifies. Perhaps wq_worker_last_func() should
be declared in include/linux/workqueue.h.
And perhaps implemented there as well. It's similar to
current_wq_worker(), which is inlined and wq_worker_last_func() is
small enough to justify inlining.
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
> return to_wakeup ? to_wakeup->task : NULL;
> }
>
> +/**
> + * wq_worker_last_func - retrieve worker's last work function
> + *
> + * Determine the last function a worker executed. This is called from
> + * the scheduler to get a worker's last known identity.
> + *
> + * CONTEXT:
> + * spin_lock_irq(rq->lock)
> + *
> + * Return:
> + * The last work function %current executed as a worker, NULL if it
> + * hasn't executed any work yet.
> + */
> +work_func_t wq_worker_last_func(struct task_struct *task)
> +{
> + struct worker *worker = kthread_data(task);
> +
> + return worker->last_func;
> +}
The semantics are troublesome. What guarantees that worker->last_func
won't change under the caller's feet? The caller should hold some lock
(presumably worker->pool->lock) in order to stabilize the
wq_worker_last_func() return value?
Also, the comment isn't really true - this is called from PSI, which is
hardly "the scheduler"?
On Mon, Jan 28, 2019 at 2:03 PM Andrew Morton <[email protected]> wrote:
>
> On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <[email protected]> wrote:
>
> > psi has provisions to shut off the periodic aggregation worker when
> > there is a period of no task activity - and thus no data that needs
> > aggregating. However, while developing psi monitoring, Suren noticed
> > that the aggregation clock currently won't stay shut off for good.
> >
> > Debugging this revealed a flaw in the idle design: an aggregation run
> > will see no task activity and decide to go to sleep; shortly
> > thereafter, the kworker thread that executed the aggregation will go
> > idle and cause a scheduling change, during which the psi callback will
> > kick the !pending worker again. This will ping-pong forever, and is
> > equivalent to having no shut-off logic at all (but with more code!)
> >
> > Fix this by exempting aggregation workers from psi's clock waking
> > logic when the state change is them going to sleep. To do this, tag
> > workers with the last work function they executed, and if in psi we
> > see a worker going to sleep after aggregating psi data, we will not
> > reschedule the aggregation work item.
> >
> > What if the worker is also executing other items before or after?
> >
> > Any psi state times that were incurred by work items preceding the
> > aggregation work will have been collected from the per-cpu buckets
> > during the aggregation itself. If there are work items following the
> > aggregation work, the worker's last_func tag will be overwritten and
> > the aggregator will be kept alive to process this genuine new activity.
> >
> > If the aggregation work is the last thing the worker does, and we
> > decide to go idle, the brief period of non-idle time incurred between
> > the aggregation run and the kworker's dequeue will be stranded in the
> > per-cpu buckets until the clock is woken by later activity. But that
> > should not be a problem. The buckets can hold 4s worth of time, and
> > future activity will wake the clock with a 2s delay, giving us 2s
> > worth of data we can leave behind when disabling aggregation. If it
> > takes a worker more than two seconds to go idle after it finishes its
> > last work item, we likely have bigger problems in the system, and
> > won't notice one sample that was averaged with a bogus per-CPU weight.
>
> Did we ever hear back from Suren about the testing?
>
> Some words here about the new wq_worker_last_func() would be
> appropriate.
>
> It's an ugly-looking thing :( Tejun, did you check this change?
Yes, sorry for the delay. I tried the patch as is and with some
tweaking of my local setup was able to see the cases when this
mechanism prevents the system from unnecessary reactivation. I
encountered some weird things in my traces (some state transitions
were missing in the trace) and wanted to investigate further, however
did not get a chance to do that yet. Overall looks like the patch is
doing what it's supposed to do. Just wanted to get the full picture
before reporting the results.
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -124,6 +124,7 @@
> > * sampling of the aggregate task states would be.
> > */
> >
> > +#include "../workqueue_internal.h"
>
> "Only to be included by workqueue and core kernel subsystems"
>
> I'm not sure that psi qualifies. Perhaps wq_worker_last_func() should
> be declared in include/linux/workqueue.h.
>
> And perhaps implemented there as well. It's similar to
> current_wq_worker(), which is inlined and wq_worker_last_func() is
> small enough to justify inlining.
>
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
> > return to_wakeup ? to_wakeup->task : NULL;
> > }
> >
> > +/**
> > + * wq_worker_last_func - retrieve worker's last work function
> > + *
> > + * Determine the last function a worker executed. This is called from
> > + * the scheduler to get a worker's last known identity.
> > + *
> > + * CONTEXT:
> > + * spin_lock_irq(rq->lock)
> > + *
> > + * Return:
> > + * The last work function %current executed as a worker, NULL if it
> > + * hasn't executed any work yet.
> > + */
> > +work_func_t wq_worker_last_func(struct task_struct *task)
> > +{
> > + struct worker *worker = kthread_data(task);
> > +
> > + return worker->last_func;
> > +}
>
> The semantics are troublesome. What guarantees that worker->last_func
> won't change under the caller's feet? The caller should hold some lock
> (presumably worker->pool->lock) in order to stabilize the
> wq_worker_last_func() return value?
>
> Also, the comment isn't really true - this is called from PSI, which is
> hardly "the scheduler"?
>
>
On Mon, Jan 28, 2019 at 02:03:36PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <[email protected]> wrote:
> > +/**
> > + * wq_worker_last_func - retrieve worker's last work function
> > + *
> > + * Determine the last function a worker executed. This is called from
> > + * the scheduler to get a worker's last known identity.
> > + *
> > + * CONTEXT:
> > + * spin_lock_irq(rq->lock)
> > + *
> > + * Return:
> > + * The last work function %current executed as a worker, NULL if it
> > + * hasn't executed any work yet.
> > + */
> > +work_func_t wq_worker_last_func(struct task_struct *task)
> > +{
> > + struct worker *worker = kthread_data(task);
> > +
> > + return worker->last_func;
> > +}
>
> The semantics are troublesome. What guarantees that worker->last_func
> won't change under the caller's feet? The caller should hold some lock
> (presumably worker->pool->lock) in order to stabilize the
> wq_worker_last_func() return value?
>
> Also, the comment isn't really true - this is called from PSI, which is
> hardly "the scheduler"?
psi isn't scheduler core, but it only works from a scheduler context.
The psi task change hook already requires being called under the
rq->lock while the task in question cannot change its scheduling
state, to record it and start the clock on the new state.
That same context ensures wq_worker_last_func() is safe. We're using
it from where the worker is inside the scheduler and going to *sleep*
(not just preemption), so couldn't possibly be changing last_func.
So IMO it makes sense to keep this last_func thing a part of the
private API between scheduler context and the workqueue code.
Hello, Andrew.
On Mon, Jan 28, 2019 at 02:03:36PM -0800, Andrew Morton wrote:
> > +#include "../workqueue_internal.h"
>
> "Only to be included by workqueue and core kernel subsystems"
>
> I'm not sure that psi qualifies. Perhaps wq_worker_last_func() should
> be declared in include/linux/workqueue.h.
>
> And perhaps implemented there as well. It's similar to
> current_wq_worker(), which is inlined and wq_worker_last_func() is
> small enough to justify inlining.
workqueue_internal.h is used mostly to expose some of the internal
details and hooks to scheduler code for concurrency management. Most
of PSI is tracking state transitions on scheduling events (psi.c is
under kernel/sched/ for this reason), so it's mostly in line with how
the file was being used.
I think keeping it in workqueue_internal.h is better. Please see
below.
> > +work_func_t wq_worker_last_func(struct task_struct *task)
> > +{
> > + struct worker *worker = kthread_data(task);
> > +
> > + return worker->last_func;
> > +}
>
> The semantics are troublesome. What guarantees that worker->last_func
> won't change under the caller's feet? The caller should hold some lock
> (presumably worker->pool->lock) in order to stabilize the
> wq_worker_last_func() return value?
>
> Also, the comment isn't really true - this is called from PSI, which is
> hardly "the scheduler"?
Hmm... This is the same way wq_worker_waking_up() and
wq_worker_sleeping() are synchronized. The task being queried is
being scheduled in or out, so it can't change underneath and is in
line with other stuff in kernel_internal.h.
Thanks.
--
tejun
Hello,
On Wed, Jan 16, 2019 at 02:35:01PM -0500, Johannes Weiner wrote:
> +/**
> + * wq_worker_last_func - retrieve worker's last work function
> + *
> + * Determine the last function a worker executed. This is called from
> + * the scheduler to get a worker's last known identity.
Maybe it can describe better why this is safe in this use case and
not suitable as a generic interface but other than that,
Acked-by: Tejun Heo <[email protected]>
Thanks.
--
tejun
On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <[email protected]> wrote:
> psi has provisions to shut off the periodic aggregation worker when
> there is a period of no task activity - and thus no data that needs
> aggregating. However, while developing psi monitoring, Suren noticed
> that the aggregation clock currently won't stay shut off for good.
>
> Debugging this revealed a flaw in the idle design: an aggregation run
> will see no task activity and decide to go to sleep; shortly
> thereafter, the kworker thread that executed the aggregation will go
> idle and cause a scheduling change, during which the psi callback will
> kick the !pending worker again. This will ping-pong forever, and is
> equivalent to having no shut-off logic at all (but with more code!)
>
> Fix this by exempting aggregation workers from psi's clock waking
> logic when the state change is them going to sleep. To do this, tag
> workers with the last work function they executed, and if in psi we
> see a worker going to sleep after aggregating psi data, we will not
> reschedule the aggregation work item.
>
> What if the worker is also executing other items before or after?
>
> Any psi state times that were incurred by work items preceding the
> aggregation work will have been collected from the per-cpu buckets
> during the aggregation itself. If there are work items following the
> aggregation work, the worker's last_func tag will be overwritten and
> the aggregator will be kept alive to process this genuine new activity.
>
> If the aggregation work is the last thing the worker does, and we
> decide to go idle, the brief period of non-idle time incurred between
> the aggregation run and the kworker's dequeue will be stranded in the
> per-cpu buckets until the clock is woken by later activity. But that
> should not be a problem. The buckets can hold 4s worth of time, and
> future activity will wake the clock with a 2s delay, giving us 2s
> worth of data we can leave behind when disabling aggregation. If it
> takes a worker more than two seconds to go idle after it finishes its
> last work item, we likely have bigger problems in the system, and
> won't notice one sample that was averaged with a bogus per-CPU weight.
>
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
> groupc->tasks[t]++;
>
> write_seqcount_end(&groupc->seq);
> -
> - if (!delayed_work_pending(&group->clock_work))
> - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> }
>
> static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
This breaks Suren's "psi: introduce psi monitor":
--- kernel/sched/psi.c~psi-introduce-psi-monitor
+++ kernel/sched/psi.c
@@ -752,8 +1012,25 @@ static void psi_group_change(struct psi_
write_seqcount_end(&groupc->seq);
- if (!delayed_work_pending(&group->clock_work))
- schedule_delayed_work(&group->clock_work, PSI_FREQ);
+ /*
+ * Polling flag resets to 0 at the max rate of once per update window
+ * (at least 500ms interval). smp_wmb is required after group->polling
+ * 0-to-1 transition to order groupc->times and group->polling writes
+ * because stall detection logic in the slowpath relies on groupc->times
+ * changing before group->polling. Explicit smp_wmb is missing because
+ * cmpxchg() implies smp_mb.
+ */
+ if ((state_mask & group->trigger_mask) &&
+ atomic_cmpxchg(&group->polling, 0, 1) == 0) {
+ /*
+ * Start polling immediately even if the work is already
+ * scheduled
+ */
+ mod_delayed_work(system_wq, &group->clock_work, 1);
+ } else {
+ if (!delayed_work_pending(&group->clock_work))
+ schedule_delayed_work(&group->clock_work, PSI_FREQ);
+ }
}
and I'm too lazy to go in and figure out how to fix it.
If we're sure about "psi: fix aggregation idle shut-off" (and I am not)
then can I ask for a redo of "psi: introduce psi monitor"?
On Mon, Jan 28, 2019 at 3:06 PM Andrew Morton <[email protected]> wrote:
>
> On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <[email protected]> wrote:
>
> > psi has provisions to shut off the periodic aggregation worker when
> > there is a period of no task activity - and thus no data that needs
> > aggregating. However, while developing psi monitoring, Suren noticed
> > that the aggregation clock currently won't stay shut off for good.
> >
> > Debugging this revealed a flaw in the idle design: an aggregation run
> > will see no task activity and decide to go to sleep; shortly
> > thereafter, the kworker thread that executed the aggregation will go
> > idle and cause a scheduling change, during which the psi callback will
> > kick the !pending worker again. This will ping-pong forever, and is
> > equivalent to having no shut-off logic at all (but with more code!)
> >
> > Fix this by exempting aggregation workers from psi's clock waking
> > logic when the state change is them going to sleep. To do this, tag
> > workers with the last work function they executed, and if in psi we
> > see a worker going to sleep after aggregating psi data, we will not
> > reschedule the aggregation work item.
> >
> > What if the worker is also executing other items before or after?
> >
> > Any psi state times that were incurred by work items preceding the
> > aggregation work will have been collected from the per-cpu buckets
> > during the aggregation itself. If there are work items following the
> > aggregation work, the worker's last_func tag will be overwritten and
> > the aggregator will be kept alive to process this genuine new activity.
> >
> > If the aggregation work is the last thing the worker does, and we
> > decide to go idle, the brief period of non-idle time incurred between
> > the aggregation run and the kworker's dequeue will be stranded in the
> > per-cpu buckets until the clock is woken by later activity. But that
> > should not be a problem. The buckets can hold 4s worth of time, and
> > future activity will wake the clock with a 2s delay, giving us 2s
> > worth of data we can leave behind when disabling aggregation. If it
> > takes a worker more than two seconds to go idle after it finishes its
> > last work item, we likely have bigger problems in the system, and
> > won't notice one sample that was averaged with a bogus per-CPU weight.
> >
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
> > groupc->tasks[t]++;
> >
> > write_seqcount_end(&groupc->seq);
> > -
> > - if (!delayed_work_pending(&group->clock_work))
> > - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> > }
> >
> > static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
>
> This breaks Suren's "psi: introduce psi monitor":
>
> --- kernel/sched/psi.c~psi-introduce-psi-monitor
> +++ kernel/sched/psi.c
> @@ -752,8 +1012,25 @@ static void psi_group_change(struct psi_
>
> write_seqcount_end(&groupc->seq);
>
> - if (!delayed_work_pending(&group->clock_work))
> - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + /*
> + * Polling flag resets to 0 at the max rate of once per update window
> + * (at least 500ms interval). smp_wmb is required after group->polling
> + * 0-to-1 transition to order groupc->times and group->polling writes
> + * because stall detection logic in the slowpath relies on groupc->times
> + * changing before group->polling. Explicit smp_wmb is missing because
> + * cmpxchg() implies smp_mb.
> + */
> + if ((state_mask & group->trigger_mask) &&
> + atomic_cmpxchg(&group->polling, 0, 1) == 0) {
> + /*
> + * Start polling immediately even if the work is already
> + * scheduled
> + */
> + mod_delayed_work(system_wq, &group->clock_work, 1);
> + } else {
> + if (!delayed_work_pending(&group->clock_work))
> + schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + }
> }
>
> and I'm too lazy to go in and figure out how to fix it.
>
> If we're sure about "psi: fix aggregation idle shut-off" (and I am not)
> then can I ask for a redo of "psi: introduce psi monitor"?
Yes, I'll redo it after applying "psi: fix aggregation idle shut-off".
I have to repost it anyway to address the comment from Johannes.
Hi Andrew,
On Mon, Jan 28, 2019 at 3:06 PM Andrew Morton <[email protected]> wrote:
>
> On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <[email protected]> wrote:
>
> > psi has provisions to shut off the periodic aggregation worker when
> > there is a period of no task activity - and thus no data that needs
> > aggregating. However, while developing psi monitoring, Suren noticed
> > that the aggregation clock currently won't stay shut off for good.
> >
> > Debugging this revealed a flaw in the idle design: an aggregation run
> > will see no task activity and decide to go to sleep; shortly
> > thereafter, the kworker thread that executed the aggregation will go
> > idle and cause a scheduling change, during which the psi callback will
> > kick the !pending worker again. This will ping-pong forever, and is
> > equivalent to having no shut-off logic at all (but with more code!)
> >
> > Fix this by exempting aggregation workers from psi's clock waking
> > logic when the state change is them going to sleep. To do this, tag
> > workers with the last work function they executed, and if in psi we
> > see a worker going to sleep after aggregating psi data, we will not
> > reschedule the aggregation work item.
> >
> > What if the worker is also executing other items before or after?
> >
> > Any psi state times that were incurred by work items preceding the
> > aggregation work will have been collected from the per-cpu buckets
> > during the aggregation itself. If there are work items following the
> > aggregation work, the worker's last_func tag will be overwritten and
> > the aggregator will be kept alive to process this genuine new activity.
> >
> > If the aggregation work is the last thing the worker does, and we
> > decide to go idle, the brief period of non-idle time incurred between
> > the aggregation run and the kworker's dequeue will be stranded in the
> > per-cpu buckets until the clock is woken by later activity. But that
> > should not be a problem. The buckets can hold 4s worth of time, and
> > future activity will wake the clock with a 2s delay, giving us 2s
> > worth of data we can leave behind when disabling aggregation. If it
> > takes a worker more than two seconds to go idle after it finishes its
> > last work item, we likely have bigger problems in the system, and
> > won't notice one sample that was averaged with a bogus per-CPU weight.
> >
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
> > groupc->tasks[t]++;
> >
> > write_seqcount_end(&groupc->seq);
> > -
> > - if (!delayed_work_pending(&group->clock_work))
> > - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> > }
> >
> > static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
>
> This breaks Suren's "psi: introduce psi monitor":
>
> --- kernel/sched/psi.c~psi-introduce-psi-monitor
> +++ kernel/sched/psi.c
> @@ -752,8 +1012,25 @@ static void psi_group_change(struct psi_
>
> write_seqcount_end(&groupc->seq);
>
> - if (!delayed_work_pending(&group->clock_work))
> - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + /*
> + * Polling flag resets to 0 at the max rate of once per update window
> + * (at least 500ms interval). smp_wmb is required after group->polling
> + * 0-to-1 transition to order groupc->times and group->polling writes
> + * because stall detection logic in the slowpath relies on groupc->times
> + * changing before group->polling. Explicit smp_wmb is missing because
> + * cmpxchg() implies smp_mb.
> + */
> + if ((state_mask & group->trigger_mask) &&
> + atomic_cmpxchg(&group->polling, 0, 1) == 0) {
> + /*
> + * Start polling immediately even if the work is already
> + * scheduled
> + */
> + mod_delayed_work(system_wq, &group->clock_work, 1);
> + } else {
> + if (!delayed_work_pending(&group->clock_work))
> + schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + }
> }
>
> and I'm too lazy to go in and figure out how to fix it.
>
> If we're sure about "psi: fix aggregation idle shut-off" (and I am not)
> then can I ask for a redo of "psi: introduce psi monitor"?
I resolved the conflict with "psi: introduce psi monitor" patch and
posted v4 at https://lore.kernel.org/lkml/[email protected],
however please be advised that it also includes additional cleanup
changes yet to be reviewed.
The first 4 patches in this series are already in linux-next, so this
one should apply cleanly there. Please let me know if it creates any
other issues.
Thanks,
Suren.