When selecting the busiest scheduling group between two otherwise identical
groups of types asym_packing or fully_busy, IPC classes can be used to
break the tie.
Compute the IPC class performance score for a scheduling group. It is
defined as the sum of the IPC scores of the tasks at the back of each
runqueue in the group. Load balancing starts by pulling tasks from the back
of the runqueue first, making this tiebreaker more useful.
Also, track the IPC class with the lowest score in the scheduling group. A
task of this class will be pulled when the destination CPU has lower
priority than the fully_busy busiest group.
These two metrics will be used during idle load balancing to compute the
current and the potential IPC class score of a scheduling group in a
subsequent changeset.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Do not compute the IPCC stats using the current tasks of runqueues.
Instead, use the tasks at the back of the queue. These are the tasks
that will be pulled first during load balance. (Vincent)
Changes since v2:
* Also excluded deadline and realtime tasks from IPCC stats. (Dietmar)
* Also excluded tasks that cannot run on the destination CPU from the
IPCC stats.
* Folded struct sg_lb_ipcc_stats into struct sg_lb_stats. (Dietmar)
* Reworded description sg_lb_stats::min_ipcc. (Ionela)
* Handle errors of arch_get_ipcc_score(). (Ionela)
Changes since v1:
* Implemented cleanups and reworks from PeterZ. Thanks!
* Used the new interface names.
---
kernel/sched/fair.c | 79 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 79 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6189d1a45635..c0cab5e501b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9110,6 +9110,11 @@ struct sg_lb_stats {
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
#endif
+#ifdef CONFIG_IPC_CLASSES
+ unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
+ unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
+ unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
+#endif
};
/*
@@ -9387,6 +9392,77 @@ group_type group_classify(unsigned int imbalance_pct,
return group_has_spare;
}
+#ifdef CONFIG_IPC_CLASSES
+static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
+{
+ /* All IPCC stats have been set to zero in update_sg_lb_stats(). */
+ sgs->min_score = ULONG_MAX;
+}
+
+static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
+{
+ struct list_head *tasks = &rq->cfs_tasks;
+ struct task_struct *p;
+ struct rq_flags rf;
+ int ret = -EINVAL;
+
+ rq_lock_irqsave(rq, &rf);
+ if (list_empty(tasks))
+ goto out;
+
+ p = list_last_entry(tasks, struct task_struct, se.group_node);
+ if (p->flags & PF_EXITING || is_idle_task(p) ||
+ !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
+ goto out;
+
+ ret = 0;
+ *ipcc = p->ipcc;
+out:
+ rq_unlock(rq, &rf);
+ return ret;
+}
+
+/* Called only if cpu_of(@rq) is not idle and has tasks running. */
+static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
+ struct rq *rq)
+{
+ unsigned short ipcc;
+ unsigned long score;
+
+ if (!sched_ipcc_enabled())
+ return;
+
+ if (rq_last_task_ipcc(dst_cpu, rq, &ipcc))
+ return;
+
+ score = arch_get_ipcc_score(ipcc, cpu_of(rq));
+
+ /*
+ * Ignore tasks with invalid scores. When finding the busiest group, we
+ * prefer those with higher sum_score. This group will not be selected.
+ */
+ if (IS_ERR_VALUE(score))
+ return;
+
+ sgs->sum_score += score;
+
+ if (score < sgs->min_score) {
+ sgs->min_score = score;
+ sgs->min_ipcc = ipcc;
+ }
+}
+
+#else /* CONFIG_IPC_CLASSES */
+static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
+ struct rq *rq)
+{
+}
+
+static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
+{
+}
+#endif /* CONFIG_IPC_CLASSES */
+
/**
* sched_use_asym_prio - Check whether asym_packing priority must be used
* @sd: The scheduling domain of the load balancing
@@ -9477,6 +9553,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
int i, nr_running, local_group;
memset(sgs, 0, sizeof(*sgs));
+ init_rq_ipcc_stats(sgs);
local_group = group == sds->local;
@@ -9526,6 +9603,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (sgs->group_misfit_task_load < load)
sgs->group_misfit_task_load = load;
}
+
+ update_sg_lb_ipcc_stats(env->dst_cpu, sgs, rq);
}
sgs->group_capacity = group->sgc->capacity;
--
2.25.1
On Mon, Jun 12, 2023 at 09:24:04PM -0700, Ricardo Neri wrote:
> +static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
> +{
> + struct list_head *tasks = &rq->cfs_tasks;
> + struct task_struct *p;
> + struct rq_flags rf;
> + int ret = -EINVAL;
> +
> + rq_lock_irqsave(rq, &rf);
> + if (list_empty(tasks))
> + goto out;
> +
> + p = list_last_entry(tasks, struct task_struct, se.group_node);
> + if (p->flags & PF_EXITING || is_idle_task(p) ||
> + !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
> + goto out;
> +
> + ret = 0;
> + *ipcc = p->ipcc;
> +out:
> + rq_unlock(rq, &rf);
This should be rq_unlock_irqrestore(). I will correct it in the next version.
Hi Ricardo,
On Monday 12 Jun 2023 at 21:24:04 (-0700), Ricardo Neri wrote:
> When selecting the busiest scheduling group between two otherwise identical
> groups of types asym_packing or fully_busy, IPC classes can be used to
> break the tie.
>
> Compute the IPC class performance score for a scheduling group. It is
> defined as the sum of the IPC scores of the tasks at the back of each
> runqueue in the group. Load balancing starts by pulling tasks from the back
> of the runqueue first, making this tiebreaker more useful.
>
> Also, track the IPC class with the lowest score in the scheduling group. A
> task of this class will be pulled when the destination CPU has lower
> priority than the fully_busy busiest group.
>
> These two metrics will be used during idle load balancing to compute the
> current and the potential IPC class score of a scheduling group in a
> subsequent changeset.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Perry Yuan <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Zhao Liu <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v3:
> * Do not compute the IPCC stats using the current tasks of runqueues.
> Instead, use the tasks at the back of the queue. These are the tasks
> that will be pulled first during load balance. (Vincent)
>
> Changes since v2:
> * Also excluded deadline and realtime tasks from IPCC stats. (Dietmar)
> * Also excluded tasks that cannot run on the destination CPU from the
> IPCC stats.
> * Folded struct sg_lb_ipcc_stats into struct sg_lb_stats. (Dietmar)
> * Reworded description sg_lb_stats::min_ipcc. (Ionela)
> * Handle errors of arch_get_ipcc_score(). (Ionela)
>
> Changes since v1:
> * Implemented cleanups and reworks from PeterZ. Thanks!
> * Used the new interface names.
> ---
> kernel/sched/fair.c | 79 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 79 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6189d1a45635..c0cab5e501b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9110,6 +9110,11 @@ struct sg_lb_stats {
> unsigned int nr_numa_running;
> unsigned int nr_preferred_running;
> #endif
> +#ifdef CONFIG_IPC_CLASSES
> + unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
^^^^^^^^^^^^^^
> + unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> + unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
^^^^^^^^^^^^^^
These no longer apply. It might be easier to describe them all in a
comment just above their declaration. Something like:
"The sum, min and its class of the IPC scores of the tasks at the back of each
runqueue in the group."
> +#endif
> };
>
> /*
> @@ -9387,6 +9392,77 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
> }
>
> +#ifdef CONFIG_IPC_CLASSES
> +static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
> +{
> + /* All IPCC stats have been set to zero in update_sg_lb_stats(). */
> + sgs->min_score = ULONG_MAX;
> +}
> +
> +static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
> +{
> + struct list_head *tasks = &rq->cfs_tasks;
> + struct task_struct *p;
> + struct rq_flags rf;
> + int ret = -EINVAL;
> +
It's more typical of ret to be initialised to 0 and changed to an error
value when there's an error case.
> + rq_lock_irqsave(rq, &rf);
> + if (list_empty(tasks))
> + goto out;
> +
> + p = list_last_entry(tasks, struct task_struct, se.group_node);
> + if (p->flags & PF_EXITING || is_idle_task(p) ||
> + !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
> + goto out;
> +
> + ret = 0;
> + *ipcc = p->ipcc;
> +out:
> + rq_unlock(rq, &rf);
> + return ret;
> +}
> +
> +/* Called only if cpu_of(@rq) is not idle and has tasks running. */
> +static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> + struct rq *rq)
> +{
> + unsigned short ipcc;
> + unsigned long score;
> +
> + if (!sched_ipcc_enabled())
> + return;
> +
> + if (rq_last_task_ipcc(dst_cpu, rq, &ipcc))
> + return;
> +
> + score = arch_get_ipcc_score(ipcc, cpu_of(rq));
> +
> + /*
> + * Ignore tasks with invalid scores. When finding the busiest group, we
> + * prefer those with higher sum_score. This group will not be selected.
> + */
nit: the comment is unnecessary, and a bit misleading, IMO.
The comment says "This group will not be selected." but the only way to
guarantee that here is to reset the sum_score to 0 when you find an
invalid score, which I don't believe is your intention.
Also the use of sum_score is captured in later functions, so I don't
believe there's a need for additional comments here.
Hope it helps,
Ionela.
> + if (IS_ERR_VALUE(score))
> + return;
> +
> + sgs->sum_score += score;
> +
> + if (score < sgs->min_score) {
> + sgs->min_score = score;
> + sgs->min_ipcc = ipcc;
> + }
> +}
> +
> +#else /* CONFIG_IPC_CLASSES */
> +static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> + struct rq *rq)
> +{
> +}
> +
> +static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
> +{
> +}
> +#endif /* CONFIG_IPC_CLASSES */
> +
> /**
> * sched_use_asym_prio - Check whether asym_packing priority must be used
> * @sd: The scheduling domain of the load balancing
> @@ -9477,6 +9553,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> int i, nr_running, local_group;
>
> memset(sgs, 0, sizeof(*sgs));
> + init_rq_ipcc_stats(sgs);
>
> local_group = group == sds->local;
>
> @@ -9526,6 +9603,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> if (sgs->group_misfit_task_load < load)
> sgs->group_misfit_task_load = load;
> }
> +
> + update_sg_lb_ipcc_stats(env->dst_cpu, sgs, rq);
> }
>
> sgs->group_capacity = group->sgc->capacity;
> --
> 2.25.1
>
On Thu, Jun 22, 2023 at 10:01:48AM +0100, Ionela Voinescu wrote:
> Hi Ricardo,
Hi Ionela,
Thank you very much for reviewing the patches!
>
> On Monday 12 Jun 2023 at 21:24:04 (-0700), Ricardo Neri wrote:
> > When selecting the busiest scheduling group between two otherwise identical
> > groups of types asym_packing or fully_busy, IPC classes can be used to
> > break the tie.
> >
> > Compute the IPC class performance score for a scheduling group. It is
> > defined as the sum of the IPC scores of the tasks at the back of each
> > runqueue in the group. Load balancing starts by pulling tasks from the back
> > of the runqueue first, making this tiebreaker more useful.
> >
> > Also, track the IPC class with the lowest score in the scheduling group. A
> > task of this class will be pulled when the destination CPU has lower
> > priority than the fully_busy busiest group.
> >
> > These two metrics will be used during idle load balancing to compute the
> > current and the potential IPC class score of a scheduling group in a
> > subsequent changeset.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Perry Yuan <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Zhao Liu <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v3:
> > * Do not compute the IPCC stats using the current tasks of runqueues.
> > Instead, use the tasks at the back of the queue. These are the tasks
> > that will be pulled first during load balance. (Vincent)
> >
> > Changes since v2:
> > * Also excluded deadline and realtime tasks from IPCC stats. (Dietmar)
> > * Also excluded tasks that cannot run on the destination CPU from the
> > IPCC stats.
> > * Folded struct sg_lb_ipcc_stats into struct sg_lb_stats. (Dietmar)
> > * Reworded description sg_lb_stats::min_ipcc. (Ionela)
> > * Handle errors of arch_get_ipcc_score(). (Ionela)
> >
> > Changes since v1:
> > * Implemented cleanups and reworks from PeterZ. Thanks!
> > * Used the new interface names.
> > ---
> > kernel/sched/fair.c | 79 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 79 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6189d1a45635..c0cab5e501b6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9110,6 +9110,11 @@ struct sg_lb_stats {
> > unsigned int nr_numa_running;
> > unsigned int nr_preferred_running;
> > #endif
> > +#ifdef CONFIG_IPC_CLASSES
> > + unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> ^^^^^^^^^^^^^^
> > + unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> > + unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> ^^^^^^^^^^^^^^
> These no longer apply.
Indeed, I missed this. I will update them in the next version.
> It might be easier to describe them all in a
> comment just above their declaration. Something like:
>
> "The sum, min and its class of the IPC scores of the tasks at the back of each
> runqueue in the group."
I am hesitant to describe the three members in a single comment as it
would not comply with what Documentation/doc-guide/kernel-doc.rst
specifies.
Admittedly, the current format does not comply either. :) I would opt to
be consistent with the existing format.
>
> > +#endif
> > };
> >
> > /*
> > @@ -9387,6 +9392,77 @@ group_type group_classify(unsigned int imbalance_pct,
> > return group_has_spare;
> > }
> >
> > +#ifdef CONFIG_IPC_CLASSES
> > +static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
> > +{
> > + /* All IPCC stats have been set to zero in update_sg_lb_stats(). */
> > + sgs->min_score = ULONG_MAX;
> > +}
> > +
> > +static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
> > +{
> > + struct list_head *tasks = &rq->cfs_tasks;
> > + struct task_struct *p;
> > + struct rq_flags rf;
> > + int ret = -EINVAL;
> > +
>
> It's more typical of ret to be initialised to 0 and changed to an error
> value when there's an error case.
My intention was to save lines of code. I would need to set the error
value and then goto out. I am ok with your suggestion if it improves
readability.
>
> > + rq_lock_irqsave(rq, &rf);
> > + if (list_empty(tasks))
> > + goto out;
> > +
> > + p = list_last_entry(tasks, struct task_struct, se.group_node);
> > + if (p->flags & PF_EXITING || is_idle_task(p) ||
> > + !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
> > + goto out;
> > +
> > + ret = 0;
> > + *ipcc = p->ipcc;
> > +out:
> > + rq_unlock(rq, &rf);
> > + return ret;
> > +}
> > +
> > +/* Called only if cpu_of(@rq) is not idle and has tasks running. */
> > +static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > + struct rq *rq)
> > +{
> > + unsigned short ipcc;
> > + unsigned long score;
> > +
> > + if (!sched_ipcc_enabled())
> > + return;
> > +
> > + if (rq_last_task_ipcc(dst_cpu, rq, &ipcc))
> > + return;
> > +
> > + score = arch_get_ipcc_score(ipcc, cpu_of(rq));
> > +
> > + /*
> > + * Ignore tasks with invalid scores. When finding the busiest group, we
> > + * prefer those with higher sum_score. This group will not be selected.
> > + */
>
> nit: the comment is unnecessary, and a bit misleading, IMO.
>
> The comment says "This group will not be selected." but the only way to
> guarantee that here is to reset the sum_score to 0 when you find an
> invalid score, which I don't believe is your intention.
You raise an interesting point. We will call this function for each rq
in the sched group. Thus, if we encounter an error, the scores would be
incomplete. Therefore, I think that the scores should be discarded and
reset to 0 so that they are not used, IMO.
>
> Also the use of sum_score is captured in later functions, so I don't
> believe there's a need for additional comments here.
Sure, I can remove the comment. Or maybe rephrase it as per my previous
comment?
>
> Hope it helps,
It does! Thanks again for taking the time.
BR,
Ricardo
> >
> > nit: the comment is unnecessary, and a bit misleading, IMO.
> >
> > The comment says "This group will not be selected." but the only way to
> > guarantee that here is to reset the sum_score to 0 when you find an
> > invalid score, which I don't believe is your intention.
>
> You raise an interesting point. We will call this function for each rq
> in the sched group. Thus, if we encounter an error, the scores would be
> incomplete. Therefore, I think that the scores should be discarded and
> reset to 0 so that they are not used, IMO.
Since we still have other rqs to loop through, reset to 0 here does
not guarantee that it will stay that way at the end of the loop when
the sum could still be added to. May need a special value like -EINVAL
and make the score a "long" to mark such case.
Tim
On Mon, Jun 26, 2023 at 12:52:59PM -0700, Tim Chen wrote:
>
> > >
> > > nit: the comment is unnecessary, and a bit misleading, IMO.
> > >
> > > The comment says "This group will not be selected." but the only way to
> > > guarantee that here is to reset the sum_score to 0 when you find an
> > > invalid score, which I don't believe is your intention.
> >
> > You raise an interesting point. We will call this function for each rq
> > in the sched group. Thus, if we encounter an error, the scores would be
> > incomplete. Therefore, I think that the scores should be discarded and
> > reset to 0 so that they are not used, IMO.
>
> Since we still have other rqs to loop through, reset to 0 here does
> not guarantee that it will stay that way at the end of the loop when
> the sum could still be added to.
That is true, to discard the would need an indication that we should not
keep iterating on the runqueues of this group.
> May need a special value like -EINVAL
> and make the score a "long" to mark such case.
IIUC, unsigned longs allow to handle negative errors if you use
IS_ERR_VALUE(); but yes, it looks odd.