2023-10-18 15:51:48

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH] sched/fair: Enable group_asym_packing in find_idlest_group

Current scheduler code doesn't handle SD_ASYM_PACKING in the
find_idlest_cpu path. On few architectures, like Powerpc, cache is at a
core. Moving threads across cores may end up in cache misses.

While asym_packing can be enabled above SMT level, enabling Asym packing
across cores could result in poorer performance due to cache misses.
However if the initial task placement via find_idlest_cpu does take
Asym_packing into consideration, then scheduler can avoid asym_packing
migrations. This will result in lesser migrations and better packing and
better overall performance.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb225921bbca..7164f79a3d13 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9931,11 +9931,13 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
* @group: sched_group whose statistics are to be updated.
* @sgs: variable to hold the statistics for this group.
* @p: The task for which we look for the idlest group/CPU.
+ * @this_cpu: current cpu
*/
static inline void update_sg_wakeup_stats(struct sched_domain *sd,
struct sched_group *group,
struct sg_lb_stats *sgs,
- struct task_struct *p)
+ struct task_struct *p,
+ int this_cpu)
{
int i, nr_running;

@@ -9972,6 +9974,11 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,

}

+ if (sd->flags & SD_ASYM_PACKING && sgs->sum_h_nr_running &&
+ sched_asym_prefer(group->asym_prefer_cpu, this_cpu)) {
+ sgs->group_asym_packing = 1;
+ }
+
sgs->group_capacity = group->sgc->capacity;

sgs->group_weight = group->group_weight;
@@ -10012,8 +10019,17 @@ static bool update_pick_idlest(struct sched_group *idlest,
return false;
break;

- case group_imbalanced:
case group_asym_packing:
+ if (sched_asym_prefer(group->asym_prefer_cpu, idlest->asym_prefer_cpu)) {
+ int busy_cpus = idlest_sgs->group_weight - idlest_sgs->idle_cpus;
+
+ busy_cpus -= (sgs->group_weight - sgs->idle_cpus);
+ if (busy_cpus >= 0)
+ return true;
+ }
+ return false;
+
+ case group_imbalanced:
case group_smt_balance:
/* Those types are not used in the slow wakeup path */
return false;
@@ -10080,7 +10096,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
sgs = &tmp_sgs;
}

- update_sg_wakeup_stats(sd, group, sgs, p);
+ update_sg_wakeup_stats(sd, group, sgs, p, this_cpu);

if (!local_group && update_pick_idlest(idlest, &idlest_sgs, group, sgs)) {
idlest = group;
@@ -10112,6 +10128,17 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
if (local_sgs.group_type > idlest_sgs.group_type)
return idlest;

+ if (idlest_sgs.group_type == group_asym_packing) {
+ if (sched_asym_prefer(idlest->asym_prefer_cpu, local->asym_prefer_cpu)) {
+ int busy_cpus = local_sgs.group_weight - local_sgs.idle_cpus;
+
+ busy_cpus -= (idlest_sgs.group_weight - idlest_sgs.idle_cpus);
+ if (busy_cpus >= 0)
+ return idlest;
+ }
+ return NULL;
+ }
+
switch (local_sgs.group_type) {
case group_overloaded:
case group_fully_busy:
--
2.31.1


2023-12-15 04:11:00

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Enable group_asym_packing in find_idlest_group

* Srikar Dronamraju <[email protected]> [2023-10-18 21:20:35]:

Hi Ingo, Peter,

> Current scheduler code doesn't handle SD_ASYM_PACKING in the
> find_idlest_cpu path. On few architectures, like Powerpc, cache is at a
> core. Moving threads across cores may end up in cache misses.
>
> While asym_packing can be enabled above SMT level, enabling Asym packing
> across cores could result in poorer performance due to cache misses.
> However if the initial task placement via find_idlest_cpu does take
> Asym_packing into consideration, then scheduler can avoid asym_packing
> migrations. This will result in lesser migrations and better packing and
> better overall performance.
>
> Signed-off-by: Srikar Dronamraju <[email protected]>

I haven't heard any comments or seen any reviews on this patch.
I have verified that it does still apply cleanly on v6.7-rc5 based
tip/master

Do let me know your thoughts on the same.

Also let me know if you want me to repost the same.

--
Thanks and Regards
Srikar Dronamraju

2024-01-04 15:51:36

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Enable group_asym_packing in find_idlest_group


On 10/18/23 9:20 PM, Srikar Dronamraju wrote:

Hi Srikar,

> Current scheduler code doesn't handle SD_ASYM_PACKING in the
> find_idlest_cpu path. On few architectures, like Powerpc, cache is at a
> core. Moving threads across cores may end up in cache misses.
>
> While asym_packing can be enabled above SMT level, enabling Asym packing
> across cores could result in poorer performance due to cache misses.
> However if the initial task placement via find_idlest_cpu does take
> Asym_packing into consideration, then scheduler can avoid asym_packing
> migrations. This will result in lesser migrations and better packing and
> better overall performance.
>

This would handle asym packing case when finding the idle CPU for newly woken
up task and thereby reducing the number of migrations if it is placed correctly in
the first place. I think thats helpful.

Currently intel cluster and powerVM shared LPAR's are the two where ASYM PACKING
is enabled at higher domain than SMT. Is that correct or is there any other topology?

+tim

> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++---
> 1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb225921bbca..7164f79a3d13 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9931,11 +9931,13 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
> * @group: sched_group whose statistics are to be updated.
> * @sgs: variable to hold the statistics for this group.
> * @p: The task for which we look for the idlest group/CPU.
> + * @this_cpu: current cpu
> */
> static inline void update_sg_wakeup_stats(struct sched_domain *sd,
> struct sched_group *group,
> struct sg_lb_stats *sgs,
> - struct task_struct *p)
> + struct task_struct *p,
> + int this_cpu)
> {
> int i, nr_running;
>
> @@ -9972,6 +9974,11 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>
> }
>
> + if (sd->flags & SD_ASYM_PACKING && sgs->sum_h_nr_running &&
> + sched_asym_prefer(group->asym_prefer_cpu, this_cpu)) {
> + sgs->group_asym_packing = 1;
> + }
> +


I think there is a corner case here which could be taken care. please correct me if i
am wrong.

Assume there are four sched groups, sg1, sg2, sg3 and sg4. asym packing is enabled at sd.
sg1, and sg3 have one task each and a new task is being created. So find_idlest_cpu is
called for this new task.

Because of sgs->sum_h_nr_running check, sg1 and sg3 will have group_asym_packing, while
sg2 and sg4 will have group_has_spare. update_pick_idlest will choose the lowest type.
so group_has_spare. TIE would be between sg2 and sg4. Because of asym packing (atleast true
for powerpc shared LPAR case) sg4 will have lower utilization compared to sg2, and hence sg4
will be given as the idlest_cpu. On the next load balance sg2 will pull task from sg4 due to
asym packing.

Additional migration may be avoided if we omit the sum_h_nr_running check?


> sgs->group_capacity = group->sgc->capacity;
>
> sgs->group_weight = group->group_weight;
> @@ -10012,8 +10019,17 @@ static bool update_pick_idlest(struct sched_group *idlest,
> return false;
> break;
>
> - case group_imbalanced:
> case group_asym_packing:
> + if (sched_asym_prefer(group->asym_prefer_cpu, idlest->asym_prefer_cpu)) {
> + int busy_cpus = idlest_sgs->group_weight - idlest_sgs->idle_cpus;
> +
> + busy_cpus -= (sgs->group_weight - sgs->idle_cpus);
> + if (busy_cpus >= 0)
> + return true;


wouldn't using idle_cpus would be simpler? something like,

if (sgs->idle_cpus - idlest->idle_cpus > 0)
return true

> + }
> + return false;
> +
> + case group_imbalanced:
> case group_smt_balance:
> /* Those types are not used in the slow wakeup path */
> return false;
> @@ -10080,7 +10096,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> sgs = &tmp_sgs;
> }
>
> - update_sg_wakeup_stats(sd, group, sgs, p);
> + update_sg_wakeup_stats(sd, group, sgs, p, this_cpu);
>
> if (!local_group && update_pick_idlest(idlest, &idlest_sgs, group, sgs)) {
> idlest = group;
> @@ -10112,6 +10128,17 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> if (local_sgs.group_type > idlest_sgs.group_type)
> return idlest;
>
> + if (idlest_sgs.group_type == group_asym_packing) {
> + if (sched_asym_prefer(idlest->asym_prefer_cpu, local->asym_prefer_cpu)) {
> + int busy_cpus = local_sgs.group_weight - local_sgs.idle_cpus;
> +
> + busy_cpus -= (idlest_sgs.group_weight - idlest_sgs.idle_cpus);
> + if (busy_cpus >= 0)
> + return idlest;
> + }
> + return NULL;
> + }

same comment of using idle_cpus

> +
> switch (local_sgs.group_type) {
> case group_overloaded:
> case group_fully_busy:

2024-01-10 00:58:51

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Enable group_asym_packing in find_idlest_group

On Thu, 2024-01-04 at 21:20 +0530, Shrikanth Hegde wrote:
> On 10/18/23 9:20 PM, Srikar Dronamraju wrote:
>
> Hi Srikar,
>
> > Current scheduler code doesn't handle SD_ASYM_PACKING in the
> > find_idlest_cpu path. On few architectures, like Powerpc, cache is at a
> > core. Moving threads across cores may end up in cache misses.
> >
> > While asym_packing can be enabled above SMT level, enabling Asym packing
> > across cores could result in poorer performance due to cache misses.
> > However if the initial task placement via find_idlest_cpu does take
> > Asym_packing into consideration, then scheduler can avoid asym_packing
> > migrations. This will result in lesser migrations and better packing and
> > better overall performance.
> >
>
> This would handle asym packing case when finding the idle CPU for newly woken
> up task and thereby reducing the number of migrations if it is placed correctly in
> the first place. I think thats helpful.
>
> Currently intel cluster and powerVM shared LPAR's are the two where ASYM PACKING
> is enabled at higher domain than SMT. Is that correct or is there any other topology?
>
> +tim
>
> > Signed-off-by: Srikar Dronamraju <[email protected]>
> > ---
> > kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++---
> > 1 file changed, 30 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb225921bbca..7164f79a3d13 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9931,11 +9931,13 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
> > * @group: sched_group whose statistics are to be updated.
> > * @sgs: variable to hold the statistics for this group.
> > * @p: The task for which we look for the idlest group/CPU.
> > + * @this_cpu: current cpu
> > */
> > static inline void update_sg_wakeup_stats(struct sched_domain *sd,
> > struct sched_group *group,
> > struct sg_lb_stats *sgs,
> > - struct task_struct *p)
> > + struct task_struct *p,
> > + int this_cpu)
> > {
> > int i, nr_running;
> >
> > @@ -9972,6 +9974,11 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
> >
> > }
> >
> > + if (sd->flags & SD_ASYM_PACKING && sgs->sum_h_nr_running &&
> > + sched_asym_prefer(group->asym_prefer_cpu, this_cpu)) {
> > + sgs->group_asym_packing = 1;

I disagree with the above criteria for doing asym_packing.

I think asym packing only makes sense if you have an idle CPU availabe
in the group that is preferred over this_cpu, and you have fewer
tasks than CPU. Using group->asym_prefer_cpu
is inappropriate as that most preferred CPU may be busy.
You should be migrating task from this_cpu to that highest
priority idle_cpu identified

If the group is fully busy or overloaded, we should stick with the original
logic of picking the most lightly loaded group and not use asym_packing. 

You may want to note down the idle CPU in the group with highest priority, 
or most preferred if there are more than 1 cpu in the group to compare 
between two idle groups that have idle CPUs.

Tim

> > + }
> > +
>
>
> I think there is a corner case here which could be taken care. please correct me if i
> am wrong.
>
> Assume there are four sched groups, sg1, sg2, sg3 and sg4. asym packing is enabled at sd.
> sg1, and sg3 have one task each and a new task is being created. So find_idlest_cpu is
> called for this new task.
>
> Because of sgs->sum_h_nr_running check, sg1 and sg3 will have group_asym_packing, while
> sg2 and sg4 will have group_has_spare. update_pick_idlest will choose the lowest type.
> so group_has_spare. TIE would be between sg2 and sg4. Because of asym packing (atleast true
> for powerpc shared LPAR case) sg4 will have lower utilization compared to sg2, and hence sg4
> will be given as the idlest_cpu. On the next load balance sg2 will pull task from sg4 due to
> asym packing.
>
> Additional migration may be avoided if we omit the sum_h_nr_running check?
>
>
> > sgs->group_capacity = group->sgc->capacity;
> >
> > sgs->group_weight = group->group_weight;
> > @@ -10012,8 +10019,17 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > return false;
> > break;
> >
> > - case group_imbalanced:
> > case group_asym_packing:
> > + if (sched_asym_prefer(group->asym_prefer_cpu, idlest->asym_prefer_cpu)) {
> > + int busy_cpus = idlest_sgs->group_weight - idlest_sgs->idle_cpus;
> > +
> > + busy_cpus -= (sgs->group_weight - sgs->idle_cpus);
> > + if (busy_cpus >= 0)
> > + return true;
>
>
> wouldn't using idle_cpus would be simpler? something like,
>
> if (sgs->idle_cpus - idlest->idle_cpus > 0)
> return true
>
> > + }
> > + return false;
> > +
> > + case group_imbalanced:
> > case group_smt_balance:
> > /* Those types are not used in the slow wakeup path */
> > return false;
> > @@ -10080,7 +10096,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > sgs = &tmp_sgs;
> > }
> >
> > - update_sg_wakeup_stats(sd, group, sgs, p);
> > + update_sg_wakeup_stats(sd, group, sgs, p, this_cpu);
> >
> > if (!local_group && update_pick_idlest(idlest, &idlest_sgs, group, sgs)) {
> > idlest = group;
> > @@ -10112,6 +10128,17 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > if (local_sgs.group_type > idlest_sgs.group_type)
> > return idlest;
> >
> > + if (idlest_sgs.group_type == group_asym_packing) {
> > + if (sched_asym_prefer(idlest->asym_prefer_cpu, local->asym_prefer_cpu)) {
> > + int busy_cpus = local_sgs.group_weight - local_sgs.idle_cpus;
> > +
> > + busy_cpus -= (idlest_sgs.group_weight - idlest_sgs.idle_cpus);
> > + if (busy_cpus >= 0)
> > + return idlest;
> > + }
> > + return NULL;
> > + }
>
> same comment of using idle_cpus
>
> > +
> > switch (local_sgs.group_type) {
> > case group_overloaded:
> > case group_fully_busy:
>