LinuxLists.cc - [RESEND PATCH] sched: sd_llc

2023-02-15 01:56:41

Subject: [RESEND PATCH] sched: sd_llc_id initialized

In my test,I use isolcpus to isolate cpu for specific,
and then I noticed different scenario when core binding.

For example, the NUMA topology is as follows,
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63

and the 'isolcpus' is as follows,
isolcpus=14,15,30,31,46,47,62,63

One task initially running on the non-isolated core belong to NUMA0
was bind to one isolated core on NUMA1, and then change its cpu affinity
to all cores, I notice the task can be scheduled back to the
non-isolated core on NUMA0.

1.taskset -pc 0-13 3512 (task running on core 1)
2.taskset -pc 63 3512 (task running on isolated core 63)
3.taskset -pc 0-63 3512 (task running on core 1)

Another case, one task initially running on the non-isolated core
belong to NUMA1 was bind to one isolated core on NUMA1,
and then change its cpu affinity to all cores,
the task can not be scheduled out and always run on the isolated core.

1.taskset -pc 16-29 3512 (task running on core 17)
2.taskset -pc 63 3512 (task running on isolated core 63)
3.taskset -pc 0-63 3512 (task still running on core 63
and not schedule out)

The root cause is isolcpu not initialized sd_llc_id,
the default value is 0, and it causes cpus_share_cache doesn't work.
select_task_rq_fair()
select_idle_sibling()
cpus_share_cache()

Suggested-by: Hu Yadi <[email protected]>
Signed-off-by: Sun Shouxin <[email protected]>
---
kernel/sched/topology.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..89e98d410a8f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -663,7 +663,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
*/
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
-DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_llc_id) = -1;
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
--
2.27.0

2023-02-15 18:11:54

by Valentin Schneider

[permalink] [raw]

Subject: Re: [RESEND PATCH] sched: sd_llc_id initialized

On 14/02/23 17:54, Sun Shouxin wrote:
> In my test,I use isolcpus to isolate cpu for specific,
> and then I noticed different scenario when core binding.
>
> For example, the NUMA topology is as follows,
> NUMA node0 CPU(s): 0-15,32-47
> NUMA node1 CPU(s): 16-31,48-63
>
> and the 'isolcpus' is as follows,
> isolcpus=14,15,30,31,46,47,62,63
>
> One task initially running on the non-isolated core belong to NUMA0
> was bind to one isolated core on NUMA1, and then change its cpu affinity
> to all cores, I notice the task can be scheduled back to the
> non-isolated core on NUMA0.
>
> 1.taskset -pc 0-13 3512 (task running on core 1)
> 2.taskset -pc 63 3512 (task running on isolated core 63)
> 3.taskset -pc 0-63 3512 (task running on core 1)
>

This is working as intended, no?

> Another case, one task initially running on the non-isolated core
> belong to NUMA1 was bind to one isolated core on NUMA1,
> and then change its cpu affinity to all cores,
> the task can not be scheduled out and always run on the isolated core.
>
> 1.taskset -pc 16-29 3512 (task running on core 17)
> 2.taskset -pc 63 3512 (task running on isolated core 63)
> 3.taskset -pc 0-63 3512 (task still running on core 63
> and not schedule out)
>

And this is also not wrong, since CPU63 is in the task's affinity mask.

That said, I can see that in this case we'd want the task to use other CPUs
if it makes sense wrt load balance.

However, since CPU63 is attached to a NULL sched_domain, AFAIA your
solution is at the mercy of the @prev and @target CPUs passed to
select_idle_sibling(). So this might only work if the waker is on a
non-isolated CPU.

I don't think your patch is wrong, but I don't think it entirely fixes the
issue either. Unfortunately, due to isolated CPUs being attached to NULL
sched_domains, there isn't a magic solution as the majority of scheduler
decisions are based on these.

A safe bet would be to exclude isolated CPUs from the affinity of your
non-critical tasks. Things like TuneD [1] and/or cpusets could help.

[1]: https://github.com/redhat-performance/tuned