2023-09-13 22:55:23

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hello Chenyu,

On 9/11/2023 3:49 PM, Chen Yu wrote:
> Hi Prateek,
>
> thanks for your review,
>
> On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>> [..snip..]
>>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
>>> kernel/sched/features.h | 1 +
>>> kernel/sched/sched.h | 1 +
>>> 3 files changed, 29 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>> hrtick_update(rq);
>>> now = sched_clock_cpu(cpu_of(rq));
>>> p->se.prev_sleep_time = task_sleep ? now : 0;
>>> +#ifdef CONFIG_SMP
>>> + /*
>>> + * If this rq will become idle, and dequeued task is
>>> + * a short sleeping one, check if we can reserve
>>> + * this idle CPU for that task for a short while.
>>> + * During this reservation period, other wakees will
>>> + * skip this 'idle' CPU in select_idle_cpu(), and this
>>> + * short sleeping task can pick its previous CPU in
>>> + * select_idle_sibling(), which brings better cache
>>> + * locality.
>>> + */
>>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
>>> + rq->cache_hot_timeout = now + p->se.sleep_avg;
>>> +#endif
>>> }
>>>
>>> #ifdef CONFIG_SMP
>>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>>> static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>>> {
>>> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>>> - sched_cpu_cookie_match(cpu_rq(cpu), p))
>>> + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
>>> + if (sched_feat(SIS_CACHE) &&
>>> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
>>> + return -1;
>>
>> Just wondering,
>>
>> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
>> finding one in its search for an idle core, would returning a "cache-hot
>> idle CPU" be better than returning previous CPU / current CPU if all
>> idle CPUs found during the search in select_idle_cpu() are marked
>> cache-hot?
>>
>
> This is a good point, we can optimize this further. Currently I only
> send a simple version to desmonstrate how we can leverage the task's
> sleep time.
>
>> Speaking of cache-hot idle CPU, is netperf actually more happy with
>> piling on current CPU?
>
> Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> put the waker and wakee together.
>
>> I ask this because the logic seems to be
>> reserving the previous CPU for a task that dislikes migration but I
>> do not see anything in the wake_affine_idle() path that would make the
>> short sleeper proactively choose the previous CPU when the wakeup is
>> marked with the WF_SYNC flag. Let me know if I'm missing something?
>>
>
> If I understand correctly, WF_SYNC is to let the wakee to woken up
> on the waker's CPU, rather than the wakee's previous CPU, because
> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> wakee's previous CPU. We can only restrict that other wakee does not
> occupy the previous CPU, but do not enhance the possibility that
> wake_affine_idle() chooses the previous CPU.

Correct me if I'm wrong here,

Say a short sleeper, is always woken up using WF_SYNC flag. When the
task is dequeued, we mark the previous CPU where it ran as "cache-hot"
and restrict any wakeup happening until the "cache_hot_timeout" is
crossed. Let us assume a perfect world where the task wakes up before
the "cache_hot_timeout" expires. Logically this CPU was reserved all
this while for the short sleeper but since the wakeup bears WF_SYNC
flag, the whole reservation is ignored and waker's LLC is explored.

Should the timeout be cleared if the wakeup decides to not target the
previous CPU? (The default "sysctl_sched_migration_cost" is probably
small enough to curb any side effect that could possibly show here but
if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
a larger value, the wakeup path might be affected where lot of idle
targets are overlooked since the CPUs are marked cache-hot forr longer
duration)

Let me know what you think.

>
> Say, there are two tasks t1, t2. t1's previous CPU is p1.
> We don't enhance that when t1 is woken up, wake_affine_idle() will
> choose p1 or not, but we makes sure t2 will not choose p1.
>
>> To confirm this can you look at the trend in migration count with and
>> without the series? Also the ratio of cache-hot idle CPUs to number
>> of CPUs searched can help estimate overheads of additional search - I
>> presume SIS_UTIL is efficient at curbing the additional search in
>> a busy system.
>
> OK, I'll collect these statistics.

Thank you :)

>
> [..snip..]
>

--
Thanks and Regards,
Prateek