2023-09-13 06:22:45

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> Hi Aaron,
>
> thanks for the review,
>
> On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:

[..snip..]

> > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > {
> > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > - sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > + if (sched_feat(SIS_CACHE) &&
> > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > + return -1;
> > > +
> >
> > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > like available_idle_cpu_migrate() so that above and below logic can be
> > simplified a bit?
> >
>
> Yes, that would be simpler, I'll do in next version.
>
> > I was thinking to simply add that rq->cache_hot_timeout check to
> > available_idle_cpu() but then a long sleeping task could be forced to
> > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > cache_hot_timeout. I guess that's why you chose to only change the idle
> > semantic in select_idle_cpu() but not in select_idle_sibling()?
> >
>
> Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> thus no one could use this CPU within the cache hot period, including the cache-hot task
> itself.

This happens even with this patch right? It is possible for a task p1
whose avg sleep time is "t" to go to sleep which causes its CPU to go
idle. When it wakes up after a time t' < t, the logic above skips the
idle CPU because it is still cache-hot, despite the fact that it is
cache hot for p1!

Have you considered recording p1's identity in the
rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
return the previous CPU if it is cache hot and the wakee is
rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
scan.

>
> thanks,
> Chenyu

--
Thanks and Regards
gautham.


2023-09-13 21:04:06

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hi Gautham,

thanks for the review,

On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote:
> On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> > Hi Aaron,
> >
> > thanks for the review,
> >
> > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
>
> [..snip..]
>
> > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > > {
> > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > > - sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > > + if (sched_feat(SIS_CACHE) &&
> > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > > + return -1;
> > > > +
> > >
> > > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > > like available_idle_cpu_migrate() so that above and below logic can be
> > > simplified a bit?
> > >
> >
> > Yes, that would be simpler, I'll do in next version.
> >
> > > I was thinking to simply add that rq->cache_hot_timeout check to
> > > available_idle_cpu() but then a long sleeping task could be forced to
> > > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > > cache_hot_timeout. I guess that's why you chose to only change the idle
> > > semantic in select_idle_cpu() but not in select_idle_sibling()?
> > >
> >
> > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> > thus no one could use this CPU within the cache hot period, including the cache-hot task
> > itself.
>
> This happens even with this patch right? It is possible for a task p1
> whose avg sleep time is "t" to go to sleep which causes its CPU to go
> idle. When it wakes up after a time t' < t, the logic above skips the
> idle CPU because it is still cache-hot, despite the fact that it is
> cache hot for p1!
>
Not sure if I understand correctly, in select_idle_sibling() p1's previous
CPU will be checked first, and that check does not involve cache-hot. So if
p1's previous CPU is idle, it will be picked, no?

if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
asym_fits_cpu(task_util, util_min, util_max, prev))
return prev;

Or do you mean, in select_idle_cpu(), we will re-check p1's previous
CPU but it is skipped due to cache-hot?

> Have you considered recording p1's identity in the
> rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> return the previous CPU if it is cache hot and the wakee is
> rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> scan.
>

Yes this seems to be donable, and one problem would be, if there are
more than 2 dequeued tasks prefer the same (previous) CPU, which task
should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
want to deal with multiple dequeued tasks. If we record all of them,
it might be costly.

[1] https://lore.kernel.org/lkml/[email protected]/

thanks,
Chenyu

2023-09-14 15:49:29

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hello Chen Yu,

On Wed, Sep 13, 2023 at 03:25:14PM +0800, Chen Yu wrote:
> Hi Gautham,
>
> thanks for the review,
>
> On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote:
> > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> > > Hi Aaron,
> > >
> > > thanks for the review,
> > >
> > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> >
> > [..snip..]
> >
> > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > > > {
> > > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > > > - sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > > > + if (sched_feat(SIS_CACHE) &&
> > > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > > > + return -1;
> > > > > +
> > > >
> > > > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > > > like available_idle_cpu_migrate() so that above and below logic can be
> > > > simplified a bit?
> > > >
> > >
> > > Yes, that would be simpler, I'll do in next version.
> > >
> > > > I was thinking to simply add that rq->cache_hot_timeout check to
> > > > available_idle_cpu() but then a long sleeping task could be forced to
> > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > > > cache_hot_timeout. I guess that's why you chose to only change the idle
> > > > semantic in select_idle_cpu() but not in select_idle_sibling()?
> > > >
> > >
> > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> > > thus no one could use this CPU within the cache hot period, including the cache-hot task
> > > itself.
> >
> > This happens even with this patch right? It is possible for a task p1
> > whose avg sleep time is "t" to go to sleep which causes its CPU to go
> > idle. When it wakes up after a time t' < t, the logic above skips the
> > idle CPU because it is still cache-hot, despite the fact that it is
> > cache hot for p1!
> >
> Not sure if I understand correctly, in select_idle_sibling() p1's previous
> CPU will be checked first, and that check does not involve cache-hot. So if
> p1's previous CPU is idle, it will be picked, no?
>
> if (prev != target && cpus_share_cache(prev, target) &&
> (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> asym_fits_cpu(task_util, util_min, util_max, prev))
> return prev;


You are right, but the "if" condition is the one prior to "if (prev !=
target ...)" which causes p1 can pick the previous CPU here. However,
note that it is not just p1 which can pick the previous CPU as
intended by your patch.

Consider the following case:

* p1' ran on CPUX, and went to sleep. We now update the
cache_hot_timeout for CPUX based on the sleep-avg of p1'

* When the CPU goes idle, during CPU_NEWLY_IDLE load balancing, (or
due to shared_runq search), the CPU could pull p1 from another CPU
Y.

* p1 now runs on CPUX, and goes to sleep.

* We update the cache_hot_timeout for CPUX based on the sleep-avg of p1.

* p1' wakesup and picks prev as the target after doing wake_affine()
check in select_task_rq_fair().

* In select_idle_sibling() it encounters the following, which checks
out and thus returns from target.

if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
asym_fits_cpu(task_util, util_min, util_max, target))
return target;

* p1 now wakes up, picks the previous cpu as the target. However,
available_idle_cpu() is no longer true.

So despite "reserving" the CPU for p1, which is likely to have its
data still hot in the case, we would have scheduled p1', thus
defeating the whole purpose of reservation.

To be honest, this isn't so bad, because we have been able to avoid a
migration in this case.

>
> Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> CPU but it is skipped due to cache-hot?

I had originally thought about this, but then as you pointed out we
have an opportunity to pick the previous cpu in the early checks
inside select_idle_sibling().

>
> > Have you considered recording p1's identity in the
> > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > return the previous CPU if it is cache hot and the wakee is
> > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > scan.
> >
>
> Yes this seems to be donable, and one problem would be, if there are
> more than 2 dequeued tasks prefer the same (previous) CPU, which task
> should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> want to deal with multiple dequeued tasks. If we record all of them,
> it might be costly.

If there are multiple dequeued tasks, then it doesn't make sense to
record the identity of the tasks. However, we need the bail out to be
much earlier, in select_task_rq_fair(), perhaps even before the
want_affine() checks.

After all, if the previous CPU is idle, and its cache_hot_timeout
hasn't expired, and if the wakee's sleep duration is less than the
cache_hot_timeout, why don't we just pick it here and be done with it?

>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> thanks,
> Chenyu

--
Thanks and Regards
gautham.

2023-09-15 00:11:40

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hi Gautham,

On 2023-09-14 at 12:36:33 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
>
> On Wed, Sep 13, 2023 at 03:25:14PM +0800, Chen Yu wrote:
> > Hi Gautham,
> >
> > thanks for the review,
> >
> > On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote:
> > > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> > > > Hi Aaron,
> > > >
> > > > thanks for the review,
> > > >
> > > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> > >
> > > [..snip..]
> > >
> > > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > > > > {
> > > > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > > > > - sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > > > > + if (sched_feat(SIS_CACHE) &&
> > > > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > > > > + return -1;
> > > > > > +
> > > > >
> > > > > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > > > > like available_idle_cpu_migrate() so that above and below logic can be
> > > > > simplified a bit?
> > > > >
> > > >
> > > > Yes, that would be simpler, I'll do in next version.
> > > >
> > > > > I was thinking to simply add that rq->cache_hot_timeout check to
> > > > > available_idle_cpu() but then a long sleeping task could be forced to
> > > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > > > > cache_hot_timeout. I guess that's why you chose to only change the idle
> > > > > semantic in select_idle_cpu() but not in select_idle_sibling()?
> > > > >
> > > >
> > > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> > > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> > > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> > > > thus no one could use this CPU within the cache hot period, including the cache-hot task
> > > > itself.
> > >
> > > This happens even with this patch right? It is possible for a task p1
> > > whose avg sleep time is "t" to go to sleep which causes its CPU to go
> > > idle. When it wakes up after a time t' < t, the logic above skips the
> > > idle CPU because it is still cache-hot, despite the fact that it is
> > > cache hot for p1!
> > >
> > Not sure if I understand correctly, in select_idle_sibling() p1's previous
> > CPU will be checked first, and that check does not involve cache-hot. So if
> > p1's previous CPU is idle, it will be picked, no?
> >
> > if (prev != target && cpus_share_cache(prev, target) &&
> > (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> > asym_fits_cpu(task_util, util_min, util_max, prev))
> > return prev;
>
>
> You are right, but the "if" condition is the one prior to "if (prev !=
> target ...)" which causes p1 can pick the previous CPU here. However,
> note that it is not just p1 which can pick the previous CPU as
> intended by your patch.
>
> Consider the following case:
>
> * p1' ran on CPUX, and went to sleep. We now update the
> cache_hot_timeout for CPUX based on the sleep-avg of p1'
>
> * When the CPU goes idle, during CPU_NEWLY_IDLE load balancing, (or
> due to shared_runq search), the CPU could pull p1 from another CPU
> Y.

Agree, newidle balance could scribble the cache-hot CPU.

>
> * p1 now runs on CPUX, and goes to sleep.
>
> * We update the cache_hot_timeout for CPUX based on the sleep-avg of p1.
>
> * p1' wakesup and picks prev as the target after doing wake_affine()
> check in select_task_rq_fair().
>
> * In select_idle_sibling() it encounters the following, which checks
> out and thus returns from target.
>
> if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> asym_fits_cpu(task_util, util_min, util_max, target))
> return target;
>
> * p1 now wakes up, picks the previous cpu as the target. However,
> available_idle_cpu() is no longer true.
>
> So despite "reserving" the CPU for p1, which is likely to have its
> data still hot in the case, we would have scheduled p1', thus
> defeating the whole purpose of reservation.
>

I see. So you mean, although we reserve the CPU for the wakee,
the wakee might not choose its previous CPU, which is against our
goal.

The reason to prevent the wakee choosing its previous CPU could be:
1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
2. the wakee's CPU has already been taken by someone else, via newidle_balance().

For 1, I think Prateek has expressed the concern. One mitigation method could be
that, we give penalty to that wakee, if it decides not to choose its previous CPU:

"
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
if (new_cpu != prev_cpu)
p->burst_sleep_avg >>= 1;
So the duration of reservation could be shrinked.
"

For 2, maybe inhit the newidle balance, something in my mind:


--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
u64 t0, t1, curr_cost = 0;
struct sched_domain *sd;
int pulled_task = 0;
+ bool cache_hot = false;

update_misfit_status(NULL, this_rq);

@@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);

+ if (sched_feat(SIS_CACHE)) {
+ s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
+
+ /*
+ * If a short time later, a short sleeping task will be woken up
+ * on this idle CPU, do not launch the newidle balance.
+ */
+ if (delta > 0 && delta < this_rq->max_idle_balance_cost)
+ cache_hot = true;
+ }
+
if (!READ_ONCE(this_rq->rd->overload) ||
- (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {

if (sd)
update_next_balance(sd, &next_balance);


> To be honest, this isn't so bad, because we have been able to avoid a
> migration in this case.
>
> >
> > Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> > CPU but it is skipped due to cache-hot?
>
> I had originally thought about this, but then as you pointed out we
> have an opportunity to pick the previous cpu in the early checks
> inside select_idle_sibling().
>
> >
> > > Have you considered recording p1's identity in the
> > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > > return the previous CPU if it is cache hot and the wakee is
> > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > > scan.
> > >
> >
> > Yes this seems to be donable, and one problem would be, if there are
> > more than 2 dequeued tasks prefer the same (previous) CPU, which task
> > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> > want to deal with multiple dequeued tasks. If we record all of them,
> > it might be costly.
>
> If there are multiple dequeued tasks, then it doesn't make sense to
> record the identity of the tasks. However, we need the bail out to be
> much earlier, in select_task_rq_fair(), perhaps even before the
> want_affine() checks.
>
> After all, if the previous CPU is idle, and its cache_hot_timeout
> hasn't expired, and if the wakee's sleep duration is less than the
> cache_hot_timeout, why don't we just pick it here and be done with it?
>

Yes we can return the previous CPU earlier, one concern is that, should
we honor WF_SYNC flag or not, because in wake_affine_idle(), WF_SYNC
seems to have a higher priority than available_idle_cpu(prev_cpu). Say,
if the current CPU has 1 running task, and the previous CPU is idle,
wake_affine_idle() still prefers the current CPU.

thanks,
Chenyu

2023-09-16 00:26:57

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hello Chen Yu,

On Thu, Sep 14, 2023 at 08:09:26PM +0800, Chen Yu wrote:
[..snip..]

> >
> > So despite "reserving" the CPU for p1, which is likely to have its
> > data still hot in the case, we would have scheduled p1', thus
> > defeating the whole purpose of reservation.
> >
>
> I see. So you mean, although we reserve the CPU for the wakee,
> the wakee might not choose its previous CPU, which is against our
> goal.


Yes, but only because some other task could have run on the previous
CPU. That other task could be something that was woken up on that CPU
due to:

1) wake-affine choosing that CPU
2) newidle-balance pulling the other task on that CPU
3) !wake-affine && that CPU was also the other task's previous CPU

It could also be due to this wakee task being woken up on the waker
CPU due to wake-affine.

>
> The reason to prevent the wakee choosing its previous CPU could be:
> 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
> 2. the wakee's CPU has already been taken by someone else, via newidle_balance().
>


> For 1, I think Prateek has expressed the concern. One mitigation method could be
> that, we give penalty to that wakee, if it decides not to choose its previous CPU:

We would be penalizing the task for something that the scheduler
decides :-)

As you point out below, in the presence of the WF_SYNC flag,
wake_affine_idle() prefer the waker CPU over the previous CPU when
they are on different LLCs and when the waker is the only task.

This strategy makes sense for two reasons:

1) The wakee may be consuming the data produced by the waker.
2) Since the wakeup will happen on the local CPU, there is no risk of
task-stacking, exactly what your SIS_CURRENT patchset was
attempting.

But this strategy would also result in increased task-migration. Which
both Mattieu and you have found is not so beneficial for workloads
such as hackbench. Is it only because task's data is still hot in the
previous CPU's cache ? Or is there more to it ?


It would be good to confirm if this is why lower migration is better
for these kinds of workloads.

>
> "
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> if (new_cpu != prev_cpu)
> p->burst_sleep_avg >>= 1;
> So the duration of reservation could be shrinked.
> "
>
> For 2, maybe inhit the newidle balance, something in my mind:
>
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> u64 t0, t1, curr_cost = 0;
> struct sched_domain *sd;
> int pulled_task = 0;
> + bool cache_hot = false;
>
> update_misfit_status(NULL, this_rq);
>
> @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>
> + if (sched_feat(SIS_CACHE)) {
> + s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
> +
> + /*
> + * If a short time later, a short sleeping task will be woken up
> + * on this idle CPU, do not launch the newidle balance.
> + */
> + if (delta > 0 && delta < this_rq->max_idle_balance_cost)
> + cache_hot = true;
> + }
> +
> if (!READ_ONCE(this_rq->rd->overload) ||
> - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {

>
> if (sd)
> update_next_balance(sd, &next_balance);

If the benefit that the workload obtains is really due to the data
being hot near its previous CPU, then this seems a sensible thing to
do.

It would be good to confirm this. Let me get some IBS data for
hackbench which is the workload which likes a sticky wakeup.

--
Thanks and Regards
gautham.



>
>
> > To be honest, this isn't so bad, because we have been able to avoid a
> > migration in this case.
> >
> > >
> > > Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> > > CPU but it is skipped due to cache-hot?
> >
> > I had originally thought about this, but then as you pointed out we
> > have an opportunity to pick the previous cpu in the early checks
> > inside select_idle_sibling().
> >
> > >
> > > > Have you considered recording p1's identity in the
> > > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > > > return the previous CPU if it is cache hot and the wakee is
> > > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > > > scan.
> > > >
> > >
> > > Yes this seems to be donable, and one problem would be, if there are
> > > more than 2 dequeued tasks prefer the same (previous) CPU, which task
> > > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> > > want to deal with multiple dequeued tasks. If we record all of them,
> > > it might be costly.
> >
> > If there are multiple dequeued tasks, then it doesn't make sense to
> > record the identity of the tasks. However, we need the bail out to be
> > much earlier, in select_task_rq_fair(), perhaps even before the
> > want_affine() checks.
> >
> > After all, if the previous CPU is idle, and its cache_hot_timeout
> > hasn't expired, and if the wakee's sleep duration is less than the
> > cache_hot_timeout, why don't we just pick it here and be done with it?
> >
>
> Yes we can return the previous CPU earlier, one concern is that, should
> we honor WF_SYNC flag or not, because in wake_affine_idle(), WF_SYNC
> seems to have a higher priority than available_idle_cpu(prev_cpu). Say,
> if the current CPU has 1 running task, and the previous CPU is idle,
> wake_affine_idle() still prefers the current CPU.
>
> thanks,
> Chenyu

2023-09-19 11:04:38

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hi Gautham,

Sorry for late reply,

On 2023-09-15 at 20:48:14 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
>
> On Thu, Sep 14, 2023 at 08:09:26PM +0800, Chen Yu wrote:
> [..snip..]
>
> > >
> > > So despite "reserving" the CPU for p1, which is likely to have its
> > > data still hot in the case, we would have scheduled p1', thus
> > > defeating the whole purpose of reservation.
> > >
> >
> > I see. So you mean, although we reserve the CPU for the wakee,
> > the wakee might not choose its previous CPU, which is against our
> > goal.
>
>
> Yes, but only because some other task could have run on the previous
> CPU. That other task could be something that was woken up on that CPU
> due to:
>
> 1) wake-affine choosing that CPU
> 2) newidle-balance pulling the other task on that CPU
> 3) !wake-affine && that CPU was also the other task's previous CPU
>
> It could also be due to this wakee task being woken up on the waker
> CPU due to wake-affine.
>
> >
> > The reason to prevent the wakee choosing its previous CPU could be:
> > 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
> > 2. the wakee's CPU has already been taken by someone else, via newidle_balance().
> >
>
>
> > For 1, I think Prateek has expressed the concern. One mitigation method could be
> > that, we give penalty to that wakee, if it decides not to choose its previous CPU:
>
> We would be penalizing the task for something that the scheduler
> decides :-)
>
> As you point out below, in the presence of the WF_SYNC flag,
> wake_affine_idle() prefer the waker CPU over the previous CPU when
> they are on different LLCs and when the waker is the only task.
>
> This strategy makes sense for two reasons:
>
> 1) The wakee may be consuming the data produced by the waker.
> 2) Since the wakeup will happen on the local CPU, there is no risk of
> task-stacking, exactly what your SIS_CURRENT patchset was
> attempting.
>
> But this strategy would also result in increased task-migration. Which
> both Mattieu and you have found is not so beneficial for workloads
> such as hackbench. Is it only because task's data is still hot in the
> previous CPU's cache ? Or is there more to it ?
>
>
> It would be good to confirm if this is why lower migration is better
> for these kinds of workloads.
>

Right. According to the previous hackbench test for shared runqueue, higher
migration brings higher DSB miss rate [1]. I'll collect some statistics with
this patch applied to confirm.

https://lore.kernel.org/lkml/ZO7e5YaS71cXVxQN@chenyu5-mobl2/
> >
> > "
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > if (new_cpu != prev_cpu)
> > p->burst_sleep_avg >>= 1;
> > So the duration of reservation could be shrinked.
> > "
> >
> > For 2, maybe inhit the newidle balance, something in my mind:
> >
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > u64 t0, t1, curr_cost = 0;
> > struct sched_domain *sd;
> > int pulled_task = 0;
> > + bool cache_hot = false;
> >
> > update_misfit_status(NULL, this_rq);
> >
> > @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > rcu_read_lock();
> > sd = rcu_dereference_check_sched_domain(this_rq->sd);
> >
> > + if (sched_feat(SIS_CACHE)) {
> > + s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
> > +
> > + /*
> > + * If a short time later, a short sleeping task will be woken up
> > + * on this idle CPU, do not launch the newidle balance.
> > + */
> > + if (delta > 0 && delta < this_rq->max_idle_balance_cost)
> > + cache_hot = true;
> > + }
> > +
> > if (!READ_ONCE(this_rq->rd->overload) ||
> > - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> > + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {
>
> >
> > if (sd)
> > update_next_balance(sd, &next_balance);
>
> If the benefit that the workload obtains is really due to the data
> being hot near its previous CPU, then this seems a sensible thing to
> do.
>
> It would be good to confirm this. Let me get some IBS data for
> hackbench which is the workload which likes a sticky wakeup.
>

I'll collect the statistics too. Thanks for your time.

thanks,
Chenyu