2023-09-29 18:28:48

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
> Hello David,
>
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >
> > Hi Prateek,
> >
> >> Even with the two patches, I still observe the following lock
> >> contention when profiling the tbench 128-clients run with IBS:
> >>
> >> - 12.61% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> >> - 10.94% native_queued_spin_lock_slowpath
> >> - 10.73% _raw_spin_lock
> >> - 9.57% __schedule
> >> schedule_idle
> >> do_idle
> >> + cpu_startup_entry
> >> - 0.82% task_rq_lock
> >> newidle_balance
> >> pick_next_task_fair
> >> __schedule
> >> schedule_idle
> >> do_idle
> >> + cpu_startup_entry
> >>
> >> Since David mentioned rq->avg_idle check is probably not the right step
> >> towards the solution, this experiment introduces a per-shard
> >> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >> notifies of the possibility of one or more rq covered in the shard's
> >> domain having a queued task. shard's overload flag is set at the same
> >> time as "rq->rd->overload", and is cleared when shard's list is found
> >> to be empty.
> >
> > I think this is an interesting idea, but I feel that it's still working
> > against the core proposition of SHARED_RUNQ, which is to enable work
> > conservation.
>
> I don't think so! Work conservation is possible if there is an
> imbalance. Consider the case where we 15 tasks in the shared_runq but we
> have 16 CPUs, 15 of which are running these 15 tasks, and one going

I'm not sure I'm fully following. Those 15 tasks would not be enqueued
in the shared runq if they were being run. They would be dequeued from
the shared_runq in __dequeue_entity(), which would be called from
set_next_entity() before they were run. In this case, the
shard->overload check should be equivalent to the
!list_empty(&shard->list) check.

Oh, or is the idea that we're not bothering to pull them from the
shared_runq if they're being woken up and enqueued on an idle core that
will immediately run them on the next resched path? If so, I wonder if
we would instead just want to not enqueue the task in the shared_runq at
all? Consider that if another task comes in on an rq with
rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
were being woken up on idle cores (nor take the overhead of inserting
and then immediately removing them from the shared_runq).

> idle. Work is conserved. What we need to worry about is tasks being in
> the shared_runq that are queued on their respective CPU. This can only
> happen if any one of the rq has nr_running >= 2, which is also the point
> we are setting "shard->overload".

Assuming this is about the "wakeup / enqueue to idle core" case, ok,
this makes sense. I still think it probably makes more sense to just not
enqueue in the shared_runq for this case though, which would allow us to
instead just rely on list_empty(&shard->list).

> Now situation can change later and all tasks in the shared_runq might be
> running on respective CPUs but "shard->overload" is only cleared when
> the shared_runq becomes empty. If this is too late, maybe we can clear
> it if periodic load balancing finds no queuing (somewhere around the
> time we update nr_idle_scan).
>
> So the window where we do not go ahead with popping a task from the
> shared_runq_shard->list is between the list being empty and at least one
> of the CPU associated with the shard reporting nr_running >= 2, which is
> when work conservation is needed.

So, I misread your patch the first time I reviewed it, and for some
reason thought you were only setting shard->overload on the
load_balance(). That's obviously not the case, and I now understand it
better, modulo my points above being clarified.

> >
> >> With these changes, following are the results for tbench 128-clients:
> >
> > Just to make sure I understand, this is to address the contention we're
> > observing on tbench with 64 - 256 clients, right? That's my
> > understanding from Gautham's reply in [0].
> >
> > [0]: https://lore.kernel.org/all/[email protected]/
>
> I'm not sure if Gautham saw a contention with IBS but he did see an
> abnormal blowup in newidle_balance() counts which he suspected were the
> cause for the regression. I noticed the rq lock contention after doing a
> fair bit of surgery. Let me go check if that was the case with vanilla
> v3 too.
>
> >
> > If so, are we sure this change won't regress other workloads that would
> > have benefited from the work conservation?
>
> I don't think we'll regress any workloads as I explained above because
> the "overload" flag being 0 almost (since update/access is not atomic)
> always indicate a case where the tasks cannot be pulled. However, that
> needs to be tested since there is a small behavior change in
> shared_runq_pick_next_task(). Where previously if the task is running
> on CPU, we would have popped it from shared_runq, did some lock
> fiddling before finding out it is running, some more lock fiddling
> before the function returned "-1", now with the changes here, it'll
> simply return a "0" and although that is correct, we have seen some
> interesting cases in past [1] where a random lock contention actually
> helps certain benchmarks ¯\_(ツ)_/¯

I don't think we need to worry about less lock contention possibly
hurting benchmarks :-)

> [1] https://lore.kernel.org/all/[email protected]/
>
> >
> > Also, I assume that you don't see the improved contention without this,
> > even if you include your fix to the newidle_balance() that has us skip
> > over the <= LLC domain?
>
> No improvements! The lock is still very much contended for. I wonder if
> it could be because of the unlocking and locking the rq again in
> shared_runq_pick_next_task() even when task is on CPU. Also since it
> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
> which can have further implications.

Yeah, I could see it being an issue if we're essentially thrashing tasks
in the shared_runq that are just temporarily enqueued in the shared_runq
between activate and doing a resched pass on an idle core.

Unfortunately, I don't think we have any choice but to drop and
reacquire the rq lock. It's not safe to look at task_cpu(p) without
pi_lock, and we can't safely acquire that without dropping the rq lock.

Thanks,
David


2023-10-04 04:22:01

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Hello David,

Thank you for answering my queries, I'll leave some data below to
answer yours.

On 9/29/2023 10:31 PM, David Vernet wrote:
> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>
>>> Hi Prateek,
>>>
>>>> Even with the two patches, I still observe the following lock
>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>
>>>> - 12.61% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>>>> - 10.94% native_queued_spin_lock_slowpath
>>>> - 10.73% _raw_spin_lock
>>>> - 9.57% __schedule
>>>> schedule_idle
>>>> do_idle
>>>> + cpu_startup_entry
>>>> - 0.82% task_rq_lock
>>>> newidle_balance
>>>> pick_next_task_fair
>>>> __schedule
>>>> schedule_idle
>>>> do_idle
>>>> + cpu_startup_entry
>>>>
>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>> towards the solution, this experiment introduces a per-shard
>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>> notifies of the possibility of one or more rq covered in the shard's
>>>> domain having a queued task. shard's overload flag is set at the same
>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>> to be empty.
>>>
>>> I think this is an interesting idea, but I feel that it's still working
>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>> conservation.
>>
>> I don't think so! Work conservation is possible if there is an
>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
>
> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
> in the shared runq if they were being run. They would be dequeued from
> the shared_runq in __dequeue_entity(), which would be called from
> set_next_entity() before they were run. In this case, the
> shard->overload check should be equivalent to the
> !list_empty(&shard->list) check.
>
> Oh, or is the idea that we're not bothering to pull them from the
> shared_runq if they're being woken up and enqueued on an idle core that
> will immediately run them on the next resched path? If so, I wonder if
> we would instead just want to not enqueue the task in the shared_runq at
> all? Consider that if another task comes in on an rq with
> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
> were being woken up on idle cores (nor take the overhead of inserting
> and then immediately removing them from the shared_runq).

So this is the breakdown of outcomes after peeking into the shared_runq
during newidle_balance:

SHARED_RUNQ SHARED_RUNQ
+ correct cost accounting + correct cost accounting
+ rq->avg_idle early bail

tbench throughput (normalized) : 1.00 2.47 (146.84%)

attempts : 6,560,413 2,273,334 (-65.35%)
shared_runq was empty : 2,276,307 [34.70%] 1,379,071 [60.66%] (-39.42%)
successful at pulling task : 2,557,158 [38/98%] 342,839 [15.08%] (-86.59%)
unsuccessful despite fetching task : 1,726,948 [26.32%] 551,424 [24.26%] (-68.06%)

As you can see, there are more attempts and a greater chance of success
in the case without the rq->avg_idle check upfront. Where the problem
lies (at least what I believe is) a task is waiting to be enqueued / has
been enqueued while we are trying to migrate a task fetched from the
shared_runq. Thus, instead of just being idle for a short duration and
running the task, we are now making it wait till we fetch another task
onto the CPU.

I think the scenario changes as follows with shared_runq:

- Current


[Short Idling] [2 tasks] [1 task] [2 tasks]
+-------+ +-------+ +-------+ +-------+
| | | | wakeup | | | |
| CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
| | | | --------> | | | |
+-------+ +-------+ +-------+ +-------+

- With shared_runq

[pull from CPU1] [2 tasks] [2 tasks] [1 task]
+-------+ +-------+ +-------+ +-------+
| | | | wakeup | | | |
| CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
| | | | --------> | | | |
+-------+ +-------+ +-------+ +-------+

We reach a similar final state but with shared_runq we've paid a price
for task migration. Worst case, the following timeline can happen:

|
CPU0 | [T0 R, T1 Q] [ T0 R ] [newidle_balance] [T4 R ...
|
| pull T1 \ pull T4 /
|
CPU1 | [T3 R] [newidle_balance] [T1 R, T4 Q] [ T1 R ]
| [T4 TTWU]
|

With the rq->avg_idle bailout, it might end up looking like:

|
CPU0 | [ T0 R, T1 Q ] [T1 R ...
|
|
CPU1 | [T3 R] [ I ] [T4 R ...
|
|

If possible, can you check how long is the avg_idle running your
workload? Meanwhile, I believe there are a few workloads that
exhibit same behavior as tbench (large scale idling for short
duration) Let me go check if I can see tbench like issue there.

>
>> idle. Work is conserved. What we need to worry about is tasks being in
>> the shared_runq that are queued on their respective CPU. This can only
>> happen if any one of the rq has nr_running >= 2, which is also the point
>> we are setting "shard->overload".
>
> Assuming this is about the "wakeup / enqueue to idle core" case, ok,
> this makes sense. I still think it probably makes more sense to just not
> enqueue in the shared_runq for this case though, which would allow us to
> instead just rely on list_empty(&shard->list).
>
>> Now situation can change later and all tasks in the shared_runq might be
>> running on respective CPUs but "shard->overload" is only cleared when
>> the shared_runq becomes empty. If this is too late, maybe we can clear
>> it if periodic load balancing finds no queuing (somewhere around the
>> time we update nr_idle_scan).
>>
>> So the window where we do not go ahead with popping a task from the
>> shared_runq_shard->list is between the list being empty and at least one
>> of the CPU associated with the shard reporting nr_running >= 2, which is
>> when work conservation is needed.
>
> So, I misread your patch the first time I reviewed it, and for some
> reason thought you were only setting shard->overload on the
> load_balance(). That's obviously not the case, and I now understand it
> better, modulo my points above being clarified.
>
>>>
>>>> With these changes, following are the results for tbench 128-clients:
>>>
>>> Just to make sure I understand, this is to address the contention we're
>>> observing on tbench with 64 - 256 clients, right? That's my
>>> understanding from Gautham's reply in [0].
>>>
>>> [0]: https://lore.kernel.org/all/[email protected]/
>>
>> I'm not sure if Gautham saw a contention with IBS but he did see an
>> abnormal blowup in newidle_balance() counts which he suspected were the
>> cause for the regression. I noticed the rq lock contention after doing a
>> fair bit of surgery. Let me go check if that was the case with vanilla
>> v3 too.
>>
>>>
>>> If so, are we sure this change won't regress other workloads that would
>>> have benefited from the work conservation?
>>
>> I don't think we'll regress any workloads as I explained above because
>> the "overload" flag being 0 almost (since update/access is not atomic)
>> always indicate a case where the tasks cannot be pulled. However, that
>> needs to be tested since there is a small behavior change in
>> shared_runq_pick_next_task(). Where previously if the task is running
>> on CPU, we would have popped it from shared_runq, did some lock
>> fiddling before finding out it is running, some more lock fiddling
>> before the function returned "-1", now with the changes here, it'll
>> simply return a "0" and although that is correct, we have seen some
>> interesting cases in past [1] where a random lock contention actually
>> helps certain benchmarks ¯\_(ツ)_/¯
>
> I don't think we need to worry about less lock contention possibly
> hurting benchmarks :-)

Yup :)

>
>> [1] https://lore.kernel.org/all/[email protected]/
>>
>>>
>>> Also, I assume that you don't see the improved contention without this,
>>> even if you include your fix to the newidle_balance() that has us skip
>>> over the <= LLC domain?
>>
>> No improvements! The lock is still very much contended for. I wonder if
>> it could be because of the unlocking and locking the rq again in
>> shared_runq_pick_next_task() even when task is on CPU. Also since it
>> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
>> which can have further implications.
>
> Yeah, I could see it being an issue if we're essentially thrashing tasks
> in the shared_runq that are just temporarily enqueued in the shared_runq
> between activate and doing a resched pass on an idle core.
>
> Unfortunately, I don't think we have any choice but to drop and
> reacquire the rq lock. It's not safe to look at task_cpu(p) without
> pi_lock, and we can't safely acquire that without dropping the rq lock.

True that. We wouldn't want to run into a deadlock scenario or cause
more lock contention with double locking :(

>
> Thanks,
> David

--
Thanks and Regards,
Prateek

2023-10-04 17:21:04

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

On Wed, Oct 04, 2023 at 09:51:18AM +0530, K Prateek Nayak wrote:
> Hello David,

Hello Prateek,

>
> Thank you for answering my queries, I'll leave some data below to
> answer yours.
>
> On 9/29/2023 10:31 PM, David Vernet wrote:
> > On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
> >> Hello David,
> >>
> >> On 9/1/2023 12:41 AM, David Vernet wrote:
> >>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >>>
> >>> Hi Prateek,
> >>>
> >>>> Even with the two patches, I still observe the following lock
> >>>> contention when profiling the tbench 128-clients run with IBS:
> >>>>
> >>>> - 12.61% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> >>>> - 10.94% native_queued_spin_lock_slowpath
> >>>> - 10.73% _raw_spin_lock
> >>>> - 9.57% __schedule
> >>>> schedule_idle
> >>>> do_idle
> >>>> + cpu_startup_entry
> >>>> - 0.82% task_rq_lock
> >>>> newidle_balance
> >>>> pick_next_task_fair
> >>>> __schedule
> >>>> schedule_idle
> >>>> do_idle
> >>>> + cpu_startup_entry
> >>>>
> >>>> Since David mentioned rq->avg_idle check is probably not the right step
> >>>> towards the solution, this experiment introduces a per-shard
> >>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >>>> notifies of the possibility of one or more rq covered in the shard's
> >>>> domain having a queued task. shard's overload flag is set at the same
> >>>> time as "rq->rd->overload", and is cleared when shard's list is found
> >>>> to be empty.
> >>>
> >>> I think this is an interesting idea, but I feel that it's still working
> >>> against the core proposition of SHARED_RUNQ, which is to enable work
> >>> conservation.
> >>
> >> I don't think so! Work conservation is possible if there is an
> >> imbalance. Consider the case where we 15 tasks in the shared_runq but we
> >> have 16 CPUs, 15 of which are running these 15 tasks, and one going
> >
> > I'm not sure I'm fully following. Those 15 tasks would not be enqueued
> > in the shared runq if they were being run. They would be dequeued from
> > the shared_runq in __dequeue_entity(), which would be called from
> > set_next_entity() before they were run. In this case, the
> > shard->overload check should be equivalent to the
> > !list_empty(&shard->list) check.
> >
> > Oh, or is the idea that we're not bothering to pull them from the
> > shared_runq if they're being woken up and enqueued on an idle core that
> > will immediately run them on the next resched path? If so, I wonder if
> > we would instead just want to not enqueue the task in the shared_runq at
> > all? Consider that if another task comes in on an rq with
> > rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
> > were being woken up on idle cores (nor take the overhead of inserting
> > and then immediately removing them from the shared_runq).

Friendly ping on this point. This is the only scenario where I could see
the overload check helping, so I want to make sure I'm understanding it
and am correct in that just avoiding enqueueing the task in the shard in
this scenario would give us the same benefit.

> So this is the breakdown of outcomes after peeking into the shared_runq
> during newidle_balance:
>
> SHARED_RUNQ SHARED_RUNQ
> + correct cost accounting + correct cost accounting
> + rq->avg_idle early bail
>
> tbench throughput (normalized) : 1.00 2.47 (146.84%)
>
> attempts : 6,560,413 2,273,334 (-65.35%)
> shared_runq was empty : 2,276,307 [34.70%] 1,379,071 [60.66%] (-39.42%)
> successful at pulling task : 2,557,158 [38/98%] 342,839 [15.08%] (-86.59%)
> unsuccessful despite fetching task : 1,726,948 [26.32%] 551,424 [24.26%] (-68.06%)
>
> As you can see, there are more attempts and a greater chance of success
> in the case without the rq->avg_idle check upfront. Where the problem
> lies (at least what I believe is) a task is waiting to be enqueued / has
> been enqueued while we are trying to migrate a task fetched from the
> shared_runq. Thus, instead of just being idle for a short duration and
> running the task, we are now making it wait till we fetch another task
> onto the CPU.
>
> I think the scenario changes as follows with shared_runq:
>
> - Current
>
>
> [Short Idling] [2 tasks] [1 task] [2 tasks]
> +-------+ +-------+ +-------+ +-------+
> | | | | wakeup | | | |
> | CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
> | | | | --------> | | | |
> +-------+ +-------+ +-------+ +-------+
>
> - With shared_runq
>
> [pull from CPU1] [2 tasks] [2 tasks] [1 task]
> +-------+ +-------+ +-------+ +-------+
> | | | | wakeup | | | |
> | CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
> | | | | --------> | | | |
> +-------+ +-------+ +-------+ +-------+
>
> We reach a similar final state but with shared_runq we've paid a price
> for task migration. Worst case, the following timeline can happen:
>
> |
> CPU0 | [T0 R, T1 Q] [ T0 R ] [newidle_balance] [T4 R ...
> |
> | pull T1 \ pull T4 /
> |
> CPU1 | [T3 R] [newidle_balance] [T1 R, T4 Q] [ T1 R ]
> | [T4 TTWU]
> |
>
> With the rq->avg_idle bailout, it might end up looking like:
>
> |
> CPU0 | [ T0 R, T1 Q ] [T1 R ...
> |
> |
> CPU1 | [T3 R] [ I ] [T4 R ...
> |
> |

This certainly seems possible, and wouldn't be terribly surprising or
unexpected. Taking a step back here, I want to be clear that I do
understand the motivation for including the rq->avg_idle check for
SHARED_RUNQ; even just conceptually, and regardless of the numbers you
and others have observed for workloads that do these short sleeps. The
whole idea behind that check is that we want to avoid doing
newidle_balance() if the overhead of doing newidle_balance() would
exceed the amount of time that a task was blocked. Makes sense. Why
would you take the overhead of balancing if you have reason to believe
that a task is likely to be idle for less time than it takes to do a
migration?

There's certainly a reasonable argument for why that should also apply
to SHARED_RUNQ. If the overhead of doing a SHARED_RUNQ migration is
greater than the amount of time that an sd is expected to be idle, then
it's not worth bothering with SHARED_RUNQ either. On the other hand, the
claim of SHARED_RUNQ is that it's faster than doing a regular balance
pass, because we're doing an O(# shards) iteration to find tasks (before
sharding it was O(1)), rather than O(# CPUs). So if we also do the
rq->avg_idle check, that basically means that SHARED_RUNQ becomes a
cache for a full load_balance() call.

Maybe that makes sense and is ultimately the correct design /
implementation for the feature. I'm not fundamentally opposed to that,
but I think we should be cognizant of the tradeoff we're making. If we
don't include this rq->avg_idle check, then some workloads will regress
because we're doing excessive migrations, but if we do check it, then
others will also regress because we're doing insufficient migrations due
to incorrectly assuming that an rq won't be idle for long. On yet
another hand, maybe it's fine to allow users to work around that by
setting sysctl_sched_migration_cost_ns = 0? That only sort of works,
because we ignore that and set rq->max_idle_balance_cost = curr_cost in
newidle_balance() if we end up doing a balance pass. I also know that
Peter and others discourage the use of these debugfs knobs, so I'm not
sure it's even applicable to point that out as a workaround.

And so hopefully the problem starts to become clear. It doesn't take
long for for us to get mired in heuristics that make it difficult to
reason about the expected behavior of the feature, and also difficult to
reason about future changes as these heuristics have now all crossed
streams. Maybe that's OK, and is preferable to the alternative. My
personal opinion, however, is that it's preferable to provide users with
knobs that do straightforward things that are independent from existing
heuristics and knobs which were added for other circumstances. I'd
rather have confidence that I understand how a feature is supposed to
work, and can easily reason about when it's stupid (or not) to use it,
vs. have an expectation for it to not regress workloads in any scenario.

Note that this doesn't mean we can't make my patches less dumb. I think
your suggestions to e.g. check the overload flag (or possibly even
better to just not enqueue in a shard if the rq isn't overloaded),
re-check ttwu->pending after failing to find a task in the shard, etc
make complete sense. There's no downside -- we're just avoiding
pointless work. It's the heuristics like checking rq->avg_idle that
really worry me.

Peter -- I think it would be helpful if you could weigh in here just to
provide your thoughts on this more "philosophical" question.

> If possible, can you check how long is the avg_idle running your
> workload? Meanwhile, I believe there are a few workloads that
> exhibit same behavior as tbench (large scale idling for short
> duration) Let me go check if I can see tbench like issue there.

Sure thing, in the meantime I'll test this out on HHVM. I've actually
been working on getting a build + testbed ready for a few days, so
hopefully it won't take much longer to get some results. Even if it
turns out that this works great for HHVM, I'd ideally like to get
Peter's and others' thoughts on the above.

Thanks,
David

2023-10-05 13:56:53

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Hello David,

On 10/4/2023 10:50 PM, David Vernet wrote:
> On Wed, Oct 04, 2023 at 09:51:18AM +0530, K Prateek Nayak wrote:
>> Hello David,
>
> Hello Prateek,
>
>>
>> Thank you for answering my queries, I'll leave some data below to
>> answer yours.
>>
>> On 9/29/2023 10:31 PM, David Vernet wrote:
>>> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>>>> Hello David,
>>>>
>>>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>>>
>>>>> Hi Prateek,
>>>>>
>>>>>> Even with the two patches, I still observe the following lock
>>>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>>>
>>>>>> - 12.61% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>>>>>> - 10.94% native_queued_spin_lock_slowpath
>>>>>> - 10.73% _raw_spin_lock
>>>>>> - 9.57% __schedule
>>>>>> schedule_idle
>>>>>> do_idle
>>>>>> + cpu_startup_entry
>>>>>> - 0.82% task_rq_lock
>>>>>> newidle_balance
>>>>>> pick_next_task_fair
>>>>>> __schedule
>>>>>> schedule_idle
>>>>>> do_idle
>>>>>> + cpu_startup_entry
>>>>>>
>>>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>>>> towards the solution, this experiment introduces a per-shard
>>>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>>>> notifies of the possibility of one or more rq covered in the shard's
>>>>>> domain having a queued task. shard's overload flag is set at the same
>>>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>>>> to be empty.
>>>>>
>>>>> I think this is an interesting idea, but I feel that it's still working
>>>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>>>> conservation.
>>>>
>>>> I don't think so! Work conservation is possible if there is an
>>>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>>>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
>>>
>>> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
>>> in the shared runq if they were being run. They would be dequeued from
>>> the shared_runq in __dequeue_entity(), which would be called from
>>> set_next_entity() before they were run. In this case, the
>>> shard->overload check should be equivalent to the
>>> !list_empty(&shard->list) check.
>>>
>>> Oh, or is the idea that we're not bothering to pull them from the
>>> shared_runq if they're being woken up and enqueued on an idle core that
>>> will immediately run them on the next resched path? If so, I wonder if
>>> we would instead just want to not enqueue the task in the shared_runq at
>>> all? Consider that if another task comes in on an rq with
>>> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
>>> were being woken up on idle cores (nor take the overhead of inserting
>>> and then immediately removing them from the shared_runq).
>
> Friendly ping on this point. This is the only scenario where I could see
> the overload check helping, so I want to make sure I'm understanding it
> and am correct in that just avoiding enqueueing the task in the shard in
> this scenario would give us the same benefit.

Woops! Missed answering this. So the original motivation for
'shard->overload' was that there is a rq lock contention, very likely as
a result of shared_runq_pick_next_task() trying to grab a remote rq's
lock. Looking at shared_runq_pick_next_task(), the criteria
"!task_on_cpu(src_rq, p)" led me to believe we might end up enqueuing a
task that is running on the CPU but now that I take a look at
shared_runq_enqueue_task() being called from __enqueue_entity(), this
should be a very rare scenario.

However, if a running task was enqueued often into a shared_runq, the
'shard->overload' is an indication that all the runqueues covered by
the shard are not overloaded and hence, peeking into the shard can be
skipped. Let me see if I can grab some more stats to verify what
exactly is happening.

>
>> So this is the breakdown of outcomes after peeking into the shared_runq
>> during newidle_balance:
>>
>> SHARED_RUNQ SHARED_RUNQ
>> + correct cost accounting + correct cost accounting
>> + rq->avg_idle early bail
>>
>> tbench throughput (normalized) : 1.00 2.47 (146.84%)
>>
>> attempts : 6,560,413 2,273,334 (-65.35%)
>> shared_runq was empty : 2,276,307 [34.70%] 1,379,071 [60.66%] (-39.42%)
>> successful at pulling task : 2,557,158 [38/98%] 342,839 [15.08%] (-86.59%)
>> unsuccessful despite fetching task : 1,726,948 [26.32%] 551,424 [24.26%] (-68.06%)
>>
>> As you can see, there are more attempts and a greater chance of success
>> in the case without the rq->avg_idle check upfront. Where the problem
>> lies (at least what I believe is) a task is waiting to be enqueued / has
>> been enqueued while we are trying to migrate a task fetched from the
>> shared_runq. Thus, instead of just being idle for a short duration and
>> running the task, we are now making it wait till we fetch another task
>> onto the CPU.
>>
>> I think the scenario changes as follows with shared_runq:
>>
>> - Current
>>
>>
>> [Short Idling] [2 tasks] [1 task] [2 tasks]
>> +-------+ +-------+ +-------+ +-------+
>> | | | | wakeup | | | |
>> | CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
>> | | | | --------> | | | |
>> +-------+ +-------+ +-------+ +-------+
>>
>> - With shared_runq
>>
>> [pull from CPU1] [2 tasks] [2 tasks] [1 task]
>> +-------+ +-------+ +-------+ +-------+
>> | | | | wakeup | | | |
>> | CPU 0 | | CPU 1 | on CPU0 | CPU 0 | | CPU 1 |
>> | | | | --------> | | | |
>> +-------+ +-------+ +-------+ +-------+
>>
>> We reach a similar final state but with shared_runq we've paid a price
>> for task migration. Worst case, the following timeline can happen:
>>
>> |
>> CPU0 | [T0 R, T1 Q] [ T0 R ] [newidle_balance] [T4 R ...
>> |
>> | pull T1 \ pull T4 /
>> |
>> CPU1 | [T3 R] [newidle_balance] [T1 R, T4 Q] [ T1 R ]
>> | [T4 TTWU]
>> |
>>
>> With the rq->avg_idle bailout, it might end up looking like:
>>
>> |
>> CPU0 | [ T0 R, T1 Q ] [T1 R ...
>> |
>> |
>> CPU1 | [T3 R] [ I ] [T4 R ...
>> |
>> |
>
> This certainly seems possible, and wouldn't be terribly surprising or
> unexpected. Taking a step back here, I want to be clear that I do
> understand the motivation for including the rq->avg_idle check for
> SHARED_RUNQ; even just conceptually, and regardless of the numbers you
> and others have observed for workloads that do these short sleeps. The
> whole idea behind that check is that we want to avoid doing
> newidle_balance() if the overhead of doing newidle_balance() would
> exceed the amount of time that a task was blocked. Makes sense. Why
> would you take the overhead of balancing if you have reason to believe
> that a task is likely to be idle for less time than it takes to do a
> migration?
>
> There's certainly a reasonable argument for why that should also apply
> to SHARED_RUNQ. If the overhead of doing a SHARED_RUNQ migration is
> greater than the amount of time that an sd is expected to be idle, then
> it's not worth bothering with SHARED_RUNQ either. On the other hand, the
> claim of SHARED_RUNQ is that it's faster than doing a regular balance
> pass, because we're doing an O(# shards) iteration to find tasks (before
> sharding it was O(1)), rather than O(# CPUs). So if we also do the
> rq->avg_idle check, that basically means that SHARED_RUNQ becomes a
> cache for a full load_balance() call.
>
> Maybe that makes sense and is ultimately the correct design /
> implementation for the feature. I'm not fundamentally opposed to that,
> but I think we should be cognizant of the tradeoff we're making. If we
> don't include this rq->avg_idle check, then some workloads will regress
> because we're doing excessive migrations, but if we do check it, then
> others will also regress because we're doing insufficient migrations due
> to incorrectly assuming that an rq won't be idle for long. On yet
> another hand, maybe it's fine to allow users to work around that by
> setting sysctl_sched_migration_cost_ns = 0? That only sort of works,
> because we ignore that and set rq->max_idle_balance_cost = curr_cost in
> newidle_balance() if we end up doing a balance pass. I also know that
> Peter and others discourage the use of these debugfs knobs, so I'm not
> sure it's even applicable to point that out as a workaround.
>
> And so hopefully the problem starts to become clear. It doesn't take
> long for for us to get mired in heuristics that make it difficult to
> reason about the expected behavior of the feature, and also difficult to
> reason about future changes as these heuristics have now all crossed
> streams. Maybe that's OK, and is preferable to the alternative. My
> personal opinion, however, is that it's preferable to provide users with
> knobs that do straightforward things that are independent from existing
> heuristics and knobs which were added for other circumstances. I'd
> rather have confidence that I understand how a feature is supposed to
> work, and can easily reason about when it's stupid (or not) to use it,
> vs. have an expectation for it to not regress workloads in any scenario.
>
> Note that this doesn't mean we can't make my patches less dumb. I think
> your suggestions to e.g. check the overload flag (or possibly even
> better to just not enqueue in a shard if the rq isn't overloaded),
> re-check ttwu->pending after failing to find a task in the shard, etc
> make complete sense. There's no downside -- we're just avoiding
> pointless work. It's the heuristics like checking rq->avg_idle that
> really worry me.

I agree since avg_idle is merely a prediction that may or may not be
true.

>
> Peter -- I think it would be helpful if you could weigh in here just to
> provide your thoughts on this more "philosophical" question.
>
>> If possible, can you check how long is the avg_idle running your
>> workload? Meanwhile, I believe there are a few workloads that
>> exhibit same behavior as tbench (large scale idling for short
>> duration) Let me go check if I can see tbench like issue there.
>
> Sure thing, in the meantime I'll test this out on HHVM. I've actually
> been working on getting a build + testbed ready for a few days, so
> hopefully it won't take much longer to get some results. Even if it
> turns out that this works great for HHVM, I'd ideally like to get
> Peter's and others' thoughts on the above.

I'll gather some more data too in the meantime :)

>
> Thanks,
> David

--
Thanks and Regards,
Prateek