2023-09-27 12:13:53

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Hi Prateek,

On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> Hello David,
>
> Some more test results (although this might be slightly irrelevant with
> next version around the corner)
>
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >
> -> With EEVDF
>
> o tl;dr
>
> - Same as what was observed without EEVDF but shared_runq shows
> serious regression with multiple more variants of tbench and
> netperf now.
>
> o Kernels
>
> eevdf : tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> shared_runq : eevdf + correct time accounting with v3 of the series without any other changes
> shared_runq_idle_check : shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> (the rd->overload check still remains below the shared_runq access)
>

I did not see any obvious regression on a Sapphire Rapids server and it seems that
the result on your platform suggests that C/S workload could be impacted
by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
(no shared resource between tasks if I understand correctly) could benefit from
shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.
The question would be how to define C/S tasks. At first thought:
A only wakes up B, and B only wakes up A, then they could be regarded as a pair
of C/S
(A->last_wakee == B && B->last_wakee == A &&
A->wakee_flips <= 1 && B->wakee_flips <= 1)
But for netperf/tbench, this does not apply, because netperf client leverages kernel
thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
wakes up B. Unless we have a chain, we can not detect this wakeup behavior.

thanks,
Chenyu


2023-09-27 15:38:54

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Hello Chenyu,

On 9/27/2023 12:29 PM, Chen Yu wrote:
> Hi Prateek,
>
> On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> Some more test results (although this might be slightly irrelevant with
>> next version around the corner)
>>
>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>
>> -> With EEVDF
>>
>> o tl;dr
>>
>> - Same as what was observed without EEVDF but shared_runq shows
>> serious regression with multiple more variants of tbench and
>> netperf now.
>>
>> o Kernels
>>
>> eevdf : tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
>> shared_runq : eevdf + correct time accounting with v3 of the series without any other changes
>> shared_runq_idle_check : shared_runq + move the rq->avg_idle check before peeking into the shared_runq
>> (the rd->overload check still remains below the shared_runq access)
>>
>
> I did not see any obvious regression on a Sapphire Rapids server and it seems that
> the result on your platform suggests that C/S workload could be impacted
> by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> (no shared resource between tasks if I understand correctly) could benefit from
> shared_runq a lot.

Yup that would be my guess too since HHVM seems to benefit purely from
more aggressive work conservation. (unless it leads to some second order
effect)

> This makes me wonder if we can let shared_runq skip the C/S tasks.
> The question would be how to define C/S tasks. At first thought:
> A only wakes up B, and B only wakes up A, then they could be regarded as a pair
> of C/S
> (A->last_wakee == B && B->last_wakee == A &&
> A->wakee_flips <= 1 && B->wakee_flips <= 1)
> But for netperf/tbench, this does not apply, because netperf client leverages kernel
> thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
> wakes up B. Unless we have a chain, we can not detect this wakeup behavior.

Yup, unless we have a notion of chain/flow, or until we can somehow
account the wakeup of client using the kthread to the server, this will
be hard to detect.

I can give it a try with the SIS_PAIR condition you shared above. Let
me know.

>
> thanks,
> Chenyu

--
Thanks and Regards,
Prateek

2023-09-28 14:11:14

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

On 2023-09-27 at 14:06:41 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 9/27/2023 12:29 PM, Chen Yu wrote:
> > Hi Prateek,
> >
> > On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> >> Hello David,
> >>
> >> Some more test results (although this might be slightly irrelevant with
> >> next version around the corner)
> >>
> >> On 9/1/2023 12:41 AM, David Vernet wrote:
> >>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >>>

[snip]

> > This makes me wonder if we can let shared_runq skip the C/S tasks.
> > The question would be how to define C/S tasks. At first thought:
> > A only wakes up B, and B only wakes up A, then they could be regarded as a pair
> > of C/S
> > (A->last_wakee == B && B->last_wakee == A &&
> > A->wakee_flips <= 1 && B->wakee_flips <= 1)
> > But for netperf/tbench, this does not apply, because netperf client leverages kernel
> > thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
> > wakes up B. Unless we have a chain, we can not detect this wakeup behavior.
>
> Yup, unless we have a notion of chain/flow, or until we can somehow
> account the wakeup of client using the kthread to the server, this will
> be hard to detect.
>
> I can give it a try with the SIS_PAIR condition you shared above. Let
> me know.

Thanks Krateek, but I don't think SIS_PAIR could bring benefit to the netperf/tbench
since SIS_PAIR can not detect the chain wakeup.

thanks,
Chenyu

2023-10-03 21:05:31

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

On Wed, Sep 27, 2023 at 02:59:29PM +0800, Chen Yu wrote:
> Hi Prateek,

Hi Chenyu,

> On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> > Hello David,
> >
> > Some more test results (although this might be slightly irrelevant with
> > next version around the corner)
> >
> > On 9/1/2023 12:41 AM, David Vernet wrote:
> > > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > >
> > -> With EEVDF
> >
> > o tl;dr
> >
> > - Same as what was observed without EEVDF but shared_runq shows
> > serious regression with multiple more variants of tbench and
> > netperf now.
> >
> > o Kernels
> >
> > eevdf : tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> > shared_runq : eevdf + correct time accounting with v3 of the series without any other changes
> > shared_runq_idle_check : shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> > (the rd->overload check still remains below the shared_runq access)
> >
>
> I did not see any obvious regression on a Sapphire Rapids server and it seems that
> the result on your platform suggests that C/S workload could be impacted
> by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> (no shared resource between tasks if I understand correctly) could benefit from

Correct, hhvmworkers are largely independent, though they do sometimes
synchronize, and they also sometimes rely on I/O happening in other
tasks.

> shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.

I'm also open to this possibility, but I worry that we'd be going down
the same rabbit hole as what fair.c does already, which is use
heuristics to determine when something should or shouldn't be migrated,
etc. I really do feel that there's value in SHARED_RUNQ providing
consistent and predictable work conservation behavior.

On the other hand, it's clear that there are things we can do to improve
performance for some of these client/server workloads that hammer the
runqueue on larger CCXs / sockets. If we can avoid those regressions
while still having reasonably high confidence that work conservation
won't disproportionately suffer, I'm open to us making some tradeoffs
and/or adding a bit of complexity to avoid some of this unnecessary
contention.

I think it's probably about time for v4 to be sent out. What do you
folks think about including:

1. A few various fixes / tweaks from v3, e.g. avoiding using the wrong
shard on the task_dead_fair() path if the feature is disabled before
a dying task is dequeued from a shard, fixing the build issues
pointed out by lkp, etc.
2. Fix the issue that Prateek pointed out in [0] where we're not
properly skipping the LLC domain due to using the for_each_domain()
macro (this is also addressed by (3)).
3. Apply Prateek's suggestions (in some form) in [1] and [2]. For [2],
I'm inclined to just avoid enqueuing a task on a shard if the rq it's on
has nr_running == 0. Or, we can just add his patch to the series
directly if it turns out that just looking at rq->nr_running is
insufficient.

[0]: https://lore.kernel.org/all/[email protected]/
[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/all/[email protected]/

Prateek -- what do you think about this? I want to make sure you get
credit for your contributions to this series, so let me know how you'd
like to apply these changes. [1] essentially just improves much of the
logic from [3], so I'm not sure it would make sense to include it as a
separate patch. I'm happy to include a Co-authored-by tag, or to just
explicitly credit your contributions in the commit summary if you'd
prefer that.

[3]: https://lore.kernel.org/all/[email protected]/

Thanks,
David

2023-10-07 02:11:07

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Hi David,

On 2023-10-03 at 16:05:11 -0500, David Vernet wrote:
> On Wed, Sep 27, 2023 at 02:59:29PM +0800, Chen Yu wrote:
> > Hi Prateek,
>
> Hi Chenyu,
>
> > On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> > > Hello David,
> > >
> > > Some more test results (although this might be slightly irrelevant with
> > > next version around the corner)
> > >
> > > On 9/1/2023 12:41 AM, David Vernet wrote:
> > > > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > > >
> > > -> With EEVDF
> > >
> > > o tl;dr
> > >
> > > - Same as what was observed without EEVDF but shared_runq shows
> > > serious regression with multiple more variants of tbench and
> > > netperf now.
> > >
> > > o Kernels
> > >
> > > eevdf : tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> > > shared_runq : eevdf + correct time accounting with v3 of the series without any other changes
> > > shared_runq_idle_check : shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> > > (the rd->overload check still remains below the shared_runq access)
> > >
> >
> > I did not see any obvious regression on a Sapphire Rapids server and it seems that
> > the result on your platform suggests that C/S workload could be impacted
> > by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> > (no shared resource between tasks if I understand correctly) could benefit from
>
> Correct, hhvmworkers are largely independent, though they do sometimes
> synchronize, and they also sometimes rely on I/O happening in other
> tasks.
>
> > shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.
>
> I'm also open to this possibility, but I worry that we'd be going down
> the same rabbit hole as what fair.c does already, which is use
> heuristics to determine when something should or shouldn't be migrated,
> etc. I really do feel that there's value in SHARED_RUNQ providing
> consistent and predictable work conservation behavior.
>
> On the other hand, it's clear that there are things we can do to improve
> performance for some of these client/server workloads that hammer the
> runqueue on larger CCXs / sockets. If we can avoid those regressions
> while still having reasonably high confidence that work conservation
> won't disproportionately suffer, I'm open to us making some tradeoffs
> and/or adding a bit of complexity to avoid some of this unnecessary
> contention.
>

Since I did not observe any regression(although I did not test hackbench
yet) on the latest version you sent to me, I'm OK with postponing the
client/server optimization to make the patchset simple, and Prateek
has other proposal to deal with the regression.

> I think it's probably about time for v4 to be sent out. What do you
> folks think about including:
>

It's OK for me and I can launch the test once the latest version is released.

thanks,
Chenyu