LinuxLists.cc - [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

2023-02-14 15:24:57

Subject: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

CPU cfs bandwidth controller uses hrtimer called period timer. Quota is
refilled upon the timer expiry and re-started when there are running tasks
within the cgroup. Each cgroup has a separate period timer which manages
the period and quota for that cgroup.

start_cfs_bandwidth calls hrtimer_forward_now which set the expiry value
based on the below logic. expiry = $initial_value + $N * $period

However, start_cfs_bandwidth doesn't set any initial value. Hence
multiple such timers would align on expiry if their period value is
same. This happens when there are multiple cgroups and each has runnable
task. Upon expiry each timer will unthrottle respective rq's and all the
rq would start at same time, competing for CPU time and use all
the SMT threads likely.

There is performance gain that can be achieved here if the timers are
interleaved when the utilization of each CPU cgroup is low and total
utilization of all the CPU cgroup's is less than 50%. This is likely
true when using containers. If the timers are interleaved, then the
unthrottled cgroup can run freely without many context switches and can
also benefit from SMT Folding[1]. This effect will be further amplified in
SPLPAR environment[2] as this would cause less hypervisor preemptions.
There can be benefit due to less IPI storm as well. Docker provides a
config option of period timer value, whereas the kubernetes only
provides millicore option. Hence with typical deployment period values
will be set to 100ms as kubernetes millicore will set the quota
accordingly without altering period values.

[1] SMT folding is a mechanism where processor core is reconfigured to
lower SMT mode to improve performance when some sibling threads are
idle. In a SMT8 core, when only one or two threads are running on a
core, we get the best throughput compared to running all 8 threads.

[2] SPLPAR is an Shared Processor Logical PARtition. There can be many
SPLPARs running on the same physical machine sharing the CPU resources.
One SPLPAR can consume all CPU resource it can, if the other SPLPARs are
idle. Processors within the SPLPAR are called vCPU. vCPU can be higher
than CPU. Hence at an instance of time if there are more requested vCPU
than CPU, then vCPU can be preempted. When the timers align, there will
be spike in requested vCPU when the timers expire. This can lead to
preemption when the other SPLPARs are not idle.

Since we are trading off between the performance vs power here,
benchmarked both the numbers. Frequency is set to 3.00Ghz and
socket power has been measured. Ran the stress-ng with two
cgroups. The numbers are with patch and without patch on a Power
system with SMT=8. Below table shows time taken by each group to
complete. Here each cgroup is assigned 25% runtime. period value is
set to 100ms.

workload: stress-ng --cpu=4 --cpu-ops=50000
data shows time it took to complete in seconds for each run.
Tried to interleave by best effort with the patch.
1CG - time to finish when only 1 cgroup is running.
2CG - time to finish when 2 cgroups are running together.
power - power consumed in Watts for the socket running the workload.
Performance gain is indicated in +ve percentage numbers and power
increase is indicated in -ve numbers. 1CG numbers are same as expected.
We are looking at improvement in 2CG Mainly.

6.2.rc5 with patch
1CG power 2CG power | 1CG power 2CG power
1Core 218 44 315 46 | 219 45 277(+12%) 47(-2%)
219 43 315 45 | 219 44 244(+22%) 48(-6%)
|
2Core 108 48 158 52 | 109 50 114(+26%) 59(-13%)
109 49 157 52 | 109 49 136(+13%) 56(-7%)
|
4Core 60 59 89 65 | 62 58 72(+19%) 68(-5%)
61 61 90 65 | 62 60 68(+24%) 73(-12%)
|
8Core 33 77 48 83 | 33 77 37(+23%) 91(-10%)
33 77 48 84 | 33 77 38(+21%) 90(-7%)

There is no benefit at higher utilization of 50% or more. There is no
degradation also.

This is RFC PATCH V2, where the code has been shifted from hrtimer to
sched. This patch sets an initial value as multiple of period/10.
Here timers can still align if the time started the cgroup is within the
period/10 interval. On a real life workload, time gives sufficient
randomness. There can be a better interleaving by being more
deterministic. For example, when there are 2 cgroups, they should
have initial value of 0/50ms or 10/60ms so on. When there are 3 cgroups,
0/3/6ms or 1/4/7ms etc. That is more complicated as it has to account
for cgroup addition/deletion and accuracy w.r.t to period/quota.
If that approach is better here, then will come up with that patch.

Signed-off-by: Shrikanth Hegde<[email protected]>
---
kernel/sched/fair.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff4dbbae3b10..7b69c329e05d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5939,14 +5939,25 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
- lockdep_assert_held(&cfs_b->lock);
+ struct hrtimer *period_timer = &cfs_b->period_timer;
+ s64 incr = ktime_to_ns(cfs_b->period) / 10;
+ ktime_t delta;
+ u64 orun = 1;

+ lockdep_assert_held(&cfs_b->lock);
if (cfs_b->period_active)
return;

cfs_b->period_active = 1;
- hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
- hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+ delta = ktime_sub(period_timer->base->get_time(),
+ hrtimer_get_expires(period_timer));
+ if (unlikely(delta >= cfs_b->period)) {
+ orun = ktime_divns(delta, incr);
+ hrtimer_add_expires_ns(period_timer, incr * orun);
+ }
+
+ hrtimer_forward_now(period_timer, cfs_b->period);
+ hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
}

static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
--
2.31.1

2023-02-14 21:37:34

by Benjamin Segall

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

Shrikanth Hegde <[email protected]> writes:

> CPU cfs bandwidth controller uses hrtimer called period timer. Quota is
> refilled upon the timer expiry and re-started when there are running tasks
> within the cgroup. Each cgroup has a separate period timer which manages
> the period and quota for that cgroup.
>
> start_cfs_bandwidth calls hrtimer_forward_now which set the expiry value
> based on the below logic. expiry = $initial_value + $N * $period
>
> However, start_cfs_bandwidth doesn't set any initial value. Hence
> multiple such timers would align on expiry if their period value is
> same. This happens when there are multiple cgroups and each has runnable
> task. Upon expiry each timer will unthrottle respective rq's and all the
> rq would start at same time, competing for CPU time and use all
> the SMT threads likely.
>
> There is performance gain that can be achieved here if the timers are
> interleaved when the utilization of each CPU cgroup is low and total
> utilization of all the CPU cgroup's is less than 50%. This is likely
> true when using containers. If the timers are interleaved, then the
> unthrottled cgroup can run freely without many context switches and can
> also benefit from SMT Folding[1]. This effect will be further amplified in
> SPLPAR environment[2] as this would cause less hypervisor preemptions.
> There can be benefit due to less IPI storm as well. Docker provides a
> config option of period timer value, whereas the kubernetes only
> provides millicore option. Hence with typical deployment period values
> will be set to 100ms as kubernetes millicore will set the quota
> accordingly without altering period values.
>
> [1] SMT folding is a mechanism where processor core is reconfigured to
> lower SMT mode to improve performance when some sibling threads are
> idle. In a SMT8 core, when only one or two threads are running on a
> core, we get the best throughput compared to running all 8 threads.
>
> [2] SPLPAR is an Shared Processor Logical PARtition. There can be many
> SPLPARs running on the same physical machine sharing the CPU resources.
> One SPLPAR can consume all CPU resource it can, if the other SPLPARs are
> idle. Processors within the SPLPAR are called vCPU. vCPU can be higher
> than CPU. Hence at an instance of time if there are more requested vCPU
> than CPU, then vCPU can be preempted. When the timers align, there will
> be spike in requested vCPU when the timers expire. This can lead to
> preemption when the other SPLPARs are not idle.
>
> Since we are trading off between the performance vs power here,
> benchmarked both the numbers. Frequency is set to 3.00Ghz and
> socket power has been measured. Ran the stress-ng with two
> cgroups. The numbers are with patch and without patch on a Power
> system with SMT=8. Below table shows time taken by each group to
> complete. Here each cgroup is assigned 25% runtime. period value is
> set to 100ms.
>
> workload: stress-ng --cpu=4 --cpu-ops=50000
> data shows time it took to complete in seconds for each run.
> Tried to interleave by best effort with the patch.
> 1CG - time to finish when only 1 cgroup is running.
> 2CG - time to finish when 2 cgroups are running together.
> power - power consumed in Watts for the socket running the workload.
> Performance gain is indicated in +ve percentage numbers and power
> increase is indicated in -ve numbers. 1CG numbers are same as expected.
> We are looking at improvement in 2CG Mainly.
>
> 6.2.rc5 with patch
> 1CG power 2CG power | 1CG power 2CG power
> 1Core 218 44 315 46 | 219 45 277(+12%) 47(-2%)
> 219 43 315 45 | 219 44 244(+22%) 48(-6%)
> |
> 2Core 108 48 158 52 | 109 50 114(+26%) 59(-13%)
> 109 49 157 52 | 109 49 136(+13%) 56(-7%)
> |
> 4Core 60 59 89 65 | 62 58 72(+19%) 68(-5%)
> 61 61 90 65 | 62 60 68(+24%) 73(-12%)
> |
> 8Core 33 77 48 83 | 33 77 37(+23%) 91(-10%)
> 33 77 48 84 | 33 77 38(+21%) 90(-7%)
>
> There is no benefit at higher utilization of 50% or more. There is no
> degradation also.
>
> This is RFC PATCH V2, where the code has been shifted from hrtimer to
> sched. This patch sets an initial value as multiple of period/10.
> Here timers can still align if the time started the cgroup is within the
> period/10 interval. On a real life workload, time gives sufficient
> randomness. There can be a better interleaving by being more
> deterministic. For example, when there are 2 cgroups, they should
> have initial value of 0/50ms or 10/60ms so on. When there are 3 cgroups,
> 0/3/6ms or 1/4/7ms etc. That is more complicated as it has to account
> for cgroup addition/deletion and accuracy w.r.t to period/quota.
> If that approach is better here, then will come up with that patch.

This does seem vaguely reasonable, though the power argument of
consolidating wakeups and such is something that we intentionally do in
other situations.

How reasonable do you think it is to just say (and what do the
equivalent numbers look like on your particular benchmark) "put some
variance on your period config if you want variance"?

>
> Signed-off-by: Shrikanth Hegde<[email protected]>
> ---
> kernel/sched/fair.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff4dbbae3b10..7b69c329e05d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5939,14 +5939,25 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>
> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> {
> - lockdep_assert_held(&cfs_b->lock);
> + struct hrtimer *period_timer = &cfs_b->period_timer;
> + s64 incr = ktime_to_ns(cfs_b->period) / 10;
> + ktime_t delta;
> + u64 orun = 1;
>
> + lockdep_assert_held(&cfs_b->lock);
> if (cfs_b->period_active)
> return;
>
> cfs_b->period_active = 1;
> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> - hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> + delta = ktime_sub(period_timer->base->get_time(),
> + hrtimer_get_expires(period_timer));
> + if (unlikely(delta >= cfs_b->period)) {

Probably could have a short comment here that's something like "forward
the hrtimer by period / 10 to reduce synchronized wakeups"

> + orun = ktime_divns(delta, incr);
> + hrtimer_add_expires_ns(period_timer, incr * orun);
> + }
> +
> + hrtimer_forward_now(period_timer, cfs_b->period);
> + hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
> }
>
> static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> --
> 2.31.1

2023-02-15 11:02:25

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

>>
>> 6.2.rc5 with patch
>> 1CG power 2CG power | 1CG power 2CG power
>> 1Core 218 44 315 46 | 219 45 277(+12%) 47(-2%)
>> 219 43 315 45 | 219 44 244(+22%) 48(-6%)
>> |
>> 2Core 108 48 158 52 | 109 50 114(+26%) 59(-13%)
>> 109 49 157 52 | 109 49 136(+13%) 56(-7%)
>> |
>> 4Core 60 59 89 65 | 62 58 72(+19%) 68(-5%)
>> 61 61 90 65 | 62 60 68(+24%) 73(-12%)
>> |
>> 8Core 33 77 48 83 | 33 77 37(+23%) 91(-10%)
>> 33 77 48 84 | 33 77 38(+21%) 90(-7%)
>>
>> There is no benefit at higher utilization of 50% or more. There is no
>> degradation also.
>>
>> This is RFC PATCH V2, where the code has been shifted from hrtimer to
>> sched. This patch sets an initial value as multiple of period/10.
>> Here timers can still align if the time started the cgroup is within the
>> period/10 interval. On a real life workload, time gives sufficient
>> randomness. There can be a better interleaving by being more
>> deterministic. For example, when there are 2 cgroups, they should
>> have initial value of 0/50ms or 10/60ms so on. When there are 3 cgroups,
>> 0/3/6ms or 1/4/7ms etc. That is more complicated as it has to account
>> for cgroup addition/deletion and accuracy w.r.t to period/quota.
>> If that approach is better here, then will come up with that patch.
>
> This does seem vaguely reasonable, though the power argument of
> consolidating wakeups and such is something that we intentionally do in
> other situations.
>
Thank you Benjamin for taking a look and spending time in reviewing this.
> How reasonable do you think it is to just say (and what do the
> equivalent numbers look like on your particular benchmark) "put some
> variance on your period config if you want variance"?
>Run to run variance is expected with this patch as the patch depends
on time upto last period/10 as the basis for interleaving.
What i could infer from this comment about variance. Please correct if not.

>>
>> Signed-off-by: Shrikanth Hegde<[email protected]>
>> ---
>> kernel/sched/fair.c | 17 ++++++++++++++---
>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ff4dbbae3b10..7b69c329e05d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5939,14 +5939,25 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>
>> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> {
>> - lockdep_assert_held(&cfs_b->lock);
>> + struct hrtimer *period_timer = &cfs_b->period_timer;
>> + s64 incr = ktime_to_ns(cfs_b->period) / 10;
>> + ktime_t delta;
>> + u64 orun = 1;
>>
>> + lockdep_assert_held(&cfs_b->lock);
>> if (cfs_b->period_active)
>> return;
>>
>> cfs_b->period_active = 1;
>> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> - hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>> + delta = ktime_sub(period_timer->base->get_time(),
>> + hrtimer_get_expires(period_timer));
>> + if (unlikely(delta >= cfs_b->period)) {
>
> Probably could have a short comment here that's something like "forward
> the hrtimer by period / 10 to reduce synchronized wakeups"
>
Sure. Will do in the next version of this patch.

>> + orun = ktime_divns(delta, incr);
>> + hrtimer_add_expires_ns(period_timer, incr * orun);
>> + }
>> +
>> + hrtimer_forward_now(period_timer, cfs_b->period);
>> + hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
>> }
>>
>> static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> --
>> 2.31.1

2023-02-15 21:32:22

by Benjamin Segall

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

shrikanth hegde <[email protected]> writes:

>>>
>>> 6.2.rc5 with patch
>>> 1CG power 2CG power | 1CG power 2CG power
>>> 1Core 218 44 315 46 | 219 45 277(+12%) 47(-2%)
>>> 219 43 315 45 | 219 44 244(+22%) 48(-6%)
>>> |
>>> 2Core 108 48 158 52 | 109 50 114(+26%) 59(-13%)
>>> 109 49 157 52 | 109 49 136(+13%) 56(-7%)
>>> |
>>> 4Core 60 59 89 65 | 62 58 72(+19%) 68(-5%)
>>> 61 61 90 65 | 62 60 68(+24%) 73(-12%)
>>> |
>>> 8Core 33 77 48 83 | 33 77 37(+23%) 91(-10%)
>>> 33 77 48 84 | 33 77 38(+21%) 90(-7%)
>>>
>>> There is no benefit at higher utilization of 50% or more. There is no
>>> degradation also.
>>>
>>> This is RFC PATCH V2, where the code has been shifted from hrtimer to
>>> sched. This patch sets an initial value as multiple of period/10.
>>> Here timers can still align if the time started the cgroup is within the
>>> period/10 interval. On a real life workload, time gives sufficient
>>> randomness. There can be a better interleaving by being more
>>> deterministic. For example, when there are 2 cgroups, they should
>>> have initial value of 0/50ms or 10/60ms so on. When there are 3 cgroups,
>>> 0/3/6ms or 1/4/7ms etc. That is more complicated as it has to account
>>> for cgroup addition/deletion and accuracy w.r.t to period/quota.
>>> If that approach is better here, then will come up with that patch.
>>
>> This does seem vaguely reasonable, though the power argument of
>> consolidating wakeups and such is something that we intentionally do in
>> other situations.
>>
> Thank you Benjamin for taking a look and spending time in reviewing this.
>> How reasonable do you think it is to just say (and what do the
>> equivalent numbers look like on your particular benchmark) "put some
>> variance on your period config if you want variance"?
>>Run to run variance is expected with this patch as the patch depends
> on time upto last period/10 as the basis for interleaving.
> What i could infer from this comment about variance. Please correct if not.

My question is what the numbers look like if you instead prepare the
cgroups with periods that are something like 97 ms and 103ms instead of
both 100ms (keeping the quota as the same proportion as the original).

2023-02-16 19:58:29

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

On 2/16/23 3:02 AM, Benjamin Segall wrote:
> shrikanth hegde <[email protected]> writes:
>
>>>>
>>>> 6.2.rc5 with patch
>>>> 1CG power 2CG power | 1CG power 2CG power
>>>> 1Core 218 44 315 46 | 219 45 277(+12%) 47(-2%)
>>>> 219 43 315 45 | 219 44 244(+22%) 48(-6%)
>>>> |
>>>> 2Core 108 48 158 52 | 109 50 114(+26%) 59(-13%)
>>>> 109 49 157 52 | 109 49 136(+13%) 56(-7%)
>>>> |
>>>> 4Core 60 59 89 65 | 62 58 72(+19%) 68(-5%)
>>>> 61 61 90 65 | 62 60 68(+24%) 73(-12%)
>>>> |
>>>> 8Core 33 77 48 83 | 33 77 37(+23%) 91(-10%)
>>>> 33 77 48 84 | 33 77 38(+21%) 90(-7%)
>>>>
>>>> There is no benefit at higher utilization of 50% or more. There is no
>>>> degradation also.
>>>>
>>>> This is RFC PATCH V2, where the code has been shifted from hrtimer to
>>>> sched. This patch sets an initial value as multiple of period/10.
>>>> Here timers can still align if the time started the cgroup is within the
>>>> period/10 interval. On a real life workload, time gives sufficient
>>>> randomness. There can be a better interleaving by being more
>>>> deterministic. For example, when there are 2 cgroups, they should
>>>> have initial value of 0/50ms or 10/60ms so on. When there are 3 cgroups,
>>>> 0/3/6ms or 1/4/7ms etc. That is more complicated as it has to account
>>>> for cgroup addition/deletion and accuracy w.r.t to period/quota.
>>>> If that approach is better here, then will come up with that patch.
>>>
>>> This does seem vaguely reasonable, though the power argument of
>>> consolidating wakeups and such is something that we intentionally do in
>>> other situations.
>>>
>> Thank you Benjamin for taking a look and spending time in reviewing this.
>>> How reasonable do you think it is to just say (and what do the
>>> equivalent numbers look like on your particular benchmark) "put some
>>> variance on your period config if you want variance"?
>>> Run to run variance is expected with this patch as the patch depends
>> on time upto last period/10 as the basis for interleaving.
>> What i could infer from this comment about variance. Please correct if not.
>
> My question is what the numbers look like if you instead prepare the
> cgroups with periods that are something like 97 ms and 103ms instead of
> both 100ms (keeping the quota as the same proportion as the original).

oh ok. If the cgroups are prepared with slightly different timer values, then
timers does interleave. That is expected as the difference would be small at
the beginning, goes to max at some point, then again would align later. Like
below

| /\
| / \
timer | / \
delta | / \
|/________\____

time -->

Did a set of experiments with the these three timer values. Here in all the
cases, each cgroup is allocated 25% of the runtime. There are 8 Core with SMT=8
(64 CPU). Values of 100ms/100ms not same as before, since this is run on
different machine as the previous one was not available. Hence kept 100/100
numbers as well.

6.2.rc6 6.2.rc6 + with patch
Period 1CG power 2CG power | 1CG power 2CG power
97/103 27.8 78 32.9 98 | 27.5 75 33.4 102
97/103 27.3 78 33 101 | 27.9 71 32.8 97

100/100 27.5 82 40.2 93 | 27.5 80 34.2 105
100/100 28 86 40.1 94 | 27.7 78 30.1 110

75/125 27.3 89 32.7 102 | 27.3 84 33 106
75/125 27.1 87 33 105 | 27.1 90 33.1 100

Few observations.
1. We get improved performance when the timers are slightly different from
100ms.
2. If the timers have slight variance, there is no difference with patch.
3. power numbers vary bit more, when the timers have variance. This maybe
because the idle/exit aren't aligning.
4. The best interleaving is still not possible if the timers have variance.
that can happen only with deterministic interleaving. patch can hope to
achieve that. But not always.

2023-02-20 17:38:22

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

On Tue, Feb 14, 2023 at 08:54:09PM +0530, shrikanth hegde wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff4dbbae3b10..7b69c329e05d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5939,14 +5939,25 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>
> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> {
> - lockdep_assert_held(&cfs_b->lock);
> + struct hrtimer *period_timer = &cfs_b->period_timer;
> + s64 incr = ktime_to_ns(cfs_b->period) / 10;
> + ktime_t delta;
> + u64 orun = 1;
>
> + lockdep_assert_held(&cfs_b->lock);
> if (cfs_b->period_active)
> return;
>
> cfs_b->period_active = 1;
> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> - hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> + delta = ktime_sub(period_timer->base->get_time(),
> + hrtimer_get_expires(period_timer));
> + if (unlikely(delta >= cfs_b->period)) {
> + orun = ktime_divns(delta, incr);
> + hrtimer_add_expires_ns(period_timer, incr * orun);
> + }
> +
> + hrtimer_forward_now(period_timer, cfs_b->period);
> + hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
> }

What kind of mad hackery is this? Why can't you do the sane thing and
initialize the timer at !0 in init_cfs_bandwidth(), then any of the
forwards will stay in period -- as they should.

Please, go re-read Thomas's email.

*completely* untested...

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c46485d65d7..4d6ea76096dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5915,6 +5915,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+ cfs_b->period_timer.node.expires = get_random_u32_below(cfs_b->period);
cfs_b->period_timer.function = sched_cfs_period_timer;
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
cfs_b->slack_timer.function = sched_cfs_slack_timer;

2023-02-21 18:53:41

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

On 2/20/23 11:08 PM, Peter Zijlstra wrote:
> On Tue, Feb 14, 2023 at 08:54:09PM +0530, shrikanth hegde wrote:
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ff4dbbae3b10..7b69c329e05d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5939,14 +5939,25 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>
>> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> {
>> - lockdep_assert_held(&cfs_b->lock);
>> + struct hrtimer *period_timer = &cfs_b->period_timer;
>> + s64 incr = ktime_to_ns(cfs_b->period) / 10;
>> + ktime_t delta;
>> + u64 orun = 1;
>>
>> + lockdep_assert_held(&cfs_b->lock);
>> if (cfs_b->period_active)
>> return;
>>
>> cfs_b->period_active = 1;
>> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> - hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>> + delta = ktime_sub(period_timer->base->get_time(),
>> + hrtimer_get_expires(period_timer));
>> + if (unlikely(delta >= cfs_b->period)) {
>> + orun = ktime_divns(delta, incr);
>> + hrtimer_add_expires_ns(period_timer, incr * orun);
>> + }
>> +
>> + hrtimer_forward_now(period_timer, cfs_b->period);
>> + hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
>> }
>
> What kind of mad hackery is this? Why can't you do the sane thing and
> initialize the timer at !0 in init_cfs_bandwidth(), then any of the
> forwards will stay in period -- as they should.
>
> Please, go re-read Thomas's email.

Thank you Peter for taking a look and review.
we can scrap this implementation and switch to the one you suggested.
All we need is to initialize the offset.

Only reason was the way i had implemented. This implementation couldn't be
fit into init_cfs_bandwidth as timers would align if the cgroups are
created together and continue to align forever.

>
> *completely* untested...
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7c46485d65d7..4d6ea76096dc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5915,6 +5915,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>
> INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> + cfs_b->period_timer.node.expires = get_random_u32_below(cfs_b->period);

This approach/implementation is better as the random function provides uniform
distribution. Had to modify this a bit to make it work. Along with setting
setting node.expires, we need to set _softexpires as well. Which is what
hrtimer_set_expires does.

Here are the similar numbers again.
8 Core system with SMT=8. Total of 64 CPU
Workload: stress-ng --cpu=32 --cpu-ops=50000

6.2-rc6 | with patch
8Core 1CG power 2CG power | 1CG power 2CG power
27.5 80.6 40 90 | 27.3 82 32.3 104
27.5 81 40.2 91 | 27.5 81 38.7 96
27.7 80 40.1 89 | 27.6 80 29.7 115
27.7 80.1 40.3 94 | 27.6 80 31.5 105

Will collect some more benchmarks numbers w.r.t to performance.

> cfs_b->period_timer.function = sched_cfs_period_timer;
> hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> cfs_b->slack_timer.function = sched_cfs_slack_timer;

This below patch worked.
Does the below patch look okay? shall i send the [PATCH V1] with this change?

Question.
Should we skip this adding the offset for root_task_group?

---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff4dbbae3b10..6448533178af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5923,6 +5923,9 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
cfs_b->period_timer.function = sched_cfs_period_timer;
+ /* Add a random offset so that timers interleave */
+ hrtimer_set_expires(&cfs_b->period_timer, get_random_u32_below(cfs_b->period));
+
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
cfs_b->slack_timer.function = sched_cfs_slack_timer;
cfs_b->slack_started = false;
--
2.31.1

2023-02-21 21:43:42

by Benjamin Segall

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

shrikanth hegde <[email protected]> writes:

> On 2/20/23 11:08 PM, Peter Zijlstra wrote:
>> On Tue, Feb 14, 2023 at 08:54:09PM +0530, shrikanth hegde wrote:
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index ff4dbbae3b10..7b69c329e05d 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5939,14 +5939,25 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>>
>>> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>> {
>>> - lockdep_assert_held(&cfs_b->lock);
>>> + struct hrtimer *period_timer = &cfs_b->period_timer;
>>> + s64 incr = ktime_to_ns(cfs_b->period) / 10;
>>> + ktime_t delta;
>>> + u64 orun = 1;
>>>
>>> + lockdep_assert_held(&cfs_b->lock);
>>> if (cfs_b->period_active)
>>> return;
>>>
>>> cfs_b->period_active = 1;
>>> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>> - hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>>> + delta = ktime_sub(period_timer->base->get_time(),
>>> + hrtimer_get_expires(period_timer));
>>> + if (unlikely(delta >= cfs_b->period)) {
>>> + orun = ktime_divns(delta, incr);
>>> + hrtimer_add_expires_ns(period_timer, incr * orun);
>>> + }
>>> +
>>> + hrtimer_forward_now(period_timer, cfs_b->period);
>>> + hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
>>> }
>>
>> What kind of mad hackery is this? Why can't you do the sane thing and
>> initialize the timer at !0 in init_cfs_bandwidth(), then any of the
>> forwards will stay in period -- as they should.
>>
>> Please, go re-read Thomas's email.
>
> Thank you Peter for taking a look and review.
> we can scrap this implementation and switch to the one you suggested.
> All we need is to initialize the offset.
>
> Only reason was the way i had implemented. This implementation couldn't be
> fit into init_cfs_bandwidth as timers would align if the cgroups are
> created together and continue to align forever.
>
>>
>> *completely* untested...
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7c46485d65d7..4d6ea76096dc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5915,6 +5915,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>
>> INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>> hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
>> + cfs_b->period_timer.node.expires = get_random_u32_below(cfs_b->period);
>
> This approach/implementation is better as the random function provides uniform
> distribution. Had to modify this a bit to make it work. Along with setting
> setting node.expires, we need to set _softexpires as well. Which is what
> hrtimer_set_expires does.
>
> Here are the similar numbers again.
> 8 Core system with SMT=8. Total of 64 CPU
> Workload: stress-ng --cpu=32 --cpu-ops=50000
>
> 6.2-rc6 | with patch
> 8Core 1CG power 2CG power | 1CG power 2CG power
> 27.5 80.6 40 90 | 27.3 82 32.3 104
> 27.5 81 40.2 91 | 27.5 81 38.7 96
> 27.7 80 40.1 89 | 27.6 80 29.7 115
> 27.7 80.1 40.3 94 | 27.6 80 31.5 105
>
> Will collect some more benchmarks numbers w.r.t to performance.
>
>
>> cfs_b->period_timer.function = sched_cfs_period_timer;
>> hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> cfs_b->slack_timer.function = sched_cfs_slack_timer;
>
> This below patch worked.
> Does the below patch look okay? shall i send the [PATCH V1] with this
> change?

Yeah, this design makes way more sense.

>
> Question.
> Should we skip this adding the offset for root_task_group?

The value should never come up, so it's just a question of if it's fine
to call get_random_* in early contexts, which I don't know offhand.

>
>
> ---
> kernel/sched/fair.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff4dbbae3b10..6448533178af 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5923,6 +5923,9 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> cfs_b->period_timer.function = sched_cfs_period_timer;
> + /* Add a random offset so that timers interleave */
> + hrtimer_set_expires(&cfs_b->period_timer, get_random_u32_below(cfs_b->period));
> +
> hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> cfs_b->slack_timer.function = sched_cfs_slack_timer;
> cfs_b->slack_started = false;

2023-02-22 09:41:21

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

On Tue, Feb 21, 2023 at 01:43:27PM -0800, Benjamin Segall wrote:
> The value should never come up, so it's just a question of if it's fine
> to call get_random_* in early contexts, which I don't know offhand.

Should be, scheduler init is quite late as things go and people have
been pushing the random init earlier and earlier.