LinuxLists.cc - [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

2023-01-31 05:49:22

Subject: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

As per current design of hrtimer, it uses the _softexpires to trigger the
timer function. _softexpires is set as multiple of the period/interval value.
This will benefit the power saving by less wakeups. Due to this, different
timers of the same period/interval values align and the callbacks functions
will be called at the same time.

CPU bandwidth controller (CPU cgroup) uses these hrtimers to implement period
and quota. Period timer refills the quota and allow the throttled cgroups to
start running again. When there are multiple such cgroup's, if their period
values are same, then these period timers will be aligned. Hence multiple
cgroup's timer fire at the same time and ends up unthrottling each cgroups
runqueues. Since all cgroups start, they would compete for CPU and use all SMT
threads likely.

There is performance gain that can be achieved here if the timers are
interleaved when the utilization of each CPU cgroup is low and total
utilization of all the CPU cgroup's is less than 50%. This is likely true when
using containers. If the timers aren't rounded-off, then the unthrottled
cgroup can run freely without many context switches and can also benefit of SMT
Folding[1]. This effect will be further amplified in SPLPAR environment[2] as
this would cause less hypervisor preemptions. There can be benefit due to less
IPI storm as well. Docker provides a config option of period timer value,
whereas the kubernetes only provides millicore option. Hence with typical
deployment period values will be set to 100ms as kubernetes millicore will
set the quota accordingly without altering period values.

[1] SMT folding is a mechanism were processor core reconfigured to lower SMT
mode to improve performance when some sibling threads are idle. In a SMT8 core,
when only one or two threads are running on a core, we get the best throughput
compared to running all 8 threads.

[2] SPLPAR is an Shared Processor Logical PARtition. There can be many SPLPARs
running on the same physical machine sharing the CPU resources. One SPLPAR can
consume all CPU resource it can, if the other SPLPARs are idle. Processors
within the SPLPAR are called vCPU. vCPU can be higher than CPU. Hence at an
instance of time if there are more requested vCPU than CPU, then vCPU can be
preempted. When the timers align, there will be spike in requested vCPU when
the timers expire. This can lead to preemption when the other SPLPARs are not
idle.

Came up with a naive patch, more of hack. Other alternative is to use a
slightly modified API for cgroups, so that all other timers align and wakeups
remain reduced. New hrtimer api is likely better, i can send out the patch
quickly. Here i am trying to misalign by setting the softexpire at multiple of
interval/10 instead of interval. Ran the stress-ng with two cgroups. The
numbers are with patch and without patch on Power10 machine with SMT=8. Below
table shows time taken by each group to complete. In the last column, both
cgroup's are run together and data shows average time taken by cgroups to
complete. Here each cgroup is assigned 25% runtime.

workload: stress-ng --cpu=4 --cpu-ops=100000 data shows time it took to
complete in seconds for each run.

Without Patch:
period/quota cgroup1 runs cgroup2 runs cgroup1 &cgroup2
alone alone run together
100ms/200ms 120s 120s 155s
120s 120s 155s
120s 120s 155s
With Patch:
period/quota cgroup1 runs cgroup2 runs cgroup1 & cgroup2
alone alone run together
100ms/200ms 120s 120s 131s
120s 120s 155s
120s 120s 121s

There is no benefit at higher utilization of 50% or more. There is no
degradation also.

Signed-off by: Shrikanth Hegde <[email protected]>
---
kernel/time/hrtimer.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3ae661ab6260..d160f49f0cce 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1055,6 +1055,17 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)

orun = ktime_divns(delta, incr);
hrtimer_add_expires_ns(timer, incr * orun);
+ /*
+ * Avoid timer round-off, so that all cfs bandwidth timers
+ * don't start at the same time
+ */
+ if (incr >= 100000000ULL) {
+ s64 interleave = 0;
+ interleave = ktime_sub_ns(delta, incr * orun);
+ interleave = interleave - (ktime_to_ns(delta) % (incr/10));
+ if (interleave > 0)
+ hrtimer_add_expires_ns(timer, interleave);
+ }
if (hrtimer_get_expires_tv64(timer) > now)
return orun;
/*
--
2.35.3

2023-01-31 10:37:22

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

* shrikanth hegde <[email protected]> wrote:

> As per current design of hrtimer, it uses the _softexpires to trigger the
> timer function. _softexpires is set as multiple of the period/interval value.
> This will benefit the power saving by less wakeups. Due to this, different
> timers of the same period/interval values align and the callbacks functions
> will be called at the same time.
>
> CPU bandwidth controller (CPU cgroup) uses these hrtimers to implement period
> and quota. Period timer refills the quota and allow the throttled cgroups to
> start running again. When there are multiple such cgroup's, if their period
> values are same, then these period timers will be aligned. Hence multiple
> cgroup's timer fire at the same time and ends up unthrottling each cgroups
> runqueues. Since all cgroups start, they would compete for CPU and use all SMT
> threads likely.
>
> There is performance gain that can be achieved here if the timers are
> interleaved when the utilization of each CPU cgroup is low and total
> utilization of all the CPU cgroup's is less than 50%. This is likely true when
> using containers. If the timers aren't rounded-off, then the unthrottled
> cgroup can run freely without many context switches and can also benefit of SMT
> Folding[1]. This effect will be further amplified in SPLPAR environment[2] as
> this would cause less hypervisor preemptions. There can be benefit due to less
> IPI storm as well. Docker provides a config option of period timer value,
> whereas the kubernetes only provides millicore option. Hence with typical
> deployment period values will be set to 100ms as kubernetes millicore will
> set the quota accordingly without altering period values.
>
> [1] SMT folding is a mechanism were processor core reconfigured to lower SMT
> mode to improve performance when some sibling threads are idle. In a SMT8 core,
> when only one or two threads are running on a core, we get the best throughput
> compared to running all 8 threads.
>
> [2] SPLPAR is an Shared Processor Logical PARtition. There can be many SPLPARs
> running on the same physical machine sharing the CPU resources. One SPLPAR can
> consume all CPU resource it can, if the other SPLPARs are idle. Processors
> within the SPLPAR are called vCPU. vCPU can be higher than CPU. Hence at an
> instance of time if there are more requested vCPU than CPU, then vCPU can be
> preempted. When the timers align, there will be spike in requested vCPU when
> the timers expire. This can lead to preemption when the other SPLPARs are not
> idle.
>
> Came up with a naive patch, more of hack. Other alternative is to use a
> slightly modified API for cgroups, so that all other timers align and wakeups
> remain reduced. New hrtimer api is likely better, i can send out the patch
> quickly. Here i am trying to misalign by setting the softexpire at multiple of
> interval/10 instead of interval. Ran the stress-ng with two cgroups. The
> numbers are with patch and without patch on Power10 machine with SMT=8. Below
> table shows time taken by each group to complete. In the last column, both
> cgroup's are run together and data shows average time taken by cgroups to
> complete. Here each cgroup is assigned 25% runtime.
>
> workload: stress-ng --cpu=4 --cpu-ops=100000 data shows time it took to
> complete in seconds for each run.
>
> Without Patch:
> period/quota cgroup1 runs cgroup2 runs cgroup1 &cgroup2
> alone alone run together
> 100ms/200ms 120s 120s 155s
> 120s 120s 155s
> 120s 120s 155s
> With Patch:
> period/quota cgroup1 runs cgroup2 runs cgroup1 & cgroup2
> alone alone run together
> 100ms/200ms 120s 120s 131s
> 120s 120s 155s
> 120s 120s 121s
>
> There is no benefit at higher utilization of 50% or more. There is no
> degradation also.
>
> Signed-off by: Shrikanth Hegde <[email protected]>
> ---
> kernel/time/hrtimer.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 3ae661ab6260..d160f49f0cce 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1055,6 +1055,17 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
>
> orun = ktime_divns(delta, incr);
> hrtimer_add_expires_ns(timer, incr * orun);
> + /*
> + * Avoid timer round-off, so that all cfs bandwidth timers
> + * don't start at the same time
> + */
> + if (incr >= 100000000ULL) {
> + s64 interleave = 0;
> + interleave = ktime_sub_ns(delta, incr * orun);
> + interleave = interleave - (ktime_to_ns(delta) % (incr/10));
> + if (interleave > 0)
> + hrtimer_add_expires_ns(timer, interleave);
> + }

Any reason why you did this in the hrtimer code, instead of the
(sched_cfs_period_timer?) hrtimer handler itself?

Thanks,

Ingo

2023-01-31 11:08:25

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

On Tue, Jan 31 2023 at 11:18, shrikanth hegde wrote:
> As per current design of hrtimer, it uses the _softexpires to trigger the
> timer function. _softexpires is set as multiple of the period/interval value.

Wrong. _softexpires is _hardexpires + slack. The slack allows for
batching which:

> This will benefit the power saving by less wakeups.

But that has absolutely nothing to do with your problem:

> Due to this, different timers of the same period/interval values align
> and the callbacks functions will be called at the same time.

The whole point of hrtimer_forward_now() is to forward the expiry time
of a timer with the given period so that it expires after 'now'.

That's functionality which is used by a lot of callers to implement
proper periodic timers.

> Came up with a naive patch, more of hack.

A broken hack to be precise because any existing user of
hrtimer_forward() will be broken by this hack.

> Other alternative is to use a slightly modified API for cgroups, so
> that all other timers align and wakeups remain reduced.

I'm not seeing why you need a new API for that. The problem is _NOT_ in
the hrtimer code at all.

Lets look at the math:

expiry = $INITIAL_EXPIRYVALUE + $N * $PERIOD

If $INITIAL_EXPIRYVALUE is the same then for all instances then
obviously the expiry values of all instances will be all aligned on
multiples of $PERIOD, right?

So why the heck do you need a new hrtimer API? There is an obvious
solution, right?

Thanks,

tglx

2023-01-31 12:09:45

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

On 1/31/23 4:07 PM, Ingo Molnar wrote:
>
> * shrikanth hegde <[email protected]> wrote:
>
>> ---
>> kernel/time/hrtimer.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
>> index 3ae661ab6260..d160f49f0cce 100644
>> --- a/kernel/time/hrtimer.c
>> +++ b/kernel/time/hrtimer.c
>> @@ -1055,6 +1055,17 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
>>
>> orun = ktime_divns(delta, incr);
>> hrtimer_add_expires_ns(timer, incr * orun);
>> + /*
>> + * Avoid timer round-off, so that all cfs bandwidth timers
>> + * don't start at the same time
>> + */
>> + if (incr >= 100000000ULL) {
>> + s64 interleave = 0;
>> + interleave = ktime_sub_ns(delta, incr * orun);
>> + interleave = interleave - (ktime_to_ns(delta) % (incr/10));
>> + if (interleave > 0)
>> + hrtimer_add_expires_ns(timer, interleave);
>> + }
>
> Any reason why you did this in the hrtimer code, instead of the
> (sched_cfs_period_timer?) hrtimer handler itself?
>
> Thanks,
>
> Ingo

Yes. Thanks for making me think in that way.
This can be done in start_cfs_bandwidth by setting an initial expiry value.
Tried that change. it works in interleaving the timers.

Will do bit more testing and send out the patch.

2023-01-31 12:27:51

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

On 1/31/23 4:38 PM, Thomas Gleixner wrote:
> On Tue, Jan 31 2023 at 11:18, shrikanth hegde wrote:
>> As per current design of hrtimer, it uses the _softexpires to trigger the
>> timer function. _softexpires is set as multiple of the period/interval value.
>
> Wrong. _softexpires is _hardexpires + slack. The slack allows for
> batching which:
>
Ok. understood.
>> This will benefit the power saving by less wakeups.
>
This is what i thought was the rationale of _softexpires
> But that has absolutely nothing to do with your problem:
>
Yes.
>> Due to this, different timers of the same period/interval values align
>> and the callbacks functions will be called at the same time.
>
> The whole point of hrtimer_forward_now() is to forward the expiry time
> of a timer with the given period so that it expires after 'now'.
>
> That's functionality which is used by a lot of callers to implement
> proper periodic timers.
>
>> Came up with a naive patch, more of hack.
>
> A broken hack to be precise because any existing user of
> hrtimer_forward() will be broken by this hack.
>
Agree. Aim was to describe the problem.
>> Other alternative is to use a slightly modified API for cgroups, so
>> that all other timers align and wakeups remain reduced.
>
> I'm not seeing why you need a new API for that. The problem is _NOT_ in
> the hrtimer code at all.
>
> Lets look at the math:
>
> expiry = $INITIAL_EXPIRYVALUE + $N * $PERIOD
>
> If $INITIAL_EXPIRYVALUE is the same then for all instances then
> obviously the expiry values of all instances will be all aligned on
> multiples of $PERIOD, right?
>
> So why the heck do you need a new hrtimer API? There is an obvious
> solution, right?
>
> Thanks,
>
> tglx

Thanks Thomas for taking a look and correcting wrong bits here. The problem
here was that $INITIAL_EXPIRYVALUE was never set. Hence the _softexpires was
always set as the Multiple of $PERIOD.

As Ingo Suggested this, this can be done in sched code. would need to set the
$INITIAL_EXPIRYVALUE in start_cfs_bandwidth and Timers would interleave on the
same logic which was described in the patch.

No change in hrtimer code would be needed. will send out the patch after some
more testing.

2023-01-31 14:56:10

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

> kernel/time/hrtimer.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 3ae661ab6260..d160f49f0cce 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1055,6 +1055,17 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
>
>          orun = ktime_divns(delta, incr);
>          hrtimer_add_expires_ns(timer, incr * orun);
> +        /*
> +         * Avoid timer round-off, so that all cfs bandwidth timers
> +         * don't start at the same time

so while I applaud the final objective, I am sort of wondering if hrtimer.c is the right place in the kernel to fix a CFS/cgroup issue...
wouldn't it be better to solve such issues at the place we want this to happen, rather than for all timers in the whole system?

(also while for performance it might be better to spread out a bit, for power consumption it's obviously the other way around)

2023-01-31 15:51:35

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

On 1/31/23 8:25 PM, Arjan van de Ven wrote:
>
>>   kernel/time/hrtimer.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
>> index 3ae661ab6260..d160f49f0cce 100644
>> --- a/kernel/time/hrtimer.c
>> +++ b/kernel/time/hrtimer.c
>> @@ -1055,6 +1055,17 @@ u64 hrtimer_forward(struct hrtimer *timer,
>> ktime_t now, ktime_t interval)
>>
>>           orun = ktime_divns(delta, incr);
>>           hrtimer_add_expires_ns(timer, incr * orun);
>> +        /*
>> +         * Avoid timer round-off, so that all cfs bandwidth timers
>> +         * don't start at the same time
>
> so while I applaud the final objective, I am sort of wondering if
> hrtimer.c is the right place in the kernel to fix a CFS/cgroup issue...
> wouldn't it be better to solve such issues at the place we want this to
> happen, rather than for all timers in the whole system?
>
Agree. It was an initial approach to the problem. This can be fixed in scheduler
code itself as suggested by Ingo. Will send out that patch soon.
> (also while for performance it might be better to spread out a bit, for
> power consumption it's obviously the other way around)
>
This would be performance vs power comparison. Would get both the
numbers next time for fairer comparison.