Message-ID: <5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com>
Date:   Tue, 31 Jan 2023 11:18:44 +0530
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.1
Content-Language: en-US
From:   shrikanth hegde <sshegde@linux.vnet.ibm.com>
To:     tglx@linutronix.de
Cc:     peterz@infradead.org, arjan@linux.intel.com, mingo@kernel.org,
        Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
        svaidy@linux.ibm.com, linux-kernel@vger.kernel.org,
        bigeasy@linutronix.de
Subject: [RFC PATCH] hrtimer: interleave timers for improved single thread
 performance at low utilization
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk

As per current design of hrtimer, it uses the _softexpires to trigger the
timer function.  _softexpires is set as multiple of the period/interval value.
This will benefit the power saving by less wakeups. Due to this, different
timers of the same period/interval values align and the callbacks functions
will be called at the same time.

CPU bandwidth controller (CPU cgroup) uses these hrtimers to implement period
and quota.  Period timer refills the quota and allow the throttled cgroups to
start running again.  When there are multiple such cgroup's, if their period
values are same, then these period timers will be aligned.  Hence multiple
cgroup's timer fire at the same time and ends up unthrottling each cgroups
runqueues. Since all cgroups start, they would compete for CPU and use all SMT
threads likely.

There is performance gain that can be achieved here if the timers are
interleaved when the utilization of each CPU cgroup is low and total
utilization of all the CPU cgroup's is less than 50%. This is likely true when
using containers. If the timers aren't rounded-off, then the unthrottled
cgroup can run freely without many context switches and can also benefit of SMT
Folding[1]. This effect will be further amplified in SPLPAR environment[2] as
this would cause less hypervisor preemptions. There can be benefit due to less
IPI storm as well. Docker provides a config option of period timer value,
whereas the kubernetes only provides millicore option. Hence with typical
deployment period values will be set to 100ms as kubernetes millicore will
set the quota accordingly without altering period values.

[1] SMT folding is a mechanism were processor core reconfigured to lower SMT
mode to improve performance when some sibling threads are idle. In a SMT8 core,
when only one or two threads are running on a core, we get the best throughput
compared to running all 8 threads.

[2] SPLPAR is an Shared Processor Logical PARtition. There can be many SPLPARs
running on the same physical machine sharing the CPU resources.  One SPLPAR can
consume all CPU resource it can, if the other SPLPARs are idle. Processors
within the SPLPAR are called vCPU. vCPU can be higher than CPU.  Hence at an
instance of time if there are more requested vCPU than CPU, then vCPU can be
preempted. When the timers align, there will be spike in requested vCPU when
the timers expire. This can lead to preemption when the other SPLPARs are not
idle.

Came up with a naive patch, more of hack. Other alternative is to use a
slightly modified API for cgroups, so that all other timers align and wakeups
remain reduced. New hrtimer api is likely better, i can send out the patch
quickly.  Here i am trying to misalign by setting the softexpire at multiple of
interval/10 instead of interval. Ran the stress-ng with two cgroups. The
numbers are with patch and without patch on Power10 machine with SMT=8. Below
table shows time taken by each group to complete. In the last column, both
cgroup's are run together and data shows average time taken by cgroups to
complete. Here each cgroup is assigned 25% runtime.

workload: stress-ng --cpu=4 --cpu-ops=100000 data shows time it took to
complete in seconds for each run.

Without Patch:
period/quota    cgroup1 runs    cgroup2 runs    cgroup1 &cgroup2
                    alone           alone         run together
100ms/200ms         120s            120s            155s
                     120s            120s            155s
                     120s            120s            155s
With Patch:
period/quota    cgroup1 runs    cgroup2 runs    cgroup1 & cgroup2
                    alone           alone         run together
100ms/200ms         120s            120s            131s
                     120s            120s            155s
                     120s            120s            121s

There is no benefit at higher utilization of 50% or more. There is no
degradation also.

Signed-off by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
---
  kernel/time/hrtimer.c | 11 +++++++++++
  1 file changed, 11 insertions(+)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3ae661ab6260..d160f49f0cce 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1055,6 +1055,17 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)

  		orun = ktime_divns(delta, incr);
  		hrtimer_add_expires_ns(timer, incr * orun);
+		/*
+		 * Avoid timer round-off, so that all cfs bandwidth timers
+		 * don't start at the same time
+		 */
+		if (incr >= 100000000ULL) {
+			s64 interleave = 0;
+			interleave = ktime_sub_ns(delta,  incr * orun);
+			interleave = interleave - (ktime_to_ns(delta) % (incr/10));
+			if (interleave > 0)
+				hrtimer_add_expires_ns(timer, interleave);
+		}
  		if (hrtimer_get_expires_tv64(timer) > now)
  			return orun;
  		/*
--
2.35.3