Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Wed, 5 Apr 2023 16:25:48 +0200
From:   Peter Zijlstra <peterz@infradead.org>
To:     Ma Xing <maxing.lan@bytedance.com>
Cc:     mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.co,
        linux-kernel@vger.kernel.org, yuanzhu@bytedance.com
Subject: Re: [PATCH] sched/cputime: Make cputime_adjust() more accurate
Message-ID: <20230405142548.GE351571@hirez.programming.kicks-ass.net>
References: <20230328024827.12187-1-maxing.lan@bytedance.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20230328024827.12187-1-maxing.lan@bytedance.com>
Precedence: bulk

On Tue, Mar 28, 2023 at 10:48:27AM +0800, Ma Xing wrote:
> In the current algorithm of cputime_adjust(), the accumulated stime and
> utime are used to divide the accumulated rtime. When the value is very
> large, it is easy for the stime or utime not to be updated. It can cause
> sys or user utilization to be zero for long time.
> 
> A better and intuitive way is to save the last stime and utime, and
> divide the rtime increment proportionally according to the tick
> increment.

<snip>

> In addition, this patch has been running stably for 2 months and no problems have been found.
> 
> Signed-off-by: Ma Xing <maxing.lan@bytedance.com>
> ---
>  include/linux/sched.h         |  2 ++
>  include/linux/sched/cputime.h |  1 +
>  kernel/sched/cputime.c        | 38 +++++++++++++++++++++++++----------
>  3 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6d654eb4cabd..e1bac4ee48ba 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -326,6 +326,8 @@ struct prev_cputime {
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>  	u64				utime;
>  	u64				stime;
> +	u64				utick;
> +	u64				stick;

Not a fan of the naming, cputime_adjust() isn't tick bound, you can run
it however much you want through proc and various syscalls.

>  	raw_spinlock_t			lock;
>  #endif
>  };
> diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
> index 5f8fd5b24a2e..855503bbd067 100644
> --- a/include/linux/sched/cputime.h
> +++ b/include/linux/sched/cputime.h
> @@ -173,6 +173,7 @@ static inline void prev_cputime_init(struct prev_cputime *prev)
>  {
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>  	prev->utime = prev->stime = 0;
> +	prev->utick = prev->stick = 0;
>  	raw_spin_lock_init(&prev->lock);
>  #endif
>  }
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index af7952f12e6c..ee8084957578 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -559,6 +559,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
>  		    u64 *ut, u64 *st)
>  {
>  	u64 rtime, stime, utime;
> +	s64 delta_rtime, delta_stime, delta_utime;
>  	unsigned long flags;
>  
>  	/* Serialize concurrent callers such that we can honour our guarantees */
> @@ -579,22 +580,36 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
>  	stime = curr->stime;
>  	utime = curr->utime;
>  
> +

Superfluous extra white space.

> +	delta_rtime = rtime - prev->stime - prev->utime;
> +	delta_stime = stime - prev->stick;
> +	delta_utime = utime - prev->utick;
> +
> +	prev->stick = stime;
> +	prev->utick = utime;
> +
>  	/*
>  	 * If either stime or utime are 0, assume all runtime is userspace.
>  	 * Once a task gets some ticks, the monotonicity code at 'update:'
>  	 * will ensure things converge to the observed ratio.
>  	 */
>  	if (stime == 0) {
> -		utime = rtime;
> +		delta_utime = delta_rtime;
>  		goto update;
>  	}
>  
>  	if (utime == 0) {
> -		stime = rtime;
> +		delta_stime = delta_rtime;
>  		goto update;
>  	}
>  
> -	stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
> +	if (delta_stime <= 0)
> +		goto update;
> +
> +	if (delta_utime <= 0)
> +		goto update;

Those things really should not be negative; see the initial goto out.

> +
> +	delta_stime = mul_u64_u64_div_u64(delta_stime, delta_rtime, delta_stime + delta_utime);

So the primary difference is that while the previous code maintained the
stime:utime ratio, this does not. I'm not sure that actually matters,
but it isn't called out and no argument is made.

In fact the Changelog is very sparse on actual details.

>  
>  update:
>  	/*
> @@ -606,21 +621,22 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
>  	 *            = (rtime_i+1 - rtime_i) + utime_i
>  	 *            >= utime_i
>  	 */
> -	if (stime < prev->stime)
> -		stime = prev->stime;
> -	utime = rtime - stime;
> +	if (delta_stime <= 0)
> +		delta_stime = 0;

Is this really still valid? The previous case was because we retained
the stime:utime ratio and this enforced monotinicity, but this should be
covered in the above condition if at all.

> +	delta_utime = delta_rtime - delta_stime;
> +
>  
>  	/*
>  	 * Make sure utime doesn't go backwards; this still preserves
>  	 * monotonicity for stime, analogous argument to above.
>  	 */
> -	if (utime < prev->utime) {
> -		utime = prev->utime;
> -		stime = rtime - utime;
> +	if (delta_utime <= 0) {
> +		delta_utime = 0;
> +		delta_stime = delta_rtime;
>  	}

idem.

>  
> -	prev->stime = stime;
> -	prev->utime = utime;
> +	prev->stime += delta_stime;
> +	prev->utime += delta_utime;
>  out:
>  	*ut = prev->utime;
>  	*st = prev->stime;
> -- 
> 2.20.1
>