Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp520541imm; Fri, 22 Jun 2018 00:17:15 -0700 (PDT) X-Google-Smtp-Source: ADUXVKLVg3aCDWO5BHXWlSS2ZOER3v4UEICNiqTIQWRDJ450NifkhjiFHR3ocxxiJ8V7zJTvcVqY X-Received: by 2002:aa7:8058:: with SMTP id y24-v6mr506635pfm.148.1529651835678; Fri, 22 Jun 2018 00:17:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529651835; cv=none; d=google.com; s=arc-20160816; b=M/qQKMR+0idHaTM7WB31wKHEiTSTAb3QOP3OMlKMXej31JcxcvonBHqGgfi8St4zQm hFJ3WSyOG8WBZay4K4P/J0XEpjWzvDT1QanNyltxdTjHezUDIB+R+9EK6mJ3xjM36tvY 9bpkJIglRc/Iqng9NeC9k6iyTdHuHHINbH4OgygW03FUoLeVaAr8DzReYfI9m/XgOj4n l18ahBEhShccWKpz84r/c0xrksUACP59/UrqL13YowPJCfXml42HNklBU3qlTvsv1F2z QYQ4QEq+IqE3KvCBPg5JGdEg78l8/e7RQbXO1HOrEMZU2kAALvX9W7NhZB12H1ZzRwDw qvGw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=077qZ5gIusGHRHcGQDdNJ4NuGTkSo9jkwlrzUk8urRo=; b=K3SDUE9Vx8+oFzBepa9uFufKp/Mi6BWO6ftNjepEuMe+QTsX/g+3uspwzOJA17YndR +1N5biqtMHQGuOGIbajabmgzeraevydmyoMQMNgl8/iaLtBqVw7xng6gtDHi21GFznk1 O3MhSQ1XlI0vcJYofKHqyi60tSIu16PFDxJX446vfpvJ5t+UJSPjYz1xHnD48zS0N6h5 IT0KI4rUcN+FjbBhQRTyMBr5zl5R754/2i6RQoghQVPo0kmK44Kex6UE/nYTwC7pIW/8 g7+s7v9ZlLPskerCG5uPk/3+UbTpeRHChc+lmW9/Ve0HJ0JzpyvrUjDU56P/fsfTQtvA s6Iw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i9-v6si5605149pgv.109.2018.06.22.00.17.01; Fri, 22 Jun 2018 00:17:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751303AbeFVHQH (ORCPT + 99 others); Fri, 22 Jun 2018 03:16:07 -0400 Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:52383 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751250AbeFVHQF (ORCPT ); Fri, 22 Jun 2018 03:16:05 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07487;MF=xlpang@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0T39frrw_1529651742; Received: from localhost(mailfrom:xlpang@linux.alibaba.com fp:SMTPD_---0T39frrw_1529651742) by smtp.aliyun-inc.com(127.0.0.1); Fri, 22 Jun 2018 15:15:51 +0800 From: Xunlei Pang To: Peter Zijlstra , Ingo Molnar , Frederic Weisbecker , Tejun Heo Cc: linux-kernel@vger.kernel.org Subject: [PATCH] sched/cputime: Ensure correct utime and stime proportion Date: Fri, 22 Jun 2018 15:15:42 +0800 Message-Id: <20180622071542.61569-1-xlpang@linux.alibaba.com> X-Mailer: git-send-email 2.14.1.40.g8e62ba1 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We use per-cgroup cpu usage statistics similar to "cgroup rstat", and encountered a problem that user and sys usages are wrongly split sometimes. Run tasks with some random run-sleep pattern for a long time, and when tick-based time and scheduler sum_exec_runtime hugely drifts apart(scheduler sum_exec_runtime is less than tick-based time), the current implementation of cputime_adjust() will produce less sys usage than the actual use after changing to run a different workload pattern with high sys. This is because total tick-based utime and stime are used to split the total sum_exec_runtime. Same problem exists on utime and stime from "/proc//stat". [Example] Run some random run-sleep patterns for minutes, then change to run high sys pattern, and watch. 1) standard "top"(which is the correct one): 4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 2) our tool parsing utime and stime from "/proc//stat": 20.5 usr, 78.4 sys We can see "20.5 usr" displayed in 2) was incorrect, it recovers gradually with time: 9.7 usr, 89.5 sys This patch fixes the issue by calculating using all kinds of time elapsed since last parse in cputime_adjust(), and accumulate the corresponding results calculated into prev_cputime. A new field of task_cputime type is added in structure prev_cputime to record previous task_cputime so that we can get the elapsed time deltas. Signed-off-by: Xunlei Pang --- include/linux/sched.h | 33 +++++++++++------------ include/linux/sched/cputime.h | 12 ++++++++- kernel/sched/cputime.c | 61 ++++++++++++++++--------------------------- 3 files changed, 51 insertions(+), 55 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..5108ac8414e0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -223,6 +223,22 @@ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +/** + * struct task_cputime - collected CPU time counts + * @utime: time spent in user mode, in nanoseconds + * @stime: time spent in kernel mode, in nanoseconds + * @sum_exec_runtime: total time spent on the CPU, in nanoseconds + * + * This structure groups together three kinds of CPU time that are tracked for + * threads and thread groups. Most things considering CPU time want to group + * these counts together and treat all three of them in parallel. + */ +struct task_cputime { + u64 utime; + u64 stime; + unsigned long long sum_exec_runtime; +}; + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode @@ -236,26 +252,11 @@ struct prev_cputime { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE u64 utime; u64 stime; + struct task_cputime cputime; raw_spinlock_t lock; #endif }; -/** - * struct task_cputime - collected CPU time counts - * @utime: time spent in user mode, in nanoseconds - * @stime: time spent in kernel mode, in nanoseconds - * @sum_exec_runtime: total time spent on the CPU, in nanoseconds - * - * This structure groups together three kinds of CPU time that are tracked for - * threads and thread groups. Most things considering CPU time want to group - * these counts together and treat all three of them in parallel. - */ -struct task_cputime { - u64 utime; - u64 stime; - unsigned long long sum_exec_runtime; -}; - /* Alternate field names when used on cache expirations: */ #define virt_exp utime #define prof_exp stime diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 53f883f5a2fd..49f8fd2564ed 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk, atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime); } -static inline void prev_cputime_init(struct prev_cputime *prev) +static inline void prev_cputime_clear(struct prev_cputime *prev) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE prev->utime = prev->stime = 0; + prev->cputime.utime = 0; + prev->cputime.stime = 0; + prev->cputime.sum_exec_runtime = 0; +#endif +} + +static inline void prev_cputime_init(struct prev_cputime *prev) +{ +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE + prev_cputime_clear(prev); raw_spin_lock_init(&prev->lock); #endif } diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..a68483ee3ad7 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -590,69 +590,54 @@ static u64 scale_stime(u64 stime, u64 rtime, u64 total) void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, u64 *ut, u64 *st) { - u64 rtime, stime, utime; + u64 rtime_delta, stime_delta, utime_delta; unsigned long flags; /* Serialize concurrent callers such that we can honour our guarantees */ raw_spin_lock_irqsave(&prev->lock, flags); - rtime = curr->sum_exec_runtime; /* * This is possible under two circumstances: - * - rtime isn't monotonic after all (a bug); + * - task_cputime isn't monotonic after all (a bug); * - we got reordered by the lock. * * In both cases this acts as a filter such that the rest of the code * can assume it is monotonic regardless of anything else. */ - if (prev->stime + prev->utime >= rtime) + if (prev->cputime.utime > curr->utime || + prev->cputime.stime > curr->stime || + prev->cputime.sum_exec_runtime >= curr->sum_exec_runtime) goto out; - stime = curr->stime; - utime = curr->utime; + stime_delta = curr->stime - prev->cputime.stime; + utime_delta = curr->utime - prev->cputime.utime; + rtime_delta = curr->sum_exec_runtime - prev->cputime.sum_exec_runtime; /* - * If either stime or utime are 0, assume all runtime is userspace. - * Once a task gets some ticks, the monotonicy code at 'update:' - * will ensure things converge to the observed ratio. + * If either stime or utime increase are 0, assume all runtime + * is userspace. Once a task gets some ticks, the monotonicy code + * at 'update:' will ensure things converge to the observed ratio. */ - if (stime == 0) { - utime = rtime; + if (stime_delta == 0) { + utime_delta = rtime_delta; goto update; } - if (utime == 0) { - stime = rtime; + if (utime_delta == 0) { + stime_delta = rtime_delta; goto update; } - stime = scale_stime(stime, rtime, stime + utime); + stime_delta = scale_stime(stime_delta, rtime_delta, + stime_delta + utime_delta); + if (stime_delta > rtime_delta) + stime_delta = rtime_delta; + utime_delta = rtime_delta - stime_delta; update: - /* - * Make sure stime doesn't go backwards; this preserves monotonicity - * for utime because rtime is monotonic. - * - * utime_i+1 = rtime_i+1 - stime_i - * = rtime_i+1 - (rtime_i - utime_i) - * = (rtime_i+1 - rtime_i) + utime_i - * >= utime_i - */ - if (stime < prev->stime) - stime = prev->stime; - utime = rtime - stime; - - /* - * Make sure utime doesn't go backwards; this still preserves - * monotonicity for stime, analogous argument to above. - */ - if (utime < prev->utime) { - utime = prev->utime; - stime = rtime - utime; - } - - prev->stime = stime; - prev->utime = utime; + prev->cputime = *curr; + prev->utime += utime_delta; + prev->stime += stime_delta; out: *ut = prev->utime; *st = prev->stime; -- 2.14.1.40.g8e62ba1