Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp6282844imm; Wed, 27 Jun 2018 05:23:54 -0700 (PDT) X-Google-Smtp-Source: ADUXVKISIr61a5vGdSNLrh8jGvBDuAh/zPGm+uaTCSNsMr1zJrUAMB/VJjg9ZfO/Sol4QcP9FhXr X-Received: by 2002:a63:8c10:: with SMTP id m16-v6mr5068645pgd.120.1530102234004; Wed, 27 Jun 2018 05:23:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530102233; cv=none; d=google.com; s=arc-20160816; b=njTIjtDXqAAl5pxf980hfi3msP2L4QfaAvSkhylpRCUBhFxPOgo8/QOA0lwiE0IUrr gVT/GomrD9Kxwa7dcibW5v8pqCgnmcnpx2eCGsmuGefIQpG1XpipRrZuAZcwsZtbyVDH hlp1SeXDP9j9pDMS7BLDFgwYSb4RuFPgxHiU2GL8ZK1RI9fkKAbKtjQ4WKh/RlwRsqXq stHjLFM1AP7s7RYuUU6ca/tnsK/4aalveCg8SQGtS25OV1ddxIrYU7UrINeTXK5hFsTJ hQDF66ro2ywPKdjIjLx1o/46faqCszBdSNF07PIRsYmcp5ABWO8Fs5YH/1f/x7TTR9yU yHQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject:reply-to:arc-authentication-results; bh=L9JxJu7ADdgb+E7k+l72zEe/Y9rN3rPUdmFAT3RlUN0=; b=Vag7fBN8B063XWK2W4yhUfjq9bVr7pBHF3+Thh/h+jGxGj3mQR2RkKLjIft1PA61lj YxNgt2xuVfRHFjqwUKPpwfyyMdCN6/UOmKmy1Mg6qNEEWyZkx1XvQ/iADyIDSDqff/QY rScPeDJLbW6gfMplFhxsTecb6swDoEZkhisAxQDzr//aRQ3f8Ov2iZz9Tj93d5O+jhH4 DuHXyKJJLsASMByl9Y/kuyk/Amr9yoWkKJaYCgLisl7NkEJh7Doaj5SuOlhVCfqiDLOy XNf9oYVuAZ25nm1gM28iy0b+IDloXUaQvLk/s2BRO7LRtA/EjppwG0GEWgrQWcUsjVF4 QDRw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f83-v6si3898409pfa.164.2018.06.27.05.23.39; Wed, 27 Jun 2018 05:23:53 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964998AbeF0MXB (ORCPT + 99 others); Wed, 27 Jun 2018 08:23:01 -0400 Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131]:45970 "EHLO out30-131.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933679AbeF0MXA (ORCPT ); Wed, 27 Jun 2018 08:23:00 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01355;MF=xlpang@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0T3QpJQk_1530102163; Received: from xunleideMacBook-Pro.local(mailfrom:xlpang@linux.alibaba.com fp:SMTPD_---0T3QpJQk_1530102163) by smtp.aliyun-inc.com(127.0.0.1); Wed, 27 Jun 2018 20:22:43 +0800 Reply-To: xlpang@linux.alibaba.com Subject: Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion To: Peter Zijlstra Cc: Ingo Molnar , Frederic Weisbecker , Tejun Heo , linux-kernel@vger.kernel.org References: <20180622071542.61569-1-xlpang@linux.alibaba.com> <20180626154908.GE2458@hirez.programming.kicks-ass.net> From: Xunlei Pang Message-ID: Date: Wed, 27 Jun 2018 20:22:42 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <20180626154908.GE2458@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/26/18 11:49 PM, Peter Zijlstra wrote: > On Tue, Jun 26, 2018 at 08:19:49PM +0800, Xunlei Pang wrote: >> On 6/22/18 3:15 PM, Xunlei Pang wrote: >>> We use per-cgroup cpu usage statistics similar to "cgroup rstat", >>> and encountered a problem that user and sys usages are wrongly >>> split sometimes. >>> >>> Run tasks with some random run-sleep pattern for a long time, and >>> when tick-based time and scheduler sum_exec_runtime hugely drifts >>> apart(scheduler sum_exec_runtime is less than tick-based time), >>> the current implementation of cputime_adjust() will produce less >>> sys usage than the actual use after changing to run a different >>> workload pattern with high sys. This is because total tick-based >>> utime and stime are used to split the total sum_exec_runtime. >>> >>> Same problem exists on utime and stime from "/proc//stat". >>> >>> [Example] >>> Run some random run-sleep patterns for minutes, then change to run >>> high sys pattern, and watch. >>> 1) standard "top"(which is the correct one): >>> 4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st >>> 2) our tool parsing utime and stime from "/proc//stat": >>> 20.5 usr, 78.4 sys >>> We can see "20.5 usr" displayed in 2) was incorrect, it recovers >>> gradually with time: 9.7 usr, 89.5 sys >>> >> High sys probably means there's something abnormal on the kernel >> path, it may hide issues, so we should make it fairly reliable. >> It can easily hit this problem with our per-cgroup statistics. >> >> Hi Peter, any comment on this patch? > > Well, no, because the Changelog is incomprehensible and the patch > doesn't really have useful comments, so I'll have to reverse engineer > the entire thing, and I've just not had time for that. > Let me try the best to describe it again. There are two types of run time for a process: 1) task_struct::utime, task_struct::stime in ticks. 2) scheduler task_struct::se.sum_exec_runtime(rtime) in ns. In case of no vtime accounting, the utime/stime fileds of /proc/pid/stat are calculated by cputime_adjust(), which splits the precise rtime in the proportion of tick-based utime and stime. However cputime_adjust() always does the split using the total utime/stime of the process, this may cause wrong splitting in some cases, e.g. A typical statistic collector accesses "/proc/pid/stat". 1) moment t0 After accessed /proc/pid/stat in t0: tick-based whole utime is utime_0, tick-based whole stime is stime_0, scheduler time is rtime_0. The ajusted utime caculated by cputime_adjust() is autime_0, ajusted stime is astime_0, so astime_0=rtime_0*stime_0/(utime_0+stime_0). For a long time, the process runs mainly in userspace with run-sleep patterns, and because two different clocks, it is possible to have the following condition: rtime_0 < utime_0 (as with little stime_0) 2) moment t1(after dt, i.e. t0+dt) Then the process suddenly runs 100% sys workload afterwards lasting "dt", when accessing /proc/pid/stat at t1="t0+dt", both rtime_0 and stime_0 increase "dt", thus cputime_ajust() does the calculation for new adjusted astime_1 as follows: (rtime_0+dt)*(stime_0+dt)/(utime_0+stime_0+dt) = (rtime_0*stime_0+rtime_0*dt+stime_0*dt+dt*dt)/(utime_0+stime_0+dt) = (rtime_0*stime_0+rtime_0*dt-utime_0*dt)/(utime_0+stime_0+dt) + dt < rtime_0*stime_0/(utime_0+stime_0+dt) + dt (for rtime_0 < utime_0) < rtime_0*stime_0/(utime_0+stime_0) + dt < astime_0+dt The actual astime_1 should be "astime_0+dt"(as it runs 100% sys during dt), but the caculated figure by cputime_adjust() becomes much smaller, as a result the statistics collector shows less cpu sys usage than the actual one. That's why we occasionally observed the incorrect cpu usage described in the changelog: [Example] Run some random run-sleep patterns for minutes, then change to run high sys pattern, and watch. 1) standard "top"(which is the correct one): 4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 2) Parsing utime and stime from "/proc//stat": 20.5 usr, 78.4 sys We can see "20.5 usr" displayed in 2) was incorrect, it recovers gradually with time: 9.7 usr, 89.5 sys Thanks, Xunlei