Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp901868imm; Tue, 5 Jun 2018 06:19:41 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKED0LISOKn5XB5JzdSHZJv3IK9ed9bC5A53Wii4X9zu/UhFB4+FT49DoE5s1ejzcSa9FLq X-Received: by 2002:a17:902:3001:: with SMTP id u1-v6mr26361681plb.376.1528204781834; Tue, 05 Jun 2018 06:19:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528204781; cv=none; d=google.com; s=arc-20160816; b=Fw0ExPPN5BIDQzXRjsuRd9sviBapcpUPaCJiieeBMILenzyEGmE2Gw5hlBR3kKGje4 K48HgAJ0Wty0OrpJTn9AH+GMOnKOQXmRxvQ6cZsDntZRkKMxUXkAu2jB37CvTosifEZt wK9WLlmpQoSxDwsUIDHxRpgy6H7vCqn3CfA+HdlMYzjiGIj+Clx8erdausuvltdSQydZ 6jqvwSykNYgqgE0KB0zHtdFwspJeQWwhLXbF8BMHxHqADDxt0OTkaMz+9h6yVESq4UdZ bQlrgVW314tDn++gHv5/WC0vTlD7L4at2n/XyG7gvbtKkB4+c8ubZA2aHjxG3fxYDXD3 /VTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=3smIt9QwWy7SLVHP6fPxa+h9wgfJEdZX6KPI33ajKHE=; b=DZ+gdUZY/G2jtIhw8e6xxa3BYKhwsoy4y95t9wl43N7zvzlvK+hLWZ3gIjolvoDFOJ KP0TBoT6DjrG8ymwWnarr9ViCMTps0ZcI00l7shn2Hvs2erK/3di3Apoy1JMTD70HoQR PGoHhoB4CuuRh19AE1ShAiUarA0/9ZFEcf19tFdk6qO3R5vivjCbmIsIqKn4JbA6w2og 0YQ4TvIp8+TZrYaYXWkGIRWPBmBuQ8QeUvEN3JJxcnG7bJi8b+mOFNxpVLufPEGPiQ4T bFqqj13XRoI5/+nHbDs0iX6/6+ovZnMirnpPRBxq1JSF3SGscxTg1ZesE9S9TJQFt8k7 s9Qw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=d2BdCgke; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g10-v6si4344722pgr.72.2018.06.05.06.19.26; Tue, 05 Jun 2018 06:19:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=d2BdCgke; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751850AbeFENTB (ORCPT + 99 others); Tue, 5 Jun 2018 09:19:01 -0400 Received: from mail-it0-f65.google.com ([209.85.214.65]:52961 "EHLO mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751775AbeFENS7 (ORCPT ); Tue, 5 Jun 2018 09:18:59 -0400 Received: by mail-it0-f65.google.com with SMTP id m194-v6so3318913itg.2 for ; Tue, 05 Jun 2018 06:18:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=3smIt9QwWy7SLVHP6fPxa+h9wgfJEdZX6KPI33ajKHE=; b=d2BdCgkePv/wXjxNHz/86MPwkkj8Bngce+qDXFohovUIB/o4pH/TyrRvGfd916wR1Y b0kZDTTB+QzERqzzxxnalrAIRG1lMTnmHo4amOGWcP2qX3I2Fd1mODGuDNfLILtMPi8q xciXsExkZpKfP1qfB4i4sc7JSNBmSluATTUbo= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=3smIt9QwWy7SLVHP6fPxa+h9wgfJEdZX6KPI33ajKHE=; b=t3dNDDnA/kIWsExdfEJ9Q3Fqrh43Zz2p+c2Dw2eN/Faj0pyXRSOst45XzwYNAGzMKl ZxWjK8hJb/5JPTamgr1KPOFUpbJkw3wUz7ApflydVzr5bZeW3tMbg7IV9FOK425EyxV0 +YpUXGvQxnaPO3T+JM8ixPzxIjvdsT8kjcFbpBG3xFmrz2M96oX3nWmCvF20NioAiXcp q8hiWjXv/RxKDd801FnDkVAmI07rHjYIiWRmuraR9xYwlZ0XE6NePSkc2Ks1mmgVVh6w ayv8w+yKDDLh7bOQDawrFIuM3lr6GYv62T/ui0ETK0uF2J+1+UY17yuMOP4xsOKV3hQA btPg== X-Gm-Message-State: APt69E1vLt9u3DMXdaB3oQT4/hXDETAhdmjJ41R+CTazLBo6MGb/CHW/ 7mpgFTChao4S2BUBOko6HM5fnzj6nxIsIkmDS0//xQ== X-Received: by 2002:a24:4985:: with SMTP id e5-v6mr2059269itd.89.1528204738743; Tue, 05 Jun 2018 06:18:58 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a6b:304a:0:0:0:0:0 with HTTP; Tue, 5 Jun 2018 06:18:38 -0700 (PDT) In-Reply-To: <20180605131224.GC12193@e108498-lin.cambridge.arm.com> References: <1527253951-22709-1-git-send-email-vincent.guittot@linaro.org> <20180605105721.GA12193@e108498-lin.cambridge.arm.com> <20180605131224.GC12193@e108498-lin.cambridge.arm.com> From: Vincent Guittot Date: Tue, 5 Jun 2018 15:18:38 +0200 Message-ID: Subject: Re: [PATCH v5 00/10] track CPU utilization To: Quentin Perret Cc: Peter Zijlstra , Ingo Molnar , linux-kernel , "Rafael J. Wysocki" , Juri Lelli , Dietmar Eggemann , Morten Rasmussen , viresh kumar , Valentin Schneider Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5 June 2018 at 15:12, Quentin Perret wrote: > On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote: >> On 5 June 2018 at 12:57, Quentin Perret wrote: >> > Hi Vincent, >> > >> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote: >> >> Hi Quentin, >> >> >> >> On 25 May 2018 at 15:12, Vincent Guittot wrote: >> >> > This patchset initially tracked only the utilization of RT rq. During >> >> > OSPM summit, it has been discussed the opportunity to extend it in order >> >> > to get an estimate of the utilization of the CPU. >> >> > >> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization >> >> > tracking for rt_rq. >> >> > >> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency >> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't >> >> > reflect anymore the utilization of cfs tasks but only the remaining part that >> >> > is not used by rt tasks. We should monitor the stolen utilization and take >> >> > it into account when selecting OPP. This patchset doesn't change the OPP >> >> > selection policy for RT tasks but only for CFS tasks >> >> > >> >> > A rt-app use case which creates an always running cfs thread and a rt threads >> >> > that wakes up periodically with both threads pinned on same CPU, show lot of >> >> > frequency switches of the CPU whereas the CPU never goes idles during the >> >> > test. I can share the json file that I used for the test if someone is >> >> > interested in. >> >> > >> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom), >> >> > the cpufreq statistics outputs (stats are reset just before the test) : >> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> >> > without patchset : 1230 >> >> > with patchset : 14 >> >> >> >> I have attached the rt-app json file that I use for this test >> > >> > Thank you very much ! I did a quick test with a much simpler fix to this >> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity(). >> > I get the following results on Hikey960: >> > >> > Without patch: >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> > 12 >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans >> > 640 >> > With patch >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> > 8 >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans >> > 12 >> > >> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think >> > this is an actual issue for realistic use-cases ? >> >> yes I think that it's worth syncing and consolidating things on the >> same metric. The result will be saner and more robust as we will have >> the same behavior > > TBH I'm not disagreeing with that, the PELT-everywhere approach feels > cleaner in a way, but do you have a use-case in mind where this will > definitely help ? > > I mean, yes the rt_avg is a slow response to the RT pressure, but is > this always a problem ? Ramping down slower might actually help in some > cases no ? I would say no because when one will decrease the other one will not increase at the same pace and we will have some wrong behavior or decision > >> >> > >> > What about the diff below (just a quick hack to show the idea) applied >> > on tip/sched/core ? >> > >> > ---8<--- >> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c >> > index a8ba6d1f262a..23a4fb1c2c25 100644 >> > --- a/kernel/sched/cpufreq_schedutil.c >> > +++ b/kernel/sched/cpufreq_schedutil.c >> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) >> > sg_cpu->util_dl = cpu_util_dl(rq); >> > } >> > >> > +unsigned long scale_rt_capacity(int cpu); >> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) >> > { >> > struct rq *rq = cpu_rq(sg_cpu->cpu); >> > + int cpu = sg_cpu->cpu; >> > + unsigned long util, dl_bw; >> > >> > if (rq->rt.rt_nr_running) >> > return sg_cpu->max; >> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) >> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet >> > * ready for such an interface. So, we only do the latter for now. >> > */ >> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); >> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); >> > + util >>= SCHED_CAPACITY_SHIFT; >> > + util = arch_scale_cpu_capacity(NULL, cpu) - util; >> > + util += sg_cpu->util_cfs; >> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >> > + >> > + /* Make sure to always provide the reserved freq to DL. */ >> > + return max(util, dl_bw); >> > } >> > >> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> > index f01f0f395f9a..0e87cbe47c8b 100644 >> > --- a/kernel/sched/fair.c >> > +++ b/kernel/sched/fair.c >> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd, >> > return load_idx; >> > } >> > >> > -static unsigned long scale_rt_capacity(int cpu) >> > +unsigned long scale_rt_capacity(int cpu) >> > { >> > struct rq *rq = cpu_rq(cpu); >> > u64 total, used, age_stamp, avg; >> > --->8--- >> > >> > >> > >> >> >> >> > >> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see >> >> > performance improvements: >> >> > >> >> > - Without patchset : >> >> > Test execution summary: >> >> > total time: 15.0009s >> >> > total number of events: 4903 >> >> > total time taken by event execution: 14.9972 >> >> > per-request statistics: >> >> > min: 1.23ms >> >> > avg: 3.06ms >> >> > max: 13.16ms >> >> > approx. 95 percentile: 12.73ms >> >> > >> >> > Threads fairness: >> >> > events (avg/stddev): 4903.0000/0.00 >> >> > execution time (avg/stddev): 14.9972/0.00 >> >> > >> >> > - With patchset: >> >> > Test execution summary: >> >> > total time: 15.0014s >> >> > total number of events: 7694 >> >> > total time taken by event execution: 14.9979 >> >> > per-request statistics: >> >> > min: 1.23ms >> >> > avg: 1.95ms >> >> > max: 10.49ms >> >> > approx. 95 percentile: 10.39ms >> >> > >> >> > Threads fairness: >> >> > events (avg/stddev): 7694.0000/0.00 >> >> > execution time (avg/stddev): 14.9979/0.00 >> >> > >> >> > The performance improvement is 56% for this use case. >> >> > >> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar >> >> > problem as with rt_rq >> >> > >> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove >> >> > dl and rt from sched_rt_avg_update >> >> > >> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP >> >> > A test with iperf on hikey 6220 gives: >> >> > w/o patchset w/ patchset >> >> > Tx 276 Mbits/sec 304 Mbits/sec +10% >> >> > Rx 299 Mbits/sec 328 Mbits/sec +09% >> >> > >> >> > 8 iterations of iperf -c server_address -r -t 5 >> >> > stdev is lower than 1% >> >> > Only WFI idle state is enable (shallowest arm idle state) >> >> > >> >> > - Patches 9 removes the unused sched_avg_update code >> >> > >> >> > - Patch 10 removes the unused sched_time_avg_ms >> >> > >> >> > Change since v3: >> >> > - add support of periodic update of blocked utilization >> >> > - rebase on lastest tip/sched/core >> >> > >> >> > Change since v2: >> >> > - move pelt code into a dedicated pelt.c file >> >> > - rebase on load tracking changes >> >> > >> >> > Change since v1: >> >> > - Only a rebase. I have addressed the comments on previous version in >> >> > patch 1/2 >> >> > >> >> > Vincent Guittot (10): >> >> > sched/pelt: Move pelt related code in a dedicated file >> >> > sched/rt: add rt_rq utilization tracking >> >> > cpufreq/schedutil: add rt utilization tracking >> >> > sched/dl: add dl_rq utilization tracking >> >> > cpufreq/schedutil: get max utilization >> >> > sched: remove rt and dl from sched_avg >> >> > sched/irq: add irq utilization tracking >> >> > cpufreq/schedutil: take into account interrupt >> >> > sched: remove rt_avg code >> >> > proc/sched: remove unused sched_time_avg_ms >> >> > >> >> > include/linux/sched/sysctl.h | 1 - >> >> > kernel/sched/Makefile | 2 +- >> >> > kernel/sched/core.c | 38 +--- >> >> > kernel/sched/cpufreq_schedutil.c | 24 ++- >> >> > kernel/sched/deadline.c | 7 +- >> >> > kernel/sched/fair.c | 381 +++---------------------------------- >> >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++ >> >> > kernel/sched/pelt.h | 63 +++++++ >> >> > kernel/sched/rt.c | 10 +- >> >> > kernel/sched/sched.h | 57 ++++-- >> >> > kernel/sysctl.c | 8 - >> >> > 11 files changed, 563 insertions(+), 423 deletions(-) >> >> > create mode 100644 kernel/sched/pelt.c >> >> > create mode 100644 kernel/sched/pelt.h >> >> > >> >> > -- >> >> > 2.7.4 >> >> > >> > >> >