Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp847072imm; Wed, 6 Jun 2018 06:54:16 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKbMXI3zMVTt9ISGOzJsiTExK/CF6L5qWly0URXQhdXiOEeTfYhnnvnbwE//QtCo6iHCs3y X-Received: by 2002:a17:902:2c83:: with SMTP id n3-v6mr3280710plb.211.1528293256226; Wed, 06 Jun 2018 06:54:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528293256; cv=none; d=google.com; s=arc-20160816; b=pG1TXxMBc4lBGfxn4WSdMSLdKEcpoVPw2CcvZwTDYbRo3tB97EOn7XD2Sf3VMbE5Rn oMNwtdXjdeJjovWNIdiFnlkaOkWiJY+EPvNUpl1tzgpg+H/9tDOyLueU5m0LDnHVg3pT BOu57cqBfEuAs1nj9z255qXjzIl7zp0KrteEZNr251mlINakgec2EchiIP64YruFZ0cs 4vbFn2giZXJkbA6noe2XT7V1tBoNwTHCr+NPcb0sxQCguKrwvkiP7kT864eUI6+M6Kax gDTWvlYwZ58C/SyFV/AdCk3BDrXxNgxId6Gfk609XXksTBU6FunniLH14yu21RSrcwJl sKMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=UCzyNAOOTDyTPD3qxvgqKj1eFBNy7k3Nrgq5FAe5FgE=; b=x37mQCWh4QoP8hf/8qUle5pvqqAzY9eQankqfJUL+E5xw1P+e+HmGccy42hSl3QXjS /HN/GsLkSaPQx+KRPu62JFllQe+KaE0IBHnbWPU1ghiSfAhpLIaCd0qdsb5+NXAD7uwr tFAr+CR3zLL6YMelRWw7Hid7TiXfgcTZUCYPVlgUQDNhhWX9ED3u4mTPxSgY66CLScai pFp/aJtP8zhXWQS8rEEi0wZuEKZlfNj0pwWh7w90np1CNlkFhQ9k25N5lfpF3M+HpsSe 7+MY28zQyoa530lY7TwH213cex7ukdB5sE1qNJkgmMpuNWy27+lV+ZjPN5xzik8TRQkT OrTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@evidence-eu-com.20150623.gappssmtp.com header.s=20150623 header.b=i6WLah0U; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o10-v6si13916188pgr.175.2018.06.06.06.54.01; Wed, 06 Jun 2018 06:54:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@evidence-eu-com.20150623.gappssmtp.com header.s=20150623 header.b=i6WLah0U; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752047AbeFFNxa (ORCPT + 99 others); Wed, 6 Jun 2018 09:53:30 -0400 Received: from mail-ua0-f193.google.com ([209.85.217.193]:43777 "EHLO mail-ua0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751533AbeFFNx3 (ORCPT ); Wed, 6 Jun 2018 09:53:29 -0400 Received: by mail-ua0-f193.google.com with SMTP id z16-v6so4056369uaz.10 for ; Wed, 06 Jun 2018 06:53:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=evidence-eu-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=UCzyNAOOTDyTPD3qxvgqKj1eFBNy7k3Nrgq5FAe5FgE=; b=i6WLah0UFKzxcHme7AXDGPBZmN+FQVJb9WG4tFja9BXlk1cE8qwsUHjMkhlojgL+Sr BvmKWhr08ECrbbS5QyYl/RfIyEAQ1caRhzPuu0U8VqBSZ4mLKxfqfu0qmlURAmfwStv6 aqSvqEyZMj397VD0ds3ilvgB0wWyVCgxEITf64I0CBUxIK/OSDR71TJERWu2/Gl37dRF m4Edq7oOgAsPISNQqn+SfWEuVhgWuLgC+H0BTZQrvqXvKkBgufeHPkabKuWIPGsfi06g PzpQGycotbjz8wz1fw3JFvbOvP01Q/sfKAzIrlDgoCBWRy2xcs9rmXxBlJbo0gO3RUQL 0spw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=UCzyNAOOTDyTPD3qxvgqKj1eFBNy7k3Nrgq5FAe5FgE=; b=VgwFJ9biSB4/bIOWLqfJsPRzBKXnqungne7qC9TB3k+og0Vj3AJYmJt314/Te0QOPW s9F8ZECCZSbQqlvvCjbG7Ck07FbqY3IP2J27m8LTvcMQd/uYZZBVrcWbfQnTn1HX7IAs LWrSliQTQXPcPsYaSBKZRdUGkXvlA+mjcwR4q7Yocpl5Hoq58/2Ry+dsoPTbV3kOdacj iLUDyiy7bxX9HUuqhQGHid6RMoLdyttSIsGFkn0iFVuQqZbpOovT/2VcjozN/iuB7mMJ KkL3nW+Xh4rJPrI6EULFMC2ucQnkZMcl+wk2YKCAoTu9SqucAN/wCKVJx/zIA/tyo2GW JX2w== X-Gm-Message-State: APt69E06z/lLaieJ280DQZ33VpLykTMJ9Jc06jEspLA895gHIpEjGadL S7fgK4biR3g5KzXrzLKbTZetKqSchvf2McWxuEjwpw== X-Received: by 2002:ab0:265:: with SMTP id 92-v6mr2255889uas.26.1528293208002; Wed, 06 Jun 2018 06:53:28 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a67:75d7:0:0:0:0:0 with HTTP; Wed, 6 Jun 2018 06:53:27 -0700 (PDT) In-Reply-To: <20180606132046.GC10870@e108498-lin.cambridge.arm.com> References: <1527253951-22709-1-git-send-email-vincent.guittot@linaro.org> <20180605105721.GA12193@e108498-lin.cambridge.arm.com> <20180605121153.GD16081@localhost.localdomain> <20180605130548.GB12193@e108498-lin.cambridge.arm.com> <20180605131518.GG16081@localhost.localdomain> <20180605140101.GE12193@e108498-lin.cambridge.arm.com> <20180605141317.GJ16081@localhost.localdomain> <6c2dc1aa-3e19-be14-0ed8-b29003c72e61@evidence.eu.com> <20180606132046.GC10870@e108498-lin.cambridge.arm.com> From: Claudio Scordino Date: Wed, 6 Jun 2018 15:53:27 +0200 Message-ID: Subject: Re: [PATCH v5 00/10] track CPU utilization To: Quentin Perret Cc: Juri Lelli , Vincent Guittot , Peter Zijlstra , Ingo Molnar , linux-kernel , "Rafael J. Wysocki" , Dietmar Eggemann , Morten Rasmussen , viresh kumar , Valentin Schneider , Luca Abeni Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Quentin, 2018-06-06 15:20 GMT+02:00 Quentin Perret : > > Hi Claudio, > > On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote: > > Hi Quentin, > > > > Il 05/06/2018 16:13, Juri Lelli ha scritto: > > > On 05/06/18 15:01, Quentin Perret wrote: > > > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote: > > > > > On 05/06/18 14:05, Quentin Perret wrote: > > > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: > > > > > > > Hi Quentin, > > > > > > > > > > > > > > On 05/06/18 11:57, Quentin Perret wrote: > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > What about the diff below (just a quick hack to show the idea) applied > > > > > > > > on tip/sched/core ? > > > > > > > > > > > > > > > > ---8<--- > > > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > > > > > > > --- a/kernel/sched/cpufreq_schedutil.c > > > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > > > > > > > sg_cpu->util_dl = cpu_util_dl(rq); > > > > > > > > } > > > > > > > > +unsigned long scale_rt_capacity(int cpu); > > > > > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > > > > { > > > > > > > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > > > > > > > + int cpu = sg_cpu->cpu; > > > > > > > > + unsigned long util, dl_bw; > > > > > > > > if (rq->rt.rt_nr_running) > > > > > > > > return sg_cpu->max; > > > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > > > > > > > * ready for such an interface. So, we only do the latter for now. > > > > > > > > */ > > > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > > > > > > > > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, > > > > > > > since we use max below, we will probably have the same problem that we > > > > > > > discussed on Vincent's approach (overestimation of DL contribution while > > > > > > > we could use running_bw). > > > > > > > > > > > > Ah no, you're right, this isn't great for long running deadline tasks. > > > > > > We should definitely account for the running_bw here, not the dl avg... > > > > > > > > > > > > I was trying to address the issue of RT stealing time from CFS here, but > > > > > > the DL integration isn't quite right which this patch as-is, I agree ... > > > > > > > > > > > > > > > > > > > > > + util >>= SCHED_CAPACITY_SHIFT; > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > > > > > > > + util += sg_cpu->util_cfs; > > > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > > > > > > > > > > > Why this_bw instead of running_bw? > > > > > > > > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the > > > > > > sum of runtime/deadline ratios of all DL tasks on that rq). > > > > > > > > > > Yep. > > > > > > > > > > > The reason I added this max is because I'm still not sure to understand > > > > > > how we can safely drop the freq below that point ? If we don't guarantee > > > > > > to always stay at least at the freq required by DL, aren't we risking to > > > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In > > > > > > this case, if that tasks uses all of its runtime then you might start > > > > > > missing deadlines ... > > > > > > > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b > > > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE"). > > > > > > > > Right, I spotted that one, but yeah you could also be limited by HW ... > > > > > > > > > > > > > > > My feeling is that the only safe thing to do is to guarantee to never go > > > > > > below the freq required by DL, and to optimistically add CFS tasks > > > > > > without raising the OPP if we have good reasons to think that DL is > > > > > > using less than it required (which is what we should get by using > > > > > > running_bw above I suppose). Does that make any sense ? > > > > > > > > > > Then we can't still avoid the hardware limits, so using running_bw is a > > > > > trade off between safety (especially considering soft real-time > > > > > scenarios) and energy consumption (which seems to be working in > > > > > practice). > > > > > > > > Ok, I see ... Have you guys already tried something like my patch above > > > > (keeping the freq >= this_bw) in real world use cases ? Is this costing > > > > that much energy in practice ? If we fill the gaps left by DL (when it > > > > > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so > > > he might add some numbers to my words above. I didn't (yet). But, please > > > consider that I might be reserving (for example) 50% of bandwidth for my > > > heavy and time sensitive task and then have that task wake up only once > > > in a while (but I'll be keeping clock speed up for the whole time). :/ > > > > As far as I can remember, we never tested energy consumption of running_bw > > vs this_bw, as at OSPM'17 we had already decided to use running_bw > > implementing GRUB-PA. > > The rationale is that, as Juri pointed out, the amount of spare (i.e. > > reclaimable) bandwidth in this_bw is very user-dependent. For example, > > the user can let this_bw be much higher than the measured bandwidth, just > > to be sure that the deadlines are met even in corner cases. > > Ok I see the issue. Trusting userspace isn't necessarily the right thing > to do, I totally agree with that. > > > In practice, this means that the task executes for quite a short time and > > then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced, > > at the 0lag time). > > Using this_bw rather than running_bw, the CPU frequency would remain at > > the same fixed value even when the task is blocked. > > I understand that on some cases it could even be better (i.e. no waste > > of energy in frequency switch). > > +1, I'm pretty sure using this_bw is pretty much always worst than > using running_bw from an energy standpoint,. The waste of energy in > frequency changes should be less than the energy wasted by staying at a > too high frequency for a long time, otherwise DVFS isn't a good idea to > begin with :-) > > > However, IMHO, these are corner cases and in the average case it is better > > to rely on running_bw and reduce the CPU frequency accordingly. > > My point was that accepting to go at a lower frequency than required by > this_bw is fundamentally unsafe. If you're at a low frequency when a DL > task starts, there are real situations where you won't be able to > increase the frequency immediately, which can eventually lead to missing > deadlines. I see. Unfortunately, I'm having quite crazy days so I couldn't follow the original discussion on LKML properly. My fault. Anyway, to answer your question (if this time I have understood it correctly). You're right: the tests have shown that whenever the DL task period gets comparable with the time for switching frequency, the amount of missed deadlines becomes not negligible. To give you a rough idea, this already happens with periods of 10msec on a Odroid XU4. The reason is that the task instance starts at a too low frequency, and the system can't switch frequency in time for meeting the deadline. This is a known issue, partially discussed during the RT Summit'17. However, the community has been more in favour of reducing the energy consumption than meeting firm deadlines. If you need a safe system, in fact, you'd better thinking about disabling DVFS completely and relying on a fixed CPU frequency. A possible trade-off could be a further entry in sys to let system designers switching from (default) running_bw to (more pessimistic) this_bw. However, I'm not sure the community wants a further knob on sysfs just to make RT people happier :) Best, Claudio