Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758714AbeAIPQY (ORCPT + 1 other); Tue, 9 Jan 2018 10:16:24 -0500 Received: from metis.ext.4.pengutronix.de ([92.198.50.35]:55001 "EHLO metis.ext.4.pengutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751239AbeAIPQW (ORCPT ); Tue, 9 Jan 2018 10:16:22 -0500 Message-ID: <1515510975.12538.37.camel@pengutronix.de> Subject: Re: [BUG] schedutil governor produces regular max freq spikes because of lockup detector watchdog threads From: Lucas Stach To: Leonard Crestez , "Rafael J. Wysocki" Cc: Patrick Bellasi , Viresh Kumar , Linux PM , Anson Huang , "linux-kernel@vger.kernel.org" , Juri Lelli , Peter Zijlstra , Vincent Guittot Date: Tue, 09 Jan 2018 16:16:15 +0100 In-Reply-To: <1515508985.3310.8.camel@nxp.com> References: <1515184652.6892.26.camel@nxp.com> <20180108040121.GB4003@vireshk-i7> <1515417622.3207.5.camel@nxp.com> <20180108151450.GA30937@e110439-lin> <1515426694.3207.28.camel@nxp.com> <1515508985.3310.8.camel@nxp.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.22.6-1+deb9u1 Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-SA-Exim-Connect-IP: 2001:67c:670:100:fa0f:41ff:fe58:4010 X-SA-Exim-Mail-From: l.stach@pengutronix.de X-SA-Exim-Scanned: No (on metis.ext.pengutronix.de); SAEximRunCond expanded to false X-PTX-Original-Recipient: linux-kernel@vger.kernel.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: Am Dienstag, den 09.01.2018, 16:43 +0200 schrieb Leonard Crestez: > On Tue, 2018-01-09 at 02:17 +0100, Rafael J. Wysocki wrote: > > On Mon, Jan 8, 2018 at 4:51 PM, Leonard Crestez  wrote: > > > On Mon, 2018-01-08 at 15:14 +0000, Patrick Bellasi wrote: > > > > On 08-Jan 15:20, Leonard Crestez wrote: > > > > > On Mon, 2018-01-08 at 09:31 +0530, Viresh Kumar wrote: > > > > > > On 05-01-18, 23:18, Rafael J. Wysocki wrote: > > > > > > > On Fri, Jan 5, 2018 at 9:37 PM, Leonard Crestez  wrote: > > > > > > > > When using the schedutil governor together with the softlockup detector > > > > > > > > all CPUs go to their maximum frequency on a regular basis. This seems > > > > > > > > to be because the watchdog creates a RT thread on each CPU and this > > > > > > > > causes regular kicks with: > > > > > > > > > > > > > > > >     cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT); > > > > > > > > > > > > > > > > The schedutil governor responds to this by immediately setting the > > > > > > > > maximum cpu frequency, this is very undesirable. > > > > > > > > > > > > > > > > The issue can be fixed by this patch from android: > > > > > > > > > > > > > > > > The patch stalled in a long discussion about how it's difficult for > > > > > > > > cpufreq to deal with RT and how some RT users might just disable > > > > > > > > cpufreq. It is indeed hard but if the system experiences regular power > > > > > > > > kicks from a common debug feature they will end up disabling schedutil > > > > > > > > instead. > > > > > > > Patrick has a series of patches dealing with this problem area AFAICS, > > > > > > > but we are currently integrating material from Juri related to > > > > > > > deadline tasks. > > > > > > I am not sure if Patrick's patches would solve this problem at all as > > > > > > we still go to max for RT and the RT task is created from the > > > > > > softlockup detector somehow. > > > > > I assume you're talking about the series starting with > > > > > "[PATCH v3 0/6] cpufreq: schedutil: fixes for flags updates" > > > > > > > > > > I checked and they have no effect on this particular issue (not > > > > > surprising). > > > > > > > > Yeah, that series was addressing the same issue but for one specific > > > > RT thread: the one used by schedutil to change the frequency. > > > > For all other RT threads the intended behavior was still to got > > > > to max... moreover those patches has been superseded by a different > > > > solution which has been recently proposed by Peter: > > > > > > > >    20171220155625.lopjlsbvss3qgb4d@hirez.programming.kicks-ass.net > > > > > > > > As Viresh and Rafael suggested, we should eventually consider a > > > > different scheduling class and/or execution context for the watchdog. > > > > Maybe a generalization of Juri's proposed SCHED_FLAG_SUGOV flag for > > > > DL tasks can be useful: > > > > > > > >    20171204102325.5110-4-juri.lelli@redhat.com > > > > > > > > Although that solution is already considered "gross" and thus perhaps > > > > it does not make sense to keep adding special DL tasks. > > > > > > > > Another possible alternative to "tag an RT task" as being special, is > > > > to use an API similar to the one proposed by the util_clamp RFC: > > > > > > > >    20170824180857.32103-1-patrick.bellasi@arm.com > > > > > > > > which would allow to define what's the maximum utilization which can > > > > be required by a properly configured RT task. > > > Marking the watchdog as somehow "not important for performance" would > > > probably work, I guess it will take a while to get a stable solution. > > > > > > BTW, in the current version it seems the kick happens *after* the RT > > > task executes. It seems very likely that cpufreq will go back down > > > before a RT executes again, so how does it help? Unless most of the > > > workload is RT. But even in that case aren't you better off with > > > regular scaling since schedutil will notice utilization is high anyway? > > > > > > Scaling freq up first would make more sense except such operations can > > > have very high latencies anyway. > > I guess what happens is that it takes time to switch the frequency and > > the RT task gives the CPU away before the frequency actually changes. > > What I am saying is that as far as I can tell when cpufreq_update_util > is called when the task has already executed and is been switched out. > My tests are not very elaborate but based on some ftracing it seems to > me that the current behavior is for cpufreq spikes to always trail RT > activity. Like this: On i.MX switching the CPU frequency involves both a regulator and PLL reconfiguration. Both actions have really long latencies (giving the CPU away to other processes while waiting to finish), so the frequency switch only happens after the sort-lived watchdog RT process has already completed its work. This behavior is probably less bad for regular RT tasks that actually use a bit more CPU when running, but it's completely nonsensical for the lightweight watchdog thread. Regards, Lucas