Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp3940460pxj; Tue, 8 Jun 2021 02:29:01 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy4ATPnDH6JhPwCnoomNSdjEcEWKdNgAHOw6FEfvq3Altu80ihoONAoxTEdX054eZNxZHmU X-Received: by 2002:a17:906:eb17:: with SMTP id mb23mr12310373ejb.239.1623144540782; Tue, 08 Jun 2021 02:29:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623144540; cv=none; d=google.com; s=arc-20160816; b=g4ir+xvuRmS8CUUOMw8Rfrn99vHJOKBHflhb3hyb38bAu2TQo/G/T18BSkC3Iy6hOJ zlf+Ubfnh0OwT50qt83Jn2ljeU6fgNhEQ0Dexg4SRTmCwoiYLEi4RaXkuqXgj1YcefsL B5ZwmihLgB32HMAcO5wjlTdjd67K5WQkSFMGVJU/2I2CjzDIO1RC/C2hoGWzrLGZl81k UtWA93XC7dyPRm6ActYcamgqPmhEUJkV8EGA8h1Cm/NKa+WGnYG8EFVrRTnGpp77+cmN 5K/WYcxH+5jVn0deHYbctpZk8DF9MKVWLD1TyPro7L0XeX2qTMpK6EGD5zX+RHb9lBXO embg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=GFtSFLk2t1fqGcRmJnmuFLXti+rl5l4p8I44dOIIP1k=; b=uKl1wipaoZ0wfGZYUeDMl8+1sNPr2hwdIB1AUadxi+1DpFJKy/oztJqsUzJgzMIqFG t98HSo//EQYyjK6KrrA0dMcSjZtfmDqhf0GiPHBRb0CLgfojg6dpz9F78TNVoT5s6N8R E6XMzxOBfu/FcmM/+m9gK7lZEcilsidxMijZ2hzfHNNXvRVFWQFyeggMnKqe19lYW2TM B1VVJ8fEdbYm6UT/ZNpInBRC2lXJROx/HWM7hXyzBYmB3iHDShkUA0j0s4NfvXpXyibg FOtbfwEpN01HELfTnziKbc3+K2Vfuzs00sKyh9sJUb62miHTv2tNTIVM6Gid95rJqD+q sFzA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 19si14783791ejx.529.2021.06.08.02.28.37; Tue, 08 Jun 2021 02:29:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230406AbhFHJ2C (ORCPT + 99 others); Tue, 8 Jun 2021 05:28:02 -0400 Received: from foss.arm.com ([217.140.110.172]:53752 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229507AbhFHJ2B (ORCPT ); Tue, 8 Jun 2021 05:28:01 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C1BAE1396; Tue, 8 Jun 2021 02:26:08 -0700 (PDT) Received: from e107158-lin.cambridge.arm.com (e107158-lin.cambridge.arm.com [10.1.195.57]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 708803F719; Tue, 8 Jun 2021 02:26:07 -0700 (PDT) Date: Tue, 8 Jun 2021 10:26:05 +0100 From: Qais Yousef To: Beata Michalska Cc: Joel Fernandes , Valentin Schneider , Quentin Perret , Peter Zijlstra , LKML , Vincent Guittot , Viresh Kumar , Dietmar Eggemann , "Rafael J. Wysocki" Subject: Re: iowait boost is broken Message-ID: <20210608092605.bhp5yyqmzrojqxjm@e107158-lin.cambridge.arm.com> References: <20210607191031.GA12489@e120325.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210607191031.GA12489@e120325.cambridge.arm.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org +CC Dietmar, Rafael On 06/07/21 20:10, Beata Michalska wrote: > Hi Joel, > > Thanks for sending this out. > > On Mon, Jun 07, 2021 at 12:19:01PM -0400, Joel Fernandes wrote: > > Hi all, > > Looks like iowait boost is completely broken upstream. Just > > documenting my findings of iowait boost issues: > > > I wouldn't go as far to state that it is completely broken. Rather that > the current sugov implementation for iowait boosting is not meeting > the expectations and I believe this should be clarified first. More on those > expectations below. > > 1. If a CPU requests iowait boost in a cluster, another CPU can go > > ahead and reset very quickly it since it thinks there's no new request > > for the iowait boosting CPU > So the 'boosting' value is being tracked per CPU, so each core in a cluster > will have it's own variant of that. When calculating the shared freq for > the cluster, sugov will use max utilization reported on each core, including > I/O boost. Now, if there is no pending request for boosting on a given core > at the time of calling sugov_iowait_apply, the current 'boost' will be > reduced, but only this one and that will not affect boost values on remaining > CPUs. It means that there was no task waking up on that particular CPU after > waiting on I/O request. So I would say it's fine. Unless I am misunderstanding > your case ? > > 2. If the iowait is longer than a tick, then successive iowait boost > > doubling does not happen. So heavy I/O waiting code never gets a > > boost. > This might indeed be an issue. What's more: the fact that boosting is applied > per core, any migration will disregard the boosting applied so far, so > it might start increasing the freq on 'new' CPU from scratch. > Might, as sugov (and the I/O boosting) has no notion of the source of boosting > request so it might end up on a core that has already lifted I/O boost. > This also means, that having different small tasks, waking up from > I/O within the mentioned TICK_NSEC time window might drive the frequency to max > even though those would be sporadic wakeups. Things get slightly > cumbersome as increasing the boost does not necessarily result in the freq > change, so the above mentioned case would need proper timing but it is possible. > Also the boost value will not get doubled unless previous one has been applied. > This might result in misalignment between task wake-ups/placement and sugov's > freq changes. > > 3. update_load_avg() is triggered right after the the iowait boost > > request which makes another cpufreq update request, this request is a > > non-iowait boost one so it ends up resetting the iowait boost request > > (in the same path!). > Not necessarily - this is guarded by the TICK_NSEC you have mentioned: > in > sugov_iowait_reset { > ... > if (delta_ns <= TICK_NSEC) > return; > ... > } > So for the particular call sequence the boost should not get reset. > Another problem is when the non-I/O bound tasks triggers the update > after the TICK_NSEC has elapsed and before the frequency change is done. > (see sugov_should_update_freq ) > > 4. Same as #3 but due the update_blocked_averages from new idle balance path. > > > > Here is a patch that tries to address these problems and I see better > > cpufreq boosting happening, however it is just a test broken patch to > > highlight the issues: > > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/5.4/iowait-boost-debug-1&id=3627d896d499d168fef9a388e5d6b3359acc3423 > > > > I think we ought to rewrite the whole mess instead of fixing it since > > a lot has changed in scheduler code over time it feels. Beata is > > working on rewriting the whole iowait boost infra, I am glad she has > > started the work on this and looking forward to helping with the > > patches. > > > So back to the expectations. > The main problem, as I see it, is what do we actually want to achieve with > the I/O boosting? Is it supposed to compensate the time lost while waiting > for the I/O request to be completed or is is supposed to optimize the rate > at which I/O requests are being made. Do we want to boost I/O bound tasks by > default, no limits applied or should we care about balancing performance > vs power ? And unless those expectations are clearly stated, we might not > get too far with any changes, really. > > Smth that I do agree with is what has been suggested few times by Quentin, > to make the I/O boosting being a per-task feature, not per-core, and that is > smth I have been experimenting with. This would help with maintaining the boost > level upon migration and could potentially improve task placement. It would also > eliminate sugov's misfires: currently the boost might be applied when the task > on behalf of which the boost has been triggered, has been migrated or is not > runnable at that point. Also sugov's boosting is completely unaware of > uclamp restrictions,and I guess it should be. This could also be fixed with > per-task boosting. > > There are few tricky bits though: > - current implementation (sugov's) makes sure the boost will not get doubled > unless the current level of boosting has been applied -> there is no notion > for that when moving the boosting as a per-task feature (considered as an > optimization for too frequent freq changes) > - guarding ramping up the frequency with TICK_NSEC & freq_update_delay_ns > is also out of the equation (same as above) > > The above two means we might be boosting faster that this is currently expected. > On the other hand we might lose the boosting as currently sugov does not care > whether that is a single or multiple tasks waking up from I/O. > > I've been trying to come up with some heuristics that would help to decide > whether boosting makes sense or not (like don't boost too much if most of the > time the task is being blocked or when despite boosting the sleep time keeps > getting longer). But that seems slightly counter intuitive, to place that sort > of decision making in schedulers generic code (without having all the info at > hand) and I would expect this to be handled by other, more suited mechanisms. > On the other hand, boosting upon each wake-up (from I/O) seems slightly > excessive. But again: it all comes down to the expectations on the actually > desired behaviour. We could potentially add another PELT-like signal to track > the time spent waiting on I/O request alongside the task utilization, but that > means more messing with signals on the rq level and losing some performance > points due to needed calculations. Alternative of blindly increasing boost > for each relevant wake-up means that small tasks could drive the freq to its > max values - which I am not sure is so desired for power-aware scenarios. > So all in all - still in progress. > > Any comments are more than welcomed. > > --- > BR > B. > > > > thanks, > > - Joel