Received: by 2002:a05:6a10:af89:0:0:0:0 with SMTP id iu9csp1296908pxb; Fri, 21 Jan 2022 14:40:48 -0800 (PST) X-Google-Smtp-Source: ABdhPJxc0ScA9r4Te7UsUVRgopCsXMHSW0qoQc4Rstsf50Er/kGltthjKVmj7Pway7BhWIGyRYOZ X-Received: by 2002:a63:c6:: with SMTP id 189mr4254586pga.85.1642804848114; Fri, 21 Jan 2022 14:40:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1642804848; cv=none; d=google.com; s=arc-20160816; b=HcyCccF7d6BVJU1f/SR1YbH9c/EPQx3OKJHuyQFaWc228vJpd/RTLgGq1zOdj367ef XBBnGZk5Nlnw4xPVbW+PSLMAQPo+lKjOpifz6uvUuBMXI6W2DQVLSl9j9RsVLGO01YoS iCGkox0WV1y3Ab5YfO3Wc+O2R4xiSjQLXy/i6Ejvz5XJMckEVjtBLOHPOg81ZJU6Nlkl nc5xUtRi5B5SUwwaeSjLivI8LVUhCphC/txh/aH6MNw1L3bpvoO4kCHY4TDcqhnh/6P3 RqVBRompLjcCiXebuv1aIC0UwLY/+F2h0uXzAoCURGJGfuwo8hVrT39sIMMPc8vMtEkM Jojg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=40N0sADz8zO4opwdCK3nUxiR7UhbelJX0KqIUQCbX+0=; b=kTde/a9D71Rv/ADZ8PuYihQC0R2APuxWRJQkGLxNpY+RDLM/nuRh6fKtwO6gr3hG4U 7Gt+owUWZiCyxGjTTcAyzFeFIpAxobMEW/T4H/tTkBIqER0l19PsuPEMvRPVASNI022R 0F7womQxJvT0yDKLPBUaFZpTAczoz5wu2uDe9Yhx73TVwXm59sD7Km9GXbXrmnclKuOz wczw+goJi104SLNEo9dlZ08kN4H9ttupQhayHqESo4rK9VBXbLqCv695g4KfrvgzorqS VoXaoz2G0PnKLsaPlLNN/1z3XtTDTDHdgS9ew+4t7CquFDae6OESsW+BvZBSOXPEz+BU 8aXg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q6si7276522pgp.405.2022.01.21.14.40.32; Fri, 21 Jan 2022 14:40:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377792AbiATVMN (ORCPT + 99 others); Thu, 20 Jan 2022 16:12:13 -0500 Received: from foss.arm.com ([217.140.110.172]:53728 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1377781AbiATVMM (ORCPT ); Thu, 20 Jan 2022 16:12:12 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 93F2C101E; Thu, 20 Jan 2022 13:12:10 -0800 (PST) Received: from FVFF7649Q05P (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D1FD83F774; Thu, 20 Jan 2022 13:12:08 -0800 (PST) Date: Thu, 20 Jan 2022 21:12:02 +0000 From: Vincent Donnefort To: Vincent Guittot Cc: peterz@infradead.org, mingo@redhat.com, linux-kernel@vger.kernel.org, dietmar.eggemann@arm.com, Valentin.Schneider@arm.com, Morten.Rasmussen@arm.com, Chris.Redpath@arm.com, qperret@google.com, Lukasz.Luba@arm.com Subject: Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration Message-ID: References: <20220112161230.836326-1-vincent.donnefort@arm.com> <20220112161230.836326-3-vincent.donnefort@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [...] > > > And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task > > > > > > > > > > > > - (B) doesn't seem to be accurate as you skip irq and steal time > > > > > accounting and you don't apply any scale invariance if the cpu is not > > > > > idle > > > > > > > > The missing irq and paravirt time is the reason why it is called "estimator". > > > > But maybe there's a chance of improving this part with a lockless version of > > > > rq->prev_irq_time and rq->prev_steal_time_rq? > > > > > > > > > - IIUC your explanation in the commit message above, the (A) period > > > > > seems to be a problem only when idle but you apply it unconditionally. > > > > > > > > If the CPU is idle (and clock_pelt == clock_task), only the B part would be > > > > worth something: > > > > > > > > A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock] > > > > A B > > > > > > > > > If cpu is idle you can assume that clock_pelt should be equal to > > > > > clock_task but you can't if cpu is not idle otherwise your sync will > > > > > be inaccurate and defeat the primary goal of this patch. If your > > > > > problem with clock_pelt is that the pending idle time is not accounted > > > > > for when entering idle but only at the next update (update blocked > > > > > load or wakeup of a thread). This patch below should fix this and > > > > > remove your A. > > > > > > > > That would help slightly the current situation, but this part is already > > > > covered by the estimator. > > > > > > But the estimator, as you name it, is wrong beaus ethe A part can't be > > > applied unconditionally > > > > Hum, it is used only in the !active migration. So we know the task was sleeping > > before that migration. As a consequence, the time we need to account is "sleeping" > > time from the task point of view, which is clock_pelt == clock_task (for > > __update_load_avg_blocked_se()). Otherwise, we would only decay with the > > "wallclock" idle time instead of the "scaled" one wouldn't we? > > clock_pelt == clock_task only when cpu is idle and after updating > lost_idle_time but you have no idea of the state of the cpu when > migrating the task I was just applying the time scaling at the task level. Why shall it depends on the CPU state? The situation would be as follows: <--X--> <--Y--> +-------+-------+-------+ CPUX ___| B | A | B |___ ^ migrate A In a such scenario, CPUX's PELT clock is indeed scaled. The Task A running time (X) has already been accounted, so what's left is to get an idle time (Y) contribution accurate. We would usually rely on the CPU being idle for the catch-up and that time would be Y + (X - scaled(X)). Without the catch-up, we would account at the migration, for the sleeping time Y, only (scaled(Y)). Applied to the same graph as for update_rq_clock_pelt()'s: clock_task | 1| 2| 3| 4| 5| 6| 7| 8| clock_pelt | 1 | 2 | 3 | 4 | (CPU's running, clock_pelt is scaled) expected | 1 |?2 | 5| 6| 7|?8| <---- X ---><--- Y ----> Task A -------************---------- ^ migrate A Contribution for Task A idle time at the migration (as we know we won't have another chance to catch-up clock_task later) should be 6, not 2, regardless of the CPU state. _But_ indeed, there would be a risk of hitting the lost_idle_time threshold and decay too much... (which is absolutely not handled in the current version). So now, if we don't want to bother too much, we could simplify the problem and say (which is true with NOHZ_IDLE) that if the CPU is running, the clock must not be that old anyway. So we should only care of the idle case, which is mitigated with your proposed snippet and I allow to get rid of the [A] part (clock_task - clock_pelt). As per sched_clock_cpu() usage, I haven't measured anything yet but notice it's already used in the wakeup path in ttwu_queue_wakelist().