Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3953927imu; Mon, 28 Jan 2019 14:04:11 -0800 (PST) X-Google-Smtp-Source: ALg8bN5n2aZae94wVTIi83DE2Ic6xFM0hTrV+JWaCnjJMzW03rrmTFWnAQnC/jXNxhDubDVZ08SA X-Received: by 2002:a63:484c:: with SMTP id x12mr21256987pgk.375.1548713051912; Mon, 28 Jan 2019 14:04:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548713051; cv=none; d=google.com; s=arc-20160816; b=J7JU+L3nm2lOFR76AXb44tbG6I678WP1DzxZqiqSWQqPBPTHra1xUc5NYjYbnHBymh dPRm8vEBzymNiWjN0ijCQ1Rxk55yG0gqdJdNywRhzgtXMxFk9Tg/ZeNBEmf7OKADWWFB kQNHoePnQHfB+xLGdxZt9lC3RZhDqarxLB+QnQSnopZAPQheR0VcyJ1ByijA6RIXwF11 tJSsvo4Su1efRGgJJPfbBK3HFCY7mGeL14bwVSTWjrTQh85qsZ9fu1yBN2YM2pNuw/u2 IJFmapVeTkFtluqdVrui0j/RMTA0OyvWlo6sKQ8Zkykn0g8+FKlXF5WeZyOATbHpqS+y 0rjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date; bh=8YRYL3mGhKWS19l6Pia5HcbAJLA80I6AG9JPKR60ej8=; b=BKyvAVYhV/AwA2gFWRCK1DAzZYG5V7Yd9bf4gY1DX9/2BJDAHI6IiNQhtKGF56itQX u4QTU7BMbcv8Tk54wAmYy1p2jjcrT26EX23zIFciH34kBM90PJDWGj97SspMO+XVfb4a 78e8HvuMHvDMY7YAiM7Yhvj+t/JFiyiRT+a+NGYqXYWLHXc8H8iVRehAX9Sb3HOcmzGU avAkGOCUdzvfNWsYrEbOXrYwfCLj/GhwdfeF/ohJOtWDlt/eZHJNiw01gDcHWmWknj4V z05EGaPEL237iLQtt2/DTv65NRwiL7SXhfWDfqWeDURu6UykL61ftvzaiVtcCdC3d6Gn y1eA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l23si32957322pgh.533.2019.01.28.14.03.55; Mon, 28 Jan 2019 14:04:11 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727882AbfA1WDj (ORCPT + 99 others); Mon, 28 Jan 2019 17:03:39 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:52546 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726766AbfA1WDj (ORCPT ); Mon, 28 Jan 2019 17:03:39 -0500 Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65]) by mail.linuxfoundation.org (Postfix) with ESMTPSA id A5F5F2A17; Mon, 28 Jan 2019 22:03:37 +0000 (UTC) Date: Mon, 28 Jan 2019 14:03:36 -0800 From: Andrew Morton To: Johannes Weiner Cc: Peter Zijlstra , Tejun Heo , Suren Baghdasaryan , Lai Jiangshan , linux-kernel@vger.kernel.org Subject: Re: [PATCH] psi: fix aggregation idle shut-off Message-Id: <20190128140336.25854cb97c25924f4bfc26c7@linux-foundation.org> In-Reply-To: <20190116193501.1910-1-hannes@cmpxchg.org> References: <20190116193501.1910-1-hannes@cmpxchg.org> X-Mailer: Sylpheed 3.6.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner wrote: > psi has provisions to shut off the periodic aggregation worker when > there is a period of no task activity - and thus no data that needs > aggregating. However, while developing psi monitoring, Suren noticed > that the aggregation clock currently won't stay shut off for good. > > Debugging this revealed a flaw in the idle design: an aggregation run > will see no task activity and decide to go to sleep; shortly > thereafter, the kworker thread that executed the aggregation will go > idle and cause a scheduling change, during which the psi callback will > kick the !pending worker again. This will ping-pong forever, and is > equivalent to having no shut-off logic at all (but with more code!) > > Fix this by exempting aggregation workers from psi's clock waking > logic when the state change is them going to sleep. To do this, tag > workers with the last work function they executed, and if in psi we > see a worker going to sleep after aggregating psi data, we will not > reschedule the aggregation work item. > > What if the worker is also executing other items before or after? > > Any psi state times that were incurred by work items preceding the > aggregation work will have been collected from the per-cpu buckets > during the aggregation itself. If there are work items following the > aggregation work, the worker's last_func tag will be overwritten and > the aggregator will be kept alive to process this genuine new activity. > > If the aggregation work is the last thing the worker does, and we > decide to go idle, the brief period of non-idle time incurred between > the aggregation run and the kworker's dequeue will be stranded in the > per-cpu buckets until the clock is woken by later activity. But that > should not be a problem. The buckets can hold 4s worth of time, and > future activity will wake the clock with a 2s delay, giving us 2s > worth of data we can leave behind when disabling aggregation. If it > takes a worker more than two seconds to go idle after it finishes its > last work item, we likely have bigger problems in the system, and > won't notice one sample that was averaged with a bogus per-CPU weight. Did we ever hear back from Suren about the testing? Some words here about the new wq_worker_last_func() would be appropriate. It's an ugly-looking thing :( Tejun, did you check this change? > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -124,6 +124,7 @@ > * sampling of the aggregate task states would be. > */ > > +#include "../workqueue_internal.h" "Only to be included by workqueue and core kernel subsystems" I'm not sure that psi qualifies. Perhaps wq_worker_last_func() should be declared in include/linux/workqueue.h. And perhaps implemented there as well. It's similar to current_wq_worker(), which is inlined and wq_worker_last_func() is small enough to justify inlining. > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task) > return to_wakeup ? to_wakeup->task : NULL; > } > > +/** > + * wq_worker_last_func - retrieve worker's last work function > + * > + * Determine the last function a worker executed. This is called from > + * the scheduler to get a worker's last known identity. > + * > + * CONTEXT: > + * spin_lock_irq(rq->lock) > + * > + * Return: > + * The last work function %current executed as a worker, NULL if it > + * hasn't executed any work yet. > + */ > +work_func_t wq_worker_last_func(struct task_struct *task) > +{ > + struct worker *worker = kthread_data(task); > + > + return worker->last_func; > +} The semantics are troublesome. What guarantees that worker->last_func won't change under the caller's feet? The caller should hold some lock (presumably worker->pool->lock) in order to stabilize the wq_worker_last_func() return value? Also, the comment isn't really true - this is called from PSI, which is hardly "the scheduler"?