Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3967601imu; Mon, 28 Jan 2019 14:21:02 -0800 (PST) X-Google-Smtp-Source: ALg8bN6/cn3o1AIo683hJtum75iPPQYfsntrfuqxzADn5N8NexJgHeW+pOirFyL1AjAwHAmzn37Y X-Received: by 2002:a63:6b05:: with SMTP id g5mr20865319pgc.15.1548714062314; Mon, 28 Jan 2019 14:21:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548714062; cv=none; d=google.com; s=arc-20160816; b=qp5jJOR8DFIESGXWJ9D57kXV5r6CzLWGtkCfLsg01Bvx/bQifMOnJ/dzsSyzU6iAH+ rDcj3ikiJqgdt32fjrnbElLY6Skg0r0iauDH/XycT4LQEe5BwgSUjqHPMUaIv0/FIvhR IRiJcKYIy6w1ZcJLzL5WQfZyczPDJzSrbc09Oj6ATLNwMK5bpvKMiRaf8AUNyRxcAyT1 HM9iAoac42vvybrpJWw//ENNKuL2Q6Clc4SsZ+zJd6Fsh9/iriYo8DWz4euljhiyTxsX sMdEDHiC6me0/vdBGTtQfpUVEpdHwiPwZqUjhRaekuj6eFxakxGpSs5O0K5/BToRemyZ X7Tg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=4KL00Sa2jpsdtnOzo+vdaTApEMKPIHpzH6AXyKLdmUs=; b=MoD4pEOoIszTSKhg3KkHg97UkOPaoEmPKgE5qODjU2h1oEQQa1BlNMJa58aU1eL4Hm NBhb6OANFz28zAj6zU8RtCIr/+AZUOCOVVybhKHCx2VRe9HrCedj2GSXXjieJ1Gnss1g YEnAc+ZhjHe9wzNcxxZI2AJozI+bzn6pY+iXYBwKg1YU6O4nJUARQKNW94bzZzfVKrHl wbJ67N3oWfXAnPeF5bEJk+2q8q76DyCUKBx8j04cDy4fJ7JkeidpNHmE+Wxynr18ic/Y 7m7WwvcAcdRXgA6juM0leAwsLpSQivb5/SVDN99Vz4M6cZY6kP1JXavgwu1Yzb9NNKMU yUGQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="dX/3otaT"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 64si73852pfe.74.2019.01.28.14.20.45; Mon, 28 Jan 2019 14:21:02 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="dX/3otaT"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726752AbfA1WUk (ORCPT + 99 others); Mon, 28 Jan 2019 17:20:40 -0500 Received: from mail-wm1-f67.google.com ([209.85.128.67]:38486 "EHLO mail-wm1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726668AbfA1WUk (ORCPT ); Mon, 28 Jan 2019 17:20:40 -0500 Received: by mail-wm1-f67.google.com with SMTP id m22so15741231wml.3 for ; Mon, 28 Jan 2019 14:20:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=4KL00Sa2jpsdtnOzo+vdaTApEMKPIHpzH6AXyKLdmUs=; b=dX/3otaTF6JVUyYit9Cinh1tJMT8xRUc95pXev0rJVPoAnHUmiof8C1f6w1202apHA gVFM6gVL+w70Nb1hWpEPUJCPyC25BPInQuRCwojKYrrtu+mf1bknYpSnIhaZniDORHI5 6VUZq0WN2LQ/O+oB+yGySavLM0sXKE6VzOxX04qpeS3C/WICKAKdYUuuuOpcENTX+Ajl yE+JAstjC/5e2ncSmlioeHh6xitn6Ae6YWgxsMEt4ltGYF2TJAUqcEctw3S/TwVxweqz +2qgvZUUFHFJiIsypLGufvT/FDu4cWcr3O/48hUxs2XUody1guDzaKlQBBvacQ9Kab8u OAAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=4KL00Sa2jpsdtnOzo+vdaTApEMKPIHpzH6AXyKLdmUs=; b=fz9sYhPhX26CFi+cCguCFsI+GiEtKcdahwyCIHq6Vnf/8ujkCIFT5e/5mTWvNKsUXd pqq7GncFj3mmgWtlOnEtCDrZ+JNOyS4HCosmmWcwEy/e9LDF1RbaFZAT1i84XWnrMBbo BL/d989WnKerbWs+otO9Z3LzAsiohdpcmZX0aQ3Xw+g5sU1OHKYVeZ1HS96I2Kxkl8O6 u4I4+yjF0zBUtkaa+q41G0i2reSmC5Ki4vfOPu4aiu7z79CY2Fb975BoiBhKwGV9daDh HpvRQCTEL6Se3r9UxgO+7+FP+RJ/BXs0Q+lNOEGDySuzESr2wUtoVQPVIpinXwmkvK/p IfDw== X-Gm-Message-State: AJcUukcMrBDNpWFVoD2c+SLHOMTLI0mienumM1zsl6ZIhrPDd8uoFNtC 5zKpkwrGvUqbSmc8ekfgNm4QrQahFghheglOGbqq2g== X-Received: by 2002:a7b:c04e:: with SMTP id u14mr17354530wmc.133.1548714036793; Mon, 28 Jan 2019 14:20:36 -0800 (PST) MIME-Version: 1.0 References: <20190116193501.1910-1-hannes@cmpxchg.org> <20190128140336.25854cb97c25924f4bfc26c7@linux-foundation.org> In-Reply-To: <20190128140336.25854cb97c25924f4bfc26c7@linux-foundation.org> From: Suren Baghdasaryan Date: Mon, 28 Jan 2019 14:20:25 -0800 Message-ID: Subject: Re: [PATCH] psi: fix aggregation idle shut-off To: Andrew Morton Cc: Johannes Weiner , Peter Zijlstra , Tejun Heo , Lai Jiangshan , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 28, 2019 at 2:03 PM Andrew Morton wrote: > > On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner wrote: > > > psi has provisions to shut off the periodic aggregation worker when > > there is a period of no task activity - and thus no data that needs > > aggregating. However, while developing psi monitoring, Suren noticed > > that the aggregation clock currently won't stay shut off for good. > > > > Debugging this revealed a flaw in the idle design: an aggregation run > > will see no task activity and decide to go to sleep; shortly > > thereafter, the kworker thread that executed the aggregation will go > > idle and cause a scheduling change, during which the psi callback will > > kick the !pending worker again. This will ping-pong forever, and is > > equivalent to having no shut-off logic at all (but with more code!) > > > > Fix this by exempting aggregation workers from psi's clock waking > > logic when the state change is them going to sleep. To do this, tag > > workers with the last work function they executed, and if in psi we > > see a worker going to sleep after aggregating psi data, we will not > > reschedule the aggregation work item. > > > > What if the worker is also executing other items before or after? > > > > Any psi state times that were incurred by work items preceding the > > aggregation work will have been collected from the per-cpu buckets > > during the aggregation itself. If there are work items following the > > aggregation work, the worker's last_func tag will be overwritten and > > the aggregator will be kept alive to process this genuine new activity. > > > > If the aggregation work is the last thing the worker does, and we > > decide to go idle, the brief period of non-idle time incurred between > > the aggregation run and the kworker's dequeue will be stranded in the > > per-cpu buckets until the clock is woken by later activity. But that > > should not be a problem. The buckets can hold 4s worth of time, and > > future activity will wake the clock with a 2s delay, giving us 2s > > worth of data we can leave behind when disabling aggregation. If it > > takes a worker more than two seconds to go idle after it finishes its > > last work item, we likely have bigger problems in the system, and > > won't notice one sample that was averaged with a bogus per-CPU weight. > > Did we ever hear back from Suren about the testing? > > Some words here about the new wq_worker_last_func() would be > appropriate. > > It's an ugly-looking thing :( Tejun, did you check this change? Yes, sorry for the delay. I tried the patch as is and with some tweaking of my local setup was able to see the cases when this mechanism prevents the system from unnecessary reactivation. I encountered some weird things in my traces (some state transitions were missing in the trace) and wanted to investigate further, however did not get a chance to do that yet. Overall looks like the patch is doing what it's supposed to do. Just wanted to get the full picture before reporting the results. > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -124,6 +124,7 @@ > > * sampling of the aggregate task states would be. > > */ > > > > +#include "../workqueue_internal.h" > > "Only to be included by workqueue and core kernel subsystems" > > I'm not sure that psi qualifies. Perhaps wq_worker_last_func() should > be declared in include/linux/workqueue.h. > > And perhaps implemented there as well. It's similar to > current_wq_worker(), which is inlined and wq_worker_last_func() is > small enough to justify inlining. > > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task) > > return to_wakeup ? to_wakeup->task : NULL; > > } > > > > +/** > > + * wq_worker_last_func - retrieve worker's last work function > > + * > > + * Determine the last function a worker executed. This is called from > > + * the scheduler to get a worker's last known identity. > > + * > > + * CONTEXT: > > + * spin_lock_irq(rq->lock) > > + * > > + * Return: > > + * The last work function %current executed as a worker, NULL if it > > + * hasn't executed any work yet. > > + */ > > +work_func_t wq_worker_last_func(struct task_struct *task) > > +{ > > + struct worker *worker = kthread_data(task); > > + > > + return worker->last_func; > > +} > > The semantics are troublesome. What guarantees that worker->last_func > won't change under the caller's feet? The caller should hold some lock > (presumably worker->pool->lock) in order to stabilize the > wq_worker_last_func() return value? > > Also, the comment isn't really true - this is called from PSI, which is > hardly "the scheduler"? > >