Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp4001177imu; Mon, 28 Jan 2019 15:07:14 -0800 (PST) X-Google-Smtp-Source: ALg8bN7kevAN9yOgnuWMQUBNOUdT/elsJdaGJFZogy9Lmz2OcLoQuD5hCZC3HU/0djWWwoBnsG6z X-Received: by 2002:a63:4342:: with SMTP id q63mr21627518pga.63.1548716834755; Mon, 28 Jan 2019 15:07:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548716834; cv=none; d=google.com; s=arc-20160816; b=a6/BZv2LyZ/3iJvyp8Lz2pQzgbczrLrFGXPm53rK2sRDz9a6fvYxK6StYRlLzGXZNN 3DZjigJ9i0j/5gHYQ6u+zjxP3x/hrM+JDEyxW+qxY1NaRU8qxYxT2WD1slZjIqB6F56G xiBf1NrKsr/MxrEzTuBSFzfw+7NF8czKU7BTzZLwmKTtYQo31M2Fkm+mNGaEjabl6ECp YNqnb36LuPB8B9rJaPPNCQuFh8V83v8wNL8TlgbpoHF2qDKYVSv/vArkrSw4skKkl5jl y3VQSGDKUuBmTI5Poij7zT3wqdsirt0YrfwYUi0I0xj9uwVnN97M7OuhaKG0lD3o1U51 B9bQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date; bh=YUPHglPt9qcK5TfyqrLQwSBNYNCLwY3YYzYKlyAMgas=; b=JdsOR2xTG1e+dmn7DVP8uPW8fUDW3esb41qooCaPMaJ+fw4USxHJB8nQCHGqqRGB3K GRfZsSioWdWaY+G7ksvg8H+LIp65TPbA1IQyXtafVOwdxnvRtpOz5PalhczlZkaLsby2 smW5r8+11knjrWLQIIiUPWEZ1m44I9aObFMQ3g2zqPLC45mLhdAh+IiuhXa6Y0rIgfpL fiwzhQabWebcOVMrvufNRrh2OAlwZ9SWJaTfTGKeVND5eBQtjjxjp3ZdBKbvO9oKT8f/ QW1V6imly76a2OO4RS+l6PT/VWBD8treuEwSuEjbWO3EOeic2Tbs2uI/ozh+k9cSfImB tm4g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f10si36676377pln.289.2019.01.28.15.06.59; Mon, 28 Jan 2019 15:07:14 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726942AbfA1XGj (ORCPT + 99 others); Mon, 28 Jan 2019 18:06:39 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:54998 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726678AbfA1XGj (ORCPT ); Mon, 28 Jan 2019 18:06:39 -0500 Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65]) by mail.linuxfoundation.org (Postfix) with ESMTPSA id 3E827192D; Mon, 28 Jan 2019 23:06:37 +0000 (UTC) Date: Mon, 28 Jan 2019 15:06:35 -0800 From: Andrew Morton To: Johannes Weiner Cc: Peter Zijlstra , Tejun Heo , Suren Baghdasaryan , Lai Jiangshan , linux-kernel@vger.kernel.org Subject: Re: [PATCH] psi: fix aggregation idle shut-off Message-Id: <20190128150635.c22842034ab7e271c6416d2f@linux-foundation.org> In-Reply-To: <20190116193501.1910-1-hannes@cmpxchg.org> References: <20190116193501.1910-1-hannes@cmpxchg.org> X-Mailer: Sylpheed 3.6.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner wrote: > psi has provisions to shut off the periodic aggregation worker when > there is a period of no task activity - and thus no data that needs > aggregating. However, while developing psi monitoring, Suren noticed > that the aggregation clock currently won't stay shut off for good. > > Debugging this revealed a flaw in the idle design: an aggregation run > will see no task activity and decide to go to sleep; shortly > thereafter, the kworker thread that executed the aggregation will go > idle and cause a scheduling change, during which the psi callback will > kick the !pending worker again. This will ping-pong forever, and is > equivalent to having no shut-off logic at all (but with more code!) > > Fix this by exempting aggregation workers from psi's clock waking > logic when the state change is them going to sleep. To do this, tag > workers with the last work function they executed, and if in psi we > see a worker going to sleep after aggregating psi data, we will not > reschedule the aggregation work item. > > What if the worker is also executing other items before or after? > > Any psi state times that were incurred by work items preceding the > aggregation work will have been collected from the per-cpu buckets > during the aggregation itself. If there are work items following the > aggregation work, the worker's last_func tag will be overwritten and > the aggregator will be kept alive to process this genuine new activity. > > If the aggregation work is the last thing the worker does, and we > decide to go idle, the brief period of non-idle time incurred between > the aggregation run and the kworker's dequeue will be stranded in the > per-cpu buckets until the clock is woken by later activity. But that > should not be a problem. The buckets can hold 4s worth of time, and > future activity will wake the clock with a 2s delay, giving us 2s > worth of data we can leave behind when disabling aggregation. If it > takes a worker more than two seconds to go idle after it finishes its > last work item, we likely have bigger problems in the system, and > won't notice one sample that was averaged with a bogus per-CPU weight. > > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu, > groupc->tasks[t]++; > > write_seqcount_end(&groupc->seq); > - > - if (!delayed_work_pending(&group->clock_work)) > - schedule_delayed_work(&group->clock_work, PSI_FREQ); > } > > static struct psi_group *iterate_groups(struct task_struct *task, void **iter) This breaks Suren's "psi: introduce psi monitor": --- kernel/sched/psi.c~psi-introduce-psi-monitor +++ kernel/sched/psi.c @@ -752,8 +1012,25 @@ static void psi_group_change(struct psi_ write_seqcount_end(&groupc->seq); - if (!delayed_work_pending(&group->clock_work)) - schedule_delayed_work(&group->clock_work, PSI_FREQ); + /* + * Polling flag resets to 0 at the max rate of once per update window + * (at least 500ms interval). smp_wmb is required after group->polling + * 0-to-1 transition to order groupc->times and group->polling writes + * because stall detection logic in the slowpath relies on groupc->times + * changing before group->polling. Explicit smp_wmb is missing because + * cmpxchg() implies smp_mb. + */ + if ((state_mask & group->trigger_mask) && + atomic_cmpxchg(&group->polling, 0, 1) == 0) { + /* + * Start polling immediately even if the work is already + * scheduled + */ + mod_delayed_work(system_wq, &group->clock_work, 1); + } else { + if (!delayed_work_pending(&group->clock_work)) + schedule_delayed_work(&group->clock_work, PSI_FREQ); + } } and I'm too lazy to go in and figure out how to fix it. If we're sure about "psi: fix aggregation idle shut-off" (and I am not) then can I ask for a redo of "psi: introduce psi monitor"?