Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp4005506imu; Mon, 28 Jan 2019 15:13:06 -0800 (PST) X-Google-Smtp-Source: ALg8bN7/ilwTy8lYrcX5Ynq+ihvPa61XiTRtjr4fz4vJ037LsMnFDGhOpbNhTeF3ZR1LogSQJXjJ X-Received: by 2002:a17:902:15a8:: with SMTP id m37mr23892379pla.129.1548717186185; Mon, 28 Jan 2019 15:13:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548717186; cv=none; d=google.com; s=arc-20160816; b=AY7bj119Q2KX7RoM2EsfbRykPY3KW3OOlFoxDOqEtEwj1kDDXnkRJl6nYvN1rHf1SO gsSgH9Mc+uW2ZAClFbEOoj5x5A6Yd5uGclNpnapVIy0Elp6hhFTyFauLdSyFZMYbxFTL 0Wpr2tFOl7icUZdOPQQSJ0mb+lSynbixj6V/UHm+T48uvCY3etpRyy5cozREsWq7Z+dJ S3TZT5E96CyvnBoabAzDSsy7ga0fCU98H4x3Po6ZYRxSAkGS72hNSFxRcBh0yT+SL4Pc gO3JKsczftk8dyXJ9t9EuL7KAxNfx7EvkqOH/xf6+dw5XQy0IG9s8DOwReTOdOG2jk7U 7NuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=aLzsVhMnW180bMhupK9gGFeM1Lcu8y9rNNOTmdWbJ8U=; b=R9DvYnzBy/GIjQnw+wwJd0cPeAg0zKnqZbr5jpxuZh4EazTklXd/SJC8aB/w8YVIRM toN3klPlwmqmavOokQErR0/z+rvCUsrT9uDzJaooRPCgK9w/vmr0Ltb7EAVfOkpcu9sI XO1c/MVbGfeh6Yo0y5Hbc0tZIzCX6oU58MQJeoRD568Xd60rPaVmTCVeXIEL4fmDPo7h cU8DpYDD+O9F78yPcjWNIvrAiIwScZBLEyDDOWvlRWvwBdG25/J9Y99SOlT6Gb+k8ek/ 5KnJvSeUbq6BqDzcuZVK2D2Gd1V7AUeca94SjJLYku7EUVHl+93bev8Wn/5GPSgNAs7k sGEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=v4Wmij+7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e188si33207339pfa.16.2019.01.28.15.12.49; Mon, 28 Jan 2019 15:13:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=v4Wmij+7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727095AbfA1XLJ (ORCPT + 99 others); Mon, 28 Jan 2019 18:11:09 -0500 Received: from mail-wr1-f68.google.com ([209.85.221.68]:32986 "EHLO mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726678AbfA1XLJ (ORCPT ); Mon, 28 Jan 2019 18:11:09 -0500 Received: by mail-wr1-f68.google.com with SMTP id p7so20074301wru.0 for ; Mon, 28 Jan 2019 15:11:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=aLzsVhMnW180bMhupK9gGFeM1Lcu8y9rNNOTmdWbJ8U=; b=v4Wmij+7KbgBG9U7bWPnCDkA7fd4rQk9Qc/nvmM/cgtbmouHXI/PAB37Zbf5xZxNV+ OOBW7PLK7oyT/8M7T+6TCslS+Tqx4bNddguefM80gwX13NN/bpEHSkROcyk2uVmitNrS dSIw6HteEqpv0ULPoasUJaDN1yC8ST5v0P3c4TTfHoiYBuCEifDRK7k3RHiZX2RNUBl2 ZVfYV6+RKhbw83PeoG81iUOt6fKS15QaX3Dh26k45xzns40EzqgdJ39oBD/Tlvgw/nYU wctzNyw+kjIoAApReMdpnjty6+z2aV8IUJMOBlL8bX9Y6zTB8A3JwZgulQU/YVYqYLvV h9mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=aLzsVhMnW180bMhupK9gGFeM1Lcu8y9rNNOTmdWbJ8U=; b=W9S0V/Q9eW6lHU7yb9LqsGazt6l/AbL+V9Pd+2YCV0PVK4NcNif/dFRA0RMDRAvO7r Glos2cjDJkXNKMeCENQaBQT2QHfFqd9boI2GbCxSsuOBsYUaO6yG2ZeG5qacF4Bwu9gO BCkMKCEMMD9CA/qQZR8IPipLg0S98LFj7nv+GWo4zGAiQb5rb4menuHfjL4xkd8EL8+K 7s9Ad7ainfJVHSn3oLP3upVGgEpfxMOE8b9/1jxBh0g7EZt+WRkV98SEg840fnKIcfdA jB7NxRxaSqjGqpPzQsrJpY461E1lkLrFCQZTCvb6ysQehYMCHpJ6ezKcYoXnTuaNreaU q3qg== X-Gm-Message-State: AJcUuke8G7rffGLM/E9z+BxIxAqJ/JvLmIMSshtOOSceUk0ARf073/za 9TsOy9s8/aFLoAoGng0RFGdEh+sRq6+A7ReqQyYHNlfaPR1TFA== X-Received: by 2002:a5d:4e82:: with SMTP id e2mr22991452wru.291.1548717066355; Mon, 28 Jan 2019 15:11:06 -0800 (PST) MIME-Version: 1.0 References: <20190116193501.1910-1-hannes@cmpxchg.org> <20190128150635.c22842034ab7e271c6416d2f@linux-foundation.org> In-Reply-To: <20190128150635.c22842034ab7e271c6416d2f@linux-foundation.org> From: Suren Baghdasaryan Date: Mon, 28 Jan 2019 15:10:55 -0800 Message-ID: Subject: Re: [PATCH] psi: fix aggregation idle shut-off To: Andrew Morton Cc: Johannes Weiner , Peter Zijlstra , Tejun Heo , Lai Jiangshan , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 28, 2019 at 3:06 PM Andrew Morton wrote: > > On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner wrote: > > > psi has provisions to shut off the periodic aggregation worker when > > there is a period of no task activity - and thus no data that needs > > aggregating. However, while developing psi monitoring, Suren noticed > > that the aggregation clock currently won't stay shut off for good. > > > > Debugging this revealed a flaw in the idle design: an aggregation run > > will see no task activity and decide to go to sleep; shortly > > thereafter, the kworker thread that executed the aggregation will go > > idle and cause a scheduling change, during which the psi callback will > > kick the !pending worker again. This will ping-pong forever, and is > > equivalent to having no shut-off logic at all (but with more code!) > > > > Fix this by exempting aggregation workers from psi's clock waking > > logic when the state change is them going to sleep. To do this, tag > > workers with the last work function they executed, and if in psi we > > see a worker going to sleep after aggregating psi data, we will not > > reschedule the aggregation work item. > > > > What if the worker is also executing other items before or after? > > > > Any psi state times that were incurred by work items preceding the > > aggregation work will have been collected from the per-cpu buckets > > during the aggregation itself. If there are work items following the > > aggregation work, the worker's last_func tag will be overwritten and > > the aggregator will be kept alive to process this genuine new activity. > > > > If the aggregation work is the last thing the worker does, and we > > decide to go idle, the brief period of non-idle time incurred between > > the aggregation run and the kworker's dequeue will be stranded in the > > per-cpu buckets until the clock is woken by later activity. But that > > should not be a problem. The buckets can hold 4s worth of time, and > > future activity will wake the clock with a 2s delay, giving us 2s > > worth of data we can leave behind when disabling aggregation. If it > > takes a worker more than two seconds to go idle after it finishes its > > last work item, we likely have bigger problems in the system, and > > won't notice one sample that was averaged with a bogus per-CPU weight. > > > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu, > > groupc->tasks[t]++; > > > > write_seqcount_end(&groupc->seq); > > - > > - if (!delayed_work_pending(&group->clock_work)) > > - schedule_delayed_work(&group->clock_work, PSI_FREQ); > > } > > > > static struct psi_group *iterate_groups(struct task_struct *task, void **iter) > > This breaks Suren's "psi: introduce psi monitor": > > --- kernel/sched/psi.c~psi-introduce-psi-monitor > +++ kernel/sched/psi.c > @@ -752,8 +1012,25 @@ static void psi_group_change(struct psi_ > > write_seqcount_end(&groupc->seq); > > - if (!delayed_work_pending(&group->clock_work)) > - schedule_delayed_work(&group->clock_work, PSI_FREQ); > + /* > + * Polling flag resets to 0 at the max rate of once per update window > + * (at least 500ms interval). smp_wmb is required after group->polling > + * 0-to-1 transition to order groupc->times and group->polling writes > + * because stall detection logic in the slowpath relies on groupc->times > + * changing before group->polling. Explicit smp_wmb is missing because > + * cmpxchg() implies smp_mb. > + */ > + if ((state_mask & group->trigger_mask) && > + atomic_cmpxchg(&group->polling, 0, 1) == 0) { > + /* > + * Start polling immediately even if the work is already > + * scheduled > + */ > + mod_delayed_work(system_wq, &group->clock_work, 1); > + } else { > + if (!delayed_work_pending(&group->clock_work)) > + schedule_delayed_work(&group->clock_work, PSI_FREQ); > + } > } > > and I'm too lazy to go in and figure out how to fix it. > > If we're sure about "psi: fix aggregation idle shut-off" (and I am not) > then can I ask for a redo of "psi: introduce psi monitor"? Yes, I'll redo it after applying "psi: fix aggregation idle shut-off". I have to repost it anyway to address the comment from Johannes.