Received: by 10.192.165.148 with SMTP id m20csp1163362imm; Thu, 10 May 2018 06:40:11 -0700 (PDT) X-Google-Smtp-Source: AB8JxZp2qYJxT1Wr1HnUI1XcgwU0NZu6tC4S23PbsDs7vvmh/m2z5LoH8NVQ8G32Tu7bk4R+YNbz X-Received: by 2002:a65:654a:: with SMTP id a10-v6mr1191261pgw.107.1525959611350; Thu, 10 May 2018 06:40:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525959611; cv=none; d=google.com; s=arc-20160816; b=BXkB/bDb+4B+LrFsg3yXiC10jgnPWrJEO9Wvu30YNqiAUuO7Upc8freMR3EZWDsx3I DN38VLjFeFr+rE+VHyD9Gam9ciNmXh9A/QpKXSzHOyfTY1WTaWh043LZEJ4zjXSdHSqm 6Xqwgy55UpeqBnd4U7sPVma7lWEa0Srt5zOaQqhpoNpm4j7cZuVcO7vfohTv8n+84pef 8N8H+5hohLexCcl31eRiyCLDshB13+MVBiAs9Sg9dq7TrcZMeJdqrip6yjY9E7+EpMJp KAL0eDZhGUVB72W8e6i/sFaozko2xR/h3RUWHjmsdi1uAfZr0Gjnmia9Gj0PW/nE+kfo ISHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=IDVS5v+xfLmBiJePuAJrbHm5wD3oLcjPjpXJ5gv/7Kk=; b=TZsh0Tvm0d3Nwpvc90JhuNfp2MRM8ZmO24In9BJEBzq2Lgq6dXgJpgJrUmz/N3k2bO vn5Ox5zQybPp5eQ6L4JW4untY0B91x1lXxdYrBFhsmyhgXU/lJNAWwNETcDQKMt9+vwB zw8vQAWTC2xJFAmQmdCa4/aoyyMF3g5zA7u79FI+9dSpF+5o5oKV3JxOQt7YuHsVMhiz bp4pjyzP3PZViQdy+ZhaBc28jk2eIVeLdZFF2Gm536R0Kr3XxYDL77oSAOOftl2MdJMB gSoEDeX+94S+B8g0K+2d+E2VVtp4cmWy+Z6NS0t2MeqPthfTciZSk1bwEN4Aqg31Csal UOlA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@cmpxchg.org header.s=x header.b=2Z9ztjhl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z33-v6si774714plb.380.2018.05.10.06.39.56; Thu, 10 May 2018 06:40:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@cmpxchg.org header.s=x header.b=2Z9ztjhl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965163AbeEJNjs (ORCPT + 99 others); Thu, 10 May 2018 09:39:48 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:50082 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935499AbeEJNjq (ORCPT ); Thu, 10 May 2018 09:39:46 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=cmpxchg.org ; s=x; h=In-Reply-To:Content-Type:MIME-Version:References:Message-ID:Subject: Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=IDVS5v+xfLmBiJePuAJrbHm5wD3oLcjPjpXJ5gv/7Kk=; b=2Z9ztjhl8w99+8ih4eqYSSKEIT N1aVWluSWTTZbJAoM+m5s11NBC3yS9FJaBSrHkcJAdDjS1wnTghqBqPFwSrXTbJ3VhIIlG1PSR/wP EYo1F6kEq0smyFpFwZdQZfAT2zkdUVZurpiQjN+l6zZXvifuxVGykpKIuA+QBTP2NYWQ=; Date: Thu, 10 May 2018 09:41:32 -0400 From: Johannes Weiner To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, cgroups@vger.kernel.org, Ingo Molnar , Andrew Morton , Tejun Heo , Balbir Singh , Mike Galbraith , Oliver Yang , Shakeel Butt , xxx xxx , Taras Kondratiuk , Daniel Walker , Vinayak Menon , Ruslan Ruslichenko , kernel-team@fb.com Subject: Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO Message-ID: <20180510134132.GA19348@cmpxchg.org> References: <20180507210135.1823-1-hannes@cmpxchg.org> <20180507210135.1823-7-hannes@cmpxchg.org> <20180509104618.GP12217@hirez.programming.kicks-ass.net> <20180509113849.GJ12235@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180509113849.GJ12235@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 09, 2018 at 01:38:49PM +0200, Peter Zijlstra wrote: > On Wed, May 09, 2018 at 12:46:18PM +0200, Peter Zijlstra wrote: > > On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote: > > > > > @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > > > cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); > > > if (task_cpu(p) != cpu) { > > > wake_flags |= WF_MIGRATED; > > > + psi_ttwu_dequeue(p); > > > set_task_cpu(p, cpu); > > > } > > > > > > > > +static inline void psi_ttwu_dequeue(struct task_struct *p) > > > +{ > > > + /* > > > + * Is the task being migrated during a wakeup? Make sure to > > > + * deregister its sleep-persistent psi states from the old > > > + * queue, and let psi_enqueue() know it has to requeue. > > > + */ > > > + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) { > > > + struct rq_flags rf; > > > + struct rq *rq; > > > + int clear = 0; > > > + > > > + if (p->in_iowait) > > > + clear |= TSK_IOWAIT; > > > + if (p->flags & PF_MEMSTALL) > > > + clear |= TSK_MEMSTALL; > > > + > > > + rq = __task_rq_lock(p, &rf); > > > + update_rq_clock(rq); > > > + psi_task_change(p, rq_clock(rq), clear, 0); > > > + p->sched_psi_wake_requeue = 1; > > > + __task_rq_unlock(rq, &rf); > > > + } > > > +} > > > > Yeah, no... not happening. > > > > We spend a lot of time to never touch the old rq->lock on wakeups. Mason > > was the one pushing for that, so he should very well know this. > > > > The one cross-cpu atomic (iowait) is already a problem (the whole iowait > > accounting being useless makes it even worse), adding significant remote > > prodding is just really bad. > > Also, since all you need is the global number, I don't think you > actually need any of this. See what we do for nr_uninterruptible. > > In general I think you want to (re)read loadavg.c some more, and maybe > reuse a bit more of that. So there is a reason I'm tracking productivity states per-cpu and not globally. Consider the following example periods on two CPUs: CPU 0 Task 1: | EXECUTING | memstalled | Task 2: | runqueued | EXECUTING | CPU 1 Task 3: | memstalled | EXECUTING | If we tracked only the global number of stalled tasks, similarly to nr_uninterruptible, the number would be elevated throughout the whole sampling period, giving a pressure value of 100% for "some stalled". And, since there is always something executing, a "full stall" of 0%. Now consider what happens when the Task 3 sequence is the other way around: CPU 0 Task 1: | EXECUTING | memstalled | Task 2: | runqueued | EXECUTING | CPU 1 Task 3: | EXECUTING | memstalled | Here the number of stalled tasks is elevated only during half of the sampling period, this time giving a pressure reading of 50% for "some" (and again 0% for "full"). That's a different measurement, but in terms of workload progress, the sequences are functionally equivalent. In both scenarios the same amount of productive CPU cycles is spent advancing tasks 1, 2 and 3, and the same amount of potentially productive CPU time is lost due to the contention of memory. We really ought to read the same pressure. So what I'm doing is calculating the productivity loss on each CPU in a sampling period as if they were independent time slices. It doesn't matter how you slice and dice the sequences within each one - if used CPU time and lost CPU time have the same proportion, we have the same pressure. In both scenarios above, this method will give a pressure reading of some=50% and full=25% of "normalized walltime", which is the time loss the work would experience on a single CPU executing it serially. To illustrate: CPU X 1 2 3 4 Task 1: | EXECUTING | memstalled | sleeping | sleeping | Task 2: | runqueued | EXECUTING | sleeping | sleeping | Task 3: | sleeping | sleeping | EXECUTING | memstalled | You can clearly see the 50% of walltime in which *somebody* isn't advancing (2 and 4), and the 25% of walltime in which *no* tasks are (3). Same amount of work, same memory stalls, same pressure numbers. Globalized state tracking would produce those numbers on the single CPU (obviously), but once concurrency gets into the mix, it's questionable what its results mean. It certainly isn't able to reliably detect equivalent slowdowns of individual tasks ("some" is all over the place), and in this example wasn't able to capture the impact of contention on overall work completion ("full" is 0%). * CPU 0: some = 50%, full = 0% CPU 1: some = 50%, full = 50% avg: some = 50%, full = 25%