Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751735AbdGaSmQ (ORCPT ); Mon, 31 Jul 2017 14:42:16 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:52804 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750979AbdGaSmO (ORCPT ); Mon, 31 Jul 2017 14:42:14 -0400 Date: Mon, 31 Jul 2017 14:41:42 -0400 From: Johannes Weiner To: Peter Zijlstra Cc: Ingo Molnar , Andrew Morton , Rik van Riel , Mel Gorman , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads Message-ID: <20170731184142.GA30943@cmpxchg.org> References: <20170727153010.23347-1-hannes@cmpxchg.org> <20170727153010.23347-4-hannes@cmpxchg.org> <20170729091055.GA6524@worktop.programming.kicks-ass.net> <20170730152813.GA26672@cmpxchg.org> <20170731083111.tgjgkwge5dgt5m2e@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170731083111.tgjgkwge5dgt5m2e@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.8.3 (2017-05-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3940 Lines: 80 On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote: > On Sun, Jul 30, 2017 at 11:28:13AM -0400, Johannes Weiner wrote: > > On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote: > > > On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote: > > > > +static void domain_cpu_update(struct memdelay_domain *md, int cpu, > > > > + int old, int new) > > > > +{ > > > > + enum memdelay_domain_state state; > > > > + struct memdelay_domain_cpu *mdc; > > > > + unsigned long now, delta; > > > > + unsigned long flags; > > > > + > > > > + mdc = per_cpu_ptr(md->mdcs, cpu); > > > > + spin_lock_irqsave(&mdc->lock, flags); > > > > > > Afaict this is inside scheduler locks, this cannot be a spinlock. Also, > > > do we really want to add more atomics there? > > > > I think we should be able to get away without an additional lock and > > rely on the rq lock instead. schedule, enqueue, dequeue already hold > > it, memdelay_enter/leave could be added. I need to think about what to > > do with try_to_wake_up in order to get the cpu move accounting inside > > the locked section of ttwu_queue(), but that should be doable too. > > So could you start by describing what actual statistics we need? Because > as is the scheduler already does a gazillion stats and why can't re > repurpose some of those? If that's possible, that would be great of course. We want to be able to tell how many tasks in a domain (the system or a memory cgroup) are inside a memdelay section as opposed to how many are in a "productive" state such as runnable or iowait. Then derive from that whether the domain as a whole is unproductive (all non-idle tasks memdelayed), or partially unproductive (some delayed, but CPUs are productive or there are iowait tasks). Then derive the percentages of walltime the domain spends partially or fully unproductive. For that we need per-domain counters for 1) nr of tasks in memdelay sections 2) nr of iowait or runnable/queued tasks that are NOT inside memdelay sections The memdelay and runnable counts need to be per-cpu as well. (The idea is this: if you have one CPU and some tasks are delayed while others are runnable, you're 100% partially productive, as the CPU is fully used. But if you have two CPUs, and the tasks on one CPU are all runnable while the tasks on the others are all delayed, the domain is 50% of the time fully unproductive (and not 100% partially productive) as half the available CPU time is being squandered by delays). On the system-level, we already count runnable/queued per cpu through rq->nr_running. However, we need to distinguish between productive runnables and tasks that are in runnable while in a memdelay section (doing reclaim). The current counters don't do that. Lastly, and somewhat obscurely, the presence of runnable tasks means that usually the domain is at least partially productive. But if the CPU is used by a task in a memdelay section (direct reclaim), the domain is fully unproductive (unless there are iowait tasks in the domain, since they make "progress" without CPU). So we need to track task_current() && task_memdelayed() per-domain per-cpu as well. Now, thinking only about the system-level, we could split rq->nr_running into a sets of delayed and non-delayed counters (present them as sum in all current read sides). Adding an rq counter for tasks inside memdelay sections should be straight-forward as well (except for maybe the migration cost of that state between CPUs in ttwu that Mike pointed out). That leaves the question of how to track these numbers per cgroup at an acceptable cost. The idea for a tree of cgroups is that walltime impact of delays at each level is reported for all tasks at or below that level. E.g. a leave group aggregates the state of its own tasks, the root/system aggregates the state of all tasks in the system; hence the propagation of the task state counters up the hierarchy.