Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751918AbdHAM0w (ORCPT ); Tue, 1 Aug 2017 08:26:52 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:52922 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751648AbdHAM0u (ORCPT ); Tue, 1 Aug 2017 08:26:50 -0400 Date: Tue, 1 Aug 2017 08:26:34 -0400 From: Johannes Weiner To: Peter Zijlstra Cc: Ingo Molnar , Andrew Morton , Rik van Riel , Mel Gorman , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads Message-ID: <20170801122634.GA7237@cmpxchg.org> References: <20170727153010.23347-1-hannes@cmpxchg.org> <20170727153010.23347-4-hannes@cmpxchg.org> <20170729091055.GA6524@worktop.programming.kicks-ass.net> <20170730152813.GA26672@cmpxchg.org> <20170731083111.tgjgkwge5dgt5m2e@hirez.programming.kicks-ass.net> <20170731184142.GA30943@cmpxchg.org> <20170801075728.GE6524@worktop.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170801075728.GE6524@worktop.programming.kicks-ass.net> User-Agent: Mutt/1.8.3 (2017-05-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2447 Lines: 62 On Tue, Aug 01, 2017 at 09:57:28AM +0200, Peter Zijlstra wrote: > On Mon, Jul 31, 2017 at 02:41:42PM -0400, Johannes Weiner wrote: > > On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote: > > > > So could you start by describing what actual statistics we need? Because > > > as is the scheduler already does a gazillion stats and why can't re > > > repurpose some of those? > > > > If that's possible, that would be great of course. > > > > We want to be able to tell how many tasks in a domain (the system or a > > memory cgroup) are inside a memdelay section as opposed to how many > > And you haven't even defined wth a memdelay section is yet.. It's what a task is in after it calls memdelay_enter() and before it calls memdelay_leave(). Tasks mark themselves to be in a memory section when they know to perform work that is necessary due to a lack of memory, such as waiting for a refault or a direct reclaim invocation. >From the patch: +/** + * memdelay_enter - mark the beginning of a memory delay section + * @flags: flags to handle nested memdelay sections + * + * Marks the calling task as being delayed due to a lack of memory, + * such as waiting for a workingset refault or performing reclaim. + */ +/** + * memdelay_leave - mark the end of a memory delay section + * @flags: flags to handle nested memdelay sections + * + * Marks the calling task as no longer delayed due to memory. + */ where a reclaim callsite looks like this (decluttered): memdelay_enter() nr_reclaimed = do_try_to_free_pages() memdelay_leave() That's what defines the "unproductive due to lack of memory" state of a task. Time spent in that state weighed against time spent while the task is productive - runnable or in iowait while not in a memdelay section - gives the memory health of the task. And the system and cgroup states/health can be derived from task states as described: > > are in a "productive" state such as runnable or iowait. Then derive > > from that whether the domain as a whole is unproductive (all non-idle > > tasks memdelayed), or partially unproductive (some delayed, but CPUs > > are productive or there are iowait tasks). Then derive the percentages > > of walltime the domain spends partially or fully unproductive. > > > > For that we need per-domain counters for > > > > 1) nr of tasks in memdelay sections > > 2) nr of iowait or runnable/queued tasks that are NOT inside > > memdelay sections