Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp318778ybx; Fri, 1 Nov 2019 04:07:09 -0700 (PDT) X-Google-Smtp-Source: APXvYqzQWkapFMf05TwttPD8dV8EEN6KfA71dhogSwcVYrVENYBCslVFcsDqW5hp6aS7yEKrENxe X-Received: by 2002:a50:f306:: with SMTP id p6mr12050287edm.284.1572606429625; Fri, 01 Nov 2019 04:07:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572606429; cv=none; d=google.com; s=arc-20160816; b=Qv9vMwMYm4C9sfhMFC1e9m6jloPLMaC49EodYNjht73PZX6PXCEedjOoTwHQMzvclI /w8u5XsIDxGu1TVnHoLh3moYep/xR5D9iDzaIEIY+Ul5aW57pv/daehiqTfNCTnEQ1uf hiGy4WpMRm+hHCas+8NpuZ9Z9lKDB1Qvl1QnSSyBtTCdcbxiS2zMItQAz5aBa/wf7sLt ymYKFtLu3qLsb2euhtroSfX6Rf1Xz/+jz/yo3SKQHstbCLGoHeNFhMEy6ZdW0FSkSOkl EreoGYTWAa/O9Dn07VJ98LuQdoNWF9lmgnsvHcb1pAFUgPNUqpFTnE4ORj+I+JQAIQx0 ykmA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=ban+VTT+2yEylrHJTQKyIKo9D5aLqpLuv4dzxjyiqbs=; b=sbixF3i1YG6ItqwxuIzBKzhpTURRc++QEJwuwOYcw6lwMmv+r84CwOmuStSLbVCSE3 Y1T1MWarndTF4IPf4OGzEEITtLZECH1lqYlI6C49JtA87JUqidsGfaAI4dI3fS8y8ktR L2S5s4WEql+2wkUNM2ODiG00YUR6aksojKz/qjCC4vKvP82I/0V+eo6E2z1XG3kAbLqn X1l9nsWvCcFiD8Q7qqSSESQLfP9zP3PR463FKZs2esVuDchaYxFFeGpe3XO+ybPhUHIz vnks+8zKdHiSTd/SKqNWs0ZgYhxzfouqamLiemX+HS8RSdf9bnW7L3GmH8xgvHGkm5yF P5hA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id gf2si5044858ejb.393.2019.11.01.04.06.43; Fri, 01 Nov 2019 04:07:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728162AbfKAJN5 (ORCPT + 99 others); Fri, 1 Nov 2019 05:13:57 -0400 Received: from mx2.suse.de ([195.135.220.15]:60602 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727919AbfKAJN5 (ORCPT ); Fri, 1 Nov 2019 05:13:57 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id A8A74AB91; Fri, 1 Nov 2019 09:13:54 +0000 (UTC) Date: Fri, 1 Nov 2019 09:13:48 +0000 From: Mel Gorman To: ?????? Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , linux-kernel@vger.kernel.org Subject: Re: [PATCH] sched/numa: advanced per-cgroup numa statistic Message-ID: <20191101091348.GM28938@suse.de> References: <46b0fd25-7b73-aa80-372a-9fcd025154cb@linux.alibaba.com> <20191030095505.GF28938@suse.de> <6f5e43db-24f1-5283-0881-f264b0d5f835@linux.alibaba.com> <20191031131731.GJ28938@suse.de> <5d69ff1b-a477-31b5-8600-9233a38445c7@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <5d69ff1b-a477-31b5-8600-9233a38445c7@linux.alibaba.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 01, 2019 at 09:49:20AM +0800, ?????? wrote: > > > On 2019/10/31 ??????9:17, Mel Gorman wrote: > > On Thu, Oct 31, 2019 at 11:31:24AM +0800, ?????? wrote: > [snip] > >> For example, tasks bind to the cpus of node_0 could have their memory on > >> node_1 when node_0 almost run out of memory, numa balancing may not be > >> able to help in this case, while by reading locality we could know how > >> critical the problem is, and may take action to rebind cpus to node_1 or > >> reclaim the memory of node_0. > >> > > > > You can already do this by walking each cgroup, identifying what tasks are > > in it and look at /proc/PID/numa_maps and /proc/PID/status to see what > > CPU bindings if any exist. This would use the actual memory placements > > and not those sampled by NUMA balancing, would not require NUMA balancing > > and would work on older kernels. It would be costly to access so I would > > not suggest doing it at high frequency but it makes sense for the tool > > that cares to pay the cost instead of spreading tidy bits of cost to > > every task whether there is an interested tool or not. > > I see the biggest concern now is the necessity to let kernel providing these > data, IMHO there are actually good reasons here: > * there are too many tasks to gathering data from, reading proc cost a lot > * tasks living and dying, data lost between each sample window > > For example in our cases, we could have hundreds of cgroups, each contain > hundreds of tasks, these worker thread could live and die at any moment, > for gathering we need to cat the list of tasks and then go reading these proc > files one by one, which fall into kernel rapidly and may even need to holding > some locks, this introduced big latency impact, and give no accurate output > since some task may already died before reading it's data. > > Then before next sample window, info of tasks died during the window can't be > acquired anymore. > > We need kernel's help on reserving data since tool can't catch them in time > before they are lost, also we have to avoid rapidly proc reading, which really > cost a lot and further more, introduce big latency on each sample window. > There is somewhat of a disconnect here. You say that the information must be accurate and historical yet are relying on NUMA hinting faults to build the picture which may not be accurate at all given that faults are not guaranteed to happen. For short-lived tasks, it is also potentially skewed information if short-lived tasks dominated remote accesses for whatever reason even though it does not matter -- the tasks were short-lived and their performance is probably irrelevant. Short-lived tasks may not even show up if they do not run longer than sysctl_numa_balancing_scan_delay so the data gathered already has holes in it. While it's a bit more of a stretch, even this could still be done from userspace if numa_hint_fault was probed and the event handled (eBPF, systemtap etc) to build the picture or add a tracepoint. That would give a much higher degree of flexibility on what information is tracked and allow flexibility on So, overall I think this can be done outside the kernel but recognise that it may not be suitable in all cases. If you feel it must be done inside the kernel, split out the patch that adds information on failed page migrations as it stands apart. Put it behind its own kconfig entry that is disabled by default -- do not tie it directly to NUMA balancing because of the data structure changes. When enabled, it should still be disabled by default at runtime and only activated via kernel command line parameter so that the only people who pay the cost are those that take deliberate action to enable it. -- Mel Gorman SUSE Labs