Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp4809400ybv; Mon, 17 Feb 2020 06:23:51 -0800 (PST) X-Google-Smtp-Source: APXvYqxxJQBM8YdidmMAa6FH8ubfDNP1islwM0mOnHHXLTJakuqtecaX9e2SXP3A6rOgHEEBF3Po X-Received: by 2002:aca:5dc3:: with SMTP id r186mr10270839oib.137.1581949431008; Mon, 17 Feb 2020 06:23:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581949430; cv=none; d=google.com; s=arc-20160816; b=bwnZqqsfpyz8HZixI5drJ0OZRuNuRkrKEdsTAszYwUKZsGtFcbtJbRVGicf4KC6/1E g+5iIvy+TyAxfqP2VL7Li0YMXIcMU8wpEVolo50IHcSnA7r2sB0KGKDMEXfeRwKkyntP plA0p6OYArKMSOrGcDCqouWaWsYe4i2OlgrDLXYafmJGbhO/LskcgE3Ey2onWx0dJJn8 1tFuLLw3Fzn75ehNm2Qr7h9OHCkuBoE+SqB3Kchf7Cq3yObLrrPuIDV6ijQofPkyxPK+ UI3yPJ9r2SEHL8PII57UvKcJdEIXAmzrrGkrB9mLpH33TGLn26o9ylQXDwSS01NlpM5h ByKw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=gKRd0DHgEHQwhqHvqagpVfuMvUEo69O3X8gyRYArrPI=; b=yEcXM6xjZNH64H9hNOkAASr2Lz3eavQ+caTHIYcR9ZNZeB0jG3xcwB2b0NumBzI6RF XaO5AH6QTwMgDureDBKXALaeMaJWrnCgZwhq/P1bZkn2quoY4zFjuMF/RtmEv3qzSwnX 6mkLFGZ1UUJpkEUm60CK34vvIom7ZsEZ21wrDtqAbvswSOPnNHvEAXcYCP1JzlRiDkyg Rks12q54wRIN+AMDr+r04lqD+pRR1sSAQhD/QndcsRL7NtZufjvnJSJta5Ox8jefw0V2 fj8R6mMOPpILuxxr8+xPFuG4qu8YJ9/kR3aFKN0ogeidaj8qp8sx93WVHgkF8HPfNUn8 oHQQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 64si226588otx.50.2020.02.17.06.23.39; Mon, 17 Feb 2020 06:23:50 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728542AbgBQOQY (ORCPT + 99 others); Mon, 17 Feb 2020 09:16:24 -0500 Received: from mx2.suse.de ([195.135.220.15]:51760 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726788AbgBQOQX (ORCPT ); Mon, 17 Feb 2020 09:16:23 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 22183B3C2; Mon, 17 Feb 2020 14:16:21 +0000 (UTC) Date: Mon, 17 Feb 2020 14:16:16 +0000 From: Mel Gorman To: ?????? Cc: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Luis Chamberlain , Kees Cook , Iurii Zaikin , Michal Koutn? , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, "Paul E. McKenney" , Randy Dunlap , Jonathan Corbet Subject: Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info Message-ID: <20200217141616.GB3420@suse.de> References: <20200214151048.GL14914@hirez.programming.kicks-ass.net> <20200217115810.GA3420@suse.de> <881deb50-163e-442a-41ec-b375cc445e4d@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <881deb50-163e-442a-41ec-b375cc445e4d@linux.alibaba.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 17, 2020 at 09:23:52PM +0800, ?????? wrote: > > > On 2020/2/17 ??????7:58, Mel Gorman wrote: > [snip] > >> Mel, I suspect you still feel that way, right? > >> > > > > Yes, I still think it would be a struggle to interpret the data > > meaningfully without very specific knowledge of the implementation. If > > the scan rate was constant, it would be easier but that would make NUMA > > balancing worse overall. Similarly, the stat might get very difficult to > > interpret when NUMA balancing is failing because of a load imbalance, > > pages are shared and being interleaved or NUMA groups span multiple > > active nodes. > > Hi, Mel, appreciated to have you back on the table :-) > > IMHO the scan period changing should not be a problem now, since the > maximum period is defined by user, so monitoring at maximum period > on the accumulated page accessing counters is always meaningful, correct? > It has meaning but the scan rate drives the fault rate which is the basis for the stats you accumulate. If the scan rate is high when accesses are local, the stats can be skewed making it appear the task is much more local than it may really is at a later point in time. The scan rate affects the accuracy of the information. The counters have meaning but they needs careful interpretation. > FYI, by monitoring locality, we found that the kvm vcpu thread is not > covered by NUMA Balancing, whatever how many maximum period passed, the > counters are not increasing, or very slowly, although inside guest we are > copying memory. > > Later we found such task rarely exit to user space to trigger task > work callbacks, and NUMA Balancing scan depends on that, which help us > realize the importance to enable NUMA Balancing inside guest, with the > correct NUMA topo, a big performance risk I'll say :-P > Which is a very interesting corner case in itself but also one that could have potentially have been inferred from monitoring /proc/vmstat numa_pte_updates or on a per-task basis by monitoring /proc/PID/sched and watching numa_scan_seq and total_numa_faults. Accumulating the information on a per-cgroup basis would require a bit more legwork. > Maybe not a good example, but we just try to highlight that NUMA Balancing > could have issue in some cases, and we want them to be exposed, somehow, > maybe by the locality. > Again, I'm somewhat neutral on the patch simply because I would not use the information for debugging problems with NUMA balancing. I would try using tracepoints and if the tracepoints were not good enough, I'd add or fix them -- similar to what I had to do with sched_stick_numa recently. The caveat is that I mostly look at this sort of problem as a developer. Sysadmins have very different requirements, especially simplicity even if the simplicity in this case is an illusion. -- Mel Gorman SUSE Labs