Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp6278315ybc; Wed, 27 Nov 2019 18:55:23 -0800 (PST) X-Google-Smtp-Source: APXvYqy/tM8lslk5Yqor5Y7GSjV4aUHagB52M+iut4Yh/hlVZF+/Wvi3ey/u7lGNsMmmPL1jXVra X-Received: by 2002:a17:906:4d93:: with SMTP id s19mr52425501eju.285.1574909723281; Wed, 27 Nov 2019 18:55:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574909723; cv=none; d=google.com; s=arc-20160816; b=DGixPJayekbQGPu1432C8CoP3bJy9H6M0d+9u3X3FPwRb60Vb/r8v7Ma5SGAt9f46e ws6Taw/JlJolAcDlA4T3qKFE90bvPPBAwsuFwcaI6mLhadjP8OWx/pIalkq6+uDoXXLz 3rBbQJcztShzkR5x5Hc2zgdP0flECzlU96tWjhnciGoZdELp2b8j1eGl1V4tU/I6jTxW Mehzhj9IFufWrIbsACfv/u8uRU/PjZUs5teNo672vhAQG+NT78Objk4J0E4SiSdbKRgd Z//QucXARya/e9n6ea1WVOtQc8EkKWYmVb5/0l2aHnq/L/VHm7lOpa0HqD5KajmRZcUr 2Vaw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=8km3YOa2A8GYc21TZ53sLt4OHx+EiAVWkIkKi8oTnl8=; b=Gx3DW+JndBukhxbbI9MAuTbx5IoWOw5R1ZJNehcL3RzYh2ZnmVs+JVg+W5xT24aOf+ 7kosajON3o2ZgPZoWUWQjAjMeEAxWj99/uPwP41EwTI4DUEXePeAosli1J0WKXaIhcHC jPonPsJBTIepPgrsnTVTGwQSmhrOMDJsBU39VxdYSdHd/6gIEBZ+YJO1ATxj8onqexyM MUEc3+plgTkHUO03exqTU3MX8+Evj+PyNbHWUfDcrRCsPxtDsH5+Ubw1ShJ1V4I7+oxT DOzI260H0bF+q5X6pdob3yHh+qIR9AMmpZk1ARc++TBjLEhAtn9soaMxdJSrP8DLW+3J N9vg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k18si2242724ejz.401.2019.11.27.18.54.56; Wed, 27 Nov 2019 18:55:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729257AbfK1CJW (ORCPT + 99 others); Wed, 27 Nov 2019 21:09:22 -0500 Received: from out30-57.freemail.mail.aliyun.com ([115.124.30.57]:46117 "EHLO out30-57.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729239AbfK1CJW (ORCPT ); Wed, 27 Nov 2019 21:09:22 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07487;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=16;SR=0;TI=SMTPD_---0TjGOkbY_1574906953; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TjGOkbY_1574906953) by smtp.aliyun-inc.com(127.0.0.1); Thu, 28 Nov 2019 10:09:14 +0800 Subject: Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic To: Mel Gorman Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Luis Chamberlain , Kees Cook , Iurii Zaikin , Michal Koutn? , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, "Paul E. McKenney" References: <743eecad-9556-a241-546b-c8a66339840e@linux.alibaba.com> <207ef46c-672c-27c8-2012-735bd692a6de@linux.alibaba.com> <9354ffe8-81ba-9e76-e0b3-222bc942b3fc@linux.alibaba.com> <20191127101932.GN28938@suse.de> From: =?UTF-8?B?546L6LSH?= Message-ID: <3ff78d18-fa29-13f3-81e5-a05537a2e344@linux.alibaba.com> Date: Thu, 28 Nov 2019 10:09:13 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20191127101932.GN28938@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/11/27 下午6:19, Mel Gorman wrote: > On Wed, Nov 27, 2019 at 09:49:34AM +0800, ?????? wrote: >> Currently there are no good approach to monitoring the per-cgroup >> numa efficiency, this could be a trouble especially when groups >> are sharing CPUs, it's impossible to tell which one caused the >> remote-memory access by reading hardware counter since multiple >> workloads could sharing the same CPU, which make it painful when >> one want to find out the root cause and fix the issue>> > > It's already possible to identify specific tasks triggering PMU events > so this is not exactly true. Should fix the description regarding this... I think you mean tools like numatop which showing per task local/remote accessing info from PMU, correct? It's a good one for debugging, but when we talking about monitoring over cluster sharing by multiple users, still not very practical... compared to the workloads classified historical data. I'm not sure about the overhead and limitation of this PMU approach, or whether there are any platform it's not yet supported, worth a survey. > >> In order to address this, we introduced new per-cgroup statistic >> for numa: >> * the numa locality to imply the numa balancing efficiency >> * the numa execution time on each node >> [snip] >> +#ifdef CONFIG_PROC_SYSCTL >> +int sysctl_cg_numa_stat(struct ctl_table *table, int write, >> + void __user *buffer, size_t *lenp, loff_t *ppos) >> +{ >> + struct ctl_table t; >> + int err; >> + int state = static_branch_likely(&sched_cg_numa_stat); >> + >> + if (write && !capable(CAP_SYS_ADMIN)) >> + return -EPERM; >> + >> + t = *table; >> + t.data = &state; >> + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); >> + if (err < 0 || !write) >> + return err; >> + >> + if (state) >> + static_branch_enable(&sched_cg_numa_stat); >> + else >> + static_branch_disable(&sched_cg_numa_stat); >> + >> + return err; >> +} >> +#endif >> + > > Why is this implemented as a toggle? I'm finding it hard to make sense > of this. The numa_stat should not even exist if the feature is disabled. numa_stat will not exist if CONFIG is not enabled, do you mean it should also disappear when dynamically turn off? > > Assuming that is fixed then the runtime overhead is fine but the same > issues with the quality of the information relying on NUMA balancing > limits the usefulness of this. Disabling NUMA balancing or the scan rate > dropping to a very low frequency would lead in misleading conclusions as > well as false positives if the CPU and memory policies force remote memory > usage. Similarly, the timing of the information available is variable du > to how numa_faults_locality gets reset so sometimes the information is > fine-grained and sometimes it's coarse grained. It will also pretend to > display useful information even if NUMA balancing is disabled. The data just represent what we traced on NUMA balancing PF, so yes folks need some understanding on NUMA balancing to figure out the real meaning behind locality. We want it to tell the real story, if NUMA balancing disabled or scan rate dropped very low, the locality increments should be very small, when it keep failing for memory policy or CPU binding reason, we want it to tell how bad it is, locality just show us how NUMA Balancing is performing, the data could contains many information since how OS dealing with NUMA could be complicated... > > I find it hard to believe it would be useful in practice and I think users > would have real trouble interpreting the data given how much it relies on > internal implementation details of NUMA balancing. I cannot be certain > as clearly something motivated the creation of this patch although it's > unclear if it has ever been used to debug and fix an actual problem in > the field. Hence, I'm neutral on the patch and will neither ack or nack > it and will defer to the scheduler maintainers but if I was pushed on it, > I would be disinclined to merge the patch due to the potential confusion > caused by users who believe it provides accurate information when at best > it gives a rough approximation with variable granularity. We have our cluster enabled this feature already, an old version though but still helpful, when we want to debug NUMA issues, this could give good hints. Consider it as load_1/5/15 which not accurate but tell the trend of system behavior, locality giving the trend of NUMA Balancing as long as it's working, when increasing slowly it means locality already good enough or no more memory to adjust, and that's fine, for those who disabled the NUMA Balancing, they do their own NUMA optimization and find their own ways to estimate the results. Anyway, we thanks for all those good inputs from your side :-) Regards, Michael Wang