Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp409379ybx; Fri, 1 Nov 2019 05:31:09 -0700 (PDT) X-Google-Smtp-Source: APXvYqxsnZalFzAS6cvFAQBVDvM5S6Cs+m5cJUxsB20Q7Ou5HIo/oM+6ITSIpsDURvpeaf6MpYtP X-Received: by 2002:a50:ab50:: with SMTP id t16mr12329124edc.171.1572611469443; Fri, 01 Nov 2019 05:31:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572611469; cv=none; d=google.com; s=arc-20160816; b=01zlvUI8xdGdJ5aszJuZ9KPAsoYbLQcnjApz1+4DhCgg6CEvAGE61AoEj2rd5lZWG7 ob8k4rDYvu0M0CNvTPnAOYeDbQq6PiDCo+1YEiaOU6IaJcWsP8oYJN400PhKyrJNkgIh rrV+RNoDu/+mvm1DO7OYc1zf2eJSFUXgs7do7sEXJoLD3d9yGEvo0hxj3oB6flD1pNZW pCFEHNbJ5Yw3C4KXCJOWy6+PNKYEfU6VGXUfrXEkehrVy92OBJtAy5rTpP/phgAQhou5 htVL6HgAw9/3eKpbrCU7k+Zche58zJRbDIsbS59TwgkPlA1nAAF50AbVLy2h4b4JOyLC jKDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=So7KwV6lOgKEF+lG45cTsVNnj26pN4E1ELg3BI0HIDE=; b=NlJzAO4QRJZ9sFP132TDCDvyJsEV5wMITf4zMi1ZMaq4EZ44vICe6bhQHBazx6mGoM 3z3ni1S1Kqm2NwI2bzD/gQzQUUIMGmmyyxbVBJOwsyTnnl3NlUNUBXqSdmWqH6HChEjY M5ZKSZIAYNEPisMAAgcTKp/qrOg6YAu1d0FJ5q831AC+DRIl6sR7cnl7FDyKHcxhKJvi dt2vdmGR7d0khcwsMiFvg5v5OSNsONfEG1FGN5X24I1WBA+8BM6geoEFT9GJQzALTZyB K8nseaQEcd1plkFiEe3o2y25qM686BC5p1SvtwfeNCJOhH40c8aljRHoL3zIhnZRfFfj jYLg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h22si237431edb.346.2019.11.01.05.30.44; Fri, 01 Nov 2019 05:31:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730459AbfKALwV (ORCPT + 99 others); Fri, 1 Nov 2019 07:52:21 -0400 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:35819 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726957AbfKALwV (ORCPT ); Fri, 1 Nov 2019 07:52:21 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=yun.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0TgtH3hA_1572609135; Received: from testdeMacBook-Pro.local(mailfrom:yun.wang@linux.alibaba.com fp:SMTPD_---0TgtH3hA_1572609135) by smtp.aliyun-inc.com(127.0.0.1); Fri, 01 Nov 2019 19:52:16 +0800 Subject: Re: [PATCH] sched/numa: advanced per-cgroup numa statistic To: Mel Gorman Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , linux-kernel@vger.kernel.org References: <46b0fd25-7b73-aa80-372a-9fcd025154cb@linux.alibaba.com> <20191030095505.GF28938@suse.de> <6f5e43db-24f1-5283-0881-f264b0d5f835@linux.alibaba.com> <20191031131731.GJ28938@suse.de> <5d69ff1b-a477-31b5-8600-9233a38445c7@linux.alibaba.com> <20191101091348.GM28938@suse.de> From: =?UTF-8?B?546L6LSH?= Message-ID: <2573b108-7885-5c4f-a0ae-2b245d663250@linux.alibaba.com> Date: Fri, 1 Nov 2019 19:52:15 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20191101091348.GM28938@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/11/1 下午5:13, Mel Gorman wrote: [snip] >> For example in our cases, we could have hundreds of cgroups, each contain >> hundreds of tasks, these worker thread could live and die at any moment, >> for gathering we need to cat the list of tasks and then go reading these proc >> files one by one, which fall into kernel rapidly and may even need to holding >> some locks, this introduced big latency impact, and give no accurate output >> since some task may already died before reading it's data. >> >> Then before next sample window, info of tasks died during the window can't be >> acquired anymore. >> >> We need kernel's help on reserving data since tool can't catch them in time >> before they are lost, also we have to avoid rapidly proc reading, which really >> cost a lot and further more, introduce big latency on each sample window. >> > > There is somewhat of a disconnect here. You say that the information must > be accurate and historical yet are relying on NUMA hinting faults to build > the picture which may not be accurate at all given that faults are not > guaranteed to happen. For short-lived tasks, it is also potentially skewed > information if short-lived tasks dominated remote accesses for whatever > reason even though it does not matter -- the tasks were short-lived and > their performance is probably irrelevant. Short-lived tasks may not even > show up if they do not run longer than sysctl_numa_balancing_scan_delay > so the data gathered already has holes in it. > > While it's a bit more of a stretch, even this could still be done from > userspace if numa_hint_fault was probed and the event handled (eBPF, > systemtap etc) to build the picture or add a tracepoint. That would give > a much higher degree of flexibility on what information is tracked and > allow flexibility on > > So, overall I think this can be done outside the kernel but recognise > that it may not be suitable in all cases. If you feel it must be done > inside the kernel, split out the patch that adds information on failed > page migrations as it stands apart. Put it behind its own kconfig entry > that is disabled by default -- do not tie it directly to NUMA balancing > because of the data structure changes. When enabled, it should still be > disabled by default at runtime and only activated via kernel command line > parameter so that the only people who pay the cost are those that take > deliberate action to enable it. Agree, we could have these per-task faults info there, give the possibility to implement maybe a practical userland tool, meanwhile have these kernel numa data disabled by default, folks who got no tool but want to do easy monitoring can just turn on the switch :-) Will have these in next version: * separate patch for showing per-task faults info * new CONFIG for numa stat (disabled by default) * dynamical runtime switch for numa stat (disabled by default) * doc to explain the numa stat and give hint on how to handle it Best Regards, Michale Wang >