Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp3763633ybg; Mon, 28 Oct 2019 19:01:53 -0700 (PDT) X-Google-Smtp-Source: APXvYqxx9jPaVT1spitjgZHo///AEH3HbMdLDe1MccrP7V7luBD3H4suH43yU+IOSHWh8PC59wbR X-Received: by 2002:aa7:cd5b:: with SMTP id v27mr3655980edw.174.1572314513298; Mon, 28 Oct 2019 19:01:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572314513; cv=none; d=google.com; s=arc-20160816; b=1H2GTq8lljjIVoiBi/jla5qphvN7N6UmmJ+12T2Mrn4ci0ctPw8olf87H5vPw0u2bc 6O70uGF1v8MrY2N6TW+myt49Q8Sd1pa15jIQAaT8xW4xBCbnCnlt5Tvpp6Lt2Gy7Pel8 uJxoYG9pIww3B0f5ECrkP2+yEDDfuZf1jafmifsEFndJBg+1N8aUH/ccVpDTwrEIEmf0 BZRelXvLZSOBbX5YrE6rbaKyASQF1aFOmuE4Yi/m54uiEL5fsW3u4wTsn7Qfgh0cNckr 6dDeGvQ5zmzoAxq/zUvqnrQ9aJg9Ysmp3Kqxsa0vD7dhC3Yfw7ru1gH6h+QjVLeqU1l4 6ckw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=MD08IaRMVjFfIvRAApgyfhpfS8l+L1midKNfES5kgEA=; b=fVCD216EKwBb2cNVWvt4xl+SfC5JegEFdDpUPtm/lbNg1i7fEbwTboSBL8JN9WGJI1 KozLApt6FCalLmEKaLNo5x4Y/ucPr98jYzYDnvM5Ax89ioI/j9iqFB4yvT4D3CNYLt60 C83jkRfx98c7ajFIjca45CxHdYaIZAyhF8qsoF2WtMRINTEpVNCHGSUn3SdzgmeoCSjX S/TuS7OxW4g/ZBcQ+aV8OdEjblmRSPXdcb1PtJy5VziLSI3wh/KVkC42100fptQOS2IZ rx6QOF8K+DP8SohTykficP5IzpvDuEpB50q4VAnuD+EsBUmIEVbP/CRR+Pf7RPKuTNYW F+9w== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=iVuCnOTx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v35si9584432edm.213.2019.10.28.19.01.29; Mon, 28 Oct 2019 19:01:53 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=iVuCnOTx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729446AbfJ1NCw (ORCPT + 99 others); Mon, 28 Oct 2019 09:02:52 -0400 Received: from merlin.infradead.org ([205.233.59.134]:58120 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726816AbfJ1NCv (ORCPT ); Mon, 28 Oct 2019 09:02:51 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=merlin.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=MD08IaRMVjFfIvRAApgyfhpfS8l+L1midKNfES5kgEA=; b=iVuCnOTx3bu24bTobnX/UCxlxV N6hi3SNHwJSRZ7ZlcJu7AmOqwupaA2XqqnUZNUsH/5g1h/WoWnJqbpd5WffYih5nPPhGcZgA8Zkpn wnD8xNIb0ud5jo5VK37coPAhnv/kP5H4hcIc/lw46bATKBhSiay1/sZcYF2cRJ/4xp4Fxrk3CWuW2 JkthbZFukVVI4T6iRh0zY4DwnHqe5DQDOmhEEbJCNnbng0uM+g0GyOm7bEFy1h2kRnqCFC2RcrKG2 n5rKgxVYDZc2KXLlAzH2ZrbQxgAJmt4PFkWM6G1n4p+hdsfx6kEEIibemkJaVrAg53cGlD0GLvFiC BzBdut/Q==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by merlin.infradead.org with esmtpsa (Exim 4.92.3 #3 (Red Hat Linux)) id 1iP4fC-0007Ai-Mb; Mon, 28 Oct 2019 13:02:27 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 17245306091; Mon, 28 Oct 2019 14:01:21 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 3FB7520761E44; Mon, 28 Oct 2019 14:02:22 +0100 (CET) Date: Mon, 28 Oct 2019 14:02:22 +0100 From: Peter Zijlstra To: =?utf-8?B?546L6LSH?= Cc: Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , linux-kernel@vger.kernel.org Subject: Re: [PATCH] sched/numa: advanced per-cgroup numa statistic Message-ID: <20191028130222.GM4131@hirez.programming.kicks-ass.net> References: <46b0fd25-7b73-aa80-372a-9fcd025154cb@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <46b0fd25-7b73-aa80-372a-9fcd025154cb@linux.alibaba.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 24, 2019 at 11:08:01AM +0800, 王贇 wrote: > Currently there are no good approach to monitoring the per-cgroup > numa efficiency, this could be a trouble especially when groups > are sharing CPUs, it's impossible to tell which one caused the > remote-memory access by reading hardware counter since multiple > workloads could sharing the same CPU, which make it painful when > one want to find out the root cause and fix the issue. > > In order to address this, we introduced new per-cgroup statistic > for numa: > * the numa locality to imply the numa balancing efficiency > * the numa execution time on each node > > The task locality is the local page accessing ratio traced on numa > balancing PF, and the group locality is the topology of task execution > time, sectioned by the locality into 8 regions. > > For example the new entry 'cpu.numa_stat' show: > locality 15393 21259 13023 44461 21247 17012 28496 145402 > exectime 311900 407166 > > Here we know the workloads executed 311900ms on node_0 and 407166ms > on node_1, tasks with locality around 0~12% executed for 15393 ms, and > tasks with locality around 88~100% executed for 145402 ms, which imply > most of the memory access is local access, for the workloads of this > group. > > By monitoring the new statistic, we will be able to know the numa > efficiency of each per-cgroup workloads on machine, whatever they > sharing the CPUs or not, we will be able to find out which one > introduced the remote access mostly. > > Besides, per-node memory topology from 'memory.numa_stat' become > more useful when we have the per-node execution time, workloads > always executing on node_0 while it's memory is all on node_1 is > usually a bad case. > > Cc: Peter Zijlstra > Cc: Michal Koutný > Signed-off-by: Michael Wang Mel, can you have a peek at this too? So this is the part I like least: > +DEFINE_PER_CPU(struct numa_stat, root_numa_stat); > + > +int alloc_tg_numa_stat(struct task_group *tg) > +{ > + tg->numa_stat = alloc_percpu(struct numa_stat); > + if (!tg->numa_stat) > + return 0; > + > + return 1; > +} > + > +void free_tg_numa_stat(struct task_group *tg) > +{ > + free_percpu(tg->numa_stat); > +} > + > +static void update_tg_numa_stat(struct task_struct *p) > +{ > + struct task_group *tg; > + unsigned long remote = p->numa_faults_locality[3]; > + unsigned long local = p->numa_faults_locality[4]; > + int idx = -1; > + > + /* Tobe scaled? */ > + if (remote || local) > + idx = NR_NL_INTERVAL * local / (remote + local + 1); > + > + rcu_read_lock(); > + > + tg = task_group(p); > + while (tg) { > + /* skip account when there are no faults records */ > + if (idx != -1) > + this_cpu_inc(tg->numa_stat->locality[idx]); > + > + this_cpu_inc(tg->numa_stat->jiffies); > + > + tg = tg->parent; > + } > + > + rcu_read_unlock(); > +} Thing is, we already have a cgroup hierarchy walk in the tick; see task_tick_fair(). On top of that, you're walking an entirely different set of pointers, instead of cfs_rq, you're walking tg->parent, which pretty much guarantees you're adding even more cache misses. How about you stick those numa_stats in cfs_rq (with cacheline alignment) and see if you can frob your update loop into the cgroup walk we already do.