Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Mon, 28 Oct 2019 14:02:22 +0100
From:   Peter Zijlstra <peterz@infradead.org>
To:     =?utf-8?B?546L6LSH?= <yun.wang@linux.alibaba.com>
Cc:     Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched/numa: advanced per-cgroup numa statistic
Message-ID: <20191028130222.GM4131@hirez.programming.kicks-ass.net>
References: <46b0fd25-7b73-aa80-372a-9fcd025154cb@linux.alibaba.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <46b0fd25-7b73-aa80-372a-9fcd025154cb@linux.alibaba.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Thu, Oct 24, 2019 at 11:08:01AM +0800, 王贇 wrote:
> Currently there are no good approach to monitoring the per-cgroup
> numa efficiency, this could be a trouble especially when groups
> are sharing CPUs, it's impossible to tell which one caused the
> remote-memory access by reading hardware counter since multiple
> workloads could sharing the same CPU, which make it painful when
> one want to find out the root cause and fix the issue.
> 
> In order to address this, we introduced new per-cgroup statistic
> for numa:
>   * the numa locality to imply the numa balancing efficiency
>   * the numa execution time on each node
> 
> The task locality is the local page accessing ratio traced on numa
> balancing PF, and the group locality is the topology of task execution
> time, sectioned by the locality into 8 regions.
> 
> For example the new entry 'cpu.numa_stat' show:
>   locality 15393 21259 13023 44461 21247 17012 28496 145402
>   exectime 311900 407166
> 
> Here we know the workloads executed 311900ms on node_0 and 407166ms
> on node_1, tasks with locality around 0~12% executed for 15393 ms, and
> tasks with locality around 88~100% executed for 145402 ms, which imply
> most of the memory access is local access, for the workloads of this
> group.
> 
> By monitoring the new statistic, we will be able to know the numa
> efficiency of each per-cgroup workloads on machine, whatever they
> sharing the CPUs or not, we will be able to find out which one
> introduced the remote access mostly.
> 
> Besides, per-node memory topology from 'memory.numa_stat' become
> more useful when we have the per-node execution time, workloads
> always executing on node_0 while it's memory is all on node_1 is
> usually a bad case.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>

Mel, can you have a peek at this too?


So this is the part I like least:

> +DEFINE_PER_CPU(struct numa_stat, root_numa_stat);
> +
> +int alloc_tg_numa_stat(struct task_group *tg)
> +{
> +	tg->numa_stat = alloc_percpu(struct numa_stat);
> +	if (!tg->numa_stat)
> +		return 0;
> +
> +	return 1;
> +}
> +
> +void free_tg_numa_stat(struct task_group *tg)
> +{
> +	free_percpu(tg->numa_stat);
> +}
> +
> +static void update_tg_numa_stat(struct task_struct *p)
> +{
> +	struct task_group *tg;
> +	unsigned long remote = p->numa_faults_locality[3];
> +	unsigned long local = p->numa_faults_locality[4];
> +	int idx = -1;
> +
> +	/* Tobe scaled? */
> +	if (remote || local)
> +		idx = NR_NL_INTERVAL * local / (remote + local + 1);
> +
> +	rcu_read_lock();
> +
> +	tg = task_group(p);
> +	while (tg) {
> +		/* skip account when there are no faults records */
> +		if (idx != -1)
> +			this_cpu_inc(tg->numa_stat->locality[idx]);
> +
> +		this_cpu_inc(tg->numa_stat->jiffies);
> +
> +		tg = tg->parent;
> +	}
> +
> +	rcu_read_unlock();
> +}

Thing is, we already have a cgroup hierarchy walk in the tick; see
task_tick_fair().

On top of that, you're walking an entirely different set of pointers,
instead of cfs_rq, you're walking tg->parent, which pretty much
guarantees you're adding even more cache misses.

How about you stick those numa_stats in cfs_rq (with cacheline
alignment) and see if you can frob your update loop into the cgroup walk
we already do.