Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Sat, 26 Mar 2022 21:47:54 +0800
From:   Leo Yan <leo.yan@linaro.org>
To:     Ali Saidi <alisaidi@amazon.com>
Cc:     linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
        linux-arm-kernel@lists.infradead.org, german.gomez@arm.com,
        acme@kernel.org, benh@kernel.crashing.org, Nick.Forrington@arm.com,
        alexander.shishkin@linux.intel.com, andrew.kilroy@arm.com,
        james.clark@arm.com, john.garry@huawei.com, jolsa@kernel.org,
        kjain@linux.ibm.com, lihuafei1@huawei.com, mark.rutland@arm.com,
        mathieu.poirier@linaro.org, mingo@redhat.com, namhyung@kernel.org,
        peterz@infradead.org, will@kernel.org
Subject: Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse
 cores
Message-ID: <20220326134754.GD20556@leoy-ThinkPad-X240s>
References: <20220324183323.31414-1-alisaidi@amazon.com>
 <20220324183323.31414-3-alisaidi@amazon.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20220324183323.31414-3-alisaidi@amazon.com>
Precedence: bulk

Hi Ali, German,

On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:

[...]

> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
>  {
> -	union perf_mem_data_src	data_src = { 0 };
> +	/*
> +	 * Even though four levels of cache hierarchy are possible, no known
> +	 * production Neoverse systems currently include more than three levels
> +	 * so for the time being we assume three exist. If a production system
> +	 * is built with four the this function would have to be changed to
> +	 * detect the number of levels for reporting.
> +	 */
>  
> -	if (record->op == ARM_SPE_LD)
> -		data_src.mem_op = PERF_MEM_OP_LOAD;
> -	else
> -		data_src.mem_op = PERF_MEM_OP_STORE;

Firstly, apologize that I didn't give clear idea when Ali sent patch sets
v2 and v3.

IMHO, we need to consider two kinds of information which can guide us
for a reliable implementation.  The first thing is to summarize the data
source configuration for x86 PEBS, we can dive in more details for this
part; the second thing is we can refer to the AMBA architecture document
ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
its sub section 'Suggested DataSource values', which would help us
much for mapping the cache topology to Arm SPE data source.

As a result, I summarized the data source configurations for PEBS and
Arm SPE Neoverse in the spreadsheet:
https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing

Please see below comments.

> +	switch (record->source) {
> +	case ARM_SPE_NV_L1D:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> +		break;

I think we need to set the field 'mem_snoop' for L1 cache hit:

        data_src->mem_snoop = PERF_MEM_SNOOP_NONE;

For L1 cache hit, it doesn't involve snooping.

> +	case ARM_SPE_NV_L2:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> +		break;

Ditto:

        data_src->mem_snoop = PERF_MEM_SNOOP_NONE;

> +	case ARM_SPE_NV_PEER_CORE:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

Peer core contains its local L1 cache, so I think we can set the
memory level L1 to indicate this case.

For this data source type and below types, though they indicate
the snooping happens, but it doesn't mean the data in the cache line
is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
think this will mislead users when report the result.

I prefer we set below fields for ARM_SPE_NV_PEER_CORE:

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;

> +		break;
> +	/*
> +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> +	 * transfer, so set SNOOP_HITM
> +	 */
> +	case ARM_SPE_NV_LCL_CLSTR:

For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
the cluster level, it should happen in L2 cache:

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;

> +	case ARM_SPE_NV_PEER_CLSTR:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> +		break;

This type can snoop from L1 or L2 cache in the peer cluster, so it
makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
should use the snoop type PERF_MEM_SNOOP_HIT, so:

        data_src->mem_lvl = PERF_MEM_LVL_HIT
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

> +	/*
> +	 * System cache is assumed to be L3
> +	 */
> +	case ARM_SPE_NV_SYS_CACHE:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> +		break;

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;

> +	/*
> +	 * We don't know what level it hit in, except it came from the other
> +	 * socket
> +	 */
> +	case ARM_SPE_NV_REMOTE:
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> +		break;

The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
in any cache levels in remote chip:

        data_src->mem_lvl = PERF_MEM_LVL_HIT;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

> +	case ARM_SPE_NV_DRAM:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> +		break;

We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:

        data_src->mem_lvl = PERF_MEM_LVL_HIT;
        data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;

The rest of this patch looks good to me.

Thanks,
Leo

> +	default:
> +		break;
> +	}
> +}
>  
> +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
> +{
>  	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> -		data_src.mem_lvl = PERF_MEM_LVL_L3;
> +		data_src->mem_lvl = PERF_MEM_LVL_L3;
>  
>  		if (record->type & ARM_SPE_LLC_MISS)
> -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
>  		else
> -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
>  	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> -		data_src.mem_lvl = PERF_MEM_LVL_L1;
> +		data_src->mem_lvl = PERF_MEM_LVL_L1;
>  
>  		if (record->type & ARM_SPE_L1D_MISS)
> -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
>  		else
> -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
>  	}
>  
>  	if (record->type & ARM_SPE_REMOTE_ACCESS)
> -		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +}
> +
> +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> +{
> +	union perf_mem_data_src	data_src = { 0 };
> +	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> +
> +	if (record->op & ARM_SPE_LD)
> +		data_src.mem_op = PERF_MEM_OP_LOAD;
> +	else
> +		data_src.mem_op = PERF_MEM_OP_STORE;
> +
> +	if (is_neoverse)
> +		arm_spe__synth_data_source_neoverse(record, &data_src);
> +	else
> +		arm_spe__synth_data_source_generic(record, &data_src);
>  
>  	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
>  		data_src.mem_dtlb = PERF_MEM_TLB_WK;
> @@ -446,7 +525,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
>  	u64 data_src;
>  	int err;
>  
> -	data_src = arm_spe__synth_data_source(record);
> +	data_src = arm_spe__synth_data_source(record, spe->midr);
>  
>  	if (spe->sample_flc) {
>  		if (record->type & ARM_SPE_L1D_MISS) {
> @@ -1183,6 +1262,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
>  	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
>  	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
>  	struct perf_record_time_conv *tc = &session->time_conv;
> +	const char *cpuid = perf_env__cpuid(session->evlist->env);
> +	u64 midr = strtol(cpuid, NULL, 16);
>  	struct arm_spe *spe;
>  	int err;
>  
> @@ -1202,6 +1283,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
>  	spe->machine = &session->machines.host; /* No kvm support */
>  	spe->auxtrace_type = auxtrace_info->type;
>  	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> +	spe->midr = midr;
>  
>  	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
>  
> -- 
> 2.32.0
>