Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Sat, 26 Mar 2022 10:52:01 -0300
From:   Arnaldo Carvalho de Melo <acme@kernel.org>
To:     Leo Yan <leo.yan@linaro.org>
Cc:     Ali Saidi <alisaidi@amazon.com>, linux-kernel@vger.kernel.org,
        linux-perf-users@vger.kernel.org,
        linux-arm-kernel@lists.infradead.org, german.gomez@arm.com,
        benh@kernel.crashing.org, Nick.Forrington@arm.com,
        alexander.shishkin@linux.intel.com, andrew.kilroy@arm.com,
        james.clark@arm.com, john.garry@huawei.com, jolsa@kernel.org,
        kjain@linux.ibm.com, lihuafei1@huawei.com, mark.rutland@arm.com,
        mathieu.poirier@linaro.org, mingo@redhat.com, namhyung@kernel.org,
        peterz@infradead.org, will@kernel.org
Subject: Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse
 cores
Message-ID: <Yj8age/PSIouUiKy@kernel.org>
References: <20220324183323.31414-1-alisaidi@amazon.com>
 <20220324183323.31414-3-alisaidi@amazon.com>
 <20220326134754.GD20556@leoy-ThinkPad-X240s>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20220326134754.GD20556@leoy-ThinkPad-X240s>
Precedence: bulk

Em Sat, Mar 26, 2022 at 09:47:54PM +0800, Leo Yan escreveu:
> Hi Ali, German,
> 
> On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> 
> [...]
> 
> > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > +						union perf_mem_data_src *data_src)
> >  {
> > -	union perf_mem_data_src	data_src = { 0 };
> > +	/*
> > +	 * Even though four levels of cache hierarchy are possible, no known
> > +	 * production Neoverse systems currently include more than three levels
> > +	 * so for the time being we assume three exist. If a production system
> > +	 * is built with four the this function would have to be changed to
> > +	 * detect the number of levels for reporting.
> > +	 */
> >  
> > -	if (record->op == ARM_SPE_LD)
> > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > -	else
> > -		data_src.mem_op = PERF_MEM_OP_STORE;
> 
> Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> v2 and v3.

Ok, removing this as well.

Thanks for reviewing.

- Arnaldo
 
> IMHO, we need to consider two kinds of information which can guide us
> for a reliable implementation.  The first thing is to summarize the data
> source configuration for x86 PEBS, we can dive in more details for this
> part; the second thing is we can refer to the AMBA architecture document
> ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
> its sub section 'Suggested DataSource values', which would help us
> much for mapping the cache topology to Arm SPE data source.
> 
> As a result, I summarized the data source configurations for PEBS and
> Arm SPE Neoverse in the spreadsheet:
> https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing
> 
> Please see below comments.
> 
> > +	switch (record->source) {
> > +	case ARM_SPE_NV_L1D:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > +		break;
> 
> I think we need to set the field 'mem_snoop' for L1 cache hit:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> 
> For L1 cache hit, it doesn't involve snooping.
> 
> > +	case ARM_SPE_NV_L2:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > +		break;
> 
> Ditto:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> 
> > +	case ARM_SPE_NV_PEER_CORE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Peer core contains its local L1 cache, so I think we can set the
> memory level L1 to indicate this case.
> 
> For this data source type and below types, though they indicate
> the snooping happens, but it doesn't mean the data in the cache line
> is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> think this will mislead users when report the result.
> 
> I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> 
> > +		break;
> > +	/*
> > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > +	 * transfer, so set SNOOP_HITM
> > +	 */
> > +	case ARM_SPE_NV_LCL_CLSTR:
> 
> For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> the cluster level, it should happen in L2 cache:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> 
> > +	case ARM_SPE_NV_PEER_CLSTR:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > +		break;
> 
> This type can snoop from L1 or L2 cache in the peer cluster, so it
> makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> should use the snoop type PERF_MEM_SNOOP_HIT, so:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> > +	/*
> > +	 * System cache is assumed to be L3
> > +	 */
> > +	case ARM_SPE_NV_SYS_CACHE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > +		break;
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> 
> > +	/*
> > +	 * We don't know what level it hit in, except it came from the other
> > +	 * socket
> > +	 */
> > +	case ARM_SPE_NV_REMOTE:
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> > +		break;
> 
> The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
> in any cache levels in remote chip:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> > +	case ARM_SPE_NV_DRAM:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > +		break;
> 
> We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> 
> The rest of this patch looks good to me.
> 
> Thanks,
> Leo
> 
> > +	default:
> > +		break;
> > +	}
> > +}
> >  
> > +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> > +						union perf_mem_data_src *data_src)
> > +{
> >  	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> > -		data_src.mem_lvl = PERF_MEM_LVL_L3;
> > +		data_src->mem_lvl = PERF_MEM_LVL_L3;
> >  
> >  		if (record->type & ARM_SPE_LLC_MISS)
> > -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> >  		else
> > -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> >  	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> > -		data_src.mem_lvl = PERF_MEM_LVL_L1;
> > +		data_src->mem_lvl = PERF_MEM_LVL_L1;
> >  
> >  		if (record->type & ARM_SPE_L1D_MISS)
> > -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> >  		else
> > -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> >  	}
> >  
> >  	if (record->type & ARM_SPE_REMOTE_ACCESS)
> > -		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> > +		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> > +}
> > +
> > +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> > +{
> > +	union perf_mem_data_src	data_src = { 0 };
> > +	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> > +
> > +	if (record->op & ARM_SPE_LD)
> > +		data_src.mem_op = PERF_MEM_OP_LOAD;
> > +	else
> > +		data_src.mem_op = PERF_MEM_OP_STORE;
> > +
> > +	if (is_neoverse)
> > +		arm_spe__synth_data_source_neoverse(record, &data_src);
> > +	else
> > +		arm_spe__synth_data_source_generic(record, &data_src);
> >  
> >  	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
> >  		data_src.mem_dtlb = PERF_MEM_TLB_WK;
> > @@ -446,7 +525,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
> >  	u64 data_src;
> >  	int err;
> >  
> > -	data_src = arm_spe__synth_data_source(record);
> > +	data_src = arm_spe__synth_data_source(record, spe->midr);
> >  
> >  	if (spe->sample_flc) {
> >  		if (record->type & ARM_SPE_L1D_MISS) {
> > @@ -1183,6 +1262,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> >  	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
> >  	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
> >  	struct perf_record_time_conv *tc = &session->time_conv;
> > +	const char *cpuid = perf_env__cpuid(session->evlist->env);
> > +	u64 midr = strtol(cpuid, NULL, 16);
> >  	struct arm_spe *spe;
> >  	int err;
> >  
> > @@ -1202,6 +1283,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> >  	spe->machine = &session->machines.host; /* No kvm support */
> >  	spe->auxtrace_type = auxtrace_info->type;
> >  	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> > +	spe->midr = midr;
> >  
> >  	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
> >  
> > -- 
> > 2.32.0
> > 

-- 

- Arnaldo