Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18;
Date:   Mon, 28 Mar 2022 21:05:47 +0800
From:   Leo Yan <leo.yan@linaro.org>
To:     Ali Saidi <alisaidi@amazon.com>
Cc:     Nick.Forrington@arm.com, acme@kernel.org,
        alexander.shishkin@linux.intel.com, andrew.kilroy@arm.com,
        benh@kernel.crashing.org, german.gomez@arm.com,
        james.clark@arm.com, john.garry@huawei.com, jolsa@kernel.org,
        kjain@linux.ibm.com, lihuafei1@huawei.com,
        linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org,
        linux-perf-users@vger.kernel.org, mark.rutland@arm.com,
        mathieu.poirier@linaro.org, mingo@redhat.com, namhyung@kernel.org,
        peterz@infradead.org, will@kernel.org
Subject: Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse
 cores
Message-ID: <20220328130547.GA360814@leoy-ThinkPad-X240s>
References: <20220326194327.32239-1-alisaidi@amazon.com>
 <20220328030805.14175-1-alisaidi@amazon.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20220328030805.14175-1-alisaidi@amazon.com>
Precedence: bulk

Hi Ali,

On Mon, Mar 28, 2022 at 03:08:05AM +0000, Ali Saidi wrote:

[...]

> > > > > +	case ARM_SPE_NV_PEER_CORE:
> > > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > > 
> > > > Peer core contains its local L1 cache, so I think we can set the
> > > > memory level L1 to indicate this case.
> > > It could be either the L1 or the L2. All the neoverse cores have private L2
> > > caches and we don't know. 
> > 
> > How about set both L1 and L2 cache together for this case?
> > 
> > Although 'L1 | L2' cannot tell the exact cache level, I think it's
> > better than use ANY_CACHE, at least this can help us to distinguish
> > from other data source types if we avoid to use ANY_CACHE for all of
> > them.
> 
> This seems much more confusing then the ambiguity of where the line came from
> and is only possible with the deprecated mem_lvl enconding.  It will make
> perf_mem__lvl_scnprintf() print the wrong thing and anyone who is trying to
> attribute a line to a single cache will find that the sum of the number of hits
> is greater than the number of accesses which also seems terribly confusing.

Understand.  I considered the potential issue for
perf_mem__lvl_scnprintf(), actually it supports printing multipl cache
levels for 'mem_lvl', by conjuncting with 'or' it composes the multiple
cache levels.  We might need to extend a bit for another field
'mem_lvlnum'.

Agreed that there would have inaccurate issue for statistics, it's fine
for me to use ANY_CACHE in this patch set.

I still think we should consider to extend the memory levels to
demonstrate clear momory hierarchy on Arm archs, I personally like the
definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
though these cache levels are not precise like L1/L2/L3 levels, they can
help us to map very well for the cache topology on Arm archs and without
any confusion.  We could take this as an enhancement if you don't want
to bother the current patch set's upstreaming.

> > > > For this data source type and below types, though they indicate
> > > > the snooping happens, but it doesn't mean the data in the cache line
> > > > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > > > think this will mislead users when report the result.
> > > 
> > > I'm of the opposite opinion. If the data wasn't modified, it will likely be
> > > found in the lower-level shared cache and the transaction wouldn't require a
> > > cache-to-cache transfer of the modified data, so the most common case when we
> > > source a line out of another cores cache will be if it was "modifiable" in that
> > > cache. 
> > 
> > Let's still use MOSI protocol as example.  I think there have two
> > cases: on case is the peer cache line is in 'Shared' state and another
> > case is the peer cache line is in 'Owned' state.
> > 
> > Quotes the wiki page for these two cases:
> > 
> > "When the cache block is in the Shared (S) state and there is a
> > snooped bus read (BusRd) transaction, then the block stays in the same
> > state and generates no more transactions as all the cache blocks have
> > the same value including the main memory and it is only being read,
> > not written into."
> > 
> > "While in the Owner (O) state and there is a snooped read request
> > (BusRd), the block remains in the same state while flushing (Flush)
> > the data for the other processor to read from it."
> > 
> > Seems to me, it's reasonable to set HTIM flag when the snooping happens
> > for the cache line line is in the Modified (M) state.
> > 
> > Again, my comment is based on the literal understanding; so please
> > correct if have any misunderstanding at here.
> 
> Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
> where sharing patterns are detected," so if a line is shared among caches it
> will likely also be in the SLC (and thus we'd get an L3 hit). 
> 
> If there is one copy of the cache line and that cache line needs to transition
> from one core to another core, I don't believe it matters if it was truly
> modified or not because the core already had permission to modify it and the
> transaction is just as costly (ping ponging between core caches). This is the
> one thing I really want to be able to uniquely identify as any cacheline doing
> this that isn't a lock is a place where optimization is likely possible. 

I understood that your point that if big amount of transitions within
multiple cores hit the same cache line, it would be very likely caused
by the cache line's Modified state so we set PERF_MEM_SNOOP_HITM flag
for easier reviewing.

Alternatively, I think it's good to pick up the patch series "perf c2c:
Sort cacheline with all loads" [1], rather than relying on HITM tag, the
patch series extends a new option "-d all" for perf c2c, so it displays
the suspecious false sharing cache lines based on load/store ops and
thread infos.  The main reason for holding on th patch set is due to we
cannot verify it with Arm SPE at that time point, as the time being Arm
SPE trace data was absent both store ops and data source packets.

I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
upstream the patch series "perf c2c: Sort cacheline with all loads"
(only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
have been merged in the mainline kernel).

If this is fine for you, I can respin the patch series for "perf c2c".
Or any other thoughts?

[1] https://lore.kernel.org/lkml/20201213133850.10070-1-leo.yan@linaro.org/

> > > > I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> > > > 
> > > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
> > > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > > 
> > > > > +		break;
> > > > > +	/*
> > > > > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > > > > +	 * transfer, so set SNOOP_HITM
> > > > > +	 */
> > > > > +	case ARM_SPE_NV_LCL_CLSTR:
> > > > 
> > > > For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> > > > the cluster level, it should happen in L2 cache:
> > > > 
> > > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
> > > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > > 
> > > We don't know if this is coming from the cluster cache, or the private L1 or L2
> > > core caches. The description above about why we'll be transferring the line from
> > > cache-to-cache applies here too. 
> > 
> > I think a minor difference between PEER_CORE and LCL_CLSTR is:
> > if data source is PEER_CORE, the snooping happens on the peer core's
> > local cache (Core's L1 or Core's L2 when core contains L2); for the data
> > source LCL_CLSTR, the snooping occurs on the cluster level's cache
> > (cluster's L2 cache or cluster's L3 cache when cluster contains L3).
> > 
> > So can we set both 'L2 | L3' for LCL_CLSTR case?
> 
> Just as above, I believe setting two options will lead to more confusion.
> Additionally, we agreed in the previous discussion that the system cache shall
> be the L3, so i don't see how this helps alleviate any confusion. That said, the
> systems I'm most familiar with never set LCL_CLSTR as a source. 

Okay, let's rollback to PERF_MEM_LVLNUM_ANY_CACHE as cache level for
LCL_CLSTR and PEER_CLSTR.

Thanks,
Leo