Hi!
From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
(and newer) because Intel added some feature where *clean* cachelines
can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
treats this mostly the same as snoop-forwarding of modified cache
lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
rodata section in "perf c2c report".
This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
Facility", table "Table 20-101. Data Source Encoding for Memory
Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
"XCORE FWD. This request was satisfied by a sibling core where either
a modified (cross-core HITM) or a non-modified (cross-core FWD)
cache-line copy was found."
I don't see anything about this in arch/x86/events/intel/ds.c - if I
understand correctly, the kernel's PEBS data source decoding assumes
that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
to be adjusted somehow - and maybe it just isn't possible to actually
distinguish between HitM and cross-core FWD in PEBS events on these
CPUs (without big-hammer chicken bit trickery)? Maybe someone from
Intel can clarify?
(The SDM describes that E-cores on the newer 12th Gen have more
precise PEBS encodings that distinguish between "L3 HITM" and "L3
HITF"; but I guess the P-cores there maybe still don't let you
distinguish HITM/HITF?)
I think https://perfmon-events.intel.com/tigerLake.html is also
outdated, or at least it uses ambiguous grammar: The
MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is
documented as "Counts retired load instructions where a cross-core
snoop hit in another cores caches on this socket, the data was
forwarded back to the requesting core as the data was modified
(SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" -
from what I understand, a "cross-core FWD" should be a case where the
L3 does have the data, unless L3 has become non-inclusive on Ice Lake?
On a Tiger Lake CPU, I can see this event trigger for the
sys_call_table, which is located in the rodata region and probably
shouldn't be containing Modified cache lines:
# grep -A1 -w sys_call_table /proc/kallsyms
ffffffff82800280 D sys_call_table
ffffffff82801100 d vdso_mapping
# perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data
^C[ perf record: Woken up 11 times to write data ]
[ perf record: Captured and wrote 22.851 MB perf.data (43176 samples) ]
# perf script -F event,ip,sym,addr | egrep --color 'ffffffff828002[89abcdef]'
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002d8
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800288
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
ffffffff82526275 do_syscall_64
(For what it's worth, there is a thread on LKML where "cross-core FWD"
got mentioned: <https://lore.kernel.org/lkml/[email protected]/>)
On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <[email protected]> wrote:
>
> Hi!
>
> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
> (and newer) because Intel added some feature where *clean* cachelines
> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
> treats this mostly the same as snoop-forwarding of modified cache
> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
> rodata section in "perf c2c report".
>
> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
> Facility", table "Table 20-101. Data Source Encoding for Memory
> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
> "XCORE FWD. This request was satisfied by a sibling core where either
> a modified (cross-core HITM) or a non-modified (cross-core FWD)
> cache-line copy was found."
>
> I don't see anything about this in arch/x86/events/intel/ds.c - if I
> understand correctly, the kernel's PEBS data source decoding assumes
> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
> to be adjusted somehow - and maybe it just isn't possible to actually
> distinguish between HitM and cross-core FWD in PEBS events on these
> CPUs (without big-hammer chicken bit trickery)? Maybe someone from
> Intel can clarify?
>
> (The SDM describes that E-cores on the newer 12th Gen have more
> precise PEBS encodings that distinguish between "L3 HITM" and "L3
> HITF"; but I guess the P-cores there maybe still don't let you
> distinguish HITM/HITF?)
>
>
> I think https://perfmon-events.intel.com/tigerLake.html is also
> outdated, or at least it uses ambiguous grammar: The
> MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is
> documented as "Counts retired load instructions where a cross-core
> snoop hit in another cores caches on this socket, the data was
> forwarded back to the requesting core as the data was modified
> (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" -
> from what I understand, a "cross-core FWD" should be a case where the
> L3 does have the data, unless L3 has become non-inclusive on Ice Lake?
>
> On a Tiger Lake CPU, I can see this event trigger for the
> sys_call_table, which is located in the rodata region and probably
> shouldn't be containing Modified cache lines:
>
> # grep -A1 -w sys_call_table /proc/kallsyms
> ffffffff82800280 D sys_call_table
> ffffffff82801100 d vdso_mapping
> # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data
> ^C[ perf record: Woken up 11 times to write data ]
> [ perf record: Captured and wrote 22.851 MB perf.data (43176 samples) ]
> # perf script -F event,ip,sym,addr | egrep --color 'ffffffff828002[89abcdef]'
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
> ffffffff82526275 do_syscall_64
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002d8
> ffffffff82526275 do_syscall_64
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
> ffffffff82526275 do_syscall_64
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
> ffffffff82526275 do_syscall_64
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
> ffffffff82526275 do_syscall_64
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
> ffffffff82526275 do_syscall_64
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
> ffffffff82526275 do_syscall_64
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800288
> ffffffff82526275 do_syscall_64
> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
> ffffffff82526275 do_syscall_64
>
>
> (For what it's worth, there is a thread on LKML where "cross-core FWD"
> got mentioned: <https://lore.kernel.org/lkml/[email protected]/>)
+others better qualified than me to respond.
Hi Jann,
I'm not overly familiar with the issue, but it appears a similar issue
has been reported for Broadwell Xeon here:
https://community.intel.com/t5/Software-Tuning-Performance/Broadwell-Xeon-perf-c2c-showing-remote-HITM-but-remote-socket-is/td-p/1172120
I'm not sure that thread will be particularly useful, but having the
Intel people better qualified than me to answer is probably the better
service of this email.
Thanks,
Ian
Just adding Joe Mario to the CC list.
On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote:
> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <[email protected]> wrote:
> >
> > Hi!
> >
> > From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
> > (and newer) because Intel added some feature where *clean* cachelines
> > can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
> > treats this mostly the same as snoop-forwarding of modified cache
> > lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
> > rodata section in "perf c2c report".
> >
> > This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
> > Facility", table "Table 20-101. Data Source Encoding for Memory
> > Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
> > "XCORE FWD. This request was satisfied by a sibling core where either
> > a modified (cross-core HITM) or a non-modified (cross-core FWD)
> > cache-line copy was found."
> >
> > I don't see anything about this in arch/x86/events/intel/ds.c - if I
> > understand correctly, the kernel's PEBS data source decoding assumes
> > that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
> > to be adjusted somehow - and maybe it just isn't possible to actually
> > distinguish between HitM and cross-core FWD in PEBS events on these
> > CPUs (without big-hammer chicken bit trickery)? Maybe someone from
> > Intel can clarify?
> >
> > (The SDM describes that E-cores on the newer 12th Gen have more
> > precise PEBS encodings that distinguish between "L3 HITM" and "L3
> > HITF"; but I guess the P-cores there maybe still don't let you
> > distinguish HITM/HITF?)
> >
> >
> > I think https://perfmon-events.intel.com/tigerLake.html is also
> > outdated, or at least it uses ambiguous grammar: The
> > MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is
> > documented as "Counts retired load instructions where a cross-core
> > snoop hit in another cores caches on this socket, the data was
> > forwarded back to the requesting core as the data was modified
> > (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" -
> > from what I understand, a "cross-core FWD" should be a case where the
> > L3 does have the data, unless L3 has become non-inclusive on Ice Lake?
> >
> > On a Tiger Lake CPU, I can see this event trigger for the
> > sys_call_table, which is located in the rodata region and probably
> > shouldn't be containing Modified cache lines:
> >
> > # grep -A1 -w sys_call_table /proc/kallsyms
> > ffffffff82800280 D sys_call_table
> > ffffffff82801100 d vdso_mapping
> > # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data
> > ^C[ perf record: Woken up 11 times to write data ]
> > [ perf record: Captured and wrote 22.851 MB perf.data (43176 samples) ]
> > # perf script -F event,ip,sym,addr | egrep --color 'ffffffff828002[89abcdef]'
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
> > ffffffff82526275 do_syscall_64
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002d8
> > ffffffff82526275 do_syscall_64
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
> > ffffffff82526275 do_syscall_64
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
> > ffffffff82526275 do_syscall_64
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
> > ffffffff82526275 do_syscall_64
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
> > ffffffff82526275 do_syscall_64
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
> > ffffffff82526275 do_syscall_64
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800288
> > ffffffff82526275 do_syscall_64
> > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
> > ffffffff82526275 do_syscall_64
> >
> >
> > (For what it's worth, there is a thread on LKML where "cross-core FWD"
> > got mentioned: <https://lore.kernel.org/lkml/[email protected]/>)
>
> +others better qualified than me to respond.
>
> Hi Jann,
>
> I'm not overly familiar with the issue, but it appears a similar issue
> has been reported for Broadwell Xeon here:
> https://community.intel.com/t5/Software-Tuning-Performance/Broadwell-Xeon-perf-c2c-showing-remote-HITM-but-remote-socket-is/td-p/1172120
> I'm not sure that thread will be particularly useful, but having the
> Intel people better qualified than me to answer is probably the better
> service of this email.
>
> Thanks,
> Ian
On Thu, Feb 22, 2024 at 9:05 PM Liang, Kan <[email protected]> wrote:
>
> Hi Jann,
>
> Sorry for the late response.
>
> On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote:
> > Just adding Joe Mario to the CC list.
> >
> > On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote:
> >> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <[email protected]> wrote:
> >>>
> >>> Hi!
> >>>
> >>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
> >>> (and newer) because Intel added some feature where *clean* cachelines
> >>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
> >>> treats this mostly the same as snoop-forwarding of modified cache
> >>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
> >>> rodata section in "perf c2c report".
> >>>
> >>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
> >>> Facility", table "Table 20-101. Data Source Encoding for Memory
> >>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
> >>> "XCORE FWD. This request was satisfied by a sibling core where either
> >>> a modified (cross-core HITM) or a non-modified (cross-core FWD)
> >>> cache-line copy was found."
> >>>
> >>> I don't see anything about this in arch/x86/events/intel/ds.c - if I
> >>> understand correctly, the kernel's PEBS data source decoding assumes
> >>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
> >>> to be adjusted somehow - and maybe it just isn't possible to actually
> >>> distinguish between HitM and cross-core FWD in PEBS events on these
> >>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from
> >>> Intel can clarify?
> >>>
> >>> (The SDM describes that E-cores on the newer 12th Gen have more
> >>> precise PEBS encodings that distinguish between "L3 HITM" and "L3
> >>> HITF"; but I guess the P-cores there maybe still don't let you
> >>> distinguish HITM/HITF?)
>
> Right, there is no way to distinguish HITM/HITF on Tiger Lake.
Aah, okay, thank you very much for the clarification!
> I think what we can do is to add both HITM and HITF for the 0x07 to
> match the SDM description.
>
> How about the below patch (not tested yet)?
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index d49d661ec0a7..8c966b5b23cb 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -84,7 +84,7 @@ static u64 pebs_data_source[] = {
> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */
> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit,
> snoop miss */
> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit,
> snoop hit */
> - OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit,
> snoop hitm */
> + OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /*
> 0x07: L3 hit, snoop hitm & fwd */
> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08:
> L3 miss snoop hit */
> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09:
> L3 miss snoop hitm*/
> OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a:
> L3 miss, shared */
(I'm not familiar enough with the perf semantics to know how the event
encoding works, maybe someone else can have a look?)
>
>
> >>>
> >>>
> >>> I think https://perfmon-events.intel.com/tigerLake.html is also
> >>> outdated, or at least it uses ambiguous grammar: The
> >>> MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is
> >>> documented as "Counts retired load instructions where a cross-core
> >>> snoop hit in another cores caches on this socket, the data was
> >>> forwarded back to the requesting core as the data was modified
> >>> (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" -
> >>> from what I understand, a "cross-core FWD" should be a case where the
> >>> L3 does have the data, unless L3 has become non-inclusive on Ice Lake?
> >>>
>
> For the event, the BriefDescription in the event list json file gives a
> more accurate description.
> "BriefDescription": "Snoop hit a modified(HITM) or clean line(HIT_W_FWD)
> in another on-pkg core which forwarded the data back due to a retired
> load instruction.",
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch/x86/tigerlake/cache.json#n286
Ah, right, that's clearer.
Hi Jann,
Sorry for the late response.
On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote:
> Just adding Joe Mario to the CC list.
>
> On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote:
>> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <[email protected]> wrote:
>>>
>>> Hi!
>>>
>>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
>>> (and newer) because Intel added some feature where *clean* cachelines
>>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
>>> treats this mostly the same as snoop-forwarding of modified cache
>>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
>>> rodata section in "perf c2c report".
>>>
>>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
>>> Facility", table "Table 20-101. Data Source Encoding for Memory
>>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
>>> "XCORE FWD. This request was satisfied by a sibling core where either
>>> a modified (cross-core HITM) or a non-modified (cross-core FWD)
>>> cache-line copy was found."
>>>
>>> I don't see anything about this in arch/x86/events/intel/ds.c - if I
>>> understand correctly, the kernel's PEBS data source decoding assumes
>>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
>>> to be adjusted somehow - and maybe it just isn't possible to actually
>>> distinguish between HitM and cross-core FWD in PEBS events on these
>>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from
>>> Intel can clarify?
>>>
>>> (The SDM describes that E-cores on the newer 12th Gen have more
>>> precise PEBS encodings that distinguish between "L3 HITM" and "L3
>>> HITF"; but I guess the P-cores there maybe still don't let you
>>> distinguish HITM/HITF?)
Right, there is no way to distinguish HITM/HITF on Tiger Lake.
I think what we can do is to add both HITM and HITF for the 0x07 to
match the SDM description.
How about the below patch (not tested yet)?
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index d49d661ec0a7..8c966b5b23cb 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -84,7 +84,7 @@ static u64 pebs_data_source[] = {
OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */
OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit,
snoop miss */
OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit,
snoop hit */
- OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit,
snoop hitm */
+ OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /*
0x07: L3 hit, snoop hitm & fwd */
OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08:
L3 miss snoop hit */
OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09:
L3 miss snoop hitm*/
OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a:
L3 miss, shared */
>>>
>>>
>>> I think https://perfmon-events.intel.com/tigerLake.html is also
>>> outdated, or at least it uses ambiguous grammar: The
>>> MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is
>>> documented as "Counts retired load instructions where a cross-core
>>> snoop hit in another cores caches on this socket, the data was
>>> forwarded back to the requesting core as the data was modified
>>> (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" -
>>> from what I understand, a "cross-core FWD" should be a case where the
>>> L3 does have the data, unless L3 has become non-inclusive on Ice Lake?
>>>
For the event, the BriefDescription in the event list json file gives a
more accurate description.
"BriefDescription": "Snoop hit a modified(HITM) or clean line(HIT_W_FWD)
in another on-pkg core which forwarded the data back due to a retired
load instruction.",
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch/x86/tigerlake/cache.json#n286
Thanks,
Kan
>>> On a Tiger Lake CPU, I can see this event trigger for the
>>> sys_call_table, which is located in the rodata region and probably
>>> shouldn't be containing Modified cache lines:
>>>
>>> # grep -A1 -w sys_call_table /proc/kallsyms
>>> ffffffff82800280 D sys_call_table
>>> ffffffff82801100 d vdso_mapping
>>> # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data
>>> ^C[ perf record: Woken up 11 times to write data ]
>>> [ perf record: Captured and wrote 22.851 MB perf.data (43176 samples) ]
>>> # perf script -F event,ip,sym,addr | egrep --color 'ffffffff828002[89abcdef]'
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
>>> ffffffff82526275 do_syscall_64
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002d8
>>> ffffffff82526275 do_syscall_64
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
>>> ffffffff82526275 do_syscall_64
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
>>> ffffffff82526275 do_syscall_64
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
>>> ffffffff82526275 do_syscall_64
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
>>> ffffffff82526275 do_syscall_64
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
>>> ffffffff82526275 do_syscall_64
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800288
>>> ffffffff82526275 do_syscall_64
>>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
>>> ffffffff82526275 do_syscall_64
>>>
>>>
>>> (For what it's worth, there is a thread on LKML where "cross-core FWD"
>>> got mentioned: <https://lore.kernel.org/lkml/[email protected]/>)
>>
>> +others better qualified than me to respond.
>>
>> Hi Jann,
>>
>> I'm not overly familiar with the issue, but it appears a similar issue
>> has been reported for Broadwell Xeon here:
>> https://community.intel.com/t5/Software-Tuning-Performance/Broadwell-Xeon-perf-c2c-showing-remote-HITM-but-remote-socket-is/td-p/1172120
>> I'm not sure that thread will be particularly useful, but having the
>> Intel people better qualified than me to answer is probably the better
>> service of this email.
>>
>> Thanks,
>> Ian
>
On 2024-02-22 3:07 p.m., Jann Horn wrote:
> On Thu, Feb 22, 2024 at 9:05 PM Liang, Kan <[email protected]> wrote:
>>
>> Hi Jann,
>>
>> Sorry for the late response.
>>
>> On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote:
>>> Just adding Joe Mario to the CC list.
>>>
>>> On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote:
>>>> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <[email protected]> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
>>>>> (and newer) because Intel added some feature where *clean* cachelines
>>>>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
>>>>> treats this mostly the same as snoop-forwarding of modified cache
>>>>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
>>>>> rodata section in "perf c2c report".
>>>>>
>>>>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
>>>>> Facility", table "Table 20-101. Data Source Encoding for Memory
>>>>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
>>>>> "XCORE FWD. This request was satisfied by a sibling core where either
>>>>> a modified (cross-core HITM) or a non-modified (cross-core FWD)
>>>>> cache-line copy was found."
>>>>>
>>>>> I don't see anything about this in arch/x86/events/intel/ds.c - if I
>>>>> understand correctly, the kernel's PEBS data source decoding assumes
>>>>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
>>>>> to be adjusted somehow - and maybe it just isn't possible to actually
>>>>> distinguish between HitM and cross-core FWD in PEBS events on these
>>>>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from
>>>>> Intel can clarify?
>>>>>
>>>>> (The SDM describes that E-cores on the newer 12th Gen have more
>>>>> precise PEBS encodings that distinguish between "L3 HITM" and "L3
>>>>> HITF"; but I guess the P-cores there maybe still don't let you
>>>>> distinguish HITM/HITF?)
>>
>> Right, there is no way to distinguish HITM/HITF on Tiger Lake.
>
> Aah, okay, thank you very much for the clarification!
>
>> I think what we can do is to add both HITM and HITF for the 0x07 to
>> match the SDM description.
>>
>> How about the below patch (not tested yet)?
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index d49d661ec0a7..8c966b5b23cb 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -84,7 +84,7 @@ static u64 pebs_data_source[] = {
>> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */
>> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit,
>> snoop miss */
>> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit,
>> snoop hit */
>> - OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit,
>> snoop hitm */
>> + OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /*
>> 0x07: L3 hit, snoop hitm & fwd */
>> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08:
>> L3 miss snoop hit */
>> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09:
>> L3 miss snoop hitm*/
>> OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a:
>> L3 miss, shared */
>
> (I'm not familiar enough with the perf semantics to know how the event
> encoding works, maybe someone else can have a look?)
>
I can do the test to verify the settings and perf c2c. But I don't have
a benchmark. Could you please share your benchmark with me?
For example, the data you used in your example.
# perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c
100 --data
Thanks,
Kan
>>
>>
>>>>>
>>>>>
>>>>> I think https://perfmon-events.intel.com/tigerLake.html is also
>>>>> outdated, or at least it uses ambiguous grammar: The
>>>>> MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is
>>>>> documented as "Counts retired load instructions where a cross-core
>>>>> snoop hit in another cores caches on this socket, the data was
>>>>> forwarded back to the requesting core as the data was modified
>>>>> (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" -
>>>>> from what I understand, a "cross-core FWD" should be a case where the
>>>>> L3 does have the data, unless L3 has become non-inclusive on Ice Lake?
>>>>>
>>
>> For the event, the BriefDescription in the event list json file gives a
>> more accurate description.
>> "BriefDescription": "Snoop hit a modified(HITM) or clean line(HIT_W_FWD)
>> in another on-pkg core which forwarded the data back due to a retired
>> load instruction.",
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch/x86/tigerlake/cache.json#n286
>
> Ah, right, that's clearer.
>
On Fri, Feb 23, 2024 at 4:52 PM Liang, Kan <[email protected]> wrote:
> On 2024-02-22 3:07 p.m., Jann Horn wrote:
> > On Thu, Feb 22, 2024 at 9:05 PM Liang, Kan <[email protected]> wrote:
> >>
> >> Hi Jann,
> >>
> >> Sorry for the late response.
> >>
> >> On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote:
> >>> Just adding Joe Mario to the CC list.
> >>>
> >>> On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote:
> >>>> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <[email protected]> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
> >>>>> (and newer) because Intel added some feature where *clean* cachelines
> >>>>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
> >>>>> treats this mostly the same as snoop-forwarding of modified cache
> >>>>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
> >>>>> rodata section in "perf c2c report".
> >>>>>
> >>>>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
> >>>>> Facility", table "Table 20-101. Data Source Encoding for Memory
> >>>>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
> >>>>> "XCORE FWD. This request was satisfied by a sibling core where either
> >>>>> a modified (cross-core HITM) or a non-modified (cross-core FWD)
> >>>>> cache-line copy was found."
> >>>>>
> >>>>> I don't see anything about this in arch/x86/events/intel/ds.c - if I
> >>>>> understand correctly, the kernel's PEBS data source decoding assumes
> >>>>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
> >>>>> to be adjusted somehow - and maybe it just isn't possible to actually
> >>>>> distinguish between HitM and cross-core FWD in PEBS events on these
> >>>>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from
> >>>>> Intel can clarify?
> >>>>>
> >>>>> (The SDM describes that E-cores on the newer 12th Gen have more
> >>>>> precise PEBS encodings that distinguish between "L3 HITM" and "L3
> >>>>> HITF"; but I guess the P-cores there maybe still don't let you
> >>>>> distinguish HITM/HITF?)
> >>
> >> Right, there is no way to distinguish HITM/HITF on Tiger Lake.
> >
> > Aah, okay, thank you very much for the clarification!
> >
> >> I think what we can do is to add both HITM and HITF for the 0x07 to
> >> match the SDM description.
> >>
> >> How about the below patch (not tested yet)?
> >> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> >> index d49d661ec0a7..8c966b5b23cb 100644
> >> --- a/arch/x86/events/intel/ds.c
> >> +++ b/arch/x86/events/intel/ds.c
> >> @@ -84,7 +84,7 @@ static u64 pebs_data_source[] = {
> >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */
> >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit,
> >> snoop miss */
> >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit,
> >> snoop hit */
> >> - OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit,
> >> snoop hitm */
> >> + OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /*
> >> 0x07: L3 hit, snoop hitm & fwd */
> >> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08:
> >> L3 miss snoop hit */
> >> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09:
> >> L3 miss snoop hitm*/
> >> OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a:
> >> L3 miss, shared */
> >
> > (I'm not familiar enough with the perf semantics to know how the event
> > encoding works, maybe someone else can have a look?)
> >
>
> I can do the test to verify the settings and perf c2c. But I don't have
> a benchmark. Could you please share your benchmark with me?
> For example, the data you used in your example.
> # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c
> 100 --data
It seems to be happening at a low rate in the background when I'm just
clicking around on websites or such; but it seems like compiling the
kernel with "make -j8" (where 8 is the number of hyperthreads my
Tiger Lake laptop has) causes it to happen at a somewhat higher rate,
a few times per second.
Sorry, I don't really have a particularly good microbenchmark or such
that makes this happen at an abnormally high rate...