2023-09-12 20:08:24

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems

On Mon, Sep 11, 2023 at 01:23:35PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 8/29/2023 4:44 PM, Tony Luck wrote:
> > The Sub-NUMA cluster feature on some Intel processors partitions
> > the CPUs that share an L3 cache into two or more sets. This plays
> > havoc with the Resource Director Technology (RDT) monitoring features.
> > Prior to this patch Intel has advised that SNC and RDT are incompatible.
> >
> > Some of these CPU support an MSR that can partition the RMID
> > counters in the same way. This allows for monitoring features
> > to be used (with the caveat that memory accesses between different
> > SNC NUMA nodes may still not be counted accuratlely.
>
> Same typo as in V4.

Sorry. Will fix and re-post.

> >
> > Note that this patch series improves resctrl reporting considerably
> > on systems with SNC enabled, but there will still be some anomalies
> > for processes accessing memory from other sub-NUMA nodes.
>
> I have the same question as with V4 that was not answered in that email
> thread nor in this new version.
> https://lore.kernel.org/lkml/[email protected]/

Non-SNC systems already have an issue when reporting memory bandwidth
for a task that Linux may migrate the task to a CPU on a different node
which means that logging for that task will also move to different files
in the mon_data/mon_L3_*/ for the new node.

With SNC enabled, migration between NUMA nodes on the same socket may happen
much more frequently because:
1) The CPUs on the other NUMA nodes in the socket are in the same Linux
L3 cache domain. So Linux regard the migration as "cheap".
2) The ACPI SLIT table on SNC enabled systems may also report the
latency for remote access to another NUMA node on the same socket
as significantly lower than the latency for cross-socket access. On
my test system the SLIT distance for same socket nodes is 0xC,
compared to 0x15 for cross-socket distance. This will also lead
to Linux being more likely to migrate a task to a CPU on another
SNC NUMA node in the same socket.

To avoid migration issues, users may use sched_setaffinity(2) to bind
tasks to the subset of CPUs that share an SNC NUMA node.

I can write this up in a new cover letter.

> I stop my review of this series here.

Reinette

Should I repost the whole series as v6 with the new cover letter. The
only change to the patches so far is to the selftest reported by
Shaopeng Tan[1].

-Tony

[1] https://lore.kernel.org/all/TYAPR01MB633033C489AAC0E514CBC6688BEEA@TYAPR01MB6330.jpnprd01.prod.outlook.com/


2023-09-12 22:25:21

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v5 0/8] Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 9/12/2023 9:01 AM, Tony Luck wrote:
> On Mon, Sep 11, 2023 at 01:23:35PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 8/29/2023 4:44 PM, Tony Luck wrote:
>>> The Sub-NUMA cluster feature on some Intel processors partitions
>>> the CPUs that share an L3 cache into two or more sets. This plays
>>> havoc with the Resource Director Technology (RDT) monitoring features.
>>> Prior to this patch Intel has advised that SNC and RDT are incompatible.
>>>
>>> Some of these CPU support an MSR that can partition the RMID
>>> counters in the same way. This allows for monitoring features
>>> to be used (with the caveat that memory accesses between different
>>> SNC NUMA nodes may still not be counted accuratlely.
>>
>> Same typo as in V4.
>
> Sorry. Will fix and re-post.
>
>>>
>>> Note that this patch series improves resctrl reporting considerably
>>> on systems with SNC enabled, but there will still be some anomalies
>>> for processes accessing memory from other sub-NUMA nodes.
>>
>> I have the same question as with V4 that was not answered in that email
>> thread nor in this new version.
>> https://lore.kernel.org/lkml/[email protected]/
>
> Non-SNC systems already have an issue when reporting memory bandwidth
> for a task that Linux may migrate the task to a CPU on a different node
> which means that logging for that task will also move to different files
> in the mon_data/mon_L3_*/ for the new node.

It is not obvious to me that this is an issue. From what I understand
the data remains accurate.

How does this map to the earlier "may still not be counted
accurately"?

>
> With SNC enabled, migration between NUMA nodes on the same socket may happen
> much more frequently because:
> 1) The CPUs on the other NUMA nodes in the socket are in the same Linux
> L3 cache domain. So Linux regard the migration as "cheap".
> 2) The ACPI SLIT table on SNC enabled systems may also report the
> latency for remote access to another NUMA node on the same socket
> as significantly lower than the latency for cross-socket access. On
> my test system the SLIT distance for same socket nodes is 0xC,
> compared to 0x15 for cross-socket distance. This will also lead
> to Linux being more likely to migrate a task to a CPU on another
> SNC NUMA node in the same socket.
>
> To avoid migration issues, users may use sched_setaffinity(2) to bind
> tasks to the subset of CPUs that share an SNC NUMA node.
>
> I can write this up in a new cover letter.
>
>> I stop my review of this series here.
>
> Reinette
>
> Should I repost the whole series as v6 with the new cover letter. The
> only change to the patches so far is to the selftest reported by
> Shaopeng Tan[1].
>

Is this an assurance that the cover letter in no way reflects how
feedback was addressed in the rest of this series?

Reinette