by Luck, Tony

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

On Tue, Jul 11, 2023 at 01:50:02PM -0700, Reinette Chatre wrote:
> Hi Tony,
> > This is expected. When SNC is enabled, CAT still supports the same number of
> > bits in the allocation cache mask. But each bit represents half as much cache.
> >
> > Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
> > as the columns, and the rows are the hashed index of the physical address.
> > When SNC is turned on the hash function for physical addresses from one
> > of the SNC number nodes will only pick half of those rows (and the other
> > SNC node gets the other half of the rows).
>
> If a test is expected to fail in a particular scenario then I think
> the test failure should be communicated as a "pass". If not this will
> reduce confidence in accuracy of tests. Even so, from the description
> it sounds as though this test can be made more accurate to indeed pass
> in the scenario when SNC is enabled?

Hi Reinette,

Yes. This could be done. The resctrl tests would need to determine
if SNC mode is enabled. But I think that is possible by comparing
output of sysfs files. E.g. with SNC disabled the lists of cpus for a node
and a CPU on that node will match like this:

$ cat /sys/devices/system/node/node0/cpulist
0-35,72-107
$ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
0-35,72-107

but with SNC enabled, the CPUs sharing a cache will be divided across
two or four nodes.

It looks like the existing tests may print a warning. I see
this code in:

tools/testing/selftests/resctrl/resctrl_tests.c

123 res = cmt_resctrl_val(cpu_no, 5, benchmark_cmd);
124 ksft_test_result(!res, "CMT: test\n");
125 if ((get_vendor() == ARCH_INTEL) && res)
126 ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");

but at first glance that warning doesn't appear to try and
check if SNC was the actual problem.

-Tony

2023-07-11 22:22:25

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

Hi Tony,

On 7/11/2023 2:23 PM, Tony Luck wrote:
> On Tue, Jul 11, 2023 at 01:50:02PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>> This is expected. When SNC is enabled, CAT still supports the same number of
>>> bits in the allocation cache mask. But each bit represents half as much cache.
>>>
>>> Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
>>> as the columns, and the rows are the hashed index of the physical address.
>>> When SNC is turned on the hash function for physical addresses from one
>>> of the SNC number nodes will only pick half of those rows (and the other
>>> SNC node gets the other half of the rows).
>>
>> If a test is expected to fail in a particular scenario then I think
>> the test failure should be communicated as a "pass". If not this will
>> reduce confidence in accuracy of tests. Even so, from the description
>> it sounds as though this test can be made more accurate to indeed pass
>> in the scenario when SNC is enabled?
>
> Hi Reinette,
>
> Yes. This could be done. The resctrl tests would need to determine
> if SNC mode is enabled. But I think that is possible by comparing
> output of sysfs files. E.g. with SNC disabled the lists of cpus for a node
> and a CPU on that node will match like this:
>
> $ cat /sys/devices/system/node/node0/cpulist
> 0-35,72-107
> $ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
> 0-35,72-107
>
> but with SNC enabled, the CPUs sharing a cache will be divided across
> two or four nodes.
>
> It looks like the existing tests may print a warning. I see
> this code in:
>
> tools/testing/selftests/resctrl/resctrl_tests.c
>
> 123 res = cmt_resctrl_val(cpu_no, 5, benchmark_cmd);
> 124 ksft_test_result(!res, "CMT: test\n");
> 125 if ((get_vendor() == ARCH_INTEL) && res)
> 126 ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
>
> but at first glance that warning doesn't appear to try and
> check if SNC was the actual problem.

Your first glance is accurate. This message was added after finding
tests fail on SNC systems but not finding the correct way to enumerate
whether SNC is enabled. At that time it was still recommended that
SNC not be enabled and thus test failures continued to be accurate.
This work changes that.

Reinette