2023-09-19 12:13:24

by Maciej Wieczor-Retman

[permalink] [raw]
Subject: Re: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled

On 2023-09-07 at 16:19:37 +0000, Luck, Tony wrote:
>> > + if (4 * node_cpus >= cache_cpus)
>> > + return 4;
>> > + else if (2 * node_cpus >= cache_cpus)
>> > + return 2;
>>
>>
>> If "4 * node_cpus >= cache_cpus " is not true,
>> "2 * node_cpus >= cache_cpus" will never be true.
>> Is it the following code?
>>
>> + if (2 * node_cpus >= cache_cpus)
>> + return 2;
>> + else if (4 * node_cpus >= cache_cpus)
>> + return 4;
>
>
>Shaopeng TAN,
>
>Good catch. Your solution is the correct one.
>
>Will fix in next post.

I played around with this code a little and I think the logical
expressions are returning wrong values.

On a system that has SNC disabled the function reports both "node_cpus"
and "cache_cpus" equal to 56. In this case snc_ways() returns "2". It is
the same on a system with SNC enabled that reports the previously mentioned
variables to be different by a factor of two (36 and 72).

Is it possible for node_cpus and cache_cpus to not be multiples of each
other? (as in for example cache_cpus being 10 and node_cpus being 21?).
If not I'd suggest using "==" instead of ">=".

If yes then I guess something like this could work? :

+ if (node_cpus >= cache_cpus)
+ return 1;
+ else if (2 * node_cpus >= cache_cpus)
+ return 2;
+ else if (4 * node_cpus >= cache_cpus)
+ return 4;

PS. I did my tests on two Intel Ice Lakes.

--
Kind regards
Maciej Wieczór-Retman


2023-09-19 21:59:34

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled

> On a system that has SNC disabled the function reports both "node_cpus"
> and "cache_cpus" equal to 56. In this case snc_ways() returns "2". It is
> the same on a system with SNC enabled that reports the previously mentioned
> variables to be different by a factor of two (36 and 72).

> Is it possible for node_cpus and cache_cpus to not be multiples of each
> other? (as in for example cache_cpus being 10 and node_cpus being 21?).
> If not I'd suggest using "==" instead of ">=".

Some CPUs may be offline when the test is run. E.g. with one CPU offline on SNC
node 0, you'd see node_cpus = 35 and cache_cpus = 71. But with one CPU offline
on node 1, you'd have node_cpus = 36, cache_cpus = 71.



> If yes then I guess something like this could work? :

+ if (node_cpus >= cache_cpus)
+ return 1;
+ else if (2 * node_cpus >= cache_cpus)
+ return 2;
+ else if (4 * node_cpus >= cache_cpus)
+ return 4;

This returns "4" for the 36 71 case. But should still be "2".

>> PS. I did my tests on two Intel Ice Lakes.

Perhaps easier to play with the algorithm in user code?

#include <stdio.h>
#include <stdlib.h>

static int snc(int node_cpus, int cache_cpus)
{
if (node_cpus >= cache_cpus)
return 1;
else if (2 * node_cpus >= cache_cpus)
return 2;
else if (4 * node_cpus >= cache_cpus)
return 4;
return -1;
}

int main(int argc, char **argv)
{
printf("%d\n", snc(atoi(argv[1]), atoi(argv[2])));

return 0;
}

N.B. it's probably not possible to handle the case where somebody took ALL the CPUs in SNC
node 1 offline (or SNC nodes 1,2,3 for the SNC 4 case).

I think it reasonable that the code handle some simple "small number of CPUs offline" cases.
But don't worry too much about cases where the user has done something extreme.

-Tony


2023-09-20 16:17:45

by Maciej Wieczor-Retman

[permalink] [raw]
Subject: Re: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled

Hi, thanks for the reply

On 2023-09-19 at 14:36:06 +0000, Luck, Tony wrote:
>> On a system that has SNC disabled the function reports both "node_cpus"
>> and "cache_cpus" equal to 56. In this case snc_ways() returns "2". It is
>> the same on a system with SNC enabled that reports the previously mentioned
>> variables to be different by a factor of two (36 and 72).
>
>> Is it possible for node_cpus and cache_cpus to not be multiples of each
>> other? (as in for example cache_cpus being 10 and node_cpus being 21?).
>> If not I'd suggest using "==" instead of ">=".
>
>Some CPUs may be offline when the test is run. E.g. with one CPU offline on SNC
>node 0, you'd see node_cpus = 35 and cache_cpus = 71. But with one CPU offline
>on node 1, you'd have node_cpus = 36, cache_cpus = 71.

Okay, thanks, good to know. On systems with disabled SNC that number
should be equal even if some CPUs were offline, right? I was mostly
concerned that the previous version was returning the same number
whether SNC was enabled with 2 nodes or disabled.

>> If yes then I guess something like this could work? :
>
>+ if (node_cpus >= cache_cpus)
>+ return 1;
>+ else if (2 * node_cpus >= cache_cpus)
>+ return 2;
>+ else if (4 * node_cpus >= cache_cpus)
>+ return 4;
>
>This returns "4" for the 36 71 case. But should still be "2".
>
>>> PS. I did my tests on two Intel Ice Lakes.
>
>Perhaps easier to play with the algorithm in user code?
>
>#include <stdio.h>
>#include <stdlib.h>
>
>static int snc(int node_cpus, int cache_cpus)
>{
> if (node_cpus >= cache_cpus)
> return 1;
> else if (2 * node_cpus >= cache_cpus)
> return 2;
> else if (4 * node_cpus >= cache_cpus)
> return 4;
> return -1;
>}
>
>int main(int argc, char **argv)
>{
> printf("%d\n", snc(atoi(argv[1]), atoi(argv[2])));
>
> return 0;
>}

My previous understanding was that the presence of ">=" comparison
implied the number of node_cpus could somehow get larger. So I
assumed that keeping it that way would be sufficient but now I can see
that wouldn't be the case.

>
>N.B. it's probably not possible to handle the case where somebody took ALL the CPUs in SNC
>node 1 offline (or SNC nodes 1,2,3 for the SNC 4 case).
>
>I think it reasonable that the code handle some simple "small number of CPUs offline" cases.
>But don't worry too much about cases where the user has done something extreme.
>
>-Tony

What about outputing this value to userspace from resctrl? The ratio is
already saved inside snc_nodes_per_l3_cache variable. And that would
help avoid these difficult cases when some cpus are offline which could
cause snc_ways() to return a wrong value. Or are there some pitfalls
to that approach?

--
Kind regards
Maciej Wiecz?r-Retman

2023-09-20 17:43:08

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v5 8/8] selftests/resctrl: Adjust effective L3 cache size when SNC enabled

> What about outputing this value to userspace from resctrl? The ratio is
> already saved inside snc_nodes_per_l3_cache variable. And that would
> help avoid these difficult cases when some cpus are offline which could
> cause snc_ways() to return a wrong value. Or are there some pitfalls
> to that approach?

My original patch series added an "snc_ways" file to the info/ directory
to make this visible. But I was talked out of it because of a lack of clear
user mode use case that needs it.

https://lore.kernel.org/all/[email protected]/

I don't know if the resctrl self tests constitute a valid use case. Perhaps not
as they can figure this out.

But ... maybe the difficulties that user mode has (because of possibly offline
CPUs) are an indicator that the kernel should expose this. If only to
make the kernel's view clear in case there are situations where it got the
calculation wrong. E.g. kernel booted with a maxcpus=N parameter.

-Tony