[permalink] [raw]

Subject: Re: [PATCH v4 6/7] x86/resctrl: Update documentation with Sub-NUMA cluster changes

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
> per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
> their name refer to Sub-NUMA nodes instead of L3 cache ids.
>
> Signed-off-by: Tony Luck <[email protected]>
> Reviewed-by: Peter Newman <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index cb05d90111b4..4d9ddb91751d 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -345,9 +345,13 @@ When control is enabled all CTRL_MON groups will also contain:
> When monitoring is enabled all MON groups will also contain:
>
> "mon_data":
> - This contains a set of files organized by L3 domain and by
> - RDT event. E.g. on a system with two L3 domains there will
> - be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
> + This contains a set of files organized by L3 domain or by NUMA
> + node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
> + or enabled respectively) and by RDT event. E.g. on a system with
> + SNC mode disabled with two L3 domains there will be subdirectories
> + "mon_L3_00" and "mon_L3_01". The numerical suffix refers to the
> + L3 cache id. With SNC enabled the directory names are the same,
> + but the numerical suffix refers to the node id. Each of these
> directories have one file per event (e.g. "llc_occupancy",
> "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
> files provide a read out of the current value of the event for

I think it would be helpful to add a modified version of the snippet
(from previous patch changelog) regarding well-behaved NUMA apps.
With the above it may be confusing that a single cache allocation has
multiple cache occupancy counters.

This also changes the meaning of the numbers in the directory names.
The documentation already provides guidance on how to find the cache
ID of a logical CPU (see section "Cache IDs"). I think it will be
helpful to add a snippet that makes it clear to users how to map
a CPU to its node ID.

Reinette

2023-08-11 20:21:13

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH v4 7/7] selftests/resctrl: Adjust effective L3 cache size when SNC enabled

Hi Tony,

On 7/22/2023 12:07 PM, Tony Luck wrote:
> Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
> nodes. Systems may support splitting into either two or four nodes.
>
> When SNC mode is enabled the effective amount of L3 cache available
> for allocation is divided by the number of nodes per L3.
>
> Detect which SNC mode is active by comparing the number of CPUs
> that share a cache with CPU0, with the number of CPUs on node0.
>
> Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
> Closes: https://lore.kernel.org/r/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com

This does not seem to be the case when looking at
https://lore.kernel.org/all/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01MB6330.jpnprd01.prod.outlook.com/

> Signed-off-by: Tony Luck <[email protected]>
> ---
> tools/testing/selftests/resctrl/resctrl.h | 1 +
> tools/testing/selftests/resctrl/resctrlfs.c | 57 +++++++++++++++++++++
> 2 files changed, 58 insertions(+)
>
> diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
> index 87e39456dee0..a8b43210b573 100644
> --- a/tools/testing/selftests/resctrl/resctrl.h
> +++ b/tools/testing/selftests/resctrl/resctrl.h
> @@ -13,6 +13,7 @@
> #include <signal.h>
> #include <dirent.h>
> #include <stdbool.h>
> +#include <ctype.h>
> #include <sys/stat.h>
> #include <sys/ioctl.h>
> #include <sys/mount.h>
> diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
> index fb00245dee92..79eecbf9f863 100644
> --- a/tools/testing/selftests/resctrl/resctrlfs.c
> +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> @@ -130,6 +130,61 @@ int get_resource_id(int cpu_no, int *resource_id)
> return 0;
> }
>
> +/*
> + * Count number of CPUs in a /sys bit map
> + */
> +static int count_sys_bitmap_bits(char *name)
> +{
> + FILE *fp = fopen(name, "r");
> + int count = 0, c;
> +
> + if (!fp)
> + return 0;
> +
> + while ((c = fgetc(fp)) != EOF) {
> + if (!isxdigit(c))
> + continue;
> + switch (c) {
> + case 'f':
> + count++;
> + case '7': case 'b': case 'd': case 'e':
> + count++;
> + case '3': case '5': case '6': case '9': case 'a': case 'c':
> + count++;
> + case '1': case '2': case '4': case '8':
> + count++;
> + }
> + }
> + fclose(fp);
> +
> + return count;
> +}
> +
> +/*
> + * Detect SNC by compating #CPUs in node0 with #CPUs sharing LLC with CPU0
> + * Try to get this right, even if a few CPUs are offline so that the number
> + * of CPUs in node0 is not exactly half or a quarter of the CPUs sharing the
> + * LLC of CPU0.
> + */
> +static int snc_ways(void)
> +{
> + int node_cpus, cache_cpus;
> +
> + node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
> + cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
> +
> + if (!node_cpus || !cache_cpus) {
> + fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
> + return 1;
> + }
> +
> + if (4 * node_cpus >= cache_cpus)
> + return 4;
> + else if (2 * node_cpus >= cache_cpus)
> + return 2;
> + return 1;
> +}
> +
> /*
> * get_cache_size - Get cache size for a specified CPU
> * @cpu_no: CPU number
> @@ -190,6 +245,8 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
> break;
> }
>
> + if (cache_num == 3)
> + *cache_size /= snc_ways();
> return 0;
> }
>

I am surprised that this small change is sufficient. The resctrl
selftests are definitely not NUMA aware and the CAT and CMT tests
are not taking that into account when picking CPUs to run on. From
what I understand LLC occupancy counters need to be added in this
scenario but I do not see that done either.

Reinette