Sub-Numa Clustering (SNC) allows splitting CPU cores, caches and memory
into multiple NUMA nodes. When enabled, NUMA-aware applications can
achieve better performance on bigger server platforms.
The series adding SNC support to the kernel is currently in review [1]
but the selftests for resctrl need NUMA-aware adjustments to use these
changes. Issues with resctrl selftests not working properly with SNC
enabled were originally reported by Shaopeng Tan [2][3] and the
following series resolves them.
The main concept currently missing from resctrl selftests is that while
resctrl tracks memory accesses on a single NUMA node (which normally is
the same as the CPU socket) on machines with SNC enabled memory accesses
can leak outside of the local NUMA node and into other NUMA nodes on the
same socket. In that case resctrl could report a diminished value in one
of its monitoring technologies: Cache Monitoring Technology (CMT) or
Memory Bandwidth Monitoring (MBM) .
Implemented solutions for both CMT and MBM follow the same idea which is
to simply sum values reported by different NUMA nodes for a single
Resource Monitoring ID (RMID).
Series was tested on Ice Lake server platforms with SNC disabled, SNC-2
and SNC-4. The tests were also ran with and without kernel support for
SNC.
Series applies cleanly on kselftest/next.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com/
[3] https://lore.kernel.org/lkml/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01MB6330.jpnprd01.prod.outlook.com/
Maciej Wieczor-Retman (4):
selftests/resctrl: Adjust effective L3 cache size with SNC enabled
selftests/resctrl: SNC support for CMT
selftests/resctrl: SNC support for MBM
selftests/resctrl: Adjust SNC support messages
tools/testing/selftests/resctrl/cache.c | 17 ++-
tools/testing/selftests/resctrl/cat_test.c | 2 +-
tools/testing/selftests/resctrl/cmt_test.c | 6 +-
tools/testing/selftests/resctrl/mba_test.c | 3 +-
tools/testing/selftests/resctrl/mbm_test.c | 4 +-
tools/testing/selftests/resctrl/resctrl.h | 13 +-
tools/testing/selftests/resctrl/resctrl_val.c | 46 ++++---
tools/testing/selftests/resctrl/resctrlfs.c | 128 +++++++++++++++++-
8 files changed, 185 insertions(+), 34 deletions(-)
--
2.44.0
Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA
nodes. Systems may support splitting into either two or four nodes.
When SNC mode is enabled the effective amount of L3 cache available
for allocation is divided by the number of nodes per L3.
Detect which SNC mode is active by comparing the number of CPUs
that share a cache with CPU0, with the number of CPUs on node0.
Signed-off-by: Tony Luck <[email protected]>
Co-developed-by: Maciej Wieczor-Retman <[email protected]>
Signed-off-by: Maciej Wieczor-Retman <[email protected]>
---
tools/testing/selftests/resctrl/resctrl.h | 4 ++
tools/testing/selftests/resctrl/resctrlfs.c | 59 +++++++++++++++++++++
2 files changed, 63 insertions(+)
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 2051bd135e0d..41811e87f81c 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -11,6 +11,7 @@
#include <signal.h>
#include <dirent.h>
#include <stdbool.h>
+#include <ctype.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mount.h>
@@ -43,6 +44,8 @@
#define DEFAULT_SPAN (250 * MB)
+#define MAX_SNC 4
+
#define PARENT_EXIT() \
do { \
kill(ppid, SIGKILL); \
@@ -129,6 +132,7 @@ extern pid_t bm_pid, ppid;
extern char llc_occup_path[1024];
+int snc_ways(void);
int get_vendor(void);
bool check_resctrlfs_support(void);
int filter_dmesg(void);
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index 1cade75176eb..e4d3624a8817 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -156,6 +156,63 @@ int get_domain_id(const char *resource, int cpu_no, int *domain_id)
return 0;
}
+/*
+ * Count number of CPUs in a /sys bit map
+ */
+static unsigned int count_sys_bitmap_bits(char *name)
+{
+ FILE *fp = fopen(name, "r");
+ int count = 0, c;
+
+ if (!fp)
+ return 0;
+
+ while ((c = fgetc(fp)) != EOF) {
+ if (!isxdigit(c))
+ continue;
+ switch (c) {
+ case 'f':
+ count++;
+ case '7': case 'b': case 'd': case 'e':
+ count++;
+ case '3': case '5': case '6': case '9': case 'a': case 'c':
+ count++;
+ case '1': case '2': case '4': case '8':
+ count++;
+ }
+ }
+ fclose(fp);
+
+ return count;
+}
+
+/*
+ * Detect SNC by comparing #CPUs in node0 with #CPUs sharing LLC with CPU0.
+ * If some CPUs are offline the numbers may not be exact multiples of each
+ * other. Any offline CPUs on node0 will be also gone from shared_cpu_map of
+ * CPU0 but offline CPUs from other nodes will only make the cache_cpus value
+ * lower. Still try to get the ratio right by preventing the second possibility.
+ */
+int snc_ways(void)
+{
+ int node_cpus, cache_cpus, i;
+
+ node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap");
+ cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map");
+
+ if (!node_cpus || !cache_cpus) {
+ fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n");
+ return 1;
+ }
+
+ for (i = 1; i <= MAX_SNC ; i++) {
+ if (i * node_cpus >= cache_cpus)
+ return i;
+ }
+
+ return 1;
+}
+
/*
* get_cache_size - Get cache size for a specified CPU
* @cpu_no: CPU number
@@ -211,6 +268,8 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
break;
}
+ if (cache_num == 3)
+ *cache_size /= snc_ways();
return 0;
}
--
2.44.0
Cache Monitoring Technology (CMT) works by measuring how much data in L3
cache is occupied by a given process identified by its Resource
Monitoring ID (RMID).
On systems with Sub-Numa Clusters (SNC) enabled, a process can occupy
not only the cache that belongs to its own NUMA node but also pieces of
other NUMA nodes' caches that lie on the same socket.
A simple correction to make the CMT selftest NUMA-aware is to sum values
reported by all nodes on the same socket for a given RMID.
Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
Closes: https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com/
Signed-off-by: Maciej Wieczor-Retman <[email protected]>
---
tools/testing/selftests/resctrl/cache.c | 17 +++++++++++------
tools/testing/selftests/resctrl/resctrl.h | 4 +++-
tools/testing/selftests/resctrl/resctrl_val.c | 9 ++++++---
3 files changed, 20 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
index 1b339d6bbff1..dab81920033b 100644
--- a/tools/testing/selftests/resctrl/cache.c
+++ b/tools/testing/selftests/resctrl/cache.c
@@ -161,16 +161,21 @@ int perf_event_measure(int pe_fd, struct perf_event_read *pe_read,
*
* Return: =0 on success. <0 on failure.
*/
-int measure_llc_resctrl(const char *filename, int bm_pid)
+int measure_llc_resctrl(const char *filename, int bm_pid, const char *ctrlgrp,
+ const char *mongrp, int res_id)
{
- unsigned long llc_occu_resc = 0;
+ unsigned long sum = 0, llc_occu_resc = 0;
int ret;
- ret = get_llc_occu_resctrl(&llc_occu_resc);
- if (ret < 0)
- return ret;
+ for (int i = 0 ; i < snc_ways() ; i++) {
+ set_cmt_path(ctrlgrp, mongrp, res_id + i);
+ ret = get_llc_occu_resctrl(&llc_occu_resc);
+ if (ret < 0)
+ return ret;
+ sum += llc_occu_resc;
+ }
- return print_results_cache(filename, bm_pid, llc_occu_resc);
+ return print_results_cache(filename, bm_pid, sum);
}
/*
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 41811e87f81c..178fb2eab13a 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -133,6 +133,7 @@ extern pid_t bm_pid, ppid;
extern char llc_occup_path[1024];
int snc_ways(void);
+void set_cmt_path(const char *ctrlgrp, const char *mongrp, char sock_num);
int get_vendor(void);
bool check_resctrlfs_support(void);
int filter_dmesg(void);
@@ -182,7 +183,8 @@ int perf_open(struct perf_event_attr *pea, pid_t pid, int cpu_no);
int perf_event_reset_enable(int pe_fd);
int perf_event_measure(int pe_fd, struct perf_event_read *pe_read,
const char *filename, int bm_pid);
-int measure_llc_resctrl(const char *filename, int bm_pid);
+int measure_llc_resctrl(const char *filename, int bm_pid, const char *ctrlgrp,
+ const char *mongrp, int res_id);
void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool lines);
/*
diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c
index 5a49f07a6c85..e75e3923ebe2 100644
--- a/tools/testing/selftests/resctrl/resctrl_val.c
+++ b/tools/testing/selftests/resctrl/resctrl_val.c
@@ -557,7 +557,7 @@ static int print_results_bw(char *filename, int bm_pid, float bw_imc,
return 0;
}
-static void set_cmt_path(const char *ctrlgrp, const char *mongrp, char sock_num)
+void set_cmt_path(const char *ctrlgrp, const char *mongrp, char sock_num)
{
if (strlen(ctrlgrp) && strlen(mongrp))
sprintf(llc_occup_path, CON_MON_LCC_OCCUP_PATH, RESCTRL_PATH,
@@ -698,8 +698,8 @@ int resctrl_val(const struct resctrl_test *test,
{
char *resctrl_val = param->resctrl_val;
unsigned long bw_resc_start = 0;
+ int res_id, ret = 0, pipefd[2];
struct sigaction sigact;
- int ret = 0, pipefd[2];
char pipe_message = 0;
union sigval value;
@@ -828,6 +828,8 @@ int resctrl_val(const struct resctrl_test *test,
sleep(1);
/* Test runs until the callback setup() tells the test to stop. */
+ get_domain_id("L3", uparams->cpu, &res_id);
+ res_id *= snc_ways();
while (1) {
ret = param->setup(test, uparams, param);
if (ret == END_OF_TESTS) {
@@ -844,7 +846,8 @@ int resctrl_val(const struct resctrl_test *test,
break;
} else if (!strncmp(resctrl_val, CMT_STR, sizeof(CMT_STR))) {
sleep(1);
- ret = measure_llc_resctrl(param->filename, bm_pid);
+ ret = measure_llc_resctrl(param->filename, bm_pid, param->ctrlgrp,
+ param->mongrp, res_id);
if (ret)
break;
}
--
2.44.0
Memory Bandwidth Monitoring (MBM) measures how much data flows between
cache levels. Only the flow for a process specified with a Resource
Monitoring ID (RMID) is measured.
Sub-Numa Clustering (SNC) causes MBM selftest to fail because the
increased amount of NUMA nodes per socket is not taken into account.
That in turn makes the test use incorrect values for the start and end
of the measurement by tracking the wrong node.
For the MBM selftest to be NUMA-aware it needs to track the start and end
values of a measurement not for a single node but for all nodes on the
same socket. Then summing all measured values comes out as the real
bandwidth used by the process.
Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
Closes: https://lore.kernel.org/lkml/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01MB6330.jpnprd01.prod.outlook.com/
Signed-off-by: Maciej Wieczor-Retman <[email protected]>
---
tools/testing/selftests/resctrl/mba_test.c | 1 -
tools/testing/selftests/resctrl/resctrl_val.c | 37 ++++++++++++-------
2 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
index 7946e32e85c8..fc31a61dab0c 100644
--- a/tools/testing/selftests/resctrl/mba_test.c
+++ b/tools/testing/selftests/resctrl/mba_test.c
@@ -147,7 +147,6 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param
struct resctrl_val_param param = {
.resctrl_val = MBA_STR,
.ctrlgrp = "c1",
- .mongrp = "m1",
.filename = RESULT_FILE_NAME,
.bw_report = "reads",
.setup = mba_setup
diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c
index e75e3923ebe2..2fe9f8bb4a45 100644
--- a/tools/testing/selftests/resctrl/resctrl_val.c
+++ b/tools/testing/selftests/resctrl/resctrl_val.c
@@ -595,9 +595,10 @@ static void initialize_llc_occu_resctrl(const char *ctrlgrp, const char *mongrp,
static int measure_vals(const struct user_params *uparams,
struct resctrl_val_param *param,
- unsigned long *bw_resc_start)
+ unsigned long *bw_resc_start,
+ int res_id)
{
- unsigned long bw_resc, bw_resc_end;
+ unsigned long bw_resc = 0, bw_resc_sum = 0, bw_resc_end;
float bw_imc;
int ret;
@@ -612,17 +613,19 @@ static int measure_vals(const struct user_params *uparams,
if (ret < 0)
return ret;
- ret = get_mem_bw_resctrl(&bw_resc_end);
- if (ret < 0)
- return ret;
+ for (int i = 0 ; i < snc_ways() ; i++) {
+ set_mbm_path(param->ctrlgrp, strlen(param->mongrp) > 0 ? param->mongrp : NULL,
+ res_id + i);
+ ret = get_mem_bw_resctrl(&bw_resc_end);
+ bw_resc = (bw_resc_end - bw_resc_start[i]) / MB;
+ bw_resc_sum += bw_resc;
+ bw_resc_start[i] = bw_resc_end;
+ }
- bw_resc = (bw_resc_end - *bw_resc_start) / MB;
- ret = print_results_bw(param->filename, bm_pid, bw_imc, bw_resc);
+ ret = print_results_bw(param->filename, bm_pid, bw_imc, bw_resc_sum);
if (ret)
return ret;
- *bw_resc_start = bw_resc_end;
-
return 0;
}
@@ -697,12 +700,16 @@ int resctrl_val(const struct resctrl_test *test,
struct resctrl_val_param *param)
{
char *resctrl_val = param->resctrl_val;
- unsigned long bw_resc_start = 0;
int res_id, ret = 0, pipefd[2];
+ unsigned long *bw_resc_start;
struct sigaction sigact;
char pipe_message = 0;
union sigval value;
+ bw_resc_start = calloc(snc_ways(), sizeof(unsigned long));
+ if (!bw_resc_start)
+ return -1;
+
if (strcmp(param->filename, "") == 0)
sprintf(param->filename, "stdio");
@@ -710,7 +717,7 @@ int resctrl_val(const struct resctrl_test *test,
!strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR))) {
ret = validate_bw_report_request(param->bw_report);
if (ret)
- return ret;
+ goto out_free;
}
/*
@@ -721,7 +728,7 @@ int resctrl_val(const struct resctrl_test *test,
if (pipe(pipefd)) {
ksft_perror("Unable to create pipe");
-
+ free(bw_resc_start);
return -1;
}
@@ -733,7 +740,7 @@ int resctrl_val(const struct resctrl_test *test,
bm_pid = fork();
if (bm_pid == -1) {
ksft_perror("Unable to fork");
-
+ free(bw_resc_start);
return -1;
}
@@ -841,7 +848,7 @@ int resctrl_val(const struct resctrl_test *test,
if (!strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR)) ||
!strncmp(resctrl_val, MBA_STR, sizeof(MBA_STR))) {
- ret = measure_vals(uparams, param, &bw_resc_start);
+ ret = measure_vals(uparams, param, bw_resc_start, res_id);
if (ret)
break;
} else if (!strncmp(resctrl_val, CMT_STR, sizeof(CMT_STR))) {
@@ -855,6 +862,8 @@ int resctrl_val(const struct resctrl_test *test,
out:
kill(bm_pid, SIGKILL);
+out_free:
+ free(bw_resc_start);
return ret;
}
--
2.44.0
Resctrl selftest prints a message on test failure that Sub-Numa
Clustering (SNC) could be enabled and points the user to check theirs BIOS
settings. No actual check is performed before printing that message so
it is not very accurate in pinpointing a problem.
Figuring out if SNC is enabled is only one part of the problem, the
other being whether the kernel supports it. As there is no easy
interface that simply states SNC support in the kernel one can find that
information by comparing L3 cache sizes from different sources. Cache
size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size
will always show the full cache size even if it's split by enabled SNC.
On the other hand /sys/fs/resctrl/size has information about L3 size,
that with kernel support is adjusted for enabled SNC.
Add a function to find a cache size from /sys/fs/resctrl/size since
finding that information from the other source is already implemented.
Add a function that compares the two cache sizes and use it to make the
SNC support message more meaningful.
Add the SNC support message just after MBA's check_results() since MBA
shares code with MBM and also can suffer from enabled SNC if there is no
support in the kernel.
Signed-off-by: Maciej Wieczor-Retman <[email protected]>
---
tools/testing/selftests/resctrl/cat_test.c | 2 +-
tools/testing/selftests/resctrl/cmt_test.c | 6 +-
tools/testing/selftests/resctrl/mba_test.c | 2 +
tools/testing/selftests/resctrl/mbm_test.c | 4 +-
tools/testing/selftests/resctrl/resctrl.h | 5 +-
tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++++++++++++++++-
6 files changed, 79 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c
index 4cb991be8e31..1cdaadf35f03 100644
--- a/tools/testing/selftests/resctrl/cat_test.c
+++ b/tools/testing/selftests/resctrl/cat_test.c
@@ -253,7 +253,7 @@ static int cat_run_test(const struct resctrl_test *test, const struct user_param
return ret;
/* Get L3/L2 cache size */
- ret = get_cache_size(uparams->cpu, test->resource, &cache_total_size);
+ ret = get_sys_cache_size(uparams->cpu, test->resource, &cache_total_size);
if (ret)
return ret;
ksft_print_msg("Cache size :%lu\n", cache_total_size);
diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c
index a81f91222a89..b7cada602484 100644
--- a/tools/testing/selftests/resctrl/cmt_test.c
+++ b/tools/testing/selftests/resctrl/cmt_test.c
@@ -112,7 +112,7 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
if (ret)
return ret;
- ret = get_cache_size(uparams->cpu, "L3", &cache_total_size);
+ ret = get_sys_cache_size(uparams->cpu, "L3", &cache_total_size);
if (ret)
return ret;
ksft_print_msg("Cache size :%lu\n", cache_total_size);
@@ -157,8 +157,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param
goto out;
ret = check_results(¶m, span, n);
- if (ret && (get_vendor() == ARCH_INTEL))
- ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
+ if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support())
+ ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
out:
cmt_test_cleanup();
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
index fc31a61dab0c..89fe3ecbf497 100644
--- a/tools/testing/selftests/resctrl/mba_test.c
+++ b/tools/testing/selftests/resctrl/mba_test.c
@@ -160,6 +160,8 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param
goto out;
ret = check_results();
+ if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support())
+ ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
out:
mba_test_cleanup();
diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
index d67ffa3ec63a..e12b4b06f6d5 100644
--- a/tools/testing/selftests/resctrl/mbm_test.c
+++ b/tools/testing/selftests/resctrl/mbm_test.c
@@ -129,8 +129,8 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param
goto out;
ret = check_results(DEFAULT_SPAN);
- if (ret && (get_vendor() == ARCH_INTEL))
- ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
+ if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support())
+ ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
out:
mbm_test_cleanup();
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h
index 178fb2eab13a..038e1269a3fc 100644
--- a/tools/testing/selftests/resctrl/resctrl.h
+++ b/tools/testing/selftests/resctrl/resctrl.h
@@ -28,6 +28,7 @@
#define RESCTRL_PATH "/sys/fs/resctrl"
#define PHYS_ID_PATH "/sys/devices/system/cpu/cpu"
#define INFO_PATH "/sys/fs/resctrl/info"
+#define SIZE_PATH "/sys/fs/resctrl/size"
/*
* CPU vendor IDs
@@ -168,14 +169,16 @@ unsigned long create_bit_mask(unsigned int start, unsigned int len);
unsigned int count_contiguous_bits(unsigned long val, unsigned int *start);
int get_full_cbm(const char *cache_type, unsigned long *mask);
int get_mask_no_shareable(const char *cache_type, unsigned long *mask);
-int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size);
int resource_info_unsigned_get(const char *resource, const char *filename, unsigned int *val);
+int get_sys_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size);
+int get_resctrl_cache_size(const char *cache_type, unsigned long *cache_size);
void ctrlc_handler(int signum, siginfo_t *info, void *ptr);
int signal_handler_register(void);
void signal_handler_unregister(void);
void cat_test_cleanup(void);
unsigned int count_bits(unsigned long n);
void cmt_test_cleanup(void);
+int snc_kernel_support(void);
void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config);
void perf_event_initialize_read_format(struct perf_event_read *pe_read);
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index e4d3624a8817..dbd10cb7abf5 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -214,14 +214,14 @@ int snc_ways(void)
}
/*
- * get_cache_size - Get cache size for a specified CPU
+ * get_sys_cache_size - Get cache size for a specified CPU
* @cpu_no: CPU number
* @cache_type: Cache level L2/L3
* @cache_size: pointer to cache_size
*
* Return: = 0 on success, < 0 on failure.
*/
-int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size)
+int get_sys_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size)
{
char cache_path[1024], cache_str[64];
int length, i, cache_num;
@@ -273,6 +273,44 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
return 0;
}
+/*
+ * get_resctrl_cache_size - Get cache size as reported by resctrl
+ * @cache_type: Cache level L2/L3
+ * @cache_size: pointer to cache_size
+ *
+ * Return: = 0 on success, < 0 on failure.
+ */
+int get_resctrl_cache_size(const char *cache_type, unsigned long *cache_size)
+{
+ char line[256], cache_prefix[16], *stripped_line, *token;
+ size_t len;
+ FILE *fp;
+
+ strcpy(cache_prefix, cache_type);
+ strncat(cache_prefix, ":", 1);
+
+ fp = fopen(SIZE_PATH, "r");
+ if (!fp) {
+ ksft_print_msg("Failed to open %s : '%s'\n",
+ SIZE_PATH, strerror(errno));
+ return -1;
+ }
+
+ while (fgets(line, sizeof(line), fp)) {
+ stripped_line = strstr(line, cache_prefix);
+
+ if (stripped_line) {
+ len = strlen(cache_prefix);
+ stripped_line += len;
+ token = strtok(stripped_line, ";");
+ if (sscanf(token, "0=%lu", cache_size) <= 0)
+ return -1;
+ }
+ }
+ fclose(fp);
+ return 0;
+}
+
#define CORE_SIBLINGS_PATH "/sys/bus/cpu/devices/cpu"
/*
@@ -935,3 +973,30 @@ unsigned int count_bits(unsigned long n)
return count;
}
+
+/**
+ * snc_kernel_support - Compare system reported cache size and resctrl
+ * reported cache size to get an idea if SNC is supported on the kernel side.
+ * If SNC is enabled and the kernel does support it the value should be equal.
+ * If the kernel doesn't support SNC the.
+ *
+ * Return: 0 if not supported, 1 if supported, < 0 on failure.
+ */
+int snc_kernel_support(void)
+{
+ unsigned long resctrl_cache_size, node_cache_size;
+ int ret;
+
+ ret = get_sys_cache_size(0, "L3", &node_cache_size);
+ if (ret < 0)
+ return ret;
+
+ ret = get_resctrl_cache_size("L3", &resctrl_cache_size);
+ if (ret < 0)
+ return ret;
+
+ if (resctrl_cache_size == node_cache_size)
+ return 1;
+
+ return 0;
+}
--
2.44.0
> Figuring out if SNC is enabled is only one part of the problem, the
> other being whether the kernel supports it. As there is no easy
> interface that simply states SNC support in the kernel one can find that
> information by comparing L3 cache sizes from different sources. Cache
> size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size
> will always show the full cache size even if it's split by enabled SNC.
> On the other hand /sys/fs/resctrl/size has information about L3 size,
> that with kernel support is adjusted for enabled SNC.
Early versions of the kernel SNC patch series added an info/l3_MON/snc_ways
file to provide this information. I was talked out of it then:
https://lore.kernel.org/all/[email protected]/
But that discussion didn't consider the question you discuss here: "Does this
instance of the kernel support SNC?"
So you have a clever solution. But it seems like a roundabout way for
an application to discover whether the kernel has configured resctrl for
SNC mode.
Should the kernel provide an info/ file listing the SNC configuration?
If so, what should it be named? "snc_ways" as a kernel variable was
later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
-Tony
On 2024-03-06 at 21:54:02 +0000, Luck, Tony wrote:
>> Figuring out if SNC is enabled is only one part of the problem, the
>> other being whether the kernel supports it. As there is no easy
>> interface that simply states SNC support in the kernel one can find that
>> information by comparing L3 cache sizes from different sources. Cache
>> size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size
>> will always show the full cache size even if it's split by enabled SNC.
>> On the other hand /sys/fs/resctrl/size has information about L3 size,
>> that with kernel support is adjusted for enabled SNC.
>
>Early versions of the kernel SNC patch series added an info/l3_MON/snc_ways
>file to provide this information. I was talked out of it then:
>
>https://lore.kernel.org/all/[email protected]/
>
>But that discussion didn't consider the question you discuss here: "Does this
>instance of the kernel support SNC?"
>
>So you have a clever solution. But it seems like a roundabout way for
>an application to discover whether the kernel has configured resctrl for
>SNC mode.
>
>Should the kernel provide an info/ file listing the SNC configuration?
I suppose it would be a benefit for other numa aware applications to have an
easy access to this kernel support information.
>
>If so, what should it be named? "snc_ways" as a kernel variable was
>later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
"snc_nodes_per_l3_cache" seems okay to me.
And I understand that the file content would show SNC mode and the presence or
absence of this file would tell if kernel supports SNC?
>
>-Tony
--
Kind regards
Maciej Wiecz?r-Retman
>>If so, what should it be named? "snc_ways" as a kernel variable was
>>later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
>
> "snc_nodes_per_l3_cache" seems okay to me.
>
> And I understand that the file content would show SNC mode and the presence or
> absence of this file would tell if kernel supports SNC?
Yes. The existence of the file indicates the kernel is SNC aware.
The value read from the file would give the number of nodes per L3 (obviously :-) )
SNC not supported by this platform or not enabled:
$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
1
SNC2 enabled:
$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
2
etc.
-Tony
Hi Tony,
On 3/7/2024 9:18 AM, Luck, Tony wrote:
>>> If so, what should it be named? "snc_ways" as a kernel variable was
>>> later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
>>
>> "snc_nodes_per_l3_cache" seems okay to me.
>>
>> And I understand that the file content would show SNC mode and the presence or
>> absence of this file would tell if kernel supports SNC?
>
> Yes. The existence of the file indicates the kernel is SNC aware.
>
> The value read from the file would give the number of nodes per L3 (obviously :-) )
>
> SNC not supported by this platform or not enabled:
>
> $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
> 1
>
> SNC2 enabled:
>
> $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
> 2
>
This would be useful. I believe "SNC" is architecture specific?
What if the file always exists and is named "nodes_per_l3_cache"?
I assume that the internals of handling more nodes per L3 cache should
be hidden from user space and it should not be necessary for user space
to know if this is because of SNC or potentially some other mechanism on
another platform?
I think that may reduce fragmentation of resctrl .... not having
resctrl look so different on different architectures but maintains
the promise of a generic interface.
I am not sure if this is specific to monitoring though,
why not host file in /sys/fs/resctrl/info/L3 ?
Reinette
> > SNC2 enabled:
> >
> > $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
> > 2
> >
>
> This would be useful. I believe "SNC" is architecture specific?
> What if the file always exists and is named "nodes_per_l3_cache"?
>
> I assume that the internals of handling more nodes per L3 cache should
> be hidden from user space and it should not be necessary for user space
> to know if this is because of SNC or potentially some other mechanism on
> another platform?
>
> I think that may reduce fragmentation of resctrl .... not having
> resctrl look so different on different architectures but maintains
> the promise of a generic interface.
>
> I am not sure if this is specific to monitoring though,
> why not host file in /sys/fs/resctrl/info/L3 ?
Reinette,
On the name change - sure. It doesn't need the "snc_" prefix.
The Intel implementation of SNC has far more effect on monitoring
than on control. The user can read separate cache occupancy and
memory bandwidth values for each SNC node. But cache allocation
bitmasks and memory throttling still have a single control point for
each L3 cache instance, not for each node. There are still some
impacts on control, e.g. each bit in a CAT bitmask represents
less actual space in the L3 cache.
Maybe move it to the top level of the info/ directory:
$ cat /sys/fs/resctrl/info/nodes_per_l3_cache
3
-Tony
Hi Tony,
On 3/7/2024 9:57 AM, Luck, Tony wrote:
>>> SNC2 enabled:
>>>
>>> $ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache
>>> 2
>>>
>>
>> This would be useful. I believe "SNC" is architecture specific?
>> What if the file always exists and is named "nodes_per_l3_cache"?
>>
>> I assume that the internals of handling more nodes per L3 cache should
>> be hidden from user space and it should not be necessary for user space
>> to know if this is because of SNC or potentially some other mechanism on
>> another platform?
>>
>> I think that may reduce fragmentation of resctrl .... not having
>> resctrl look so different on different architectures but maintains
>> the promise of a generic interface.
>>
>> I am not sure if this is specific to monitoring though,
>> why not host file in /sys/fs/resctrl/info/L3 ?
>
> Reinette,
>
> On the name change - sure. It doesn't need the "snc_" prefix.
>
> The Intel implementation of SNC has far more effect on monitoring
> than on control. The user can read separate cache occupancy and
> memory bandwidth values for each SNC node. But cache allocation
> bitmasks and memory throttling still have a single control point for
> each L3 cache instance, not for each node. There are still some
> impacts on control, e.g. each bit in a CAT bitmask represents
> less actual space in the L3 cache.
I understand the impact but I am trying to view this conceptually.
The info directory exists to "contain information about the enabled
resources" (as per documentation) and it seems appropriate to make
this a property of the L3 resource.
>
> Maybe move it to the top level of the info/ directory:
>
> $ cat /sys/fs/resctrl/info/nodes_per_l3_cache
> 3
Thinking about it even differently. The goal is to give information
to userspace so we need to think about what would help user space?
For example, what if there is a file in info that shows
which CPUs are associated with each domain?
Reinette
> Thinking about it even differently. The goal is to give information
> to userspace so we need to think about what would help user space?
> For example, what if there is a file in info that shows
> which CPUs are associated with each domain?
Reinette,
Interesting idea. That would save users from having to chase through
/sys/devices/system/cpu/cpu*/cache/index?/* to figure out what the domain
numbers in schemata files and the mon_data/mon_L3_XX values mean.
May be extra useful for ARM which seems to have big random-looking numbers
for domains that came out of ACPI tables.
For SNC it would get the user directly to what they probably care about
(which CPUs are in which domain).
So something like this for an SNC 2 system:
$ cat /sys/fs/resctrl/info/L3/cpus
0: 0-35,72-107
1: 36-71,108-143
$ cat /sys/fs/resctrl/info/L3_MON/cpus
0: 0-17,72-89
1: 18-35,90-107
2: 36-53,108-125
3: 54-71,126-143
[maybe there is a better name than "cpus" for this file?]
-Tony
Hi Tony,
On 3/7/2024 1:14 PM, Luck, Tony wrote:
>> Thinking about it even differently. The goal is to give information
>> to userspace so we need to think about what would help user space?
>> For example, what if there is a file in info that shows
>> which CPUs are associated with each domain?
>
> Reinette,
>
> Interesting idea. That would save users from having to chase through
> /sys/devices/system/cpu/cpu*/cache/index?/* to figure out what the domain
> numbers in schemata files and the mon_data/mon_L3_XX values mean.
>
> May be extra useful for ARM which seems to have big random-looking numbers
> for domains that came out of ACPI tables.
>
> For SNC it would get the user directly to what they probably care about
> (which CPUs are in which domain).
I agree.
>
> So something like this for an SNC 2 system:
>
> $ cat /sys/fs/resctrl/info/L3/cpus
> 0: 0-35,72-107
> 1: 36-71,108-143
>
> $ cat /sys/fs/resctrl/info/L3_MON/cpus
> 0: 0-17,72-89
> 1: 18-35,90-107
> 2: 36-53,108-125
> 3: 54-71,126-143
>
> [maybe there is a better name than "cpus" for this file?]
Thank you for the example. I find that significantly easier to
understand than a single number in a generic "nodes_per_l3_cache".
Especially with potential confusion surrounding inconsistent "nodes"
between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Reinette
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
> Thank you for the example. I find that significantly easier to
> understand than a single number in a generic "nodes_per_l3_cache".
> Especially with potential confusion surrounding inconsistent "nodes"
> between allocation and monitoring.
>
> How about domain_cpu_list and domain_cpu_map ?
Reinette,
Like this (my test system doesn't have SNC, so all domains are the same):
$ cd /sys/fs/resctrl/info/
$ grep . */domain*
L3/domain_cpu_list:0: 0-35,72-107
L3/domain_cpu_list:1: 36-71,108-143
L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
L3_MON/domain_cpu_list:0: 0-35,72-107
L3_MON/domain_cpu_list:1: 36-71,108-143
L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
MB/domain_cpu_list:0: 0-35,72-107
MB/domain_cpu_list:1: 36-71,108-143
MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
The patch to do this is pretty straightforward.
-Tony
---
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index ae80170a0d1b..c180b80640e3 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -957,6 +957,20 @@ static int rdt_num_closids_show(struct kernfs_open_file *of,
return 0;
}
+static int rdt_ctrl_cpus_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct resctrl_schema *s = of->kn->parent->priv;
+ struct rdt_resource *r = s->res;
+ struct rdt_ctrl_domain *d;
+
+ list_for_each_entry(d, &r->ctrl_domains, hdr.list)
+ seq_printf(seq, is_cpu_list(of) ? "%d: %*pbl\n" : "%d: %*pb\n",
+ d->hdr.id, cpumask_pr_args(&d->hdr.cpu_mask));
+
+ return 0;
+}
+
static int rdt_default_ctrl_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
@@ -1103,6 +1117,19 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of,
return 0;
}
+static int rdt_mon_cpus_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ struct rdt_mon_domain *d;
+
+ list_for_each_entry(d, &r->mon_domains, hdr.list)
+ seq_printf(seq, is_cpu_list(of) ? "%d: %*pbl\n" : "%d: %*pb\n",
+ d->hdr.id, cpumask_pr_args(&d->hdr.cpu_mask));
+
+ return 0;
+}
+
static int rdt_mon_features_show(struct kernfs_open_file *of,
struct seq_file *seq, void *v)
{
@@ -1810,6 +1837,21 @@ static struct rftype res_common_files[] = {
.seq_show = rdt_num_closids_show,
.fflags = RFTYPE_CTRL_INFO,
},
+ {
+ .name = "domain_cpu_list",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_ctrl_cpus_show,
+ .flags = RFTYPE_FLAGS_CPUS_LIST,
+ .fflags = RFTYPE_CTRL_INFO,
+ },
+ {
+ .name = "domain_cpu_map",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_ctrl_cpus_show,
+ .fflags = RFTYPE_CTRL_INFO,
+ },
{
.name = "mon_features",
.mode = 0444,
@@ -1824,6 +1866,21 @@ static struct rftype res_common_files[] = {
.seq_show = rdt_num_rmids_show,
.fflags = RFTYPE_MON_INFO,
},
+ {
+ .name = "domain_cpu_list",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_mon_cpus_show,
+ .flags = RFTYPE_FLAGS_CPUS_LIST,
+ .fflags = RFTYPE_MON_INFO,
+ },
+ {
+ .name = "domain_cpu_map",
+ .mode = 0444,
+ .kf_ops = &rdtgroup_kf_single_ops,
+ .seq_show = rdt_mon_cpus_show,
+ .fflags = RFTYPE_MON_INFO,
+ },
{
.name = "cbm_mask",
.mode = 0444,
--
2.43.0
Hi Tony,
On 3/7/2024 3:16 PM, Tony Luck wrote:
> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>> Thank you for the example. I find that significantly easier to
>> understand than a single number in a generic "nodes_per_l3_cache".
>> Especially with potential confusion surrounding inconsistent "nodes"
>> between allocation and monitoring.
>>
>> How about domain_cpu_list and domain_cpu_map ?
>
> Reinette,
>
> Like this (my test system doesn't have SNC, so all domains are the same):
>
> $ cd /sys/fs/resctrl/info/
> $ grep . */domain*
> L3/domain_cpu_list:0: 0-35,72-107
> L3/domain_cpu_list:1: 36-71,108-143
> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> L3_MON/domain_cpu_list:0: 0-35,72-107
> L3_MON/domain_cpu_list:1: 36-71,108-143
> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> MB/domain_cpu_list:0: 0-35,72-107
> MB/domain_cpu_list:1: 36-71,108-143
> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>
>
> The patch to do this is pretty straightforward.
>
Thank you for looking into this. This looks like valuable information
for user space. Back to what started this discussion ... I expect user
space can compare CPUs associated with control and monitoring domains
to learn if SNC is enabled? (And now existence of domain_cpu_list and/or
domain_cpu_map can also be used to determine if kernel supports SNC).
Reinette
On Fri, 8 Mar 2024, Ilpo J?rvinen wrote:
> On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
>
> > Cache Monitoring Technology (CMT) works by measuring how much data in L3
> > cache is occupied by a given process identified by its Resource
> > Monitoring ID (RMID).
> >
> > On systems with Sub-Numa Clusters (SNC) enabled, a process can occupy
> > not only the cache that belongs to its own NUMA node but also pieces of
> > other NUMA nodes' caches that lie on the same socket.
> >
> > A simple correction to make the CMT selftest NUMA-aware is to sum values
> > reported by all nodes on the same socket for a given RMID.
> >
> > Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
> > Closes: https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com/
> > Signed-off-by: Maciej Wieczor-Retman <[email protected]>
> > ---
> > @@ -828,6 +828,8 @@ int resctrl_val(const struct resctrl_test *test,
> > sleep(1);
> >
> > /* Test runs until the callback setup() tells the test to stop. */
> > + get_domain_id("L3", uparams->cpu, &res_id);
>
> Hardcoding L3 here limits the genericness of this function. You don't even
> need to do it, get_domain_id() does "MB" -> "L3" transformation implicitly
> for you so you can just pass test->resource instead.
>
> Also, I don't understand why you now again make the naming inconsistent
> with "res_id".
>
> If you based this on top of the patches I just posted, resctl_val()
> already the domain_id variable.
Ah, I retract what I said. I see you actually want it only from L3.
> > + res_id *= snc_ways();
I don't understand what this is trying to achieve and how.
--
i.
On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
> Memory Bandwidth Monitoring (MBM) measures how much data flows between
> cache levels. Only the flow for a process specified with a Resource
> Monitoring ID (RMID) is measured.
>
> Sub-Numa Clustering (SNC) causes MBM selftest to fail because the
> increased amount of NUMA nodes per socket is not taken into account.
> That in turn makes the test use incorrect values for the start and end
> of the measurement by tracking the wrong node.
>
> For the MBM selftest to be NUMA-aware it needs to track the start and end
> values of a measurement not for a single node but for all nodes on the
> same socket. Then summing all measured values comes out as the real
> bandwidth used by the process.
>
> Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
> Closes: https://lore.kernel.org/lkml/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01MB6330.jpnprd01.prod.outlook.com/
> Signed-off-by: Maciej Wieczor-Retman <[email protected]>
> ---
> tools/testing/selftests/resctrl/mba_test.c | 1 -
> tools/testing/selftests/resctrl/resctrl_val.c | 37 ++++++++++++-------
> 2 files changed, 23 insertions(+), 15 deletions(-)
>
> diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c
> index 7946e32e85c8..fc31a61dab0c 100644
> --- a/tools/testing/selftests/resctrl/mba_test.c
> +++ b/tools/testing/selftests/resctrl/mba_test.c
> @@ -147,7 +147,6 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param
> struct resctrl_val_param param = {
> .resctrl_val = MBA_STR,
> .ctrlgrp = "c1",
> - .mongrp = "m1",
> .filename = RESULT_FILE_NAME,
> .bw_report = "reads",
> .setup = mba_setup
> diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c
> index e75e3923ebe2..2fe9f8bb4a45 100644
> --- a/tools/testing/selftests/resctrl/resctrl_val.c
> +++ b/tools/testing/selftests/resctrl/resctrl_val.c
> @@ -595,9 +595,10 @@ static void initialize_llc_occu_resctrl(const char *ctrlgrp, const char *mongrp,
>
> static int measure_vals(const struct user_params *uparams,
> struct resctrl_val_param *param,
> - unsigned long *bw_resc_start)
> + unsigned long *bw_resc_start,
> + int res_id)
> {
> - unsigned long bw_resc, bw_resc_end;
> + unsigned long bw_resc = 0, bw_resc_sum = 0, bw_resc_end;
> float bw_imc;
> int ret;
>
> @@ -612,17 +613,19 @@ static int measure_vals(const struct user_params *uparams,
> if (ret < 0)
> return ret;
>
> - ret = get_mem_bw_resctrl(&bw_resc_end);
> - if (ret < 0)
> - return ret;
> + for (int i = 0 ; i < snc_ways() ; i++) {
> + set_mbm_path(param->ctrlgrp, strlen(param->mongrp) > 0 ? param->mongrp : NULL,
> + res_id + i);
> + ret = get_mem_bw_resctrl(&bw_resc_end);
> + bw_resc = (bw_resc_end - bw_resc_start[i]) / MB;
> + bw_resc_sum += bw_resc;
> + bw_resc_start[i] = bw_resc_end;
> + }
>
> - bw_resc = (bw_resc_end - *bw_resc_start) / MB;
> - ret = print_results_bw(param->filename, bm_pid, bw_imc, bw_resc);
> + ret = print_results_bw(param->filename, bm_pid, bw_imc, bw_resc_sum);
> if (ret)
> return ret;
>
> - *bw_resc_start = bw_resc_end;
> -
> return 0;
> }
>
> @@ -697,12 +700,16 @@ int resctrl_val(const struct resctrl_test *test,
> struct resctrl_val_param *param)
> {
> char *resctrl_val = param->resctrl_val;
> - unsigned long bw_resc_start = 0;
> int res_id, ret = 0, pipefd[2];
> + unsigned long *bw_resc_start;
> struct sigaction sigact;
> char pipe_message = 0;
> union sigval value;
>
> + bw_resc_start = calloc(snc_ways(), sizeof(unsigned long));
While correct, this seems a bit overkill given is MAX_SNC = 4, not
something large or unbounded.
This patch would be be much simpler on top of my resctrl_val() cleanup
patches because bw_resc_start is only local to the measurement function.
--
i.
> + if (!bw_resc_start)
> + return -1;
> +
> if (strcmp(param->filename, "") == 0)
> sprintf(param->filename, "stdio");
>
> @@ -710,7 +717,7 @@ int resctrl_val(const struct resctrl_test *test,
> !strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR))) {
> ret = validate_bw_report_request(param->bw_report);
> if (ret)
> - return ret;
> + goto out_free;
> }
>
> /*
> @@ -721,7 +728,7 @@ int resctrl_val(const struct resctrl_test *test,
>
> if (pipe(pipefd)) {
> ksft_perror("Unable to create pipe");
> -
> + free(bw_resc_start);
> return -1;
> }
>
> @@ -733,7 +740,7 @@ int resctrl_val(const struct resctrl_test *test,
> bm_pid = fork();
> if (bm_pid == -1) {
> ksft_perror("Unable to fork");
> -
> + free(bw_resc_start);
> return -1;
> }
>
> @@ -841,7 +848,7 @@ int resctrl_val(const struct resctrl_test *test,
>
> if (!strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR)) ||
> !strncmp(resctrl_val, MBA_STR, sizeof(MBA_STR))) {
> - ret = measure_vals(uparams, param, &bw_resc_start);
> + ret = measure_vals(uparams, param, bw_resc_start, res_id);
> if (ret)
> break;
> } else if (!strncmp(resctrl_val, CMT_STR, sizeof(CMT_STR))) {
> @@ -855,6 +862,8 @@ int resctrl_val(const struct resctrl_test *test,
>
> out:
> kill(bm_pid, SIGKILL);
> +out_free:
> + free(bw_resc_start);
>
> return ret;
> }
On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
> Cache Monitoring Technology (CMT) works by measuring how much data in L3
> cache is occupied by a given process identified by its Resource
> Monitoring ID (RMID).
>
> On systems with Sub-Numa Clusters (SNC) enabled, a process can occupy
> not only the cache that belongs to its own NUMA node but also pieces of
> other NUMA nodes' caches that lie on the same socket.
>
> A simple correction to make the CMT selftest NUMA-aware is to sum values
> reported by all nodes on the same socket for a given RMID.
>
> Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
> Closes: https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com/
> Signed-off-by: Maciej Wieczor-Retman <[email protected]>
> ---
> tools/testing/selftests/resctrl/cache.c | 17 +++++++++++------
> tools/testing/selftests/resctrl/resctrl.h | 4 +++-
> tools/testing/selftests/resctrl/resctrl_val.c | 9 ++++++---
> 3 files changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c
> index 1b339d6bbff1..dab81920033b 100644
> --- a/tools/testing/selftests/resctrl/cache.c
> +++ b/tools/testing/selftests/resctrl/cache.c
> @@ -161,16 +161,21 @@ int perf_event_measure(int pe_fd, struct perf_event_read *pe_read,
> *
> * Return: =0 on success. <0 on failure.
> */
> -int measure_llc_resctrl(const char *filename, int bm_pid)
> +int measure_llc_resctrl(const char *filename, int bm_pid, const char *ctrlgrp,
> + const char *mongrp, int res_id)
> {
> - unsigned long llc_occu_resc = 0;
> + unsigned long sum = 0, llc_occu_resc = 0;
> int ret;
>
> - ret = get_llc_occu_resctrl(&llc_occu_resc);
> - if (ret < 0)
> - return ret;
> + for (int i = 0 ; i < snc_ways() ; i++) {
Spaces as per usual coding style:
for (int i = 0; i < snc_ways(); i++) {
> + set_cmt_path(ctrlgrp, mongrp, res_id + i);
> + ret = get_llc_occu_resctrl(&llc_occu_resc);
> + if (ret < 0)
> + return ret;
> + sum += llc_occu_resc;
> + }
>
> - return print_results_cache(filename, bm_pid, llc_occu_resc);
> + return print_results_cache(filename, bm_pid, sum);
> }
>
> /*
> @@ -828,6 +828,8 @@ int resctrl_val(const struct resctrl_test *test,
> sleep(1);
>
> /* Test runs until the callback setup() tells the test to stop. */
> + get_domain_id("L3", uparams->cpu, &res_id);
Hardcoding L3 here limits the genericness of this function. You don't even
need to do it, get_domain_id() does "MB" -> "L3" transformation implicitly
for you so you can just pass test->resource instead.
Also, I don't understand why you now again make the naming inconsistent
with "res_id".
If you based this on top of the patches I just posted, resctl_val()
already the domain_id variable.
--
i.
> + res_id *= snc_ways();
> while (1) {
> ret = param->setup(test, uparams, param);
> if (ret == END_OF_TESTS) {
> @@ -844,7 +846,8 @@ int resctrl_val(const struct resctrl_test *test,
> break;
> } else if (!strncmp(resctrl_val, CMT_STR, sizeof(CMT_STR))) {
> sleep(1);
> - ret = measure_llc_resctrl(param->filename, bm_pid);
> + ret = measure_llc_resctrl(param->filename, bm_pid, param->ctrlgrp,
> + param->mongrp, res_id);
> if (ret)
> break;
> }
>
Hi guys,
On 07/03/2024 23:16, Tony Luck wrote:
> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>> Thank you for the example. I find that significantly easier to
>> understand than a single number in a generic "nodes_per_l3_cache".
>> Especially with potential confusion surrounding inconsistent "nodes"
>> between allocation and monitoring.
>>
>> How about domain_cpu_list and domain_cpu_map ?
> Like this (my test system doesn't have SNC, so all domains are the same):
>
> $ cd /sys/fs/resctrl/info/
> $ grep . */domain*
> L3/domain_cpu_list:0: 0-35,72-107
> L3/domain_cpu_list:1: 36-71,108-143
> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> L3_MON/domain_cpu_list:0: 0-35,72-107
> L3_MON/domain_cpu_list:1: 36-71,108-143
> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> MB/domain_cpu_list:0: 0-35,72-107
> MB/domain_cpu_list:1: 36-71,108-143
> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
really because that information is, er, wrong on SNC systems. Is it possible to fix that?
From Tony's earlier description of how SNC changes things, the MB controls remain
per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
scoped.
This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
views of 'the' cache hierarchy.
(I also think that this be over the threshold on 'funny machines look funny' - but I bet
someone builds an arm machine that looks like this too!)
Thanks,
James
On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
> Hi guys,
>
> On 07/03/2024 23:16, Tony Luck wrote:
> > On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
> >> Thank you for the example. I find that significantly easier to
> >> understand than a single number in a generic "nodes_per_l3_cache".
> >> Especially with potential confusion surrounding inconsistent "nodes"
> >> between allocation and monitoring.
> >>
> >> How about domain_cpu_list and domain_cpu_map ?
>
> > Like this (my test system doesn't have SNC, so all domains are the same):
> >
> > $ cd /sys/fs/resctrl/info/
> > $ grep . */domain*
> > L3/domain_cpu_list:0: 0-35,72-107
> > L3/domain_cpu_list:1: 36-71,108-143
> > L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> > L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> > L3_MON/domain_cpu_list:0: 0-35,72-107
> > L3_MON/domain_cpu_list:1: 36-71,108-143
> > L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> > L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
> > MB/domain_cpu_list:0: 0-35,72-107
> > MB/domain_cpu_list:1: 36-71,108-143
> > MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
> > MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>
> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
> really because that information is, er, wrong on SNC systems. Is it possible to fix that?
On an SNC system the resctrl domain for L3_MON becomes the SNC node
instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
Even without the SNC issue this duplication may be a useful
convienience. On Intel to get from a resctrl domain is a multi-step
process to first find which of the indexY directories has level=3
and then look for the "id" that matches the domain.
> >From Tony's earlier description of how SNC changes things, the MB controls remain
> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
> scoped.
> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
> views of 'the' cache hierarchy.
I almost went partly in that direction when I started this epic voyage.
The "almost" part was to change the names of the monitoring directories
under mon_data from (legacy non-SNC system):
$ ls -l mon_data
total 0
dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00
dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01
to (2 socket, SNC=2 system):
$ ls -l mon_data
total 0
dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00
dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01
dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02
dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03
While that is in some ways a more accurate view, it breaks a lot of
legacy monitoring applications that expect the "L3" names.
> (I also think that this be over the threshold on 'funny machines look funny' - but I bet
> someone builds an arm machine that looks like this too!)
-Tony
On 2024-03-08 at 15:59:02 +0200, Ilpo J?rvinen wrote:
>On Fri, 8 Mar 2024, Ilpo J?rvinen wrote:
>
>> On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
>>
>> > Cache Monitoring Technology (CMT) works by measuring how much data in L3
>> > cache is occupied by a given process identified by its Resource
>> > Monitoring ID (RMID).
>> >
>> > On systems with Sub-Numa Clusters (SNC) enabled, a process can occupy
>> > not only the cache that belongs to its own NUMA node but also pieces of
>> > other NUMA nodes' caches that lie on the same socket.
>> >
>> > A simple correction to make the CMT selftest NUMA-aware is to sum values
>> > reported by all nodes on the same socket for a given RMID.
>> >
>> > Reported-by: "Shaopeng Tan (Fujitsu)" <[email protected]>
>> > Closes: https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB6330.jpnprd01.prod.outlook.com/
>> > Signed-off-by: Maciej Wieczor-Retman <[email protected]>
>> > ---
>
>> > @@ -828,6 +828,8 @@ int resctrl_val(const struct resctrl_test *test,
>> > sleep(1);
>> >
>> > /* Test runs until the callback setup() tells the test to stop. */
>> > + get_domain_id("L3", uparams->cpu, &res_id);
>>
>> Hardcoding L3 here limits the genericness of this function. You don't even
>> need to do it, get_domain_id() does "MB" -> "L3" transformation implicitly
>> for you so you can just pass test->resource instead.
>>
>> Also, I don't understand why you now again make the naming inconsistent
>> with "res_id".
>>
>> If you based this on top of the patches I just posted, resctl_val()
>> already the domain_id variable.
>
>Ah, I retract what I said. I see you actually want it only from L3.
>
>> > + res_id *= snc_ways();
>
>I don't understand what this is trying to achieve and how.
We exchanged some private messages on this but I'll post the explanation here
too if anyone else was looking for it.
get_domain_id("L3"...) essentially gives us the number of the socket (on
platforms that have one L3 cache per socket - I still have too look into other
ones). The problem here is that to get an accurate reading with SNC enabled we
need to collect values from all nodes on a single socket that has the CPU our
test is running on. So we need to find the first node on that socket so then we
can loop through all the nodes on that socket. To do that we multiply res_id by
the amount of SNC nodes per socket (res_id *= snc_ways()) and that's it.
I'll add some helper with an explaining comment on what it does in the next
version.
>
>--
> i.
--
Kind regards
Maciej Wiecz?r-Retman
On 2024-03-08 at 16:07:05 +0200, Ilpo J?rvinen wrote:
>On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
>> @@ -697,12 +700,16 @@ int resctrl_val(const struct resctrl_test *test,
>> struct resctrl_val_param *param)
>> {
>> char *resctrl_val = param->resctrl_val;
>> - unsigned long bw_resc_start = 0;
>> int res_id, ret = 0, pipefd[2];
>> + unsigned long *bw_resc_start;
>> struct sigaction sigact;
>> char pipe_message = 0;
>> union sigval value;
>>
>> + bw_resc_start = calloc(snc_ways(), sizeof(unsigned long));
>
>While correct, this seems a bit overkill given is MAX_SNC = 4, not
>something large or unbounded.
>
>This patch would be be much simpler on top of my resctrl_val() cleanup
>patches because bw_resc_start is only local to the measurement function.
You're right, the series will get a lot simpler rebased onto yours. I'll try out
some different approaches and comment any relevant points under your thread [1].
[1] https://lore.kernel.org/all/[email protected]/
>
>--
> i.
--
Kind regards
Maciej Wiecz?r-Retman
Hi Tony,
On 08/03/2024 18:42, Tony Luck wrote:
> On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
>> Hi guys,
>>
>> On 07/03/2024 23:16, Tony Luck wrote:
>>> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>>>> Thank you for the example. I find that significantly easier to
>>>> understand than a single number in a generic "nodes_per_l3_cache".
>>>> Especially with potential confusion surrounding inconsistent "nodes"
>>>> between allocation and monitoring.
>>>>
>>>> How about domain_cpu_list and domain_cpu_map ?
>>
>>> Like this (my test system doesn't have SNC, so all domains are the same):
>>>
>>> $ cd /sys/fs/resctrl/info/
>>> $ grep . */domain*
>>> L3/domain_cpu_list:0: 0-35,72-107
>>> L3/domain_cpu_list:1: 36-71,108-143
>>> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>> L3_MON/domain_cpu_list:0: 0-35,72-107
>>> L3_MON/domain_cpu_list:1: 36-71,108-143
>>> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>> MB/domain_cpu_list:0: 0-35,72-107
>>> MB/domain_cpu_list:1: 36-71,108-143
>>> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>
>> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
>> really because that information is, er, wrong on SNC systems. Is it possible to fix that?
>
> On an SNC system the resctrl domain for L3_MON becomes the SNC node
> instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
>
> Even without the SNC issue this duplication may be a useful
> convienience. On Intel to get from a resctrl domain is a multi-step
> process to first find which of the indexY directories has level=3
> and then look for the "id" that matches the domain.
>
>> >From Tony's earlier description of how SNC changes things, the MB controls remain
>> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
>> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
>> scoped.
>> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
>> views of 'the' cache hierarchy.
>
> I almost went partly in that direction when I started this epic voyage.
> The "almost" part was to change the names of the monitoring directories
> under mon_data from (legacy non-SNC system):
>
> $ ls -l mon_data
> total 0
> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00
> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01
>
> to (2 socket, SNC=2 system):
>
> $ ls -l mon_data
> total 0
> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00
> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01
> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02
> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03
This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM
controls on the 'L3', but describe it as a single global L3. The MPAM driver currently
hides this by summing the NUMA node counters and reporting it as the global L3's value.
> While that is in some ways a more accurate view, it breaks a lot of
> legacy monitoring applications that expect the "L3" names.
True - but the behaviour is different from a non SNC system, if this software can read the
file - but goes wrong because the contents of the file represent something different, its
still broken.
Thanks,
James
On 3/15/2024 11:02 AM, James Morse wrote:
> On 08/03/2024 18:42, Tony Luck wrote:
>> On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
>>> Hi guys,
>>>
>>> On 07/03/2024 23:16, Tony Luck wrote:
>>>> On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
>>>>> Thank you for the example. I find that significantly easier to
>>>>> understand than a single number in a generic "nodes_per_l3_cache".
>>>>> Especially with potential confusion surrounding inconsistent "nodes"
>>>>> between allocation and monitoring.
>>>>>
>>>>> How about domain_cpu_list and domain_cpu_map ?
>>>
>>>> Like this (my test system doesn't have SNC, so all domains are the same):
>>>>
>>>> $ cd /sys/fs/resctrl/info/
>>>> $ grep . */domain*
>>>> L3/domain_cpu_list:0: 0-35,72-107
>>>> L3/domain_cpu_list:1: 36-71,108-143
>>>> L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>> L3_MON/domain_cpu_list:0: 0-35,72-107
>>>> L3_MON/domain_cpu_list:1: 36-71,108-143
>>>> L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>> MB/domain_cpu_list:0: 0-35,72-107
>>>> MB/domain_cpu_list:1: 36-71,108-143
>>>> MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff
>>>> MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
>>>
>>> This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this
>>> really because that information is, er, wrong on SNC systems. Is it possible to fix that?
>>
>> On an SNC system the resctrl domain for L3_MON becomes the SNC node
>> instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
>>
>> Even without the SNC issue this duplication may be a useful
>> convienience. On Intel to get from a resctrl domain is a multi-step
>> process to first find which of the indexY directories has level=3
>> and then look for the "id" that matches the domain.
>>
>>> >From Tony's earlier description of how SNC changes things, the MB controls remain
>>> per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to
>>> describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA
>>> scoped.
>>> This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different
>>> views of 'the' cache hierarchy.
>>
>> I almost went partly in that direction when I started this epic voyage.
>> The "almost" part was to change the names of the monitoring directories
>> under mon_data from (legacy non-SNC system):
>>
>> $ ls -l mon_data
>> total 0
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01
>>
>> to (2 socket, SNC=2 system):
>>
>> $ ls -l mon_data
>> total 0
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02
>> dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03
>
> This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM
> controls on the 'L3', but describe it as a single global L3. The MPAM driver currently
> hides this by summing the NUMA node counters and reporting it as the global L3's value.
>
>
>> While that is in some ways a more accurate view, it breaks a lot of
>> legacy monitoring applications that expect the "L3" names.
>
> True - but the behaviour is different from a non SNC system, if this software can read the
> file - but goes wrong because the contents of the file represent something different, its
> still broken.
This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think
what to do about that makes me go in circles about when user space may expect resctrl to indicate
the resource and when user space may expect resctrl to indicate the scope. For example,
/sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the
"L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it
switches the meaning of the middle term to be "scope" while it still contains the monitoring
data of the "L3" resource. So does that mean user space would need to rely on
/sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files
(/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then
match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know
which resource it applies to and learn from the directory name what scope measurement is at?
Reinette
> >> While that is in some ways a more accurate view, it breaks a lot of
> >> legacy monitoring applications that expect the "L3" names.
> >
> > True - but the behaviour is different from a non SNC system, if this software can read the
> > file - but goes wrong because the contents of the file represent something different, its
> > still broken.
>
> This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think
> what to do about that makes me go in circles about when user space may expect resctrl to indicate
> the resource and when user space may expect resctrl to indicate the scope. For example,
> /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the
> "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it
> switches the meaning of the middle term to be "scope" while it still contains the monitoring
> data of the "L3" resource. So does that mean user space would need to rely on
> /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files
> (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then
> match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know
> which resource it applies to and learn from the directory name what scope measurement is at?
Reinette,
It's both a wave and a particle, depending on the observer.
In SNC systems resources on each socket are divided into 2, 3, 4 nodes. But the
division is complicated. Memory and CPU cores are easy. They are each assigned
to an SNC node. The cache is more complicated. The hash function for memory
address to cache index is the part that is SNC aware. So memory on SNC node1
will allocate in the cache indices assigned to SNC node1. But that function has to
be independent of which CPU is doing the access. That's why I keep mentioning
"well behaved NUMA applications when talking about SNC.
So the resctrl monitoring operations still work on the L3 cache, but in SNC mode
they work on a portion of the L3 cache. As long as all accesses are NUMA local you
can think of the cache as partitioned between the SNC nodes.
But not everything is well behaved from a NUMA perspective. It would be misleading
to describe the occupancy and bandwidth as belonging to an SNC node.
It's also a bit misleading to describe in terms of an L3 cache instance. But doing
so doesn't require application changes.
-Tony
Hi Tony,
On 3/18/2024 12:34 PM, Luck, Tony wrote:
>>>> While that is in some ways a more accurate view, it breaks a lot of
>>>> legacy monitoring applications that expect the "L3" names.
>>>
>>> True - but the behaviour is different from a non SNC system, if this software can read the
>>> file - but goes wrong because the contents of the file represent something different, its
>>> still broken.
>>
>> This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think
>> what to do about that makes me go in circles about when user space may expect resctrl to indicate
>> the resource and when user space may expect resctrl to indicate the scope. For example,
>> /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the
>> "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it
>> switches the meaning of the middle term to be "scope" while it still contains the monitoring
>> data of the "L3" resource. So does that mean user space would need to rely on
>> /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files
>> (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then
>> match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know
>> which resource it applies to and learn from the directory name what scope measurement is at?
>
> Reinette,
>
> It's both a wave and a particle, depending on the observer.
>
> In SNC systems resources on each socket are divided into 2, 3, 4 nodes. But the
> division is complicated. Memory and CPU cores are easy. They are each assigned
> to an SNC node. The cache is more complicated. The hash function for memory
> address to cache index is the part that is SNC aware. So memory on SNC node1
> will allocate in the cache indices assigned to SNC node1. But that function has to
> be independent of which CPU is doing the access. That's why I keep mentioning
> "well behaved NUMA applications when talking about SNC.
>
> So the resctrl monitoring operations still work on the L3 cache, but in SNC mode
> they work on a portion of the L3 cache. As long as all accesses are NUMA local you
> can think of the cache as partitioned between the SNC nodes.
>
> But not everything is well behaved from a NUMA perspective. It would be misleading
> to describe the occupancy and bandwidth as belonging to an SNC node.
>
> It's also a bit misleading to describe in terms of an L3 cache instance. But doing
> so doesn't require application changes.
What is the use case for needing to expose the individual cluster counts? What if
resctrl just summed the cluster counts and presented the data as before - per L3
cache instance? I doubt that resctrl would be what applications would use to verify
whether they are "well behaved" wrt NUMA.
Reinette
> What is the use case for needing to expose the individual cluster counts? What if
> resctrl just summed the cluster counts and presented the data as before - per L3
> cache instance? I doubt that resctrl would be what applications would use to verify
> whether they are "well behaved" wrt NUMA.
Reinette,
My (perhaps naïve) belief is that in a cloud server environment there are many
well behaved NUMA applications. Only presenting the sum would lose the detailed
information from each SNC node.
-Tony
> > What is the use case for needing to expose the individual cluster counts? What if
> > resctrl just summed the cluster counts and presented the data as before - per L3
> > cache instance? I doubt that resctrl would be what applications would use to verify
> > whether they are "well behaved" wrt NUMA.
>
> Reinette,
>
> My (perhaps naïve) belief is that in a cloud server environment there are many
> well behaved NUMA applications. Only presenting the sum would lose the detailed
> information from each SNC node.
Is the answer to "A" or "B" ... why not provide both:
$ ls -l /sys/fs/resctrl/mon_data
total 0
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_00
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_01
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_00
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_01
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_02
dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_03
The "L3" entries provide the sum across all SNC nodes sharing the cache. The NODE ones
give the broken out counts.
-Tony
Hi Tony,
On 3/18/2024 1:47 PM, Luck, Tony wrote:
>> What is the use case for needing to expose the individual cluster counts? What if
>> resctrl just summed the cluster counts and presented the data as before - per L3
>> cache instance? I doubt that resctrl would be what applications would use to verify
>> whether they are "well behaved" wrt NUMA.
>
> Reinette,
>
> My (perhaps naïve) belief is that in a cloud server environment there are many
> well behaved NUMA applications. Only presenting the sum would lose the detailed
> information from each SNC node.
Yes ... I understand by providing a sum the values that contributed to the sum
are lost.
Could you please help me understand the details by answering my first
question: What is the use case for needing to expose the individual cluster
counts?
This is a model specific feature so if this is something needed for just a
couple of systems I think we should be less inclined to make changes to
resctrl interface. I am starting to be concerned about something similar
becoming architectural later and then we need to wrangle this model specific
resctrl support (which has then become ABI) again to support whatever that
may look like.
Reinette
On 3/18/2024 2:04 PM, Luck, Tony wrote:
>>> What is the use case for needing to expose the individual cluster counts? What if
>>> resctrl just summed the cluster counts and presented the data as before - per L3
>>> cache instance? I doubt that resctrl would be what applications would use to verify
>>> whether they are "well behaved" wrt NUMA.
>>
>> Reinette,
>>
>> My (perhaps naïve) belief is that in a cloud server environment there are many
>> well behaved NUMA applications. Only presenting the sum would lose the detailed
>> information from each SNC node.
>
> Is the answer to "A" or "B" ... why not provide both:
>
> $ ls -l /sys/fs/resctrl/mon_data
> total 0
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_00
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_01
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_00
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_01
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_02
> dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_03
>
> The "L3" entries provide the sum across all SNC nodes sharing the cache. The NODE ones
> give the broken out counts.
Perhaps ... in this case it may make things easier to understand if
those "mon_NODE_*" directories are sub-directories of the appropriate
"mon_L3_*" directories.
Reinette
> Perhaps ... in this case it may make things easier to understand if
> those "mon_NODE_*" directories are sub-directories of the appropriate
> "mon_L3_*" directories.
Reinette,
Like this?
$ tree mon_data/
mon_data/
├── mon_L3_00
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ ├── mbm_total_bytes
│ ├── mon_NODE_00
│ │ ├── llc_occupancy
│ │ ├── mbm_local_bytes
│ │ └── mbm_total_bytes
│ └── mon_NODE_01
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
└── mon_L3_01
├── llc_occupancy
├── mbm_local_bytes
├── mbm_total_bytes
├── mon_NODE_02
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
└── mon_NODE_03
├── llc_occupancy
├── mbm_local_bytes
└── mbm_total_bytes
-Tony
> Could you please help me understand the details by answering my first
> question: What is the use case for needing to expose the individual cluster
> counts?
>
> This is a model specific feature so if this is something needed for just a
> couple of systems I think we should be less inclined to make changes to
> resctrl interface. I am starting to be concerned about something similar
> becoming architectural later and then we need to wrangle this model specific
> resctrl support (which has then become ABI) again to support whatever that
> may look like.
Reinette,
Model specific. But present in multiple consecutive generations (Sapphire Rapids,
Emerald Rapids, Granite Rapids, Sierra Forest).
Adding Peter Newman for a resctrl user perspective on SNC, rather than me
continue to speculate on possible ways this might be used.
Peter: You will need to dig back a few messages on lore.kernel.org to
get context.
-Tony
Hi Tony,
On 3/18/2024 3:00 PM, Luck, Tony wrote:
>> Perhaps ... in this case it may make things easier to understand if
>> those "mon_NODE_*" directories are sub-directories of the appropriate
>> "mon_L3_*" directories.
>
> Reinette,
>
> Like this?
>
> $ tree mon_data/
> mon_data/
> ├── mon_L3_00
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ ├── mbm_total_bytes
> │ ├── mon_NODE_00
> │ │ ├── llc_occupancy
> │ │ ├── mbm_local_bytes
> │ │ └── mbm_total_bytes
> │ └── mon_NODE_01
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> └── mon_L3_01
> ├── llc_occupancy
> ├── mbm_local_bytes
> ├── mbm_total_bytes
> ├── mon_NODE_02
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> └── mon_NODE_03
> ├── llc_occupancy
> ├── mbm_local_bytes
> └── mbm_total_bytes
>
Yes.
Pro:
* This is familiar to users when compared to existing
CTRL_MON group counts that are the sum of the MON groups within it.
* Users continue to see the clear connection between files listed in
/sys/fs/resctrl/info/L3_MON/mon_features found in "mon_L3*" directories.
* If I understand correctly it also would continue to be useful to
Arm [1] while not breaking existing user space since the
mon_L3* counts continue to reflect the entire L3 resource.
* This addresses your comment of maintaining the detailed information
from each SNC node.
* I do assume that if there is only one SNC node per L3 cache then those
mon_NODE_* directories will not be present and, to address the issue
that triggered this thread, user space can use presence of
multiple "mon_NODE_*" directories per cache instance to know if
SNC is enabled.
Con:
* Unclear how this may need to change if/when SNC becomes architectural.
* Continues to "muddy" the naming of the directories: mon_<resource>_<id>
vs mon_<scope>_<id>. This cannot be turned into agreement with user space
where first directory is mon_<resource>_<id> and second directory is
mon_<scope>_<id> because then we would need to have, for example,
mon_L3_00/mon_L3_00 where the first directory is for the resource and the
second is for the scope, which appears redundant.
* Things may get confusing if there is ever a "node" resource.
This is starting to look like an interface that "checks" all the
requirements I've seen so far. Looking at the "con" it is difficult for me
to see how doing something like this now may cause frustrations later.
Reinette
[1] https://lore.kernel.org/lkml/[email protected]/
Hi Tony,
On Mon, Mar 18, 2024 at 3:05 PM Luck, Tony <[email protected]> wrote:
>
> > Could you please help me understand the details by answering my first
> > question: What is the use case for needing to expose the individual cluster
> > counts?
> >
> > This is a model specific feature so if this is something needed for just a
> > couple of systems I think we should be less inclined to make changes to
> > resctrl interface. I am starting to be concerned about something similar
> > becoming architectural later and then we need to wrangle this model specific
> > resctrl support (which has then become ABI) again to support whatever that
> > may look like.
>
> Reinette,
>
> Model specific. But present in multiple consecutive generations (Sapphire Rapids,
> Emerald Rapids, Granite Rapids, Sierra Forest).
>
> Adding Peter Newman for a resctrl user perspective on SNC, rather than me
> continue to speculate on possible ways this might be used.
>
> Peter: You will need to dig back a few messages on lore.kernel.org to
> get context.
Our main concern with supporting SNC in resctrl is all of the
monitoring groups successfully recording memory bandwidth from all
CPUs, regardless of the RMIDs they're assigned.
I would prefer that we don't complicate the model of resctrl
monitoring domains for this feature. On ARM SoCs there will be a
plethora of technologies influencing the layout of resources, so we
shouldn't start cluttering the model with special cases for each.
I think it's valid for the number of domains in the L3 resource to
increase or stay the same when the system is configured for SNC. I
don't think the details of how the domains came about is relevant at
the resctrl interface level so long as the user has enough information
to deduce what the domain is referring to based on knowledge of their
system configuration.
I would prefer per-cluster as more information could prove useful in
some future investigation, but if you feel the data is misleading,
providing the clusters combined is also fine. I would prefer that the
choice remains consistent from this point forward on any particular
implementation to avoid breaking existing controller software
developed for that implementation.
In our main use case, we sum mon_data/*/mbm_total_bytes to determine a
group's total bandwidth, so please don't cause this logic to produce
the wrong answer.
Thanks!
-Peter