LinuxLists.cc - [PATCH v2 0/2] Fixup for discontiguous/sparse numa nodes

2017-08-21 09:27:39

Subject: [PATCH v2 0/2] Fixup for discontiguous/sparse numa nodes

From: Satheesh Rajendran <[email protected]>

Certain systems would have sparse/discontinguous
numa nodes.
perf bench numa doesnt work well on such nodes.
1. It shows wrong values.
2. It can hang.
3. It can show redundant information for non-existant nodes.

#numactl -H
available: 2 nodes (0,8)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 61352 MB
node 0 free: 57168 MB
node 8 cpus: 8 9 10 11 12 13 14 15
node 8 size: 65416 MB
node 8 free: 36593 MB
node distances:
node 0 8
0: 10 40
8: 40 10

Scenario 1:

Before Fix:
# perf bench numa mem --no-data_rand_walk -p 2 -t 20 -G 0 -P 3072 -T 0 -l 50 -c -s 1000
...
...
# 40 tasks will execute (on 9 nodes, 16 CPUs): ----> Wrong number of nodes
...
# 2.0% [0.2 mins] 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 4/1 [ 4/2 ] l: 0-0 ( 0) ----> Shows info on non-existant nodes.

After Fix:
# ./perf bench numa mem --no-data_rand_walk -p 2 -t 20 -G 0 -P 3072 -T 0 -l 50 -c -s 1000
...
...
# 40 tasks will execute (on 2 nodes, 16 CPUs):
...
# 2.0% [0.2 mins] 9/1 0/0 [ 9/1 ] l: 0-0 ( 0)
# 4.0% [0.4 mins] 21/2 19/1 [ 2/3 ] l: 0-1 ( 1) {1-2}

Scenario 2:

Before Fix:
# perf bench numa all
# Running numa/mem benchmark...
....
...
# Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk"
perf: bench/numa.c:306: bind_to_memnode: Assertion `!(ret)' failed. ------------> Got hung

After Fix:
# ./perf bench numa all
# Running numa/mem benchmark...
....
...
# Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk"

# NOTE: ignoring bind NODEs starting at NODE#1
# NOTE: 0 tasks mem-bound, 1 tasks unbound
20.017 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.006 secs average thread-runtime
0.043 % difference between max/avg runtime
413.794 GB data processed, per thread
413.794 GB data processed, total
0.048 nsecs/byte/thread runtime
20.672 GB/sec/thread speed
20.672 GB/sec total speed

Changes in v2:
Fixed review comments for function names and alloc failure handle

Satheesh Rajendran (2):
perf/bench/numa: Add functions to detect sparse numa nodes
perf/bench/numa: Handle discontiguous/sparse numa nodes

tools/perf/bench/numa.c | 52 ++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 45 insertions(+), 7 deletions(-)

--
2.7.4

2017-08-21 09:28:06

by Satheesh Rajendran

[permalink] [raw]

Subject: [PATCH v2 1/2] perf/bench/numa: Add functions to detect sparse numa nodes

From: Satheesh Rajendran <[email protected]>

Added functions 1) to get a count of all nodes that are exposed to
userspace. These nodes could be memoryless cpu nodes or cpuless memory
nodes, 2) to check given node is present and 3) to check given
node has cpus

This information can be used to handle sparse/discontiguous nodes.

Reviewed-by: Arnaldo Carvalho de Melo <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Satheesh Rajendran <[email protected]>
Signed-off-by: Balamuruhan S <[email protected]>
---
tools/perf/bench/numa.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c
index 469d65b..300faba1 100644
--- a/tools/perf/bench/numa.c
+++ b/tools/perf/bench/numa.c
@@ -215,6 +215,41 @@ static const char * const numa_usage[] = {
NULL
};

+// To get number of numa nodes present.
+static int nr_numa_nodes(void)
+{
+ int i, nr_nodes = 0;
+
+ for (i = 0; i < g->p.nr_nodes; i++) {
+ if (numa_bitmask_isbitset(numa_nodes_ptr, i))
+ nr_nodes++;
+ }
+ return nr_nodes;
+}
+
+// To check if given numa node is present.
+static int is_node_present(int node)
+{
+ return numa_bitmask_isbitset(numa_nodes_ptr, node);
+}
+
+// To check given numa node has cpus.
+static bool node_has_cpus(int node)
+{
+ struct bitmask *cpu = numa_allocate_cpumask();
+ unsigned int i;
+
+ if (cpu == NULL)
+ return false; // lets fall back to nocpus safely
+ if (numa_node_to_cpus(node, cpu) == 0) {
+ for (i = 0; i < cpu->size; i++) {
+ if (numa_bitmask_isbitset(cpu, i))
+ return true;
+ }
+ }
+ return false;
+}
+
static cpu_set_t bind_to_cpu(int target_cpu)
{
cpu_set_t orig_mask, mask;
--
2.7.4

2017-08-21 09:28:36

by Satheesh Rajendran

[permalink] [raw]

Subject: [PATCH v2 2/2] perf/bench/numa: Handle discontiguous/sparse numa nodes

From: Satheesh Rajendran <[email protected]>

Certain systems are designed to have sparse/discontiguous nodes.
On such systems, perf bench numa hangs, shows wrong number of nodes
and shows values for non-existent nodes. Handle this by only
taking nodes that are exposed by kernel to userspace.

Reviewed-by: Arnaldo Carvalho de Melo <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Satheesh Rajendran <[email protected]>
Signed-off-by: Balamuruhan S <[email protected]>
---
tools/perf/bench/numa.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c
index 300faba1..a3deee2 100644
--- a/tools/perf/bench/numa.c
+++ b/tools/perf/bench/numa.c
@@ -278,12 +278,12 @@ static cpu_set_t bind_to_cpu(int target_cpu)

static cpu_set_t bind_to_node(int target_node)
{
- int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes;
+ int cpus_per_node = g->p.nr_cpus/nr_numa_nodes();
cpu_set_t orig_mask, mask;
int cpu;
int ret;

- BUG_ON(cpus_per_node*g->p.nr_nodes != g->p.nr_cpus);
+ BUG_ON(cpus_per_node*nr_numa_nodes() != g->p.nr_cpus);
BUG_ON(!cpus_per_node);

ret = sched_getaffinity(0, sizeof(orig_mask), &orig_mask);
@@ -683,7 +683,7 @@ static int parse_setup_node_list(void)
int i;

for (i = 0; i < mul; i++) {
- if (t >= g->p.nr_tasks) {
+ if (t >= g->p.nr_tasks || !node_has_cpus(bind_node)) {
printf("\n# NOTE: ignoring bind NODEs starting at NODE#%d\n", bind_node);
goto out;
}
@@ -964,6 +964,7 @@ static void calc_convergence(double runtime_ns_max, double *convergence)
int node;
int cpu;
int t;
+ int processes;

if (!g->p.show_convergence && !g->p.measure_convergence)
return;
@@ -998,13 +999,14 @@ static void calc_convergence(double runtime_ns_max, double *convergence)
sum = 0;

for (node = 0; node < g->p.nr_nodes; node++) {
+ if (!is_node_present(node))
+ continue;
nr = nodes[node];
nr_min = min(nr, nr_min);
nr_max = max(nr, nr_max);
sum += nr;
}
BUG_ON(nr_min > nr_max);
-
BUG_ON(sum > g->p.nr_tasks);

if (0 && (sum < g->p.nr_tasks))
@@ -1018,8 +1020,9 @@ static void calc_convergence(double runtime_ns_max, double *convergence)
process_groups = 0;

for (node = 0; node < g->p.nr_nodes; node++) {
- int processes = count_node_processes(node);
-
+ if (!is_node_present(node))
+ continue;
+ processes = count_node_processes(node);
nr = nodes[node];
tprintf(" %2d/%-2d", nr, processes);

@@ -1325,7 +1328,7 @@ static void print_summary(void)

printf("\n ###\n");
printf(" # %d %s will execute (on %d nodes, %d CPUs):\n",
- g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", g->p.nr_nodes, g->p.nr_cpus);
+ g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", nr_numa_nodes(), g->p.nr_cpus);
printf(" # %5dx %5ldMB global shared mem operations\n",
g->p.nr_loops, g->p.bytes_global/1024/1024);
printf(" # %5dx %5ldMB process shared mem operations\n",
--
2.7.4