2017-08-15 08:46:51

by Kemi Wang

[permalink] [raw]
Subject: [PATCH 0/2] Separate NUMA statistics from zone statistics

Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM submit, these are a substantial
source of overhead in the page allocator and are very rarely consumed. This
significant overhead in cache bouncing caused by zone counters (NUMA
associated counters) update in parallel in multi-threaded page allocation
(pointed out by Dave Hansen).

To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of 32765, as a small threshold greatly increases the update frequency
of the global counter from local per cpu counter (suggested by Ying Huang).
The rationality is that these statistics counters don't need to be read
often, unlike other VM counters, so it's not a problem to use a large
threshold and make readers more expensive.

With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
for per single page allocation and reclaim on Jesper's page_bench03
benchmark. Meanwhile, this patchset keeps the same style of virtual memory
statistics with little end-user-visible effects (see the first patch for
details), except that the number of NUMA items in each cpu
(vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
the value of NUMA counter to eliminate deviation.

I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
(88 processors with 126G memory) with different size of threshold of pcp
counter.

Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 <==> system by default
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

Kemi Wang (2):
mm: Change the call sites of numa statistics items
mm: Update NUMA counter threshold size

drivers/base/node.c | 22 ++++---
include/linux/mmzone.h | 25 +++++---
include/linux/vmstat.h | 33 ++++++++++
mm/page_alloc.c | 10 +--
mm/vmstat.c | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
5 files changed, 227 insertions(+), 25 deletions(-)

--
2.7.4


2017-08-15 08:46:58

by Kemi Wang

[permalink] [raw]
Subject: [PATCH 2/2] mm: Update NUMA counter threshold size

There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).

This patch updates NUMA counter threshold to a fixed size of 32765, as a
small threshold greatly increases the update frequency of the global
counter from local per cpu counter, and the number of NUMA items in each
cpu (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user
*reads* the value of numa counter to eliminate deviation (suggested by
Ying Huang).

The rationality is that these statistics counters don't need to be read
often, unlike other VM counters, so it's not a problem to use a large
threshold and make readers more expensive.

With this patchset, we see 26.6% drop of CPU cycles(537-->394) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.

Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 <==> system by default (base)
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

Signed-off-by: Kemi Wang <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Ying Huang <[email protected]>
---
include/linux/mmzone.h | 4 ++--
include/linux/vmstat.h | 6 +++++-
mm/vmstat.c | 23 ++++++++++-------------
3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0b11ba7..7eaf0e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -282,8 +282,8 @@ struct per_cpu_pageset {
struct per_cpu_pages pcp;
#ifdef CONFIG_NUMA
s8 expire;
- s8 numa_stat_threshold;
- s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
+ s16 numa_stat_threshold;
+ s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
#endif
#ifdef CONFIG_SMP
s8 stat_threshold;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1e19379..d97cc34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
return x;
}

-static inline unsigned long zone_numa_state(struct zone *zone,
+static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
enum zone_numa_stat_item item)
{
long x = atomic_long_read(&zone->vm_numa_stat[item]);
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];

return x;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5a7fa30..c7f50ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@

#include "internal.h"

+#define NUMA_STAT_THRESHOLD 32765
+
#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
EXPORT_PER_CPU_SYMBOL(vm_event_states);
@@ -196,7 +198,7 @@ void refresh_zone_stat_thresholds(void)
= threshold;
#ifdef CONFIG_NUMA
per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
- = threshold;
+ = NUMA_STAT_THRESHOLD;
#endif
/* Base nodestat threshold on the largest populated zone. */
pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
@@ -231,14 +233,9 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
continue;

threshold = (*calculate_pressure)(zone);
- for_each_online_cpu(cpu) {
+ for_each_online_cpu(cpu)
per_cpu_ptr(zone->pageset, cpu)->stat_threshold
= threshold;
-#ifdef CONFIG_NUMA
- per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
- = threshold;
-#endif
- }
}
}

@@ -872,13 +869,13 @@ void __inc_zone_numa_state(struct zone *zone,
enum zone_numa_stat_item item)
{
struct per_cpu_pageset __percpu *pcp = zone->pageset;
- s8 __percpu *p = pcp->vm_numa_stat_diff + item;
- s8 v, t;
+ s16 __percpu *p = pcp->vm_numa_stat_diff + item;
+ s16 v, t;

v = __this_cpu_inc_return(*p);
t = __this_cpu_read(pcp->numa_stat_threshold);
if (unlikely(v > t)) {
- s8 overstep = t >> 1;
+ s16 overstep = t >> 1;

zone_numa_state_add(v + overstep, zone, item);
__this_cpu_write(*p, -overstep);
@@ -914,7 +911,7 @@ unsigned long sum_zone_node_numa_state(int node,
unsigned long count = 0;

for (i = 0; i < MAX_NR_ZONES; i++)
- count += zone_numa_state(zones + i, item);
+ count += zone_numa_state_snapshot(zones + i, item);

return count;
}
@@ -1536,7 +1533,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
seq_printf(m, "\n %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
- zone_numa_state(zone, i));
+ zone_numa_state_snapshot(zone, i));
#endif

seq_printf(m, "\n pagesets");
@@ -1795,7 +1792,7 @@ static bool need_update(int cpu)

BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
#ifdef CONFIG_NUMA
- BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 1);
+ BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2);
#endif
/*
* The fast way of checking if there are any vmstat diffs.
--
2.7.4

2017-08-15 08:47:11

by Kemi Wang

[permalink] [raw]
Subject: [PATCH 1/2] mm: Change the call sites of numa statistics items

In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the value of NUMA stats is shown behind zone page stats, and
the threshold size of NUMA stats is shown behind pcp threshold when users
*read* the zone info.

E.g. cat /proc/zoneinfo
***Base*** ***With this patch***
nr_free_pages 3976 nr_free_pages 3976
nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
nr_zone_active_anon 0 nr_zone_active_anon 0
nr_zone_inactive_file 0 nr_zone_inactive_file 0
nr_zone_active_file 0 nr_zone_active_file 0
nr_zone_unevictable 0 nr_zone_unevictable 0
nr_zone_write_pending 0 nr_zone_write_pending 0
nr_mlock 0 nr_mlock 0
nr_page_table_pages 0 nr_page_table_pages 0
nr_kernel_stack 0 nr_kernel_stack 0
nr_bounce 0 nr_bounce 0
nr_zspages 0 nr_zspages 0
numa_hit 0 *nr_free_cma 0*
numa_miss 0 numa_hit 0
numa_foreign 0 numa_miss 0
numa_interleave 0 numa_foreign 0
numa_local 0 numa_interleave 0
numa_other 0 numa_local 0
*nr_free_cma 0* numa_other 0
... ...
vm stats threshold: 10 vm stats threshold: 10
... *vm numa stats threshold: 10*
...

The next patch updates the numa stats counter size and threshold.

Signed-off-by: Kemi Wang <[email protected]>
---
drivers/base/node.c | 22 ++++---
include/linux/mmzone.h | 25 +++++---
include/linux/vmstat.h | 29 +++++++++
mm/page_alloc.c | 10 +--
mm/vmstat.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
5 files changed, 227 insertions(+), 26 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index d8dc830..12080c6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
"interleave_hit %lu\n"
"local_node %lu\n"
"other_node %lu\n",
- sum_zone_node_page_state(dev->id, NUMA_HIT),
- sum_zone_node_page_state(dev->id, NUMA_MISS),
- sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
- sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
- sum_zone_node_page_state(dev->id, NUMA_LOCAL),
- sum_zone_node_page_state(dev->id, NUMA_OTHER));
+ sum_zone_node_numa_state(dev->id, NUMA_HIT),
+ sum_zone_node_numa_state(dev->id, NUMA_MISS),
+ sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
+ sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
+ sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
+ sum_zone_node_numa_state(dev->id, NUMA_OTHER));
}
static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);

@@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
sum_zone_node_page_state(nid, i));

- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+#ifdef CONFIG_NUMA
+ for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+ sum_zone_node_numa_state(nid, i));
+#endif
+
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+ n += sprintf(buf+n, "%s %lu\n",
+ vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
+ NR_VM_ZONE_NUMA_STAT_ITEMS],
node_page_state(pgdat, i));

return n;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b..0b11ba7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,6 +114,20 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif

+#ifdef CONFIG_NUMA
+enum zone_numa_stat_item {
+ NUMA_HIT, /* allocated in intended node */
+ NUMA_MISS, /* allocated in non intended node */
+ NUMA_FOREIGN, /* was intended here, hit elsewhere */
+ NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
+ NUMA_LOCAL, /* allocation from local node */
+ NUMA_OTHER, /* allocation from other node */
+ NR_VM_ZONE_NUMA_STAT_ITEMS
+};
+#else
+#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
+#endif
+
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
@@ -132,14 +146,6 @@ enum zone_stat_item {
#if IS_ENABLED(CONFIG_ZSMALLOC)
NR_ZSPAGES, /* allocated in zsmalloc */
#endif
-#ifdef CONFIG_NUMA
- NUMA_HIT, /* allocated in intended node */
- NUMA_MISS, /* allocated in non intended node */
- NUMA_FOREIGN, /* was intended here, hit elsewhere */
- NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
- NUMA_LOCAL, /* allocation from local node */
- NUMA_OTHER, /* allocation from other node */
-#endif
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

@@ -276,6 +282,8 @@ struct per_cpu_pageset {
struct per_cpu_pages pcp;
#ifdef CONFIG_NUMA
s8 expire;
+ s8 numa_stat_threshold;
+ s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
#endif
#ifdef CONFIG_SMP
s8 stat_threshold;
@@ -496,6 +504,7 @@ struct zone {
ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+ atomic_long_t vm_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS];
} ____cacheline_internodealigned_in_smp;

enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index b3d85f3..1e19379 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -107,8 +107,33 @@ static inline void vm_events_fold_cpu(int cpu)
* Zone and node-based page accounting with per cpu differentials.
*/
extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_zone_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS];
extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];

+#ifdef CONFIG_NUMA
+static inline void zone_numa_state_add(long x, struct zone *zone,
+ enum zone_numa_stat_item item)
+{
+ atomic_long_add(x, &zone->vm_numa_stat[item]);
+ atomic_long_add(x, &vm_zone_numa_stat[item]);
+}
+
+static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
+{
+ long x = atomic_long_read(&vm_zone_numa_stat[item]);
+
+ return x;
+}
+
+static inline unsigned long zone_numa_state(struct zone *zone,
+ enum zone_numa_stat_item item)
+{
+ long x = atomic_long_read(&zone->vm_numa_stat[item]);
+
+ return x;
+}
+#endif /* CONFIG_NUMA */
+
static inline void zone_page_state_add(long x, struct zone *zone,
enum zone_stat_item item)
{
@@ -194,8 +219,12 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,


#ifdef CONFIG_NUMA
+extern void __inc_zone_numa_state(struct zone *,
+ enum zone_numa_stat_item);
extern unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item);
+extern unsigned long sum_zone_node_numa_state(int node,
+ enum zone_numa_stat_item item);
extern unsigned long node_page_state(struct pglist_data *pgdat,
enum node_stat_item item);
#else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d30e91..7222c47 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2720,18 +2720,18 @@ int __isolate_free_page(struct page *page, unsigned int order)
static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
{
#ifdef CONFIG_NUMA
- enum zone_stat_item local_stat = NUMA_LOCAL;
+ enum zone_numa_stat_item local_stat = NUMA_LOCAL;

if (z->node != numa_node_id())
local_stat = NUMA_OTHER;

if (z->node == preferred_zone->node)
- __inc_zone_state(z, NUMA_HIT);
+ __inc_zone_numa_state(z, NUMA_HIT);
else {
- __inc_zone_state(z, NUMA_MISS);
- __inc_zone_state(preferred_zone, NUMA_FOREIGN);
+ __inc_zone_numa_state(z, NUMA_MISS);
+ __inc_zone_numa_state(preferred_zone, NUMA_FOREIGN);
}
- __inc_zone_state(z, local_stat);
+ __inc_zone_numa_state(z, local_stat);
#endif
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9a4441b..5a7fa30 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -87,8 +87,10 @@ void vm_events_fold_cpu(int cpu)
* vm_stat contains the global counters
*/
atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_zone_numa_stat[NR_VM_ZONE_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_zone_numa_stat);
EXPORT_SYMBOL(vm_node_stat);

#ifdef CONFIG_SMP
@@ -192,7 +194,10 @@ void refresh_zone_stat_thresholds(void)

per_cpu_ptr(zone->pageset, cpu)->stat_threshold
= threshold;
-
+#ifdef CONFIG_NUMA
+ per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
+ = threshold;
+#endif
/* Base nodestat threshold on the largest populated zone. */
pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
@@ -226,9 +231,14 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
continue;

threshold = (*calculate_pressure)(zone);
- for_each_online_cpu(cpu)
+ for_each_online_cpu(cpu) {
per_cpu_ptr(zone->pageset, cpu)->stat_threshold
= threshold;
+#ifdef CONFIG_NUMA
+ per_cpu_ptr(zone->pageset, cpu)->numa_stat_threshold
+ = threshold;
+#endif
+ }
}
}

@@ -604,6 +614,32 @@ EXPORT_SYMBOL(dec_node_page_state);
* Fold a differential into the global counters.
* Returns the number of counters updated.
*/
+#ifdef CONFIG_NUMA
+static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
+{
+ int i;
+ int changes = 0;
+
+ for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+ if (zone_diff[i]) {
+ atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
+ changes++;
+ }
+
+ for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+ if (numa_diff[i]) {
+ atomic_long_add(numa_diff[i], &vm_zone_numa_stat[i]);
+ changes++;
+ }
+
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+ if (node_diff[i]) {
+ atomic_long_add(node_diff[i], &vm_node_stat[i]);
+ changes++;
+ }
+ return changes;
+}
+#else
static int fold_diff(int *zone_diff, int *node_diff)
{
int i;
@@ -622,6 +658,7 @@ static int fold_diff(int *zone_diff, int *node_diff)
}
return changes;
}
+#endif /* CONFIG_NUMA */

/*
* Update the zone counters for the current cpu.
@@ -645,6 +682,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+#ifdef CONFIG_NUMA
+ int global_numa_diff[NR_VM_ZONE_NUMA_STAT_ITEMS] = {0, };
+#endif
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
int changes = 0;

@@ -666,6 +706,18 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
}
#ifdef CONFIG_NUMA
+ for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++) {
+ int v;
+
+ v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
+ if (v) {
+
+ atomic_long_add(v, &zone->vm_numa_stat[i]);
+ global_numa_diff[i] += v;
+ __this_cpu_write(p->expire, 3);
+ }
+ }
+
if (do_pagesets) {
cond_resched();
/*
@@ -712,7 +764,11 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
}

+#ifdef CONFIG_NUMA
+ changes += fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
+#else
changes += fold_diff(global_zone_diff, global_node_diff);
+#endif
return changes;
}

@@ -727,6 +783,9 @@ void cpu_vm_stats_fold(int cpu)
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+#ifdef CONFIG_NUMA
+ int global_numa_diff[NR_VM_ZONE_NUMA_STAT_ITEMS] = {0, };
+#endif
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };

for_each_populated_zone(zone) {
@@ -743,6 +802,18 @@ void cpu_vm_stats_fold(int cpu)
atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
}
+
+#ifdef CONFIG_NUMA
+ for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+ if (p->vm_numa_stat_diff[i]) {
+ int v;
+
+ v = p->vm_numa_stat_diff[i];
+ p->vm_numa_stat_diff[i] = 0;
+ atomic_long_add(v, &zone->vm_numa_stat[i]);
+ global_numa_diff[i] += v;
+ }
+#endif
}

for_each_online_pgdat(pgdat) {
@@ -761,7 +832,11 @@ void cpu_vm_stats_fold(int cpu)
}
}

+#ifdef CONFIG_NUMA
+ fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
+#else
fold_diff(global_zone_diff, global_node_diff);
+#endif
}

/*
@@ -779,10 +854,37 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
atomic_long_add(v, &zone->vm_stat[i]);
atomic_long_add(v, &vm_zone_stat[i]);
}
+
+#ifdef CONFIG_NUMA
+ for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+ if (pset->vm_numa_stat_diff[i]) {
+ int v = pset->vm_numa_stat_diff[i];
+ pset->vm_numa_stat_diff[i] = 0;
+ atomic_long_add(v, &zone->vm_numa_stat[i]);
+ atomic_long_add(v, &vm_zone_numa_stat[i]);
+ }
+#endif
}
#endif

#ifdef CONFIG_NUMA
+void __inc_zone_numa_state(struct zone *zone,
+ enum zone_numa_stat_item item)
+{
+ struct per_cpu_pageset __percpu *pcp = zone->pageset;
+ s8 __percpu *p = pcp->vm_numa_stat_diff + item;
+ s8 v, t;
+
+ v = __this_cpu_inc_return(*p);
+ t = __this_cpu_read(pcp->numa_stat_threshold);
+ if (unlikely(v > t)) {
+ s8 overstep = t >> 1;
+
+ zone_numa_state_add(v + overstep, zone, item);
+ __this_cpu_write(*p, -overstep);
+ }
+}
+
/*
* Determine the per node value of a stat item. This function
* is called frequently in a NUMA machine, so try to be as
@@ -801,6 +903,22 @@ unsigned long sum_zone_node_page_state(int node,
return count;
}

+/* Determine the per node value of a numa stat item. To avoid deviation,
+ * the per-cpu stat number in vm_numa_stat_diff[] is also included.
+ */
+unsigned long sum_zone_node_numa_state(int node,
+ enum zone_numa_stat_item item)
+{
+ struct zone *zones = NODE_DATA(node)->node_zones;
+ int i;
+ unsigned long count = 0;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ count += zone_numa_state(zones + i, item);
+
+ return count;
+}
+
/*
* Determine the per node value of a stat item.
*/
@@ -934,6 +1052,9 @@ const char * const vmstat_text[] = {
#if IS_ENABLED(CONFIG_ZSMALLOC)
"nr_zspages",
#endif
+ "nr_free_cma",
+
+ /* enum zone_numa_stat_item counters */
#ifdef CONFIG_NUMA
"numa_hit",
"numa_miss",
@@ -942,7 +1063,6 @@ const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
- "nr_free_cma",

/* Node-based counters */
"nr_inactive_anon",
@@ -1097,7 +1217,6 @@ const char * const vmstat_text[] = {
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */

-
#if (defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)) || \
defined(CONFIG_PROC_FS)
static void *frag_start(struct seq_file *m, loff_t *pos)
@@ -1375,7 +1494,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
seq_printf(m, "\n per-node stats");
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n %-12s %lu",
- vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+ vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
+ NR_VM_ZONE_NUMA_STAT_ITEMS],
node_page_state(pgdat, i));
}
}
@@ -1412,6 +1532,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
seq_printf(m, "\n %-12s %lu", vmstat_text[i],
zone_page_state(zone, i));

+#ifdef CONFIG_NUMA
+ for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+ seq_printf(m, "\n %-12s %lu",
+ vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+ zone_numa_state(zone, i));
+#endif
+
seq_printf(m, "\n pagesets");
for_each_online_cpu(i) {
struct per_cpu_pageset *pageset;
@@ -1430,6 +1557,11 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
seq_printf(m, "\n vm stats threshold: %d",
pageset->stat_threshold);
#endif
+
+#ifdef CONFIG_NUMA
+ seq_printf(m, "\n vm numa stats threshold: %d",
+ pageset->numa_stat_threshold);
+#endif
}
seq_printf(m,
"\n node_unreclaimable: %u"
@@ -1488,6 +1620,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
if (*pos >= ARRAY_SIZE(vmstat_text))
return NULL;
stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
+ NR_VM_ZONE_NUMA_STAT_ITEMS * sizeof(unsigned long) +
NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);

@@ -1503,6 +1636,12 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
v[i] = global_page_state(i);
v += NR_VM_ZONE_STAT_ITEMS;

+#ifdef CONFIG_NUMA
+ for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
+ v[i] = global_numa_state(i);
+ v += NR_VM_ZONE_NUMA_STAT_ITEMS;
+#endif
+
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
v[i] = global_node_page_state(i);
v += NR_VM_NODE_STAT_ITEMS;
@@ -1604,6 +1743,16 @@ int vmstat_refresh(struct ctl_table *table, int write,
err = -EINVAL;
}
}
+#ifdef CONFIG_NUMA
+ for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++) {
+ val = atomic_long_read(&vm_zone_numa_stat[i]);
+ if (val < 0) {
+ pr_warn("%s: %s %ld\n",
+ __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val);
+ err = -EINVAL;
+ }
+ }
+#endif
if (err)
return err;
if (write)
@@ -1645,13 +1794,19 @@ static bool need_update(int cpu)
struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);

BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
+#ifdef CONFIG_NUMA
+ BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 1);
+#endif
/*
* The fast way of checking if there are any vmstat diffs.
* This works because the diffs are byte sized items.
*/
if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
return true;
-
+#ifdef CONFIG_NUMA
+ if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_ZONE_NUMA_STAT_ITEMS))
+ return true;
+#endif
}
return false;
}
--
2.7.4

2017-08-15 09:49:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Change the call sites of numa statistics items

On Tue, Aug 15, 2017 at 04:45:35PM +0800, Kemi Wang wrote:
> In this patch, NUMA statistics is separated from zone statistics
> framework, all the call sites of NUMA stats are changed to use
> numa-stats-specific functions, it does not have any functionality change
> except that the value of NUMA stats is shown behind zone page stats, and
> the threshold size of NUMA stats is shown behind pcp threshold when users
> *read* the zone info.
>
> E.g. cat /proc/zoneinfo
> ***Base*** ***With this patch***
> nr_free_pages 3976 nr_free_pages 3976
> nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
> nr_zone_active_anon 0 nr_zone_active_anon 0
> nr_zone_inactive_file 0 nr_zone_inactive_file 0
> nr_zone_active_file 0 nr_zone_active_file 0
> nr_zone_unevictable 0 nr_zone_unevictable 0
> nr_zone_write_pending 0 nr_zone_write_pending 0
> nr_mlock 0 nr_mlock 0
> nr_page_table_pages 0 nr_page_table_pages 0
> nr_kernel_stack 0 nr_kernel_stack 0
> nr_bounce 0 nr_bounce 0
> nr_zspages 0 nr_zspages 0
> numa_hit 0 *nr_free_cma 0*
> numa_miss 0 numa_hit 0
> numa_foreign 0 numa_miss 0
> numa_interleave 0 numa_foreign 0
> numa_local 0 numa_interleave 0
> numa_other 0 numa_local 0
> *nr_free_cma 0* numa_other 0
> ... ...
> vm stats threshold: 10 vm stats threshold: 10
> ... *vm numa stats threshold: 10*
> ...
>
> The next patch updates the numa stats counter size and threshold.
>
> Signed-off-by: Kemi Wang <[email protected]>
> ---
> drivers/base/node.c | 22 ++++---
> include/linux/mmzone.h | 25 +++++---
> include/linux/vmstat.h | 29 +++++++++
> mm/page_alloc.c | 10 +--
> mm/vmstat.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
> 5 files changed, 227 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index d8dc830..12080c6 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
> "interleave_hit %lu\n"
> "local_node %lu\n"
> "other_node %lu\n",
> - sum_zone_node_page_state(dev->id, NUMA_HIT),
> - sum_zone_node_page_state(dev->id, NUMA_MISS),
> - sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
> - sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
> - sum_zone_node_page_state(dev->id, NUMA_LOCAL),
> - sum_zone_node_page_state(dev->id, NUMA_OTHER));
> + sum_zone_node_numa_state(dev->id, NUMA_HIT),
> + sum_zone_node_numa_state(dev->id, NUMA_MISS),
> + sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
> + sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
> + sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
> + sum_zone_node_numa_state(dev->id, NUMA_OTHER));
> }

The names are very similar and it would be preferred if the names were
visually different like sum_zone_numa_stat() which is hard to confuse with
the zone stat fields.

> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>
> @@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
> n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
> sum_zone_node_page_state(nid, i));
>
> - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> +#ifdef CONFIG_NUMA
> + for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
> n += sprintf(buf+n, "%s %lu\n",
> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> + sum_zone_node_numa_state(nid, i));
> +#endif

Similar with NR_VM_ZONE_NUMA_STAT_ITEMS, it's too similar to
NR_VM_NODE_STAT_ITEMS.

> +
> + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> + n += sprintf(buf+n, "%s %lu\n",
> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> + NR_VM_ZONE_NUMA_STAT_ITEMS],
> node_page_state(pgdat, i));
>
> return n;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fc14b8b..0b11ba7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -114,6 +114,20 @@ struct zone_padding {
> #define ZONE_PADDING(name)
> #endif
>
> +#ifdef CONFIG_NUMA
> +enum zone_numa_stat_item {
> + NUMA_HIT, /* allocated in intended node */
> + NUMA_MISS, /* allocated in non intended node */
> + NUMA_FOREIGN, /* was intended here, hit elsewhere */
> + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
> + NUMA_LOCAL, /* allocation from local node */
> + NUMA_OTHER, /* allocation from other node */
> + NR_VM_ZONE_NUMA_STAT_ITEMS
> +};
> +#else
> +#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
> +#endif
> +
> enum zone_stat_item {
> /* First 128 byte cacheline (assuming 64 bit words) */
> NR_FREE_PAGES,
> @@ -132,14 +146,6 @@ enum zone_stat_item {
> #if IS_ENABLED(CONFIG_ZSMALLOC)
> NR_ZSPAGES, /* allocated in zsmalloc */
> #endif
> -#ifdef CONFIG_NUMA
> - NUMA_HIT, /* allocated in intended node */
> - NUMA_MISS, /* allocated in non intended node */
> - NUMA_FOREIGN, /* was intended here, hit elsewhere */
> - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
> - NUMA_LOCAL, /* allocation from local node */
> - NUMA_OTHER, /* allocation from other node */
> -#endif
> NR_FREE_CMA_PAGES,
> NR_VM_ZONE_STAT_ITEMS };
>
> @@ -276,6 +282,8 @@ struct per_cpu_pageset {
> struct per_cpu_pages pcp;
> #ifdef CONFIG_NUMA
> s8 expire;
> + s8 numa_stat_threshold;
> + s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> #endif
> #ifdef CONFIG_SMP
> s8 stat_threshold;

Ok. this slightly increases the size of the per_cpu_pageset due to
numa_stat_threshold. The structure occupes 2 cache lines and still occupies
2 cache lines afterwards so that is ok but consider hard-coding the value
of it. The locality stats are never used as part of a decision made by the
kernel and they get summed when reading proc unconditionally. There is little
benefit to tuning that threshold at all and there should be a very small
performance gain if it's removed because it'll be a compile-time constant.

The rest of the patch is mostly mechanical.

--
Mel Gorman
SUSE Labs

2017-08-15 09:58:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size

On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> Threshold CPU cycles Throughput(88 threads)
> 32 799 241760478
> 64 640 301628829
> 125 537 358906028 <==> system by default (base)
> 256 468 412397590
> 512 428 450550704
> 4096 399 482520943
> 20000 394 489009617
> 30000 395 488017817
> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
> N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>
> Signed-off-by: Kemi Wang <[email protected]>
> Suggested-by: Dave Hansen <[email protected]>
> Suggested-by: Ying Huang <[email protected]>
> ---
> include/linux/mmzone.h | 4 ++--
> include/linux/vmstat.h | 6 +++++-
> mm/vmstat.c | 23 ++++++++++-------------
> 3 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0b11ba7..7eaf0e8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> struct per_cpu_pages pcp;
> #ifdef CONFIG_NUMA
> s8 expire;
> - s8 numa_stat_threshold;
> - s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> + s16 numa_stat_threshold;
> + s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];

I'm fairly sure this pushes the size of that structure into the next
cache line which is not welcome.

vm_numa_stat_diff is an always incrementing field. How much do you gain
if this becomes a u8 code and remove any code that deals with negative
values? That would double the threshold without consuming another cache line.

Furthermore, the stats in question are only ever incremented by one.
That means that any calcluation related to overlap can be removed and
special cased that it'll never overlap by more than 1. That potentially
removes code that is required for other stats but not locality stats.
This may give enough savings to avoid moving to s16.

Very broadly speaking, I like what you're doing but I would like to see
more work on reducing any unnecessary code in that path (such as dealing
with overlaps for single increments) and treat incrasing the cache footprint
only as a very last resort.

> #endif
> #ifdef CONFIG_SMP
> s8 stat_threshold;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 1e19379..d97cc34 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
> return x;
> }
>
> -static inline unsigned long zone_numa_state(struct zone *zone,
> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
> enum zone_numa_stat_item item)
> {
> long x = atomic_long_read(&zone->vm_numa_stat[item]);
> + int cpu;
> +
> + for_each_online_cpu(cpu)
> + x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>
> return x;
> }

This does not appear to be related to the current patch. It either
should be merged with the previous patch or stand on its own.

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 5a7fa30..c7f50ed 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -30,6 +30,8 @@
>
> #include "internal.h"
>
> +#define NUMA_STAT_THRESHOLD 32765
> +

This should be expressed in terms of the type and not a hard-coded value.

--
Mel Gorman
SUSE Labs

2017-08-15 10:37:50

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/2] Separate NUMA statistics from zone statistics

On Tue, 15 Aug 2017 16:45:34 +0800
Kemi Wang <[email protected]> wrote:

> Each page allocation updates a set of per-zone statistics with a call to
> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
^^^^^^ should be "summit"
> source of overhead in the page allocator and are very rarely consumed. This
> significant overhead in cache bouncing caused by zone counters (NUMA
> associated counters) update in parallel in multi-threaded page allocation
> (pointed out by Dave Hansen).

Hi Kemi

Thanks a lot for following up on this work. A link to the MM summit slides:
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

> To mitigate this overhead, this patchset separates NUMA statistics from
> zone statistics framework, and update NUMA counter threshold to a fixed
> size of 32765, as a small threshold greatly increases the update frequency
> of the global counter from local per cpu counter (suggested by Ying Huang).
> The rationality is that these statistics counters don't need to be read
> often, unlike other VM counters, so it's not a problem to use a large
> threshold and make readers more expensive.
>
> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
> for per single page allocation and reclaim on Jesper's page_bench03
> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
> statistics with little end-user-visible effects (see the first patch for
> details), except that the number of NUMA items in each cpu
> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
> the value of NUMA counter to eliminate deviation.

I'm very happy to see that you found my kernel module for benchmarking useful :-)

> I did an experiment of single page allocation and reclaim concurrently
> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
> (88 processors with 126G memory) with different size of threshold of pcp
> counter.
>
> Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
^^^^^^^
You mis-spelled my last name, it is "Brouer".

> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
>
> Threshold CPU cycles Throughput(88 threads)
> 32 799 241760478
> 64 640 301628829
> 125 537 358906028 <==> system by default
> 256 468 412397590
> 512 428 450550704
> 4096 399 482520943
> 20000 394 489009617
> 30000 395 488017817
> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
> N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>
> Kemi Wang (2):
> mm: Change the call sites of numa statistics items
> mm: Update NUMA counter threshold size
>
> drivers/base/node.c | 22 ++++---
> include/linux/mmzone.h | 25 +++++---
> include/linux/vmstat.h | 33 ++++++++++
> mm/page_alloc.c | 10 +--
> mm/vmstat.c | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
> 5 files changed, 227 insertions(+), 25 deletions(-)
>



--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2017-08-15 16:55:47

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size

On 08/15/2017 02:58 AM, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>> Threshold CPU cycles Throughput(88 threads)
>> 32 799 241760478
>> 64 640 301628829
>> 125 537 358906028 <==> system by default (base)
>> 256 468 412397590
>> 512 428 450550704
>> 4096 399 482520943
>> 20000 394 489009617
>> 30000 395 488017817
>> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
>> N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Signed-off-by: Kemi Wang <[email protected]>
>> Suggested-by: Dave Hansen <[email protected]>
>> Suggested-by: Ying Huang <[email protected]>
>> ---
>> include/linux/mmzone.h | 4 ++--
>> include/linux/vmstat.h | 6 +++++-
>> mm/vmstat.c | 23 ++++++++++-------------
>> 3 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 0b11ba7..7eaf0e8 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>> struct per_cpu_pages pcp;
>> #ifdef CONFIG_NUMA
>> s8 expire;
>> - s8 numa_stat_threshold;
>> - s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> + s16 numa_stat_threshold;
>> + s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>
> I'm fairly sure this pushes the size of that structure into the next
> cache line which is not welcome.
>
> vm_numa_stat_diff is an always incrementing field. How much do you gain
> if this becomes a u8 code and remove any code that deals with negative
> values? That would double the threshold without consuming another cache line.

Doubling the threshold and counter size will help, but not as much
as making them above u8 limit as seen in Kemi's data:

125 537 358906028 <==> system by default (base)
256 468 412397590
32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset

For small system making them u8 makes sense. For larger ones the
frequent local counter overflow into the global counter still
causes a lot of cache bounce. Kemi can perhaps collect some data
to see what is the gain from making the counters u8.

>
> Furthermore, the stats in question are only ever incremented by one.
> That means that any calcluation related to overlap can be removed and
> special cased that it'll never overlap by more than 1. That potentially
> removes code that is required for other stats but not locality stats.
> This may give enough savings to avoid moving to s16.
>
> Very broadly speaking, I like what you're doing but I would like to see
> more work on reducing any unnecessary code in that path (such as dealing
> with overlaps for single increments) and treat incrasing the cache footprint
> only as a very last resort.
>
>> #endif
>> #ifdef CONFIG_SMP
>> s8 stat_threshold;
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1e19379..d97cc34 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
>> return x;
>> }
>>
>> -static inline unsigned long zone_numa_state(struct zone *zone,
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> enum zone_numa_stat_item item)
>> {
>> long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> + int cpu;
>> +
>> + for_each_online_cpu(cpu)
>> + x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>>
>> return x;
>> }
>
> This does not appear to be related to the current patch. It either
> should be merged with the previous patch or stand on its own.
>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 5a7fa30..c7f50ed 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>
>> #include "internal.h"
>>
>> +#define NUMA_STAT_THRESHOLD 32765
>> +
>
> This should be expressed in terms of the type and not a hard-coded value.
>

2017-08-15 17:30:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size

On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:
> On 08/15/2017 02:58 AM, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> >> Threshold CPU cycles Throughput(88 threads)
> >> 32 799 241760478
> >> 64 640 301628829
> >> 125 537 358906028 <==> system by default (base)
> >> 256 468 412397590
> >> 512 428 450550704
> >> 4096 399 482520943
> >> 20000 394 489009617
> >> 30000 395 488017817
> >> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >> N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> >>
> >> Signed-off-by: Kemi Wang <[email protected]>
> >> Suggested-by: Dave Hansen <[email protected]>
> >> Suggested-by: Ying Huang <[email protected]>
> >> ---
> >> include/linux/mmzone.h | 4 ++--
> >> include/linux/vmstat.h | 6 +++++-
> >> mm/vmstat.c | 23 ++++++++++-------------
> >> 3 files changed, 17 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 0b11ba7..7eaf0e8 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> >> struct per_cpu_pages pcp;
> >> #ifdef CONFIG_NUMA
> >> s8 expire;
> >> - s8 numa_stat_threshold;
> >> - s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >> + s16 numa_stat_threshold;
> >> + s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >
> > I'm fairly sure this pushes the size of that structure into the next
> > cache line which is not welcome.
> >
> > vm_numa_stat_diff is an always incrementing field. How much do you gain
> > if this becomes a u8 code and remove any code that deals with negative
> > values? That would double the threshold without consuming another cache line.
>
> Doubling the threshold and counter size will help, but not as much
> as making them above u8 limit as seen in Kemi's data:
>
> 125 537 358906028 <==> system by default (base)
> 256 468 412397590
> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
>
> For small system making them u8 makes sense. For larger ones the
> frequent local counter overflow into the global counter still
> causes a lot of cache bounce. Kemi can perhaps collect some data
> to see what is the gain from making the counters u8.
>

The same comments hold. The increase of a cache line is undesirable but
there are other places where the overall cost can be reduced by special
casing based on how this counter is used (always incrementing by one).
It would be preferred if those were addressed to see how close that gets
to the same performance of doubling the necessary storage for a counter.
doubling the storage

--
Mel Gorman
SUSE Labs

2017-08-15 17:51:23

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size

On 08/15/2017 10:30 AM, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:

>>
>> Doubling the threshold and counter size will help, but not as much
>> as making them above u8 limit as seen in Kemi's data:
>>
>> 125 537 358906028 <==> system by default (base)
>> 256 468 412397590
>> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
>>
>> For small system making them u8 makes sense. For larger ones the
>> frequent local counter overflow into the global counter still
>> causes a lot of cache bounce. Kemi can perhaps collect some data
>> to see what is the gain from making the counters u8.
>>
>
> The same comments hold. The increase of a cache line is undesirable but
> there are other places where the overall cost can be reduced by special
> casing based on how this counter is used (always incrementing by one).

Can you be more explicit of what optimization you suggest here and changes
to inc/dec_zone_page_state? Seems to me like we will still overflow
the local counter with the same frequency unless the threshold and
counter size is changed.

Thanks.

Tim

> It would be preferred if those were addressed to see how close that gets
> to the same performance of doubling the necessary storage for a counter.
> doubling the storage
>

2017-08-15 19:05:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size

On Tue, Aug 15, 2017 at 10:51:21AM -0700, Tim Chen wrote:
> On 08/15/2017 10:30 AM, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 09:55:39AM -0700, Tim Chen wrote:
>
> >>
> >> Doubling the threshold and counter size will help, but not as much
> >> as making them above u8 limit as seen in Kemi's data:
> >>
> >> 125 537 358906028 <==> system by default (base)
> >> 256 468 412397590
> >> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >>
> >> For small system making them u8 makes sense. For larger ones the
> >> frequent local counter overflow into the global counter still
> >> causes a lot of cache bounce. Kemi can perhaps collect some data
> >> to see what is the gain from making the counters u8.
> >>
> >
> > The same comments hold. The increase of a cache line is undesirable but
> > there are other places where the overall cost can be reduced by special
> > casing based on how this counter is used (always incrementing by one).
>
> Can you be more explicit of what optimization you suggest here and changes
> to inc/dec_zone_page_state? Seems to me like we will still overflow
> the local counter with the same frequency unless the threshold and
> counter size is changed.

One of the helpers added is __inc_zone_numa_state which doesn't have a
symmetrical __dec_zone_numa_state because the counter is always
incrementing. Because of this, there is little or no motivation to
update the global value by threshold >> 1 because with both inc/dec, you
want to avoid a corner case whereby a loop of inc/dec would do an
overflow every time. Instead, you can always apply the full threshold
and clear it which is fewer operations and halves the frequency at which
the global value needs to be updated.

--
Mel Gorman
SUSE Labs

2017-08-16 02:13:22

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Change the call sites of numa statistics items



On 2017年08月15日 17:49, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:35PM +0800, Kemi Wang wrote:
>> In this patch, NUMA statistics is separated from zone statistics
>> framework, all the call sites of NUMA stats are changed to use
>> numa-stats-specific functions, it does not have any functionality change
>> except that the value of NUMA stats is shown behind zone page stats, and
>> the threshold size of NUMA stats is shown behind pcp threshold when users
>> *read* the zone info.
>>
>> E.g. cat /proc/zoneinfo
>> ***Base*** ***With this patch***
>> nr_free_pages 3976 nr_free_pages 3976
>> nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
>> nr_zone_active_anon 0 nr_zone_active_anon 0
>> nr_zone_inactive_file 0 nr_zone_inactive_file 0
>> nr_zone_active_file 0 nr_zone_active_file 0
>> nr_zone_unevictable 0 nr_zone_unevictable 0
>> nr_zone_write_pending 0 nr_zone_write_pending 0
>> nr_mlock 0 nr_mlock 0
>> nr_page_table_pages 0 nr_page_table_pages 0
>> nr_kernel_stack 0 nr_kernel_stack 0
>> nr_bounce 0 nr_bounce 0
>> nr_zspages 0 nr_zspages 0
>> numa_hit 0 *nr_free_cma 0*
>> numa_miss 0 numa_hit 0
>> numa_foreign 0 numa_miss 0
>> numa_interleave 0 numa_foreign 0
>> numa_local 0 numa_interleave 0
>> numa_other 0 numa_local 0
>> *nr_free_cma 0* numa_other 0
>> ... ...
>> vm stats threshold: 10 vm stats threshold: 10
>> ... *vm numa stats threshold: 10*
>> ...
>>
>> The next patch updates the numa stats counter size and threshold.
>>
>> Signed-off-by: Kemi Wang <[email protected]>
>> ---
>> drivers/base/node.c | 22 ++++---
>> include/linux/mmzone.h | 25 +++++---
>> include/linux/vmstat.h | 29 +++++++++
>> mm/page_alloc.c | 10 +--
>> mm/vmstat.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
>> 5 files changed, 227 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index d8dc830..12080c6 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> - sum_zone_node_page_state(dev->id, NUMA_HIT),
>> - sum_zone_node_page_state(dev->id, NUMA_MISS),
>> - sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
>> - sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
>> - sum_zone_node_page_state(dev->id, NUMA_LOCAL),
>> - sum_zone_node_page_state(dev->id, NUMA_OTHER));
>> + sum_zone_node_numa_state(dev->id, NUMA_HIT),
>> + sum_zone_node_numa_state(dev->id, NUMA_MISS),
>> + sum_zone_node_numa_state(dev->id, NUMA_FOREIGN),
>> + sum_zone_node_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
>> + sum_zone_node_numa_state(dev->id, NUMA_LOCAL),
>> + sum_zone_node_numa_state(dev->id, NUMA_OTHER));
>> }
>
> The names are very similar and it would be preferred if the names were
> visually different like sum_zone_numa_stat() which is hard to confuse with
> the zone stat fields.
>
Agree. Thanks for your suggestion.

>> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>
>> @@ -181,9 +181,17 @@ static ssize_t node_read_vmstat(struct device *dev,
>> n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>> sum_zone_node_page_state(nid, i));
>>
>> - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> +#ifdef CONFIG_NUMA
>> + for (i = 0; i < NR_VM_ZONE_NUMA_STAT_ITEMS; i++)
>> n += sprintf(buf+n, "%s %lu\n",
>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> + sum_zone_node_numa_state(nid, i));
>> +#endif
>
> Similar with NR_VM_ZONE_NUMA_STAT_ITEMS, it's too similar to
> NR_VM_NODE_STAT_ITEMS>
How about NR_VM_NUMA_STAT_ITEMS? anyone has better idea?

>> +
>> + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> + n += sprintf(buf+n, "%s %lu\n",
>> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
>> + NR_VM_ZONE_NUMA_STAT_ITEMS],
>> node_page_state(pgdat, i));
>>
>> return n;
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index fc14b8b..0b11ba7 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -114,6 +114,20 @@ struct zone_padding {
>> #define ZONE_PADDING(name)
>> #endif
>>
>> +#ifdef CONFIG_NUMA
>> +enum zone_numa_stat_item {
>> + NUMA_HIT, /* allocated in intended node */
>> + NUMA_MISS, /* allocated in non intended node */
>> + NUMA_FOREIGN, /* was intended here, hit elsewhere */
>> + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
>> + NUMA_LOCAL, /* allocation from local node */
>> + NUMA_OTHER, /* allocation from other node */
>> + NR_VM_ZONE_NUMA_STAT_ITEMS
>> +};
>> +#else
>> +#define NR_VM_ZONE_NUMA_STAT_ITEMS 0
>> +#endif
>> +
>> enum zone_stat_item {
>> /* First 128 byte cacheline (assuming 64 bit words) */
>> NR_FREE_PAGES,
>> @@ -132,14 +146,6 @@ enum zone_stat_item {
>> #if IS_ENABLED(CONFIG_ZSMALLOC)
>> NR_ZSPAGES, /* allocated in zsmalloc */
>> #endif
>> -#ifdef CONFIG_NUMA
>> - NUMA_HIT, /* allocated in intended node */
>> - NUMA_MISS, /* allocated in non intended node */
>> - NUMA_FOREIGN, /* was intended here, hit elsewhere */
>> - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
>> - NUMA_LOCAL, /* allocation from local node */
>> - NUMA_OTHER, /* allocation from other node */
>> -#endif
>> NR_FREE_CMA_PAGES,
>> NR_VM_ZONE_STAT_ITEMS };
>>
>> @@ -276,6 +282,8 @@ struct per_cpu_pageset {
>> struct per_cpu_pages pcp;
>> #ifdef CONFIG_NUMA
>> s8 expire;
>> + s8 numa_stat_threshold;
>> + s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> #endif
>> #ifdef CONFIG_SMP
>> s8 stat_threshold;
>
> Ok. this slightly increases the size of the per_cpu_pageset due to
> numa_stat_threshold. The structure occupes 2 cache lines and still occupies
> 2 cache lines afterwards so that is ok but consider hard-coding the value
> of it. The locality stats are never used as part of a decision made by the
> kernel and they get summed when reading proc unconditionally. There is little
> benefit to tuning that threshold at all and there should be a very small
> performance gain if it's removed because it'll be a compile-time constant.
>
Agree, Thanks. I will remove numa_stat_threshold in next version.

> The rest of the patch is mostly mechanical.
>

2017-08-16 02:32:33

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size


>>
>> -static inline unsigned long zone_numa_state(struct zone *zone,
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> enum zone_numa_stat_item item)
>> {
>> long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> + int cpu;
>> +
>> + for_each_online_cpu(cpu)
>> + x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>>
>> return x;
>> }
>
> This does not appear to be related to the current patch. It either
> should be merged with the previous patch or stand on its own.
>
OK. I can move it to an individual patch if it does not make anyone unhappy.
Since it is not graceful to introduce any functionality change in first patch.

>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 5a7fa30..c7f50ed 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>
>> #include "internal.h"
>>
>> +#define NUMA_STAT_THRESHOLD 32765
>> +
>
> This should be expressed in terms of the type and not a hard-coded value.
>
OK, Thanks. I will follow it.

2017-08-16 03:03:57

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size



On 2017年08月16日 00:55, Tim Chen wrote:
> On 08/15/2017 02:58 AM, Mel Gorman wrote:
>> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:

>> I'm fairly sure this pushes the size of that structure into the next
>> cache line which is not welcome.
>>>> vm_numa_stat_diff is an always incrementing field. How much do you gain
>> if this becomes a u8 code and remove any code that deals with negative
>> values? That would double the threshold without consuming another cache line.
>
> Doubling the threshold and counter size will help, but not as much
> as making them above u8 limit as seen in Kemi's data:
>
> 125 537 358906028 <==> system by default (base)
> 256 468 412397590
> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
>
> For small system making them u8 makes sense. For larger ones the
> frequent local counter overflow into the global counter still
> causes a lot of cache bounce. Kemi can perhaps collect some data
> to see what is the gain from making the counters u8.
>
Tim, thanks for your answer. That is what I want to clarify.

Also, pls notice that the negative threshold/2 is set to cpu local counter
(e.g. vm_numa_stat_diff[]) once per-zone counter is updated in current code
path. This weakens the benefit of changing s8 to u8 in this case.
>>
>> Furthermore, the stats in question are only ever incremented by one.
>> That means that any calcluation related to overlap can be removed and
>> special cased that it'll never overlap by more than 1. That potentially
>> removes code that is required for other stats but not locality stats.
>> This may give enough savings to avoid moving to s16.
>>
>> Very broadly speaking, I like what you're doing but I would like to see
>> more work on reducing any unnecessary code in that path (such as dealing
>> with overlaps for single increments) and treat incrasing the cache footprint
>> only as a very last resort.
>>
Agree. I will think about it more.

>>> #endif
>>> #ifdef CONFIG_SMP
>>> s8 stat_threshold;
>>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>>> index 1e19379..d97cc34 100644
>>> --- a/include/linux/vmstat.h
>>> +++ b/include/linux/vmstat.h
>>> @@ -125,10 +125,14 @@ static inline unsigned long global_numa_state(enum zone_numa_stat_item item)
>>> return x;
>>> }
>>>

2017-08-16 03:24:59

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Separate NUMA statistics from zone statistics



On 2017年08月15日 18:36, Jesper Dangaard Brouer wrote:
> On Tue, 15 Aug 2017 16:45:34 +0800
> Kemi Wang <[email protected]> wrote:
>
>> Each page allocation updates a set of per-zone statistics with a call to
>> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
> ^^^^^^ should be "summit"

Hi, Jesper
Thanks for reporting this issue and providing the benchmark to test raw
performance of page allocation. It is really quite helpful to figure out the
root cause.
>> source of overhead in the page allocator and are very rarely consumed. This
>> significant overhead in cache bouncing caused by zone counters (NUMA
>> associated counters) update in parallel in multi-threaded page allocation
>> (pointed out by Dave Hansen).
>
> Hi Kemi
>
> Thanks a lot for following up on this work. A link to the MM summit slides:
> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
>
Thanks for adding the link here. I should have done that in this cover letter.

>> To mitigate this overhead, this patchset separates NUMA statistics from
>> zone statistics framework, and update NUMA counter threshold to a fixed
>> size of 32765, as a small threshold greatly increases the update frequency
>> of the global counter from local per cpu counter (suggested by Ying Huang).
>> The rationality is that these statistics counters don't need to be read
>> often, unlike other VM counters, so it's not a problem to use a large
>> threshold and make readers more expensive.
>>
>> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
>> for per single page allocation and reclaim on Jesper's page_bench03
>> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
>> statistics with little end-user-visible effects (see the first patch for
>> details), except that the number of NUMA items in each cpu
>> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
>> the value of NUMA counter to eliminate deviation.
>
> I'm very happy to see that you found my kernel module for benchmarking useful :-)
>
>> I did an experiment of single page allocation and reclaim concurrently
>> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
>> (88 processors with 126G memory) with different size of threshold of pcp
>> counter.
>>
>> Benchmark provided by Jesper D Broucer(increase loop times to 10000000):
> ^^^^^^^
> You mis-spelled my last name, it is "Brouer".
>
Dear Jesper, I am so sorry about it, please forgive me :)

>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
>>
>> Threshold CPU cycles Throughput(88 threads)
>> 32 799 241760478
>> 64 640 301628829
>> 125 537 358906028 <==> system by default
>> 256 468 412397590
>> 512 428 450550704
>> 4096 399 482520943
>> 20000 394 489009617
>> 30000 395 488017817
>> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
>> N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Kemi Wang (2):
>> mm: Change the call sites of numa statistics items
>> mm: Update NUMA counter threshold size
>>
>> drivers/base/node.c | 22 ++++---
>> include/linux/mmzone.h | 25 +++++---
>> include/linux/vmstat.h | 33 ++++++++++
>> mm/page_alloc.c | 10 +--
>> mm/vmstat.c | 162 +++++++++++++++++++++++++++++++++++++++++++++++--
>> 5 files changed, 227 insertions(+), 25 deletions(-)
>>
>
>
>

2017-08-22 03:22:48

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size



On 2017年08月15日 17:58, Mel Gorman wrote:
> On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
>> Threshold CPU cycles Throughput(88 threads)
>> 32 799 241760478
>> 64 640 301628829
>> 125 537 358906028 <==> system by default (base)
>> 256 468 412397590
>> 512 428 450550704
>> 4096 399 482520943
>> 20000 394 489009617
>> 30000 395 488017817
>> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
>> N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
>>
>> Signed-off-by: Kemi Wang <[email protected]>
>> Suggested-by: Dave Hansen <[email protected]>
>> Suggested-by: Ying Huang <[email protected]>
>> ---
>> include/linux/mmzone.h | 4 ++--
>> include/linux/vmstat.h | 6 +++++-
>> mm/vmstat.c | 23 ++++++++++-------------
>> 3 files changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 0b11ba7..7eaf0e8 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
>> struct per_cpu_pages pcp;
>> #ifdef CONFIG_NUMA
>> s8 expire;
>> - s8 numa_stat_threshold;
>> - s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>> + s16 numa_stat_threshold;
>> + s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
>
> I'm fairly sure this pushes the size of that structure into the next
> cache line which is not welcome.
>
Hi Mel
I am refreshing this patch. Would you pls be more explicit of what "that
structure" indicates.
If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
still occupies two caches line after extending s8 to s16/u16, that should
not be a problem. For 32 bits machine, we probably does not need to extend
the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
numa system, and s8/u8 is large enough for it, in this case, we can keep the
same size of "struct per_cpu_pageset".

If you mean "s16 vm_numa_stat_diff[]", and want to keep it in a single cache
line, we probably can add some padding after "s8 expire" to achieve it.

Again, thanks for your comments to make this patch more graceful.
> vm_numa_stat_diff is an always incrementing field. How much do you gain
> if this becomes a u8 code and remove any code that deals with negative
> values? That would double the threshold without consuming another cache line.
>
> Furthermore, the stats in question are only ever incremented by one.
> That means that any calcluation related to overlap can be removed and
> special cased that it'll never overlap by more than 1. That potentially
> removes code that is required for other stats but not locality stats.
> This may give enough savings to avoid moving to s16.
>
> Very broadly speaking, I like what you're doing but I would like to see
> more work on reducing any unnecessary code in that path (such as dealing
> with overlaps for single increments) and treat incrasing the cache footprint
> only as a very last resort.
>
>>

2017-08-22 08:39:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: Update NUMA counter threshold size

On Tue, Aug 22, 2017 at 11:21:31AM +0800, kemi wrote:
>
>
> On 2017???08???15??? 17:58, Mel Gorman wrote:
> > On Tue, Aug 15, 2017 at 04:45:36PM +0800, Kemi Wang wrote:
> >> Threshold CPU cycles Throughput(88 threads)
> >> 32 799 241760478
> >> 64 640 301628829
> >> 125 537 358906028 <==> system by default (base)
> >> 256 468 412397590
> >> 512 428 450550704
> >> 4096 399 482520943
> >> 20000 394 489009617
> >> 30000 395 488017817
> >> 32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
> >> N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> >>
> >> Signed-off-by: Kemi Wang <[email protected]>
> >> Suggested-by: Dave Hansen <[email protected]>
> >> Suggested-by: Ying Huang <[email protected]>
> >> ---
> >> include/linux/mmzone.h | 4 ++--
> >> include/linux/vmstat.h | 6 +++++-
> >> mm/vmstat.c | 23 ++++++++++-------------
> >> 3 files changed, 17 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 0b11ba7..7eaf0e8 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -282,8 +282,8 @@ struct per_cpu_pageset {
> >> struct per_cpu_pages pcp;
> >> #ifdef CONFIG_NUMA
> >> s8 expire;
> >> - s8 numa_stat_threshold;
> >> - s8 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >> + s16 numa_stat_threshold;
> >> + s16 vm_numa_stat_diff[NR_VM_ZONE_NUMA_STAT_ITEMS];
> >
> > I'm fairly sure this pushes the size of that structure into the next
> > cache line which is not welcome.
> >
> Hi Mel
> I am refreshing this patch. Would you pls be more explicit of what "that
> structure" indicates.
> If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
> still occupies two caches line after extending s8 to s16/u16, that should
> not be a problem.

You're right, I was in error. I miscalculated badly initially. It still
fits in as expected.

> For 32 bits machine, we probably does not need to extend
> the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
> numa system, and s8/u8 is large enough for it, in this case, we can keep the
> same size of "struct per_cpu_pageset".
>

I don't believe it's worth the complexity of making this
bitness-specific. 32-bit takes penalties in other places and besides,
32-bit does not necessarily mean a change in cache line size.

Fortunately, I think you should still be able to gain a bit more with
some special casing the fact it's always incrementing and always do full
spill of the counters instead of half. If so, then using u16 instead of
s16 should also reduce the update frequency. However, if you find it's
too complex and the gain is too marginal then I'll ack without it.

--
Mel Gorman
SUSE Labs

Subject: Re: [PATCH 0/2] Separate NUMA statistics from zone statistics

Can we simple get rid of the stats or make then configurable (off by
defaut)? I agree they are rarely used and have been rarely used in the past.

Maybe some instrumentation for perf etc will allow
similar statistics these days? Thus its possible to drop them?

The space in the pcp pageset is precious and we should strive to use no
more than a cacheline for the diffs.


2017-08-22 23:19:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Separate NUMA statistics from zone statistics

Christopher Lameter <[email protected]> writes:

> Can we simple get rid of the stats or make then configurable (off by
> defaut)? I agree they are rarely used and have been rarely used in the past.
>
> Maybe some instrumentation for perf etc will allow
> similar statistics these days? Thus its possible to drop them?
>
> The space in the pcp pageset is precious and we should strive to use no
> more than a cacheline for the diffs.

The statistics are useful and we need them sometimes. And more and more
runtime switches are a pain -- if you need them they would be likely
turned off. The key is just to make them cheap enough that they're not a
problem.

The only problem was just that that the standard vmstats which are
optimized for readers too are too expensive for them.

The motivation for the patch was that the frequent atomics
were proven to slow the allocator down, and Kemi's patch
fixed it and he has shown it with lots of data.

I don't really see the point of so much discussion about a single cache
line.

There are lots of cache lines used all over the VM. Why is this one
special? Adding one more shouldn't be that bad.

But there's no data at all that touching another cache line
here is a problem.

It's next to an already touched cache line, so it's highly
likely that a prefetcher would catch it anyways.

I can see the point of worrying about over all cache line foot print
("death of a thousand cuts") but the right way to address problems like
this is use a profiler in a realistic workload and systematically
look at the code who actually has cache misses. And I bet we would
find quite a few that could be easily avoided and have real
payoff. I would really surprise me if it was this cache line.

But blocking real demonstrated improvements over a theoretical
cache line doesn't really help.

-Andi

2017-08-23 01:15:48

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Separate NUMA statistics from zone statistics



On 2017年08月23日 05:22, Christopher Lameter wrote:
> Can we simple get rid of the stats or make then configurable (off by
> defaut)? I agree they are rarely used and have been rarely used in the past.
>

I agree that we can make numa stats as well as other stats items that are rarely
used configurable. Perhaps we can introduce a general mechanism to hide such unimportant
stats(suggested by *Dave Hansen* initially), it works like this:

when performance is not important and when you want all tooling to work, you set:

sysctl vm.strict_stats=1

but if you can tolerate some possible tool breakage and some decreased
counter precision, you can do:

sysctl vm.strict_stats=0

What's your idea for that? I can help to implement it later.

But it may not a good idea to simply get rid of such kinds of stats.

> Maybe some instrumentation for perf etc will allow
> similar statistics these days? Thus its possible to drop them?
>
> The space in the pcp pageset is precious and we should strive to use no
> more than a cacheline for the diffs.
>
>

Andi has helped to explain it very clearly. Thanks very much.

For 64-bit OS:
base with this patch(even include numa_threshold)
sizeof(struct per_cpu_pageset) 88 96

Copy the discussion before from another email thread in case you missed it:

> Hi Mel
> I am refreshing this patch. Would you pls be more explicit of what "that
> structure" indicates.
> If you mean "struct per_cpu_pageset", for 64 bits machine, this structure
> still occupies two caches line after extending s8 to s16/u16, that should
> not be a problem.

You're right, I was in error. I miscalculated badly initially. It still
fits in as expected.

> For 32 bits machine, we probably does not need to extend
> the size of vm_numa_stat_diff[] since 32 bits OS nearly not be used in large
> numa system, and s8/u8 is large enough for it, in this case, we can keep the
> same size of "struct per_cpu_pageset".
>

I don't believe it's worth the complexity of making this
bitness-specific. 32-bit takes penalties in other places and besides,
32-bit does not necessarily mean a change in cache line size.

Fortunately, I think you should still be able to gain a bit more with
some special casing the fact it's always incrementing and always do full
spill of the counters instead of half. If so, then using u16 instead of
s16 should also reduce the update frequency. However, if you find it's
too complex and the gain is too marginal then I'll ack without it.

2017-08-23 04:55:43

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Separate NUMA statistics from zone statistics

On 08/22/2017 06:14 PM, kemi wrote:
> when performance is not important and when you want all tooling to work, you set:
>
> sysctl vm.strict_stats=1
>
> but if you can tolerate some possible tool breakage and some decreased
> counter precision, you can do:
>
> sysctl vm.strict_stats=0

My other thought was to try to set vm.strict_stats=0 and move to
vm.strict_stats=1 (and issue a printk) when somebody reads
/proc/zoneinfo (or the other files where the expensive stats are displayed).

We'd need three modes for the expensive stats:

1. Off by default
2. On. (#1 transforms to this by default when stats are read)
3. Off permanently. An *actual* tunable that someone could set
on systems that want to be able to read the stat files, don't
care about precision, and want the best performance.

That way, virtually everybody (who falls into mode #1 or #2) gets what
they want. The only folks who have to mess with a tunable are the
really weird, picky ones who use option #3.