2017-12-19 06:41:31

by Kemi Wang

[permalink] [raw]
Subject: [PATCH v2 0/5] mm: NUMA stats code cleanup and enhancement

The existed implementation of NUMA counters is per logical CPU along with
zone->vm_numa_stat[] separated by zone, plus a global numa counter array
vm_numa_stat[]. However, unlike the other vmstat counters, NUMA stats don't
effect system's decision and are only consumed when reading from /proc and
/sys. Also, usually nodes only have a single zone, except for node 0, and
there isn't really any use where you need these hits counts separated by
zone.

Therefore, we can migrate the implementation of numa stats from per-zone to
per-node (as suggested by Andi Kleen), and reuse the existed per-cpu
infrastructure with a little enhancement for NUMA stats. In this way, we
can get rid of the special way for NUMA stats and keep the performance gain
at the same time. With this patch series, about 170 lines code can be
saved.

The first patch migrates NUMA stats from per-zone to pre-node using the
existed per-cpu infrastructure. There is a little user-visual change when
read /proc/zoneinfo listed below:
Before After
Node 0, zone DMA Node 0, zone DMA
per-node stats per-node stats
nr_inactive_anon 7244 *numa_hit 98665086*
nr_active_anon 177064 *numa_miss 0*
... *numa_foreign 0*
nr_bounce 0 *numa_interleave 21059*
nr_free_cma 0 *numa_local 98665086*
*numa_hit 0* *numa_other 0*
*numa_miss 0* nr_inactive_anon 20055
*numa_foreign 0* nr_active_anon 389771
*numa_interleave 0* ...
*numa_local 0* nr_bounce 0
*numa_other 0* nr_free_cma 0

The second patch extends the local cpu counter vm_stat_node_diff from s8 to
s16. It does not have any functionality change.

The third patch uses a large and constant threshold size for NUMA counters
to reduce the global NUMA counters update frequency.

The forth patch uses node_page_state_snapshot instead of node_page_state
when query a node stats (e.g. cat /sys/devices/system/node/node*/vmstat).
The only differece is that the stats value in local cpus are also included
in node_page_state_snapshot.

The last patch renames zone_statistics() to numa_statistics().

At last, I want to extend my heartiest appreciation for Michal Hocko's
suggestion of reusing the existed per-cpu infrastructure making it much
better than before.

Changelog:
v1->v2:
a) enhance the existed per-cpu infrastructure for node page stats by
entending local cpu counters vm_node_stat_diff from s8 to s16
b) reuse the per-cpu infrastrcuture for NUMA stats

Kemi Wang (5):
mm: migrate NUMA stats from per-zone to per-node
mm: Extends local cpu counter vm_diff_nodestat from s8 to s16
mm: enlarge NUMA counters threshold size
mm: use node_page_state_snapshot to avoid deviation
mm: Rename zone_statistics() to numa_statistics()

drivers/base/node.c | 28 +++----
include/linux/mmzone.h | 31 ++++----
include/linux/vmstat.h | 31 --------
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 22 +++---
mm/vmstat.c | 206 +++++++++----------------------------------------
6 files changed, 74 insertions(+), 246 deletions(-)

--
2.7.4


2017-12-19 06:41:40

by Kemi Wang

[permalink] [raw]
Subject: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node

There is not really any use to get NUMA stats separated by zone, and
current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code
cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the
existed per-cpu infrastructure.

Suggested-by: Andi Kleen <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Kemi Wang <[email protected]>
---
drivers/base/node.c | 23 +++----
include/linux/mmzone.h | 27 ++++----
include/linux/vmstat.h | 31 ---------
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 16 +++--
mm/vmstat.c | 177 +++++--------------------------------------------
6 files changed, 46 insertions(+), 230 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab..a045ea1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev,
"interleave_hit %lu\n"
"local_node %lu\n"
"other_node %lu\n",
- sum_zone_numa_state(dev->id, NUMA_HIT),
- sum_zone_numa_state(dev->id, NUMA_MISS),
- sum_zone_numa_state(dev->id, NUMA_FOREIGN),
- sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
- sum_zone_numa_state(dev->id, NUMA_LOCAL),
- sum_zone_numa_state(dev->id, NUMA_OTHER));
+ node_page_state(NODE_DATA(dev->id), NUMA_HIT),
+ node_page_state(NODE_DATA(dev->id), NUMA_MISS),
+ node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
+ node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
+ node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
+ node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
}
+
static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);

static ssize_t node_read_vmstat(struct device *dev,
@@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev,
n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
sum_zone_node_page_state(nid, i));

-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- n += sprintf(buf+n, "%s %lu\n",
- vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
- sum_zone_numa_state(nid, i));
-#endif
-
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
- vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
- NR_VM_NUMA_STAT_ITEMS],
+ vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
node_page_state(pgdat, i));

return n;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c..c06d880 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,20 +115,6 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif

-#ifdef CONFIG_NUMA
-enum numa_stat_item {
- NUMA_HIT, /* allocated in intended node */
- NUMA_MISS, /* allocated in non intended node */
- NUMA_FOREIGN, /* was intended here, hit elsewhere */
- NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
- NUMA_LOCAL, /* allocation from local node */
- NUMA_OTHER, /* allocation from other node */
- NR_VM_NUMA_STAT_ITEMS
-};
-#else
-#define NR_VM_NUMA_STAT_ITEMS 0
-#endif
-
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
@@ -151,7 +137,18 @@ enum zone_stat_item {
NR_VM_ZONE_STAT_ITEMS };

enum node_stat_item {
- NR_LRU_BASE,
+#ifdef CONFIG_NUMA
+ NUMA_HIT, /* allocated in intended node */
+ NUMA_MISS, /* allocated in non intended node */
+ NUMA_FOREIGN, /* was intended here, hit elsewhere */
+ NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
+ NUMA_LOCAL, /* allocation from local node */
+ NUMA_OTHER, /* allocation from other node */
+ NR_VM_NUMA_STAT_ITEMS,
+#else
+#define NR_VM_NUMA_STAT_ITEMS 0
+#endif
+ NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS,
NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1779c98..80bf290 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
* Zone and node-based page accounting with per cpu differentials.
*/
extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
-extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];

-#ifdef CONFIG_NUMA
-static inline void zone_numa_state_add(long x, struct zone *zone,
- enum numa_stat_item item)
-{
- atomic_long_add(x, &zone->vm_numa_stat[item]);
- atomic_long_add(x, &vm_numa_stat[item]);
-}
-
-static inline unsigned long global_numa_state(enum numa_stat_item item)
-{
- long x = atomic_long_read(&vm_numa_stat[item]);
-
- return x;
-}
-
-static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
- enum numa_stat_item item)
-{
- long x = atomic_long_read(&zone->vm_numa_stat[item]);
- int cpu;
-
- for_each_online_cpu(cpu)
- x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
-
- return x;
-}
-#endif /* CONFIG_NUMA */
-
static inline void zone_page_state_add(long x, struct zone *zone,
enum zone_stat_item item)
{
@@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,


#ifdef CONFIG_NUMA
-extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
extern unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item);
-extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
extern unsigned long node_page_state(struct pglist_data *pgdat,
enum node_stat_item item);
#else
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ce44d3..b2293e3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
return page;
if (page && page_to_nid(page) == nid) {
preempt_disable();
- __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
+ inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT);
preempt_enable();
}
return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e5e775..81e8d8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order)
static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
{
#ifdef CONFIG_NUMA
- enum numa_stat_item local_stat = NUMA_LOCAL;
+ int preferred_nid = preferred_zone->node;
+ int nid = z->node;
+ enum node_stat_item local_stat = NUMA_LOCAL;

/* skip numa counters update if numa stats is disabled */
if (!static_branch_likely(&vm_numa_stat_key))
return;

- if (z->node != numa_node_id())
+ if (nid != numa_node_id())
local_stat = NUMA_OTHER;

- if (z->node == preferred_zone->node)
- __inc_numa_state(z, NUMA_HIT);
+ if (nid == preferred_nid)
+ inc_node_state(NODE_DATA(nid), NUMA_HIT);
else {
- __inc_numa_state(z, NUMA_MISS);
- __inc_numa_state(preferred_zone, NUMA_FOREIGN);
+ inc_node_state(NODE_DATA(nid), NUMA_MISS);
+ inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN);
}
- __inc_numa_state(z, local_stat);
+ inc_node_state(NODE_DATA(nid), local_stat);
#endif
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 40b2db6..1dd12ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,46 +30,44 @@

#include "internal.h"

-#define NUMA_STATS_THRESHOLD (U16_MAX - 2)
-
#ifdef CONFIG_NUMA
int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;

-/* zero numa counters within a zone */
-static void zero_zone_numa_counters(struct zone *zone)
+/* zero numa stats within a node */
+static void zero_node_numa_stats(int node)
{
int item, cpu;

for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
- atomic_long_set(&zone->vm_numa_stat[item], 0);
+ atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0);
for_each_online_cpu(cpu)
- per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]
- = 0;
+ per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats,
+ cpu)->vm_node_stat_diff[item] = 0;
}
}

-/* zero numa counters of all the populated zones */
-static void zero_zones_numa_counters(void)
+/* zero numa stats of all the online nodes */
+static void zero_nodes_numa_stats(void)
{
- struct zone *zone;
+ int node;

- for_each_populated_zone(zone)
- zero_zone_numa_counters(zone);
+ for_each_online_node(node)
+ zero_node_numa_stats(node);
}

-/* zero global numa counters */
-static void zero_global_numa_counters(void)
+/* zero global numa stats */
+static void zero_global_numa_stats(void)
{
int item;

for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
- atomic_long_set(&vm_numa_stat[item], 0);
+ atomic_long_set(&vm_node_stat[item], 0);
}

static void invalid_numa_statistics(void)
{
- zero_zones_numa_counters();
- zero_global_numa_counters();
+ zero_nodes_numa_stats();
+ zero_global_numa_stats();
}

static DEFINE_MUTEX(vm_numa_stat_lock);
@@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu)
* vm_stat contains the global counters
*/
atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
-atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
EXPORT_SYMBOL(vm_zone_stat);
-EXPORT_SYMBOL(vm_numa_stat);
EXPORT_SYMBOL(vm_node_stat);

#ifdef CONFIG_SMP
@@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state);
* Fold a differential into the global counters.
* Returns the number of counters updated.
*/
-#ifdef CONFIG_NUMA
-static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
-{
- int i;
- int changes = 0;
-
- for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
- if (zone_diff[i]) {
- atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
- changes++;
- }
-
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- if (numa_diff[i]) {
- atomic_long_add(numa_diff[i], &vm_numa_stat[i]);
- changes++;
- }
-
- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
- if (node_diff[i]) {
- atomic_long_add(node_diff[i], &vm_node_stat[i]);
- changes++;
- }
- return changes;
-}
-#else
static int fold_diff(int *zone_diff, int *node_diff)
{
int i;
@@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff)
}
return changes;
}
-#endif /* CONFIG_NUMA */

/*
* Update the zone counters for the current cpu.
@@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
-#ifdef CONFIG_NUMA
- int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
-#endif
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
int changes = 0;

@@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
}
#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
- int v;
-
- v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
- if (v) {
-
- atomic_long_add(v, &zone->vm_numa_stat[i]);
- global_numa_diff[i] += v;
- __this_cpu_write(p->expire, 3);
- }
- }
-
if (do_pagesets) {
cond_resched();
/*
@@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
}

-#ifdef CONFIG_NUMA
- changes += fold_diff(global_zone_diff, global_numa_diff,
- global_node_diff);
-#else
changes += fold_diff(global_zone_diff, global_node_diff);
-#endif
return changes;
}

@@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu)
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
-#ifdef CONFIG_NUMA
- int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
-#endif
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };

for_each_populated_zone(zone) {
@@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu)
atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
}
-
-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- if (p->vm_numa_stat_diff[i]) {
- int v;
-
- v = p->vm_numa_stat_diff[i];
- p->vm_numa_stat_diff[i] = 0;
- atomic_long_add(v, &zone->vm_numa_stat[i]);
- global_numa_diff[i] += v;
- }
-#endif
}

for_each_online_pgdat(pgdat) {
@@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu)
}
}

-#ifdef CONFIG_NUMA
- fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
-#else
fold_diff(global_zone_diff, global_node_diff);
-#endif
}

/*
@@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
atomic_long_add(v, &zone->vm_stat[i]);
atomic_long_add(v, &vm_zone_stat[i]);
}
-
-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- if (pset->vm_numa_stat_diff[i]) {
- int v = pset->vm_numa_stat_diff[i];
-
- pset->vm_numa_stat_diff[i] = 0;
- atomic_long_add(v, &zone->vm_numa_stat[i]);
- atomic_long_add(v, &vm_numa_stat[i]);
- }
-#endif
}
#endif

#ifdef CONFIG_NUMA
-void __inc_numa_state(struct zone *zone,
- enum numa_stat_item item)
-{
- struct per_cpu_pageset __percpu *pcp = zone->pageset;
- u16 __percpu *p = pcp->vm_numa_stat_diff + item;
- u16 v;
-
- v = __this_cpu_inc_return(*p);
-
- if (unlikely(v > NUMA_STATS_THRESHOLD)) {
- zone_numa_state_add(v, zone, item);
- __this_cpu_write(*p, 0);
- }
-}
-
/*
* Determine the per node value of a stat item. This function
* is called frequently in a NUMA machine, so try to be as
@@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node,
}

/*
- * Determine the per node value of a numa stat item. To avoid deviation,
- * the per cpu stat number in vm_numa_stat_diff[] is also included.
- */
-unsigned long sum_zone_numa_state(int node,
- enum numa_stat_item item)
-{
- struct zone *zones = NODE_DATA(node)->node_zones;
- int i;
- unsigned long count = 0;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- count += zone_numa_state_snapshot(zones + i, item);
-
- return count;
-}
-
-/*
* Determine the per node value of a stat item.
*/
unsigned long node_page_state(struct pglist_data *pgdat,
@@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
seq_printf(m, "\n per-node stats");
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n %-12s %lu",
- vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
- NR_VM_NUMA_STAT_ITEMS],
+ vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
node_page_state(pgdat, i));
}
}
@@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
seq_printf(m, "\n %-12s %lu", vmstat_text[i],
zone_page_state(zone, i));

-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- seq_printf(m, "\n %-12s %lu",
- vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
- zone_numa_state_snapshot(zone, i));
-#endif
-
seq_printf(m, "\n pagesets");
for_each_online_cpu(i) {
struct per_cpu_pageset *pageset;
@@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
if (*pos >= ARRAY_SIZE(vmstat_text))
return NULL;
stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
- NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) +
NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);

@@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
v[i] = global_zone_page_state(i);
v += NR_VM_ZONE_STAT_ITEMS;

-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
- v[i] = global_numa_state(i);
- v += NR_VM_NUMA_STAT_ITEMS;
-#endif
-
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
v[i] = global_node_page_state(i);
v += NR_VM_NODE_STAT_ITEMS;
@@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write,
err = -EINVAL;
}
}
-#ifdef CONFIG_NUMA
- for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
- val = atomic_long_read(&vm_numa_stat[i]);
- if (val < 0) {
- pr_warn("%s: %s %ld\n",
- __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val);
- err = -EINVAL;
- }
- }
-#endif
if (err)
return err;
if (write)
@@ -1862,9 +1724,6 @@ static bool need_update(int cpu)
struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);

BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
-#ifdef CONFIG_NUMA
- BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2);
-#endif

/*
* The fast way of checking if there are any vmstat diffs.
@@ -1872,10 +1731,6 @@ static bool need_update(int cpu)
*/
if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
return true;
-#ifdef CONFIG_NUMA
- if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS))
- return true;
-#endif
}
return false;
}
--
2.7.4

2017-12-19 06:41:42

by Kemi Wang

[permalink] [raw]
Subject: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

The type s8 used for vm_diff_nodestat[] as local cpu counters has the
limitation of global counters update frequency, especially for those
monotone increasing type of counters like NUMA counters with more and more
cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
without any functionality change.

before after
sizeof(struct per_cpu_nodestat) 28 68

Signed-off-by: Kemi Wang <[email protected]>
---
include/linux/mmzone.h | 4 ++--
mm/vmstat.c | 16 ++++++++--------
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c06d880..2da6b6f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -289,8 +289,8 @@ struct per_cpu_pageset {
};

struct per_cpu_nodestat {
- s8 stat_threshold;
- s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+ s16 stat_threshold;
+ s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
};

#endif /* !__GENERATING_BOUNDS.H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1dd12ae..9c681cc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
long delta)
{
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
- s8 __percpu *p = pcp->vm_node_stat_diff + item;
+ s16 __percpu *p = pcp->vm_node_stat_diff + item;
long x;
long t;

@@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
{
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
- s8 __percpu *p = pcp->vm_node_stat_diff + item;
- s8 v, t;
+ s16 __percpu *p = pcp->vm_node_stat_diff + item;
+ s16 v, t;

v = __this_cpu_inc_return(*p);
t = __this_cpu_read(pcp->stat_threshold);
if (unlikely(v > t)) {
- s8 overstep = t >> 1;
+ s16 overstep = t >> 1;

node_page_state_add(v + overstep, pgdat, item);
__this_cpu_write(*p, -overstep);
@@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
{
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
- s8 __percpu *p = pcp->vm_node_stat_diff + item;
- s8 v, t;
+ s16 __percpu *p = pcp->vm_node_stat_diff + item;
+ s16 v, t;

v = __this_cpu_dec_return(*p);
t = __this_cpu_read(pcp->stat_threshold);
if (unlikely(v < - t)) {
- s8 overstep = t >> 1;
+ s16 overstep = t >> 1;

node_page_state_add(v - overstep, pgdat, item);
__this_cpu_write(*p, overstep);
@@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat,
enum node_stat_item item, int delta, int overstep_mode)
{
struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
- s8 __percpu *p = pcp->vm_node_stat_diff + item;
+ s16 __percpu *p = pcp->vm_node_stat_diff + item;
long o, n, t, z;

do {
--
2.7.4

2017-12-19 06:41:53

by Kemi Wang

[permalink] [raw]
Subject: [PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics()

Since the functionality of zone_statistics() updates numa counters, but
numa statistics has been separated from zone statistics framework. Thus,
the function name makes people confused. So, change the name to
numa_statistics() as well as its call sites accordingly.

Signed-off-by: Kemi Wang <[email protected]>
---
mm/page_alloc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e8d8f..f7583de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2790,7 +2790,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
*
* Must be called with interrupts disabled.
*/
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void numa_statistics(struct zone *preferred_zone, struct zone *z)
{
#ifdef CONFIG_NUMA
int preferred_nid = preferred_zone->node;
@@ -2854,7 +2854,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
page = __rmqueue_pcplist(zone, migratetype, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
- zone_statistics(preferred_zone, zone);
+ numa_statistics(preferred_zone, zone);
}
local_irq_restore(flags);
return page;
@@ -2902,7 +2902,7 @@ struct page *rmqueue(struct zone *preferred_zone,
get_pcppage_migratetype(page));

__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
- zone_statistics(preferred_zone, zone);
+ numa_statistics(preferred_zone, zone);
local_irq_restore(flags);

out:
--
2.7.4

2017-12-19 06:41:46

by Kemi Wang

[permalink] [raw]
Subject: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

We have seen significant overhead in cache bouncing caused by NUMA counters
update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
update NUMA counter threshold size")' for more details.

This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
with global counter update using different threshold size for node page
stats.

Signed-off-by: Kemi Wang <[email protected]>
---
mm/vmstat.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9c681cc..64e08ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@

#include "internal.h"

+#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
+
#ifdef CONFIG_NUMA
int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;

@@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
s16 v, t;

v = __this_cpu_inc_return(*p);
- t = __this_cpu_read(pcp->stat_threshold);
+ if (item >= NR_VM_NUMA_STAT_ITEMS)
+ t = __this_cpu_read(pcp->stat_threshold);
+ else
+ t = VM_NUMA_STAT_THRESHOLD;
+
if (unlikely(v > t)) {
s16 overstep = t >> 1;

@@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat,
* Most of the time the thresholds are the same anyways
* for all cpus in a node.
*/
- t = this_cpu_read(pcp->stat_threshold);
+ if (item >= NR_VM_NUMA_STAT_ITEMS)
+ t = this_cpu_read(pcp->stat_threshold);
+ else
+ t = VM_NUMA_STAT_THRESHOLD;

o = this_cpu_read(*p);
n = delta + o;
--
2.7.4

2017-12-19 06:42:12

by Kemi Wang

[permalink] [raw]
Subject: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

To avoid deviation, this patch uses node_page_state_snapshot instead of
node_page_state for node page stats query.
e.g. cat /proc/zoneinfo
cat /sys/devices/system/node/node*/vmstat
cat /sys/devices/system/node/node*/numastat

As it is a slow path and would not be read frequently, I would worry about
it.

Signed-off-by: Kemi Wang <[email protected]>
---
drivers/base/node.c | 17 ++++++++++-------
mm/vmstat.c | 2 +-
2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a045ea1..cf303f8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
"interleave_hit %lu\n"
"local_node %lu\n"
"other_node %lu\n",
- node_page_state(NODE_DATA(dev->id), NUMA_HIT),
- node_page_state(NODE_DATA(dev->id), NUMA_MISS),
- node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
- node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
- node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
- node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
+ node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
+ node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
+ node_page_state_snapshot(NODE_DATA(dev->id),
+ NUMA_FOREIGN),
+ node_page_state_snapshot(NODE_DATA(dev->id),
+ NUMA_INTERLEAVE_HIT),
+ node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
+ node_page_state_snapshot(NODE_DATA(dev->id),
+ NUMA_OTHER));
}

static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
@@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
n += sprintf(buf+n, "%s %lu\n",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
- node_page_state(pgdat, i));
+ node_page_state_snapshot(pgdat, i));

return n;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 64e08ae..d65f28d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
seq_printf(m, "\n %-12s %lu",
vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
- node_page_state(pgdat, i));
+ node_page_state_snapshot(pgdat, i));
}
}
seq_printf(m,
--
2.7.4

2017-12-19 12:28:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node

On Tue 19-12-17 14:39:22, Kemi Wang wrote:
> There is not really any use to get NUMA stats separated by zone, and
> current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code
> cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the
> existed per-cpu infrastructure.

Let's hope that nobody really depends on the per-zone numbers. It would
be really strange as those counters are inherently per-node and that is
what users should care about but who knows...

Anyway, I hoped we could get rid of NR_VM_NUMA_STAT_ITEMS but your patch
keeps it and follow up patches even use it further. I will comment on
those separately but this still makes these few counters really special
which I think is wrong.

> Suggested-by: Andi Kleen <[email protected]>
> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Kemi Wang <[email protected]>

I have to fully grasp the rest of the series before I'll give my Ack,
but I _really_ like the simplification this adds to the code. I believe
it can be even simpler.

> ---
> drivers/base/node.c | 23 +++----
> include/linux/mmzone.h | 27 ++++----
> include/linux/vmstat.h | 31 ---------
> mm/mempolicy.c | 2 +-
> mm/page_alloc.c | 16 +++--
> mm/vmstat.c | 177 +++++--------------------------------------------
> 6 files changed, 46 insertions(+), 230 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ee090ab..a045ea1 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev,
> "interleave_hit %lu\n"
> "local_node %lu\n"
> "other_node %lu\n",
> - sum_zone_numa_state(dev->id, NUMA_HIT),
> - sum_zone_numa_state(dev->id, NUMA_MISS),
> - sum_zone_numa_state(dev->id, NUMA_FOREIGN),
> - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
> - sum_zone_numa_state(dev->id, NUMA_LOCAL),
> - sum_zone_numa_state(dev->id, NUMA_OTHER));
> + node_page_state(NODE_DATA(dev->id), NUMA_HIT),
> + node_page_state(NODE_DATA(dev->id), NUMA_MISS),
> + node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
> + node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
> + node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
> + node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
> }
> +
> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>
> static ssize_t node_read_vmstat(struct device *dev,
> @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev,
> n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
> sum_zone_node_page_state(nid, i));
>
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> - n += sprintf(buf+n, "%s %lu\n",
> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> - sum_zone_numa_state(nid, i));
> -#endif
> -
> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> n += sprintf(buf+n, "%s %lu\n",
> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> - NR_VM_NUMA_STAT_ITEMS],
> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> node_page_state(pgdat, i));
>
> return n;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 67f2e3c..c06d880 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -115,20 +115,6 @@ struct zone_padding {
> #define ZONE_PADDING(name)
> #endif
>
> -#ifdef CONFIG_NUMA
> -enum numa_stat_item {
> - NUMA_HIT, /* allocated in intended node */
> - NUMA_MISS, /* allocated in non intended node */
> - NUMA_FOREIGN, /* was intended here, hit elsewhere */
> - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
> - NUMA_LOCAL, /* allocation from local node */
> - NUMA_OTHER, /* allocation from other node */
> - NR_VM_NUMA_STAT_ITEMS
> -};
> -#else
> -#define NR_VM_NUMA_STAT_ITEMS 0
> -#endif
> -
> enum zone_stat_item {
> /* First 128 byte cacheline (assuming 64 bit words) */
> NR_FREE_PAGES,
> @@ -151,7 +137,18 @@ enum zone_stat_item {
> NR_VM_ZONE_STAT_ITEMS };
>
> enum node_stat_item {
> - NR_LRU_BASE,
> +#ifdef CONFIG_NUMA
> + NUMA_HIT, /* allocated in intended node */
> + NUMA_MISS, /* allocated in non intended node */
> + NUMA_FOREIGN, /* was intended here, hit elsewhere */
> + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
> + NUMA_LOCAL, /* allocation from local node */
> + NUMA_OTHER, /* allocation from other node */
> + NR_VM_NUMA_STAT_ITEMS,
> +#else
> +#define NR_VM_NUMA_STAT_ITEMS 0
> +#endif
> + NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS,
> NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> NR_ACTIVE_ANON, /* " " " " " */
> NR_INACTIVE_FILE, /* " " " " " */
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 1779c98..80bf290 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
> * Zone and node-based page accounting with per cpu differentials.
> */
> extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
> extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
>
> -#ifdef CONFIG_NUMA
> -static inline void zone_numa_state_add(long x, struct zone *zone,
> - enum numa_stat_item item)
> -{
> - atomic_long_add(x, &zone->vm_numa_stat[item]);
> - atomic_long_add(x, &vm_numa_stat[item]);
> -}
> -
> -static inline unsigned long global_numa_state(enum numa_stat_item item)
> -{
> - long x = atomic_long_read(&vm_numa_stat[item]);
> -
> - return x;
> -}
> -
> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
> - enum numa_stat_item item)
> -{
> - long x = atomic_long_read(&zone->vm_numa_stat[item]);
> - int cpu;
> -
> - for_each_online_cpu(cpu)
> - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
> -
> - return x;
> -}
> -#endif /* CONFIG_NUMA */
> -
> static inline void zone_page_state_add(long x, struct zone *zone,
> enum zone_stat_item item)
> {
> @@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
>
>
> #ifdef CONFIG_NUMA
> -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
> extern unsigned long sum_zone_node_page_state(int node,
> enum zone_stat_item item);
> -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
> extern unsigned long node_page_state(struct pglist_data *pgdat,
> enum node_stat_item item);
> #else
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ce44d3..b2293e3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> return page;
> if (page && page_to_nid(page) == nid) {
> preempt_disable();
> - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> + inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT);
> preempt_enable();
> }
> return page;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7e5e775..81e8d8f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order)
> static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
> {
> #ifdef CONFIG_NUMA
> - enum numa_stat_item local_stat = NUMA_LOCAL;
> + int preferred_nid = preferred_zone->node;
> + int nid = z->node;
> + enum node_stat_item local_stat = NUMA_LOCAL;
>
> /* skip numa counters update if numa stats is disabled */
> if (!static_branch_likely(&vm_numa_stat_key))
> return;
>
> - if (z->node != numa_node_id())
> + if (nid != numa_node_id())
> local_stat = NUMA_OTHER;
>
> - if (z->node == preferred_zone->node)
> - __inc_numa_state(z, NUMA_HIT);
> + if (nid == preferred_nid)
> + inc_node_state(NODE_DATA(nid), NUMA_HIT);
> else {
> - __inc_numa_state(z, NUMA_MISS);
> - __inc_numa_state(preferred_zone, NUMA_FOREIGN);
> + inc_node_state(NODE_DATA(nid), NUMA_MISS);
> + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN);
> }
> - __inc_numa_state(z, local_stat);
> + inc_node_state(NODE_DATA(nid), local_stat);
> #endif
> }
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 40b2db6..1dd12ae 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -30,46 +30,44 @@
>
> #include "internal.h"
>
> -#define NUMA_STATS_THRESHOLD (U16_MAX - 2)
> -
> #ifdef CONFIG_NUMA
> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>
> -/* zero numa counters within a zone */
> -static void zero_zone_numa_counters(struct zone *zone)
> +/* zero numa stats within a node */
> +static void zero_node_numa_stats(int node)
> {
> int item, cpu;
>
> for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
> - atomic_long_set(&zone->vm_numa_stat[item], 0);
> + atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0);
> for_each_online_cpu(cpu)
> - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]
> - = 0;
> + per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats,
> + cpu)->vm_node_stat_diff[item] = 0;
> }
> }
>
> -/* zero numa counters of all the populated zones */
> -static void zero_zones_numa_counters(void)
> +/* zero numa stats of all the online nodes */
> +static void zero_nodes_numa_stats(void)
> {
> - struct zone *zone;
> + int node;
>
> - for_each_populated_zone(zone)
> - zero_zone_numa_counters(zone);
> + for_each_online_node(node)
> + zero_node_numa_stats(node);
> }
>
> -/* zero global numa counters */
> -static void zero_global_numa_counters(void)
> +/* zero global numa stats */
> +static void zero_global_numa_stats(void)
> {
> int item;
>
> for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
> - atomic_long_set(&vm_numa_stat[item], 0);
> + atomic_long_set(&vm_node_stat[item], 0);
> }
>
> static void invalid_numa_statistics(void)
> {
> - zero_zones_numa_counters();
> - zero_global_numa_counters();
> + zero_nodes_numa_stats();
> + zero_global_numa_stats();
> }
>
> static DEFINE_MUTEX(vm_numa_stat_lock);
> @@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu)
> * vm_stat contains the global counters
> */
> atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
> -atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
> atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
> EXPORT_SYMBOL(vm_zone_stat);
> -EXPORT_SYMBOL(vm_numa_stat);
> EXPORT_SYMBOL(vm_node_stat);
>
> #ifdef CONFIG_SMP
> @@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state);
> * Fold a differential into the global counters.
> * Returns the number of counters updated.
> */
> -#ifdef CONFIG_NUMA
> -static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
> -{
> - int i;
> - int changes = 0;
> -
> - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> - if (zone_diff[i]) {
> - atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
> - changes++;
> - }
> -
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> - if (numa_diff[i]) {
> - atomic_long_add(numa_diff[i], &vm_numa_stat[i]);
> - changes++;
> - }
> -
> - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> - if (node_diff[i]) {
> - atomic_long_add(node_diff[i], &vm_node_stat[i]);
> - changes++;
> - }
> - return changes;
> -}
> -#else
> static int fold_diff(int *zone_diff, int *node_diff)
> {
> int i;
> @@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff)
> }
> return changes;
> }
> -#endif /* CONFIG_NUMA */
>
> /*
> * Update the zone counters for the current cpu.
> @@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
> struct zone *zone;
> int i;
> int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
> -#ifdef CONFIG_NUMA
> - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
> -#endif
> int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
> int changes = 0;
>
> @@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
> }
> }
> #ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
> - int v;
> -
> - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
> - if (v) {
> -
> - atomic_long_add(v, &zone->vm_numa_stat[i]);
> - global_numa_diff[i] += v;
> - __this_cpu_write(p->expire, 3);
> - }
> - }
> -
> if (do_pagesets) {
> cond_resched();
> /*
> @@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
> }
> }
>
> -#ifdef CONFIG_NUMA
> - changes += fold_diff(global_zone_diff, global_numa_diff,
> - global_node_diff);
> -#else
> changes += fold_diff(global_zone_diff, global_node_diff);
> -#endif
> return changes;
> }
>
> @@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu)
> struct zone *zone;
> int i;
> int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
> -#ifdef CONFIG_NUMA
> - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
> -#endif
> int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
>
> for_each_populated_zone(zone) {
> @@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu)
> atomic_long_add(v, &zone->vm_stat[i]);
> global_zone_diff[i] += v;
> }
> -
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> - if (p->vm_numa_stat_diff[i]) {
> - int v;
> -
> - v = p->vm_numa_stat_diff[i];
> - p->vm_numa_stat_diff[i] = 0;
> - atomic_long_add(v, &zone->vm_numa_stat[i]);
> - global_numa_diff[i] += v;
> - }
> -#endif
> }
>
> for_each_online_pgdat(pgdat) {
> @@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu)
> }
> }
>
> -#ifdef CONFIG_NUMA
> - fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
> -#else
> fold_diff(global_zone_diff, global_node_diff);
> -#endif
> }
>
> /*
> @@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
> atomic_long_add(v, &zone->vm_stat[i]);
> atomic_long_add(v, &vm_zone_stat[i]);
> }
> -
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> - if (pset->vm_numa_stat_diff[i]) {
> - int v = pset->vm_numa_stat_diff[i];
> -
> - pset->vm_numa_stat_diff[i] = 0;
> - atomic_long_add(v, &zone->vm_numa_stat[i]);
> - atomic_long_add(v, &vm_numa_stat[i]);
> - }
> -#endif
> }
> #endif
>
> #ifdef CONFIG_NUMA
> -void __inc_numa_state(struct zone *zone,
> - enum numa_stat_item item)
> -{
> - struct per_cpu_pageset __percpu *pcp = zone->pageset;
> - u16 __percpu *p = pcp->vm_numa_stat_diff + item;
> - u16 v;
> -
> - v = __this_cpu_inc_return(*p);
> -
> - if (unlikely(v > NUMA_STATS_THRESHOLD)) {
> - zone_numa_state_add(v, zone, item);
> - __this_cpu_write(*p, 0);
> - }
> -}
> -
> /*
> * Determine the per node value of a stat item. This function
> * is called frequently in a NUMA machine, so try to be as
> @@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node,
> }
>
> /*
> - * Determine the per node value of a numa stat item. To avoid deviation,
> - * the per cpu stat number in vm_numa_stat_diff[] is also included.
> - */
> -unsigned long sum_zone_numa_state(int node,
> - enum numa_stat_item item)
> -{
> - struct zone *zones = NODE_DATA(node)->node_zones;
> - int i;
> - unsigned long count = 0;
> -
> - for (i = 0; i < MAX_NR_ZONES; i++)
> - count += zone_numa_state_snapshot(zones + i, item);
> -
> - return count;
> -}
> -
> -/*
> * Determine the per node value of a stat item.
> */
> unsigned long node_page_state(struct pglist_data *pgdat,
> @@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> seq_printf(m, "\n per-node stats");
> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> seq_printf(m, "\n %-12s %lu",
> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
> - NR_VM_NUMA_STAT_ITEMS],
> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> node_page_state(pgdat, i));
> }
> }
> @@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> seq_printf(m, "\n %-12s %lu", vmstat_text[i],
> zone_page_state(zone, i));
>
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> - seq_printf(m, "\n %-12s %lu",
> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> - zone_numa_state_snapshot(zone, i));
> -#endif
> -
> seq_printf(m, "\n pagesets");
> for_each_online_cpu(i) {
> struct per_cpu_pageset *pageset;
> @@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> if (*pos >= ARRAY_SIZE(vmstat_text))
> return NULL;
> stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
> - NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) +
> NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
> NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);
>
> @@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> v[i] = global_zone_page_state(i);
> v += NR_VM_ZONE_STAT_ITEMS;
>
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
> - v[i] = global_numa_state(i);
> - v += NR_VM_NUMA_STAT_ITEMS;
> -#endif
> -
> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> v[i] = global_node_page_state(i);
> v += NR_VM_NODE_STAT_ITEMS;
> @@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write,
> err = -EINVAL;
> }
> }
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
> - val = atomic_long_read(&vm_numa_stat[i]);
> - if (val < 0) {
> - pr_warn("%s: %s %ld\n",
> - __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val);
> - err = -EINVAL;
> - }
> - }
> -#endif
> if (err)
> return err;
> if (write)
> @@ -1862,9 +1724,6 @@ static bool need_update(int cpu)
> struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
>
> BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
> -#ifdef CONFIG_NUMA
> - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2);
> -#endif
>
> /*
> * The fast way of checking if there are any vmstat diffs.
> @@ -1872,10 +1731,6 @@ static bool need_update(int cpu)
> */
> if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
> return true;
> -#ifdef CONFIG_NUMA
> - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS))
> - return true;
> -#endif
> }
> return false;
> }
> --
> 2.7.4
>

--
Michal Hocko
SUSE Labs

2017-12-19 12:38:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

On Tue 19-12-17 14:39:23, Kemi Wang wrote:
> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> limitation of global counters update frequency, especially for those
> monotone increasing type of counters like NUMA counters with more and more
> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> without any functionality change.
>
> before after
> sizeof(struct per_cpu_nodestat) 28 68

So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO
but the changelog is a bit silent about any numbers. This is a
performance optimization so it should better give us some.

> Signed-off-by: Kemi Wang <[email protected]>
> ---
> include/linux/mmzone.h | 4 ++--
> mm/vmstat.c | 16 ++++++++--------
> 2 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c06d880..2da6b6f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -289,8 +289,8 @@ struct per_cpu_pageset {
> };
>
> struct per_cpu_nodestat {
> - s8 stat_threshold;
> - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
> + s16 stat_threshold;
> + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
> };
>
> #endif /* !__GENERATING_BOUNDS.H */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1dd12ae..9c681cc 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
> long delta)
> {
> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
> long x;
> long t;
>
> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
> {
> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> - s8 v, t;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 v, t;
>
> v = __this_cpu_inc_return(*p);
> t = __this_cpu_read(pcp->stat_threshold);
> if (unlikely(v > t)) {
> - s8 overstep = t >> 1;
> + s16 overstep = t >> 1;
>
> node_page_state_add(v + overstep, pgdat, item);
> __this_cpu_write(*p, -overstep);
> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
> void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
> {
> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> - s8 v, t;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 v, t;
>
> v = __this_cpu_dec_return(*p);
> t = __this_cpu_read(pcp->stat_threshold);
> if (unlikely(v < - t)) {
> - s8 overstep = t >> 1;
> + s16 overstep = t >> 1;
>
> node_page_state_add(v - overstep, pgdat, item);
> __this_cpu_write(*p, overstep);
> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat,
> enum node_stat_item item, int delta, int overstep_mode)
> {
> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
> long o, n, t, z;
>
> do {
> --
> 2.7.4
>

--
Michal Hocko
SUSE Labs

2017-12-19 12:40:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

On Tue 19-12-17 14:39:24, Kemi Wang wrote:
> We have seen significant overhead in cache bouncing caused by NUMA counters
> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
> update NUMA counter threshold size")' for more details.
>
> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
> with global counter update using different threshold size for node page
> stats.

Again, no numbers. To be honest I do not really like the special casing
here. Why are numa counters any different from PGALLOC which is
incremented for _every_ single page allocation?

> ---
> mm/vmstat.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 9c681cc..64e08ae 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -30,6 +30,8 @@
>
> #include "internal.h"
>
> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
> +
> #ifdef CONFIG_NUMA
> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>
> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
> s16 v, t;
>
> v = __this_cpu_inc_return(*p);
> - t = __this_cpu_read(pcp->stat_threshold);
> + if (item >= NR_VM_NUMA_STAT_ITEMS)
> + t = __this_cpu_read(pcp->stat_threshold);
> + else
> + t = VM_NUMA_STAT_THRESHOLD;
> +
> if (unlikely(v > t)) {
> s16 overstep = t >> 1;
>
> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat,
> * Most of the time the thresholds are the same anyways
> * for all cpus in a node.
> */
> - t = this_cpu_read(pcp->stat_threshold);
> + if (item >= NR_VM_NUMA_STAT_ITEMS)
> + t = this_cpu_read(pcp->stat_threshold);
> + else
> + t = VM_NUMA_STAT_THRESHOLD;
>
> o = this_cpu_read(*p);
> n = delta + o;
> --
> 2.7.4
>

--
Michal Hocko
SUSE Labs

2017-12-19 12:43:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

On Tue 19-12-17 14:39:25, Kemi Wang wrote:
> To avoid deviation, this patch uses node_page_state_snapshot instead of
> node_page_state for node page stats query.
> e.g. cat /proc/zoneinfo
> cat /sys/devices/system/node/node*/vmstat
> cat /sys/devices/system/node/node*/numastat
>
> As it is a slow path and would not be read frequently, I would worry about
> it.

The changelog doesn't explain why these counters needs any special
treatment. _snapshot variants where used only for internal handling
where the precision really mattered. We do not have any in-tree user and
Jack has removed this by http://lkml.kernel.org/r/[email protected]
which is already sitting in the mmotm tree. We can re-add it but that
would really require a _very good_ reason.

> Signed-off-by: Kemi Wang <[email protected]>
> ---
> drivers/base/node.c | 17 ++++++++++-------
> mm/vmstat.c | 2 +-
> 2 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index a045ea1..cf303f8 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
> "interleave_hit %lu\n"
> "local_node %lu\n"
> "other_node %lu\n",
> - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
> - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
> + node_page_state_snapshot(NODE_DATA(dev->id),
> + NUMA_FOREIGN),
> + node_page_state_snapshot(NODE_DATA(dev->id),
> + NUMA_INTERLEAVE_HIT),
> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
> + node_page_state_snapshot(NODE_DATA(dev->id),
> + NUMA_OTHER));
> }
>
> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> n += sprintf(buf+n, "%s %lu\n",
> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> - node_page_state(pgdat, i));
> + node_page_state_snapshot(pgdat, i));
>
> return n;
> }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 64e08ae..d65f28d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> seq_printf(m, "\n %-12s %lu",
> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> - node_page_state(pgdat, i));
> + node_page_state_snapshot(pgdat, i));
> }
> }
> seq_printf(m,
> --
> 2.7.4
>

--
Michal Hocko
SUSE Labs

2017-12-19 12:44:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics()

On Tue 19-12-17 14:39:26, Kemi Wang wrote:
> Since the functionality of zone_statistics() updates numa counters, but
> numa statistics has been separated from zone statistics framework. Thus,
> the function name makes people confused. So, change the name to
> numa_statistics() as well as its call sites accordingly.
>
> Signed-off-by: Kemi Wang <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> mm/page_alloc.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81e8d8f..f7583de 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2790,7 +2790,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
> *
> * Must be called with interrupts disabled.
> */
> -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
> +static inline void numa_statistics(struct zone *preferred_zone, struct zone *z)
> {
> #ifdef CONFIG_NUMA
> int preferred_nid = preferred_zone->node;
> @@ -2854,7 +2854,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> page = __rmqueue_pcplist(zone, migratetype, pcp, list);
> if (page) {
> __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> - zone_statistics(preferred_zone, zone);
> + numa_statistics(preferred_zone, zone);
> }
> local_irq_restore(flags);
> return page;
> @@ -2902,7 +2902,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> get_pcppage_migratetype(page));
>
> __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> - zone_statistics(preferred_zone, zone);
> + numa_statistics(preferred_zone, zone);
> local_irq_restore(flags);
>
> out:
> --
> 2.7.4
>

--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

On Tue, 19 Dec 2017, Kemi Wang wrote:

> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> limitation of global counters update frequency, especially for those
> monotone increasing type of counters like NUMA counters with more and more
> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> without any functionality change.

Well the reason for s8 was to keep the data structures small so that they
fit in the higher level cpu caches. The large these structures become the
more cachelines are used by the counters and the larger the performance
influence on the code that should not be impacted by the overhead.

2017-12-19 16:20:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

On Tue 19-12-17 10:05:48, Cristopher Lameter wrote:
> On Tue, 19 Dec 2017, Kemi Wang wrote:
>
> > The type s8 used for vm_diff_nodestat[] as local cpu counters has the
> > limitation of global counters update frequency, especially for those
> > monotone increasing type of counters like NUMA counters with more and more
> > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
> > without any functionality change.
>
> Well the reason for s8 was to keep the data structures small so that they
> fit in the higher level cpu caches. The large these structures become the
> more cachelines are used by the counters and the larger the performance
> influence on the code that should not be impacted by the overhead.

I am not sure I understand. We usually do not access more counters in
the single code path (well, PGALLOC and NUMA counteres is more of an
exception). So it is rarely an advantage that the whole array is in the
same cache line. Besides that this is allocated by the percpu allocator
aligns to the type size rather than cache lines AFAICS.

Maybe it used to be all different back then when the code has been added
but arguing about cache lines seems to be a bit problematic here. Maybe
you have some specific workloads which can prove me wrong?
--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16

On Tue, 19 Dec 2017, Michal Hocko wrote:

> > Well the reason for s8 was to keep the data structures small so that they
> > fit in the higher level cpu caches. The large these structures become the
> > more cachelines are used by the counters and the larger the performance
> > influence on the code that should not be impacted by the overhead.
>
> I am not sure I understand. We usually do not access more counters in
> the single code path (well, PGALLOC and NUMA counteres is more of an
> exception). So it is rarely an advantage that the whole array is in the
> same cache line. Besides that this is allocated by the percpu allocator
> aligns to the type size rather than cache lines AFAICS.

I thought we are talking about NUMA counters here?

Regardless: A typical fault, system call or OS action will access multiple
zone and node counters when allocating or freeing memory. Enlarging the
fields will increase the number of cachelines touched.

> Maybe it used to be all different back then when the code has been added
> but arguing about cache lines seems to be a bit problematic here. Maybe
> you have some specific workloads which can prove me wrong?

Run a workload that does some page faults? Heavy allocation and freeing of
memory?

Maybe that is no longer relevant since the number of the counters is
large that the accesses are so sparse that each action pulls in a whole
cacheline. That would be something we tried to avoid when implementing
the differentials.


2017-12-20 03:07:31

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16



On 2017年12月19日 20:38, Michal Hocko wrote:
> On Tue 19-12-17 14:39:23, Kemi Wang wrote:
>> The type s8 used for vm_diff_nodestat[] as local cpu counters has the
>> limitation of global counters update frequency, especially for those
>> monotone increasing type of counters like NUMA counters with more and more
>> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16
>> without any functionality change.
>>
>> before after
>> sizeof(struct per_cpu_nodestat) 28 68
>
> So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO
> but the changelog is a bit silent about any numbers. This is a
> performance optimization so it should better give us some.
>

This patch does not have any functionality change. So no performance gain
I suppose.
I guess you are talking about performance gain from the third patch which
increases threshold size of NUMA counters.

>> Signed-off-by: Kemi Wang <[email protected]>
>> ---
>> include/linux/mmzone.h | 4 ++--
>> mm/vmstat.c | 16 ++++++++--------
>> 2 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index c06d880..2da6b6f 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -289,8 +289,8 @@ struct per_cpu_pageset {
>> };
>>
>> struct per_cpu_nodestat {
>> - s8 stat_threshold;
>> - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>> + s16 stat_threshold;
>> + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
>> };
>>
>> #endif /* !__GENERATING_BOUNDS.H */
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 1dd12ae..9c681cc 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
>> long delta)
>> {
>> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> long x;
>> long t;
>>
>> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>> void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>> {
>> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> - s8 v, t;
>> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> + s16 v, t;
>>
>> v = __this_cpu_inc_return(*p);
>> t = __this_cpu_read(pcp->stat_threshold);
>> if (unlikely(v > t)) {
>> - s8 overstep = t >> 1;
>> + s16 overstep = t >> 1;
>>
>> node_page_state_add(v + overstep, pgdat, item);
>> __this_cpu_write(*p, -overstep);
>> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>> void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>> {
>> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> - s8 v, t;
>> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> + s16 v, t;
>>
>> v = __this_cpu_dec_return(*p);
>> t = __this_cpu_read(pcp->stat_threshold);
>> if (unlikely(v < - t)) {
>> - s8 overstep = t >> 1;
>> + s16 overstep = t >> 1;
>>
>> node_page_state_add(v - overstep, pgdat, item);
>> __this_cpu_write(*p, overstep);
>> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat,
>> enum node_stat_item item, int delta, int overstep_mode)
>> {
>> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>> - s8 __percpu *p = pcp->vm_node_stat_diff + item;
>> + s16 __percpu *p = pcp->vm_node_stat_diff + item;
>> long o, n, t, z;
>>
>> do {
>> --
>> 2.7.4
>>
>

2017-12-20 05:34:19

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node



On 2017年12月19日 20:28, Michal Hocko wrote:
> On Tue 19-12-17 14:39:22, Kemi Wang wrote:
>> There is not really any use to get NUMA stats separated by zone, and
>> current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code
>> cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the
>> existed per-cpu infrastructure.
>
> Let's hope that nobody really depends on the per-zone numbers. It would
> be really strange as those counters are inherently per-node and that is
> what users should care about but who knows...
>
> Anyway, I hoped we could get rid of NR_VM_NUMA_STAT_ITEMS but your patch
> keeps it and follow up patches even use it further. I will comment on
> those separately but this still makes these few counters really special
> which I think is wrong.
>

Well, that's what I can think of to keep a balance between performance
and simplification. If you have a better idea, please post it and
I will follow that surely.

>> Suggested-by: Andi Kleen <[email protected]>
>> Suggested-by: Michal Hocko <[email protected]>
>> Signed-off-by: Kemi Wang <[email protected]>
>
> I have to fully grasp the rest of the series before I'll give my Ack,
> but I _really_ like the simplification this adds to the code. I believe
> it can be even simpler.
>
>> ---
>> drivers/base/node.c | 23 +++----
>> include/linux/mmzone.h | 27 ++++----
>> include/linux/vmstat.h | 31 ---------
>> mm/mempolicy.c | 2 +-
>> mm/page_alloc.c | 16 +++--
>> mm/vmstat.c | 177 +++++--------------------------------------------
>> 6 files changed, 46 insertions(+), 230 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index ee090ab..a045ea1 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> - sum_zone_numa_state(dev->id, NUMA_HIT),
>> - sum_zone_numa_state(dev->id, NUMA_MISS),
>> - sum_zone_numa_state(dev->id, NUMA_FOREIGN),
>> - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
>> - sum_zone_numa_state(dev->id, NUMA_LOCAL),
>> - sum_zone_numa_state(dev->id, NUMA_OTHER));
>> + node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>> + node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>> + node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>> + node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>> + node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>> + node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>> }
>> +
>> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>
>> static ssize_t node_read_vmstat(struct device *dev,
>> @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev,
>> n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
>> sum_zone_node_page_state(nid, i));
>>
>> -#ifdef CONFIG_NUMA
>> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
>> - n += sprintf(buf+n, "%s %lu\n",
>> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - sum_zone_numa_state(nid, i));
>> -#endif
>> -
>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> n += sprintf(buf+n, "%s %lu\n",
>> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
>> - NR_VM_NUMA_STAT_ITEMS],
>> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> node_page_state(pgdat, i));
>>
>> return n;
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 67f2e3c..c06d880 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -115,20 +115,6 @@ struct zone_padding {
>> #define ZONE_PADDING(name)
>> #endif
>>
>> -#ifdef CONFIG_NUMA
>> -enum numa_stat_item {
>> - NUMA_HIT, /* allocated in intended node */
>> - NUMA_MISS, /* allocated in non intended node */
>> - NUMA_FOREIGN, /* was intended here, hit elsewhere */
>> - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
>> - NUMA_LOCAL, /* allocation from local node */
>> - NUMA_OTHER, /* allocation from other node */
>> - NR_VM_NUMA_STAT_ITEMS
>> -};
>> -#else
>> -#define NR_VM_NUMA_STAT_ITEMS 0
>> -#endif
>> -
>> enum zone_stat_item {
>> /* First 128 byte cacheline (assuming 64 bit words) */
>> NR_FREE_PAGES,
>> @@ -151,7 +137,18 @@ enum zone_stat_item {
>> NR_VM_ZONE_STAT_ITEMS };
>>
>> enum node_stat_item {
>> - NR_LRU_BASE,
>> +#ifdef CONFIG_NUMA
>> + NUMA_HIT, /* allocated in intended node */
>> + NUMA_MISS, /* allocated in non intended node */
>> + NUMA_FOREIGN, /* was intended here, hit elsewhere */
>> + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */
>> + NUMA_LOCAL, /* allocation from local node */
>> + NUMA_OTHER, /* allocation from other node */
>> + NR_VM_NUMA_STAT_ITEMS,
>> +#else
>> +#define NR_VM_NUMA_STAT_ITEMS 0
>> +#endif
>> + NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS,
>> NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
>> NR_ACTIVE_ANON, /* " " " " " */
>> NR_INACTIVE_FILE, /* " " " " " */
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1779c98..80bf290 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
>> * Zone and node-based page accounting with per cpu differentials.
>> */
>> extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
>> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
>>
>> -#ifdef CONFIG_NUMA
>> -static inline void zone_numa_state_add(long x, struct zone *zone,
>> - enum numa_stat_item item)
>> -{
>> - atomic_long_add(x, &zone->vm_numa_stat[item]);
>> - atomic_long_add(x, &vm_numa_stat[item]);
>> -}
>> -
>> -static inline unsigned long global_numa_state(enum numa_stat_item item)
>> -{
>> - long x = atomic_long_read(&vm_numa_stat[item]);
>> -
>> - return x;
>> -}
>> -
>> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> - enum numa_stat_item item)
>> -{
>> - long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> - int cpu;
>> -
>> - for_each_online_cpu(cpu)
>> - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>> -
>> - return x;
>> -}
>> -#endif /* CONFIG_NUMA */
>> -
>> static inline void zone_page_state_add(long x, struct zone *zone,
>> enum zone_stat_item item)
>> {
>> @@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
>>
>>
>> #ifdef CONFIG_NUMA
>> -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
>> extern unsigned long sum_zone_node_page_state(int node,
>> enum zone_stat_item item);
>> -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>> extern unsigned long node_page_state(struct pglist_data *pgdat,
>> enum node_stat_item item);
>> #else
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 4ce44d3..b2293e3 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
>> return page;
>> if (page && page_to_nid(page) == nid) {
>> preempt_disable();
>> - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
>> + inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT);
>> preempt_enable();
>> }
>> return page;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 7e5e775..81e8d8f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order)
>> static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
>> {
>> #ifdef CONFIG_NUMA
>> - enum numa_stat_item local_stat = NUMA_LOCAL;
>> + int preferred_nid = preferred_zone->node;
>> + int nid = z->node;
>> + enum node_stat_item local_stat = NUMA_LOCAL;
>>
>> /* skip numa counters update if numa stats is disabled */
>> if (!static_branch_likely(&vm_numa_stat_key))
>> return;
>>
>> - if (z->node != numa_node_id())
>> + if (nid != numa_node_id())
>> local_stat = NUMA_OTHER;
>>
>> - if (z->node == preferred_zone->node)
>> - __inc_numa_state(z, NUMA_HIT);
>> + if (nid == preferred_nid)
>> + inc_node_state(NODE_DATA(nid), NUMA_HIT);
>> else {
>> - __inc_numa_state(z, NUMA_MISS);
>> - __inc_numa_state(preferred_zone, NUMA_FOREIGN);
>> + inc_node_state(NODE_DATA(nid), NUMA_MISS);
>> + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN);
>> }
>> - __inc_numa_state(z, local_stat);
>> + inc_node_state(NODE_DATA(nid), local_stat);
>> #endif
>> }
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 40b2db6..1dd12ae 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,46 +30,44 @@
>>
>> #include "internal.h"
>>
>> -#define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>> -
>> #ifdef CONFIG_NUMA
>> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>>
>> -/* zero numa counters within a zone */
>> -static void zero_zone_numa_counters(struct zone *zone)
>> +/* zero numa stats within a node */
>> +static void zero_node_numa_stats(int node)
>> {
>> int item, cpu;
>>
>> for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
>> - atomic_long_set(&zone->vm_numa_stat[item], 0);
>> + atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0);
>> for_each_online_cpu(cpu)
>> - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]
>> - = 0;
>> + per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats,
>> + cpu)->vm_node_stat_diff[item] = 0;
>> }
>> }
>>
>> -/* zero numa counters of all the populated zones */
>> -static void zero_zones_numa_counters(void)
>> +/* zero numa stats of all the online nodes */
>> +static void zero_nodes_numa_stats(void)
>> {
>> - struct zone *zone;
>> + int node;
>>
>> - for_each_populated_zone(zone)
>> - zero_zone_numa_counters(zone);
>> + for_each_online_node(node)
>> + zero_node_numa_stats(node);
>> }
>>
>> -/* zero global numa counters */
>> -static void zero_global_numa_counters(void)
>> +/* zero global numa stats */
>> +static void zero_global_numa_stats(void)
>> {
>> int item;
>>
>> for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
>> - atomic_long_set(&vm_numa_stat[item], 0);
>> + atomic_long_set(&vm_node_stat[item], 0);
>> }
>>
>> static void invalid_numa_statistics(void)
>> {
>> - zero_zones_numa_counters();
>> - zero_global_numa_counters();
>> + zero_nodes_numa_stats();
>> + zero_global_numa_stats();
>> }
>>
>> static DEFINE_MUTEX(vm_numa_stat_lock);
>> @@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu)
>> * vm_stat contains the global counters
>> */
>> atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
>> -atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
>> atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
>> EXPORT_SYMBOL(vm_zone_stat);
>> -EXPORT_SYMBOL(vm_numa_stat);
>> EXPORT_SYMBOL(vm_node_stat);
>>
>> #ifdef CONFIG_SMP
>> @@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state);
>> * Fold a differential into the global counters.
>> * Returns the number of counters updated.
>> */
>> -#ifdef CONFIG_NUMA
>> -static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
>> -{
>> - int i;
>> - int changes = 0;
>> -
>> - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>> - if (zone_diff[i]) {
>> - atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
>> - changes++;
>> - }
>> -
>> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
>> - if (numa_diff[i]) {
>> - atomic_long_add(numa_diff[i], &vm_numa_stat[i]);
>> - changes++;
>> - }
>> -
>> - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> - if (node_diff[i]) {
>> - atomic_long_add(node_diff[i], &vm_node_stat[i]);
>> - changes++;
>> - }
>> - return changes;
>> -}
>> -#else
>> static int fold_diff(int *zone_diff, int *node_diff)
>> {
>> int i;
>> @@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff)
>> }
>> return changes;
>> }
>> -#endif /* CONFIG_NUMA */
>>
>> /*
>> * Update the zone counters for the current cpu.
>> @@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
>> struct zone *zone;
>> int i;
>> int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
>> -#ifdef CONFIG_NUMA
>> - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
>> -#endif
>> int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
>> int changes = 0;
>>
>> @@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
>> }
>> }
>> #ifdef CONFIG_NUMA
>> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
>> - int v;
>> -
>> - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
>> - if (v) {
>> -
>> - atomic_long_add(v, &zone->vm_numa_stat[i]);
>> - global_numa_diff[i] += v;
>> - __this_cpu_write(p->expire, 3);
>> - }
>> - }
>> -
>> if (do_pagesets) {
>> cond_resched();
>> /*
>> @@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
>> }
>> }
>>
>> -#ifdef CONFIG_NUMA
>> - changes += fold_diff(global_zone_diff, global_numa_diff,
>> - global_node_diff);
>> -#else
>> changes += fold_diff(global_zone_diff, global_node_diff);
>> -#endif
>> return changes;
>> }
>>
>> @@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu)
>> struct zone *zone;
>> int i;
>> int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
>> -#ifdef CONFIG_NUMA
>> - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
>> -#endif
>> int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
>>
>> for_each_populated_zone(zone) {
>> @@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu)
>> atomic_long_add(v, &zone->vm_stat[i]);
>> global_zone_diff[i] += v;
>> }
>> -
>> -#ifdef CONFIG_NUMA
>> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
>> - if (p->vm_numa_stat_diff[i]) {
>> - int v;
>> -
>> - v = p->vm_numa_stat_diff[i];
>> - p->vm_numa_stat_diff[i] = 0;
>> - atomic_long_add(v, &zone->vm_numa_stat[i]);
>> - global_numa_diff[i] += v;
>> - }
>> -#endif
>> }
>>
>> for_each_online_pgdat(pgdat) {
>> @@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu)
>> }
>> }
>>
>> -#ifdef CONFIG_NUMA
>> - fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
>> -#else
>> fold_diff(global_zone_diff, global_node_diff);
>> -#endif
>> }
>>
>> /*
>> @@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
>> atomic_long_add(v, &zone->vm_stat[i]);
>> atomic_long_add(v, &vm_zone_stat[i]);
>> }
>> -
>> -#ifdef CONFIG_NUMA
>> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
>> - if (pset->vm_numa_stat_diff[i]) {
>> - int v = pset->vm_numa_stat_diff[i];
>> -
>> - pset->vm_numa_stat_diff[i] = 0;
>> - atomic_long_add(v, &zone->vm_numa_stat[i]);
>> - atomic_long_add(v, &vm_numa_stat[i]);
>> - }
>> -#endif
>> }
>> #endif
>>
>> #ifdef CONFIG_NUMA
>> -void __inc_numa_state(struct zone *zone,
>> - enum numa_stat_item item)
>> -{
>> - struct per_cpu_pageset __percpu *pcp = zone->pageset;
>> - u16 __percpu *p = pcp->vm_numa_stat_diff + item;
>> - u16 v;
>> -
>> - v = __this_cpu_inc_return(*p);
>> -
>> - if (unlikely(v > NUMA_STATS_THRESHOLD)) {
>> - zone_numa_state_add(v, zone, item);
>> - __this_cpu_write(*p, 0);
>> - }
>> -}
>> -
>> /*
>> * Determine the per node value of a stat item. This function
>> * is called frequently in a NUMA machine, so try to be as
>> @@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node,
>> }
>>
>> /*
>> - * Determine the per node value of a numa stat item. To avoid deviation,
>> - * the per cpu stat number in vm_numa_stat_diff[] is also included.
>> - */
>> -unsigned long sum_zone_numa_state(int node,
>> - enum numa_stat_item item)
>> -{
>> - struct zone *zones = NODE_DATA(node)->node_zones;
>> - int i;
>> - unsigned long count = 0;
>> -
>> - for (i = 0; i < MAX_NR_ZONES; i++)
>> - count += zone_numa_state_snapshot(zones + i, item);
>> -
>> - return count;
>> -}
>> -
>> -/*
>> * Determine the per node value of a stat item.
>> */
>> unsigned long node_page_state(struct pglist_data *pgdat,
>> @@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>> seq_printf(m, "\n per-node stats");
>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>> seq_printf(m, "\n %-12s %lu",
>> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
>> - NR_VM_NUMA_STAT_ITEMS],
>> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> node_page_state(pgdat, i));
>> }
>> }
>> @@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>> seq_printf(m, "\n %-12s %lu", vmstat_text[i],
>> zone_page_state(zone, i));
>>
>> -#ifdef CONFIG_NUMA
>> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
>> - seq_printf(m, "\n %-12s %lu",
>> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - zone_numa_state_snapshot(zone, i));
>> -#endif
>> -
>> seq_printf(m, "\n pagesets");
>> for_each_online_cpu(i) {
>> struct per_cpu_pageset *pageset;
>> @@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>> if (*pos >= ARRAY_SIZE(vmstat_text))
>> return NULL;
>> stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
>> - NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) +
>> NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
>> NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);
>>
>> @@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>> v[i] = global_zone_page_state(i);
>> v += NR_VM_ZONE_STAT_ITEMS;
>>
>> -#ifdef CONFIG_NUMA
>> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
>> - v[i] = global_numa_state(i);
>> - v += NR_VM_NUMA_STAT_ITEMS;
>> -#endif
>> -
>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> v[i] = global_node_page_state(i);
>> v += NR_VM_NODE_STAT_ITEMS;
>> @@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write,
>> err = -EINVAL;
>> }
>> }
>> -#ifdef CONFIG_NUMA
>> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
>> - val = atomic_long_read(&vm_numa_stat[i]);
>> - if (val < 0) {
>> - pr_warn("%s: %s %ld\n",
>> - __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val);
>> - err = -EINVAL;
>> - }
>> - }
>> -#endif
>> if (err)
>> return err;
>> if (write)
>> @@ -1862,9 +1724,6 @@ static bool need_update(int cpu)
>> struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
>>
>> BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
>> -#ifdef CONFIG_NUMA
>> - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2);
>> -#endif
>>
>> /*
>> * The fast way of checking if there are any vmstat diffs.
>> @@ -1872,10 +1731,6 @@ static bool need_update(int cpu)
>> */
>> if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
>> return true;
>> -#ifdef CONFIG_NUMA
>> - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS))
>> - return true;
>> -#endif
>> }
>> return false;
>> }
>> --
>> 2.7.4
>>
>

2017-12-20 05:54:18

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size



On 2017年12月19日 20:40, Michal Hocko wrote:
> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>> We have seen significant overhead in cache bouncing caused by NUMA counters
>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>> update NUMA counter threshold size")' for more details.
>>
>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>> with global counter update using different threshold size for node page
>> stats.
>
> Again, no numbers.

Compare to vanilla kernel, I don't think it has performance improvement, so
I didn't post performance data here.
But, if you would like to see performance gain from enlarging threshold size
for NUMA stats (compare to the first patch), I will do that later.

> To be honest I do not really like the special casing
> here. Why are numa counters any different from PGALLOC which is
> incremented for _every_ single page allocation?
>

I guess you meant to PGALLOC event.
The number of this event is kept in local cpu and sum up (for_each_online_cpu)
when need. It uses the similar way to what I used before for NUMA stats in V1
patch series. Good enough.

>> ---
>> mm/vmstat.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 9c681cc..64e08ae 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>
>> #include "internal.h"
>>
>> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
>> +
>> #ifdef CONFIG_NUMA
>> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>>
>> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>> s16 v, t;
>>
>> v = __this_cpu_inc_return(*p);
>> - t = __this_cpu_read(pcp->stat_threshold);
>> + if (item >= NR_VM_NUMA_STAT_ITEMS)
>> + t = __this_cpu_read(pcp->stat_threshold);
>> + else
>> + t = VM_NUMA_STAT_THRESHOLD;
>> +
>> if (unlikely(v > t)) {
>> s16 overstep = t >> 1;
>>
>> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat,
>> * Most of the time the thresholds are the same anyways
>> * for all cpus in a node.
>> */
>> - t = this_cpu_read(pcp->stat_threshold);
>> + if (item >= NR_VM_NUMA_STAT_ITEMS)
>> + t = this_cpu_read(pcp->stat_threshold);
>> + else
>> + t = VM_NUMA_STAT_THRESHOLD;
>>
>> o = this_cpu_read(*p);
>> n = delta + o;
>> --
>> 2.7.4
>>
>

2017-12-20 06:09:40

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation



On 2017年12月19日 20:43, Michal Hocko wrote:
> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
>> To avoid deviation, this patch uses node_page_state_snapshot instead of
>> node_page_state for node page stats query.
>> e.g. cat /proc/zoneinfo
>> cat /sys/devices/system/node/node*/vmstat
>> cat /sys/devices/system/node/node*/numastat
>>
>> As it is a slow path and would not be read frequently, I would worry about
>> it.
>
> The changelog doesn't explain why these counters needs any special
> treatment. _snapshot variants where used only for internal handling
> where the precision really mattered. We do not have any in-tree user and
> Jack has removed this by http://lkml.kernel.org/r/[email protected]
> which is already sitting in the mmotm tree. We can re-add it but that
> would really require a _very good_ reason.
>

Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t.
Currently, Skylake platform has hundreds of CPUs numbers and the number is still
increasing. Also, even the threshold size is kept to 125 at maximum (32765
for NUMA counters now), the deviation is just a little too big as I have mentioned in
the log. I tend to sum the number in local cpus up when query the global stats.

Also, node_page_state_snapshot is only called in slow path and I don't think that
would be a big problem.

Anyway, it is a matter of taste. I just think it's better to have.

>> Signed-off-by: Kemi Wang <[email protected]>
>> ---
>> drivers/base/node.c | 17 ++++++++++-------
>> mm/vmstat.c | 2 +-
>> 2 files changed, 11 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index a045ea1..cf303f8 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>> "interleave_hit %lu\n"
>> "local_node %lu\n"
>> "other_node %lu\n",
>> - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>> - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
>> + node_page_state_snapshot(NODE_DATA(dev->id),
>> + NUMA_FOREIGN),
>> + node_page_state_snapshot(NODE_DATA(dev->id),
>> + NUMA_INTERLEAVE_HIT),
>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
>> + node_page_state_snapshot(NODE_DATA(dev->id),
>> + NUMA_OTHER));
>> }
>>
>> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> n += sprintf(buf+n, "%s %lu\n",
>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - node_page_state(pgdat, i));
>> + node_page_state_snapshot(pgdat, i));
>>
>> return n;
>> }
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 64e08ae..d65f28d 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>> seq_printf(m, "\n %-12s %lu",
>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>> - node_page_state(pgdat, i));
>> + node_page_state_snapshot(pgdat, i));
>> }
>> }
>> seq_printf(m,
>> --
>> 2.7.4
>>
>

2017-12-20 06:47:56

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16



On 2017年12月20日 01:21, Christopher Lameter wrote:
> On Tue, 19 Dec 2017, Michal Hocko wrote:
>
>>> Well the reason for s8 was to keep the data structures small so that they
>>> fit in the higher level cpu caches. The large these structures become the
>>> more cachelines are used by the counters and the larger the performance
>>> influence on the code that should not be impacted by the overhead.
>>
>> I am not sure I understand. We usually do not access more counters in
>> the single code path (well, PGALLOC and NUMA counteres is more of an
>> exception). So it is rarely an advantage that the whole array is in the
>> same cache line. Besides that this is allocated by the percpu allocator
>> aligns to the type size rather than cache lines AFAICS.
>
> I thought we are talking about NUMA counters here?
>
> Regardless: A typical fault, system call or OS action will access multiple
> zone and node counters when allocating or freeing memory. Enlarging the
> fields will increase the number of cachelines touched.
>

Yes, we add one more cache line footprint access theoretically.
But I don't think it would be a problem.
1) Not all the counters need to be accessed in fast path of page allocation,
the counters covered in a single cache line usually is enough for that, we
probably don't need to access one more cache line. I tend to agree Michal's
argument.
Besides, in some slow path in which code is protected by zone lock or lru lock,
access one more cache line would be a big problem since many other cache lines
are also be accessed.

2) Enlarging vm_node_stat_diff from s8 to s16 gives an opportunity to keep
more number in local cpus that provides the possibility of reducing the global
counter update frequency. Thus, we can gain the benefit by reducing expensive
cache bouncing.

Well, if you still have some concerns, I can post some data for will-it-scale.page_fault1.
What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Or you can provide some other benchmarks on which you want to see performance
impact.

>> Maybe it used to be all different back then when the code has been added
>> but arguing about cache lines seems to be a bit problematic here. Maybe
>> you have some specific workloads which can prove me wrong?
>
> Run a workload that does some page faults? Heavy allocation and freeing of
> memory?
>
> Maybe that is no longer relevant since the number of the counters is
> large that the accesses are so sparse that each action pulls in a whole
> cacheline. That would be something we tried to avoid when implementing
> the differentials.
>
>

2017-12-20 10:07:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

On Wed 20-12-17 14:07:35, kemi wrote:
>
>
> On 2017年12月19日 20:43, Michal Hocko wrote:
> > On Tue 19-12-17 14:39:25, Kemi Wang wrote:
> >> To avoid deviation, this patch uses node_page_state_snapshot instead of
> >> node_page_state for node page stats query.
> >> e.g. cat /proc/zoneinfo
> >> cat /sys/devices/system/node/node*/vmstat
> >> cat /sys/devices/system/node/node*/numastat
> >>
> >> As it is a slow path and would not be read frequently, I would worry about
> >> it.
> >
> > The changelog doesn't explain why these counters needs any special
> > treatment. _snapshot variants where used only for internal handling
> > where the precision really mattered. We do not have any in-tree user and
> > Jack has removed this by http://lkml.kernel.org/r/[email protected]
> > which is already sitting in the mmotm tree. We can re-add it but that
> > would really require a _very good_ reason.
> >
>
> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t.
> Currently, Skylake platform has hundreds of CPUs numbers and the number is still
> increasing. Also, even the threshold size is kept to 125 at maximum (32765
> for NUMA counters now), the deviation is just a little too big as I have mentioned in
> the log. I tend to sum the number in local cpus up when query the global stats.

This is a general problem of pcp accounting. So if it needs to be
addressed then do it the same way for all stats.

> Also, node_page_state_snapshot is only called in slow path and I don't think that
> would be a big problem.
>
> Anyway, it is a matter of taste. I just think it's better to have.

You are making numastats special and I yet haven't heard any sounds
arguments for that. But that should be discussed in the respective
patch.

> >> Signed-off-by: Kemi Wang <[email protected]>
> >> ---
> >> drivers/base/node.c | 17 ++++++++++-------
> >> mm/vmstat.c | 2 +-
> >> 2 files changed, 11 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/base/node.c b/drivers/base/node.c
> >> index a045ea1..cf303f8 100644
> >> --- a/drivers/base/node.c
> >> +++ b/drivers/base/node.c
> >> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
> >> "interleave_hit %lu\n"
> >> "local_node %lu\n"
> >> "other_node %lu\n",
> >> - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
> >> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_FOREIGN),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_INTERLEAVE_HIT),
> >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
> >> + node_page_state_snapshot(NODE_DATA(dev->id),
> >> + NUMA_OTHER));
> >> }
> >>
> >> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
> >> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
> >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> >> n += sprintf(buf+n, "%s %lu\n",
> >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> >> - node_page_state(pgdat, i));
> >> + node_page_state_snapshot(pgdat, i));
> >>
> >> return n;
> >> }
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index 64e08ae..d65f28d 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
> >> seq_printf(m, "\n %-12s %lu",
> >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
> >> - node_page_state(pgdat, i));
> >> + node_page_state_snapshot(pgdat, i));
> >> }
> >> }
> >> seq_printf(m,
> >> --
> >> 2.7.4
> >>
> >

--
Michal Hocko
SUSE Labs

2017-12-20 10:12:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

On Wed 20-12-17 13:52:14, kemi wrote:
>
>
> On 2017年12月19日 20:40, Michal Hocko wrote:
> > On Tue 19-12-17 14:39:24, Kemi Wang wrote:
> >> We have seen significant overhead in cache bouncing caused by NUMA counters
> >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
> >> update NUMA counter threshold size")' for more details.
> >>
> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
> >> with global counter update using different threshold size for node page
> >> stats.
> >
> > Again, no numbers.
>
> Compare to vanilla kernel, I don't think it has performance improvement, so
> I didn't post performance data here.
> But, if you would like to see performance gain from enlarging threshold size
> for NUMA stats (compare to the first patch), I will do that later.

Please do. I would also like to hear _why_ all counters cannot simply
behave same. In other words why we cannot simply increase
stat_threshold? Maybe calculate_normal_threshold needs a better scaling
for larger machines.
--
Michal Hocko
SUSE Labs

2017-12-20 10:23:45

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size



On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>> We have seen significant overhead in cache bouncing caused by NUMA counters
>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>> update NUMA counter threshold size")' for more details.
>>>>
>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>>>> with global counter update using different threshold size for node page
>>>> stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later.
>
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
>

Agree. We may consider that.
But, unlike NUMA counters which do not effect system decision.
We need consider very carefully when increase stat_threshold for all the counters
for larger machines. BTW, this is another topic that we may discuss it in different
thread.

2017-12-20 10:26:32

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation



On 2017年12月20日 18:06, Michal Hocko wrote:
> On Wed 20-12-17 14:07:35, kemi wrote:
>>
>>
>> On 2017年12月19日 20:43, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:25, Kemi Wang wrote:
>>>> To avoid deviation, this patch uses node_page_state_snapshot instead of
>>>> node_page_state for node page stats query.
>>>> e.g. cat /proc/zoneinfo
>>>> cat /sys/devices/system/node/node*/vmstat
>>>> cat /sys/devices/system/node/node*/numastat
>>>>
>>>> As it is a slow path and would not be read frequently, I would worry about
>>>> it.
>>>
>>> The changelog doesn't explain why these counters needs any special
>>> treatment. _snapshot variants where used only for internal handling
>>> where the precision really mattered. We do not have any in-tree user and
>>> Jack has removed this by http://lkml.kernel.org/r/[email protected]
>>> which is already sitting in the mmotm tree. We can re-add it but that
>>> would really require a _very good_ reason.
>>>
>>
>> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t.
>> Currently, Skylake platform has hundreds of CPUs numbers and the number is still
>> increasing. Also, even the threshold size is kept to 125 at maximum (32765
>> for NUMA counters now), the deviation is just a little too big as I have mentioned in
>> the log. I tend to sum the number in local cpus up when query the global stats.
>
> This is a general problem of pcp accounting. So if it needs to be
> addressed then do it the same way for all stats.
>
>> Also, node_page_state_snapshot is only called in slow path and I don't think that
>> would be a big problem.
>>
>> Anyway, it is a matter of taste. I just think it's better to have.
>
> You are making numastats special and I yet haven't heard any sounds
> arguments for that. But that should be discussed in the respective
> patch.
>

That is because we have much larger threshold size for NUMA counters, that means larger
deviation. So, the number in local cpus may not be simply ignored.

>>>> Signed-off-by: Kemi Wang <[email protected]>
>>>> ---
>>>> drivers/base/node.c | 17 ++++++++++-------
>>>> mm/vmstat.c | 2 +-
>>>> 2 files changed, 11 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>>> index a045ea1..cf303f8 100644
>>>> --- a/drivers/base/node.c
>>>> +++ b/drivers/base/node.c
>>>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev,
>>>> "interleave_hit %lu\n"
>>>> "local_node %lu\n"
>>>> "other_node %lu\n",
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_HIT),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_MISS),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL),
>>>> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER));
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_FOREIGN),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_INTERLEAVE_HIT),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL),
>>>> + node_page_state_snapshot(NODE_DATA(dev->id),
>>>> + NUMA_OTHER));
>>>> }
>>>>
>>>> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>>>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>>>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>>>> n += sprintf(buf+n, "%s %lu\n",
>>>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>>> - node_page_state(pgdat, i));
>>>> + node_page_state_snapshot(pgdat, i));
>>>>
>>>> return n;
>>>> }
>>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>>> index 64e08ae..d65f28d 100644
>>>> --- a/mm/vmstat.c
>>>> +++ b/mm/vmstat.c
>>>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>>>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>>>> seq_printf(m, "\n %-12s %lu",
>>>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>>> - node_page_state(pgdat, i));
>>>> + node_page_state_snapshot(pgdat, i));
>>>> }
>>>> }
>>>> seq_printf(m,
>>>> --
>>>> 2.7.4
>>>>
>>>
>

Subject: Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation

On Wed, 20 Dec 2017, kemi wrote:

> > You are making numastats special and I yet haven't heard any sounds
> > arguments for that. But that should be discussed in the respective
> > patch.
> >
>
> That is because we have much larger threshold size for NUMA counters, that means larger
> deviation. So, the number in local cpus may not be simply ignored.

Some numbers showing the effect of these changes would be helpful. You can
probably create some in kernel synthetic tests to start with which would
allow you to see any significant effects of those changes.

Then run the larger testsuites (f.e. those that Mel has published) and
benchmarks to figure out how behavior of real apps *may* change?

2017-12-21 01:41:05

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation



On 2017年12月20日 23:58, Christopher Lameter wrote:
> On Wed, 20 Dec 2017, kemi wrote:
>
>>> You are making numastats special and I yet haven't heard any sounds
>>> arguments for that. But that should be discussed in the respective
>>> patch.
>>>
>>
>> That is because we have much larger threshold size for NUMA counters, that means larger
>> deviation. So, the number in local cpus may not be simply ignored.
>
> Some numbers showing the effect of these changes would be helpful. You can
> probably create some in kernel synthetic tests to start with which would
> allow you to see any significant effects of those changes.
>
> Then run the larger testsuites (f.e. those that Mel has published) and
> benchmarks to figure out how behavior of real apps *may* change?
>

OK.
I will do that when available.
Let's just drop this patch in this series and consider this issue
in another patch.

2017-12-21 08:08:54

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size



On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>> We have seen significant overhead in cache bouncing caused by NUMA counters
>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>> update NUMA counter threshold size")' for more details.
>>>>
>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>>>> with global counter update using different threshold size for node page
>>>> stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later.
>
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
>

I will add this performance data to changelog in V3 patch series.

Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
Benchmark: page_bench03
Description: 112 threads do single page allocation/deallocation in parallel.
before after
(enlarge threshold size)
CPU cycles 722 379(-47.5%)

Some thinking about that:
a) the overhead due to cache bouncing caused by NUMA counter update in fast path
severely increase with more and more CPUs cores
b) AFAIK, the typical usage scenario (similar at least)for which this optimization can
benefit is 10/40G NIC used in high-speed data center network of cloud service providers.

2017-12-21 08:17:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

On Thu 21-12-17 16:06:50, kemi wrote:
>
>
> On 2017年12月20日 18:12, Michal Hocko wrote:
> > On Wed 20-12-17 13:52:14, kemi wrote:
> >>
> >>
> >> On 2017年12月19日 20:40, Michal Hocko wrote:
> >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
> >>>> We have seen significant overhead in cache bouncing caused by NUMA counters
> >>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
> >>>> update NUMA counter threshold size")' for more details.
> >>>>
> >>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
> >>>> with global counter update using different threshold size for node page
> >>>> stats.
> >>>
> >>> Again, no numbers.
> >>
> >> Compare to vanilla kernel, I don't think it has performance improvement, so
> >> I didn't post performance data here.
> >> But, if you would like to see performance gain from enlarging threshold size
> >> for NUMA stats (compare to the first patch), I will do that later.
> >
> > Please do. I would also like to hear _why_ all counters cannot simply
> > behave same. In other words why we cannot simply increase
> > stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> > for larger machines.
> >
>
> I will add this performance data to changelog in V3 patch series.
>
> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
> Benchmark: page_bench03
> Description: 112 threads do single page allocation/deallocation in parallel.
> before after
> (enlarge threshold size)
> CPU cycles 722 379(-47.5%)

Please describe the numbers some more. Is this an average? What is the
std? Can you see any difference with a more generic workload?

> Some thinking about that:
> a) the overhead due to cache bouncing caused by NUMA counter update in fast path
> severely increase with more and more CPUs cores

What is an effect on a smaller system with fewer CPUs?

> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can
> benefit is 10/40G NIC used in high-speed data center network of cloud service providers.

I would expect those would disable the numa accounting altogether.
--
Michal Hocko
SUSE Labs

2017-12-21 08:25:28

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size



On 2017年12月21日 16:17, Michal Hocko wrote:
> On Thu 21-12-17 16:06:50, kemi wrote:
>>
>>
>> On 2017年12月20日 18:12, Michal Hocko wrote:
>>> On Wed 20-12-17 13:52:14, kemi wrote:
>>>>
>>>>
>>>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>>>>>> We have seen significant overhead in cache bouncing caused by NUMA counters
>>>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>>>>>> update NUMA counter threshold size")' for more details.
>>>>>>
>>>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>>>>>> with global counter update using different threshold size for node page
>>>>>> stats.
>>>>>
>>>>> Again, no numbers.
>>>>
>>>> Compare to vanilla kernel, I don't think it has performance improvement, so
>>>> I didn't post performance data here.
>>>> But, if you would like to see performance gain from enlarging threshold size
>>>> for NUMA stats (compare to the first patch), I will do that later.
>>>
>>> Please do. I would also like to hear _why_ all counters cannot simply
>>> behave same. In other words why we cannot simply increase
>>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
>>> for larger machines.
>>>
>>
>> I will add this performance data to changelog in V3 patch series.
>>
>> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
>> Benchmark: page_bench03
>> Description: 112 threads do single page allocation/deallocation in parallel.
>> before after
>> (enlarge threshold size)
>> CPU cycles 722 379(-47.5%)
>
> Please describe the numbers some more. Is this an average?

Yes

> What is the std?

I increase the loop times to 10m, so the std is quite slow (repeat 3 times)

> Can you see any difference with a more generic workload?
>

I didn't see obvious improvement for will-it-scale.page_fault1
Two reasons for that:
1) too long code path
2) server zone lock and lru lock contention (access to buddy system frequently)

>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast path
>> severely increase with more and more CPUs cores
>
> What is an effect on a smaller system with fewer CPUs?
>

Several CPU cycles can be saved using single thread for that.

>> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can
>> benefit is 10/40G NIC used in high-speed data center network of cloud service providers.
>
> I would expect those would disable the numa accounting altogether.
>

Yes, but it is still worthy to do some optimization, isn't?

2017-12-21 08:59:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

On Thu 21-12-17 16:23:23, kemi wrote:
>
>
> On 2017年12月21日 16:17, Michal Hocko wrote:
[...]
> > Can you see any difference with a more generic workload?
> >
>
> I didn't see obvious improvement for will-it-scale.page_fault1
> Two reasons for that:
> 1) too long code path
> 2) server zone lock and lru lock contention (access to buddy system frequently)

OK. So does the patch helps for anything other than a microbenchmark?

> >> Some thinking about that:
> >> a) the overhead due to cache bouncing caused by NUMA counter update in fast path
> >> severely increase with more and more CPUs cores
> >
> > What is an effect on a smaller system with fewer CPUs?
> >
>
> Several CPU cycles can be saved using single thread for that.
>
> >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can
> >> benefit is 10/40G NIC used in high-speed data center network of cloud service providers.
> >
> > I would expect those would disable the numa accounting altogether.
> >
>
> Yes, but it is still worthy to do some optimization, isn't?

Ohh, I am not opposing optimizations but you should make sure that they
are worth the additional code and special casing. As I've said I am not
convinced special casing numa counters is good. You can play with the
threshold scaling for larger CPU count but let's make sure that the
benefit is really measurable for normal workloads. Special ones will
disable the numa accounting anyway.

Thanks!
--
Michal Hocko
SUSE Labs

2017-12-21 10:33:22

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size



On 2017年12月21日 16:59, Michal Hocko wrote:
> On Thu 21-12-17 16:23:23, kemi wrote:
>>
>>
>> On 2017年12月21日 16:17, Michal Hocko wrote:
> [...]
>>> Can you see any difference with a more generic workload?
>>>
>>
>> I didn't see obvious improvement for will-it-scale.page_fault1
>> Two reasons for that:
>> 1) too long code path
>> 2) server zone lock and lru lock contention (access to buddy system frequently)
>
> OK. So does the patch helps for anything other than a microbenchmark?
>
>>>> Some thinking about that:
>>>> a) the overhead due to cache bouncing caused by NUMA counter update in fast path
>>>> severely increase with more and more CPUs cores
>>>
>>> What is an effect on a smaller system with fewer CPUs?
>>>
>>
>> Several CPU cycles can be saved using single thread for that.
>>
>>>> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can
>>>> benefit is 10/40G NIC used in high-speed data center network of cloud service providers.
>>>
>>> I would expect those would disable the numa accounting altogether.
>>>
>>
>> Yes, but it is still worthy to do some optimization, isn't?
>
> Ohh, I am not opposing optimizations but you should make sure that they
> are worth the additional code and special casing. As I've said I am not
> convinced special casing numa counters is good. You can play with the
> threshold scaling for larger CPU count but let's make sure that the
> benefit is really measurable for normal workloads. Special ones will
> disable the numa accounting anyway.
>

I understood. Could you give me some suggestion for those normal workloads, Thanks.
I will have a try and post the data ASAP.

Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

On Thu, 21 Dec 2017, kemi wrote:

> Some thinking about that:
> a) the overhead due to cache bouncing caused by NUMA counter update in fast path
> severely increase with more and more CPUs cores
> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can
> benefit is 10/40G NIC used in high-speed data center network of cloud service providers.

I think you are fighting a lost battle there. As evident from the timing
constraints on packet processing in a 10/40G you will have a hard time to
process data if the packets are of regular ethernet size. And we alrady
have 100G NICs in operation here.

We can try to get the performance as high as possible but full rate high
speed networking invariable must use offload mechanisms and thus the
statistics would only be available from the hardware devices that can do
wire speed processing.

2017-12-22 02:08:51

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size



On 2017年12月22日 01:10, Christopher Lameter wrote:
> On Thu, 21 Dec 2017, kemi wrote:
>
>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast path
>> severely increase with more and more CPUs cores
>> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can
>> benefit is 10/40G NIC used in high-speed data center network of cloud service providers.
>
> I think you are fighting a lost battle there. As evident from the timing
> constraints on packet processing in a 10/40G you will have a hard time to
> process data if the packets are of regular ethernet size. And we alrady
> have 100G NICs in operation here.
>

Not really.
For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network
rather than kernel driver in production environment.
That's due to the slow page allocator and long pipeline processing in network
protocol stack.
That's not easy to change this state in short time, but if we can do something
here to change it a little, why not.

> We can try to get the performance as high as possible but full rate high
> speed networking invariable must use offload mechanisms and thus the
> statistics would only be available from the hardware devices that can do
> wire speed processing.
>

I think you may be talking something about SmartNIC (e.g. OpenVswitch offload +
VF pass through). That's usually used in virtualization environment to eliminate
the overhead from device emulation and packet processing in software virtual
switch(OVS or linux bridge).

What I have done in this patch series is to improve page allocator performance,
that's also helpful in offload environment (guest kernel at least), IMHO.

2017-12-22 12:31:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

On Thu 21-12-17 18:31:19, kemi wrote:
>
>
> On 2017年12月21日 16:59, Michal Hocko wrote:
> > On Thu 21-12-17 16:23:23, kemi wrote:
> >>
> >>
> >> On 2017年12月21日 16:17, Michal Hocko wrote:
> > [...]
> >>> Can you see any difference with a more generic workload?
> >>>
> >>
> >> I didn't see obvious improvement for will-it-scale.page_fault1
> >> Two reasons for that:
> >> 1) too long code path
> >> 2) server zone lock and lru lock contention (access to buddy system frequently)
> >
> > OK. So does the patch helps for anything other than a microbenchmark?
> >
> >>>> Some thinking about that:
> >>>> a) the overhead due to cache bouncing caused by NUMA counter update in fast path
> >>>> severely increase with more and more CPUs cores
> >>>
> >>> What is an effect on a smaller system with fewer CPUs?
> >>>
> >>
> >> Several CPU cycles can be saved using single thread for that.
> >>
> >>>> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can
> >>>> benefit is 10/40G NIC used in high-speed data center network of cloud service providers.
> >>>
> >>> I would expect those would disable the numa accounting altogether.
> >>>
> >>
> >> Yes, but it is still worthy to do some optimization, isn't?
> >
> > Ohh, I am not opposing optimizations but you should make sure that they
> > are worth the additional code and special casing. As I've said I am not
> > convinced special casing numa counters is good. You can play with the
> > threshold scaling for larger CPU count but let's make sure that the
> > benefit is really measurable for normal workloads. Special ones will
> > disable the numa accounting anyway.
> >
>
> I understood. Could you give me some suggestion for those normal workloads, Thanks.
> I will have a try and post the data ASAP.

Well, to be honest, I am really confused what is your objective for
these optimizations then. I hope we have agreed that workloads which
really need to squeeze every single CPU cycle in the allocation path
will simply disable the whole numa stat thing. I haven't yet heard about
any use case which would really required numa stats and suffer from the
numa stats overhead.

I can see some arguments for a better threshold scaling but that
requires to check wider range of tests to show there are no unintended
changes. I am not really confident you understand that when you are
asking for "those normal workloads".

So please, try to step back, rethink who you are optimizing for and act
accordingly. If I were you I would repost the first patch which only
integrates numa stats because that removes a lot of pointless code and
that is a win of its own.

--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

On Fri, 22 Dec 2017, kemi wrote:

> > I think you are fighting a lost battle there. As evident from the timing
> > constraints on packet processing in a 10/40G you will have a hard time to
> > process data if the packets are of regular ethernet size. And we alrady
> > have 100G NICs in operation here.
> >
>
> Not really.
> For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network
> rather than kernel driver in production environment.

Shudder. I would rather have an user space API that is vendor neutral and
that allows the use of multiple NICs. The Linux kernel has an RDMA
subsystem that does just that.

But time budget is difficult to deal with even using RDMA or DPKG where we
can avoid the OS overhead.

> That's due to the slow page allocator and long pipeline processing in network
> protocol stack.

Right the timing budget there for processing a single packet gets below a
microsecond at some point and there its going to be difficult to do much.
Some aggregation / offloading is required and that increases as speeds
become higher.

> That's not easy to change this state in short time, but if we can do something
> here to change it a little, why not.

How much of an improvement is this going to be? If it is significant then
by all means lets do it.

> > We can try to get the performance as high as possible but full rate high
> > speed networking invariable must use offload mechanisms and thus the
> > statistics would only be available from the hardware devices that can do
> > wire speed processing.
> >
>
> I think you may be talking something about SmartNIC (e.g. OpenVswitch offload +
> VF pass through). That's usually used in virtualization environment to eliminate
> the overhead from device emulation and packet processing in software virtual
> switch(OVS or linux bridge).

The switch offloads Can also be used elsewhere. Also the RDMA subsystem
has counters like that.