2005-12-06 18:28:56

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 1/3] Framework for accurate node based statistics

[RFC] Framework for accurate node based statistics

Currently we have various vm counters that are split per cpu. This arrangement
does not allow access to per node statistics that are important to optimize
VM behavior for NUMA architectures. All one can say from the per_cpu
differential variables is how much a certain variable was changed by this cpu
without being able to deduce how many pages in each node are of a certain type.

This patch introduces a generic framework to allow accurate per node vm
statistics through a large per node and per cpu array. The numbers are
consolidated when the slab drainer runs (every 3 seconds or so) into global
and per node counters. VM functions can then check these statistics by
simply accessing the node specific or global counter.

A significant problem with this approach is that the statistics are only
accumulated every 3 seconds or so. I have tried various other approaches
but they typically end up with having to add atomic variables to critical
VM paths. I'd be glad if someone else had a bright idea on how to improve
the situation.

There are two patches following that convert two important counters to
work per node but there may be many more that may be useful in the future.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc3/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc3.orig/mm/page_alloc.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/mm/page_alloc.c 2005-12-01 00:38:05.000000000 -0800
@@ -557,6 +557,33 @@ static int rmqueue_bulk(struct zone *zon
}

#ifdef CONFIG_NUMA
+static spinlock_t node_stat_lock;
+unsigned long vm_stat_global[NR_STAT_ITEMS];
+unsigned long vm_stat_node[MAX_NUMNODES][NR_STAT_ITEMS];
+int vm_stat_diff[NR_CPUS][MAX_NUMNODES][NR_STAT_ITEMS];
+
+void refresh_vm_stats(void) {
+ int cpu;
+ int node;
+ int i;
+
+ spin_lock(&node_stat_lock);
+
+ cpu = get_cpu();
+ for_each_online_node(node)
+ for(i = 0; i < NR_STAT_ITEMS; i++) {
+ int * p = vm_stat_diff[cpu][node]+i;
+ if (*p) {
+ vm_stat_node[node][i] += *p;
+ vm_stat_global[i] += *p;
+ *p = 0;
+ }
+ }
+ put_cpu();
+
+ spin_unlock(&node_stat_lock);
+}
+
/* Called from the slab reaper to drain remote pagesets */
void drain_remote_pages(void)
{
Index: linux-2.6.15-rc3/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc3.orig/include/linux/page-flags.h 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/include/linux/page-flags.h 2005-12-01 00:35:38.000000000 -0800
@@ -163,6 +163,27 @@ extern void __mod_page_state(unsigned lo
} while (0)

/*
+ * Node based accounting with per cpu differentials.
+ */
+enum node_stat_item { };
+#define NR_STAT_ITEMS 0
+
+extern unsigned long vm_stat_global[NR_STAT_ITEMS];
+extern unsigned long vm_stat_node[MAX_NUMNODES][NR_STAT_ITEMS];
+extern int vm_stat_diff[NR_CPUS][MAX_NUMNODES][NR_STAT_ITEMS];
+
+static inline void mod_node_page_state(int node, enum node_stat_item item, int delta)
+{
+ vm_stat_diff[get_cpu()][node][item] += delta;
+ put_cpu();
+}
+
+#define inc_node_page_state(node, item) mod_node_page_state(node, item, 1)
+#define dec_node_page_state(node, item) mod_node_page_state(node, item, -1)
+#define add_node_page_state(node, item) mod_node_page_state(node, item, delta)
+#define sub_node_page_state(node, item) mod_node_page_state(node, item, -(delta))
+
+/*
* Manipulation of page state flags
*/
#define PageLocked(page) \
Index: linux-2.6.15-rc3/include/linux/gfp.h
===================================================================
--- linux-2.6.15-rc3.orig/include/linux/gfp.h 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/include/linux/gfp.h 2005-12-01 00:34:02.000000000 -0800
@@ -153,8 +153,10 @@ extern void FASTCALL(free_cold_page(stru
void page_alloc_init(void);
#ifdef CONFIG_NUMA
void drain_remote_pages(void);
+void refresh_vm_stats(void);
#else
static inline void drain_remote_pages(void) { };
+static inline void refresh_vm_stats(void) { }
#endif

#endif /* __LINUX_GFP_H */
Index: linux-2.6.15-rc3/mm/slab.c
===================================================================
--- linux-2.6.15-rc3.orig/mm/slab.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/mm/slab.c 2005-12-01 00:34:02.000000000 -0800
@@ -3359,6 +3359,7 @@ next:
check_irq_on();
up(&cache_chain_sem);
drain_remote_pages();
+ refresh_vm_stats();
/* Setup the next iteration */
schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC);
}


2005-12-06 18:29:18

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 3/3] Make nr_pagecache a per node counter

Make nr_pagecache a per node variable

The nr_pagecache atomic variable is a particular ugly spot in the VM right
now. We ultimately need a sortof accurate value. This patch makes nr_pagecache
conform to the other VM statistics

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc5/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/page-flags.h 2005-12-06 10:13:49.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/page-flags.h 2005-12-06 10:15:59.000000000 -0800
@@ -164,8 +164,8 @@ extern void __mod_page_state(unsigned lo
/*
* Node based accounting with per cpu differentials.
*/
-enum node_stat_item { NR_MAPPED };
-#define NR_STAT_ITEMS 1
+enum node_stat_item { NR_MAPPED, NR_PAGECACHE };
+#define NR_STAT_ITEMS 2

extern unsigned long vm_stat_global[NR_STAT_ITEMS];
extern unsigned long vm_stat_node[MAX_NUMNODES][NR_STAT_ITEMS];
Index: linux-2.6.15-rc5/include/linux/pagemap.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/pagemap.h 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/pagemap.h 2005-12-06 10:15:59.000000000 -0800
@@ -99,49 +99,9 @@ int add_to_page_cache_lru(struct page *p
extern void remove_from_page_cache(struct page *page);
extern void __remove_from_page_cache(struct page *page);

-extern atomic_t nr_pagecache;
-
-#ifdef CONFIG_SMP
-
-#define PAGECACHE_ACCT_THRESHOLD max(16, NR_CPUS * 2)
-DECLARE_PER_CPU(long, nr_pagecache_local);
-
-/*
- * pagecache_acct implements approximate accounting for pagecache.
- * vm_enough_memory() do not need high accuracy. Writers will keep
- * an offset in their per-cpu arena and will spill that into the
- * global count whenever the absolute value of the local count
- * exceeds the counter's threshold.
- *
- * MUST be protected from preemption.
- * current protection is mapping->page_lock.
- */
-static inline void pagecache_acct(int count)
-{
- long *local;
-
- local = &__get_cpu_var(nr_pagecache_local);
- *local += count;
- if (*local > PAGECACHE_ACCT_THRESHOLD || *local < -PAGECACHE_ACCT_THRESHOLD) {
- atomic_add(*local, &nr_pagecache);
- *local = 0;
- }
-}
-
-#else
-
-static inline void pagecache_acct(int count)
-{
- atomic_add(count, &nr_pagecache);
-}
-#endif
-
static inline unsigned long get_page_cache_size(void)
{
- int ret = atomic_read(&nr_pagecache);
- if (unlikely(ret < 0))
- ret = 0;
- return ret;
+ return vm_stat_global[NR_PAGECACHE];
}

/*
Index: linux-2.6.15-rc5/mm/swap_state.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/swap_state.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/swap_state.c 2005-12-06 10:15:59.000000000 -0800
@@ -84,7 +84,7 @@ static int __add_to_swap_cache(struct pa
SetPageSwapCache(page);
set_page_private(page, entry.val);
total_swapcache_pages++;
- pagecache_acct(1);
+ inc_node_page_state(page_to_nid(page), NR_PAGECACHE);
}
write_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();
@@ -129,7 +129,7 @@ void __delete_from_swap_cache(struct pag
set_page_private(page, 0);
ClearPageSwapCache(page);
total_swapcache_pages--;
- pagecache_acct(-1);
+ dec_node_page_state(page_to_nid(page), NR_PAGECACHE);
INC_CACHE_INFO(del_total);
}

Index: linux-2.6.15-rc5/mm/filemap.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/filemap.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/filemap.c 2005-12-06 10:15:59.000000000 -0800
@@ -115,7 +115,7 @@ void __remove_from_page_cache(struct pag
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
- pagecache_acct(-1);
+ dec_node_page_state(page_to_nid(page), NR_PAGECACHE);
}

void remove_from_page_cache(struct page *page)
@@ -390,7 +390,7 @@ int add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = offset;
mapping->nrpages++;
- pagecache_acct(1);
+ inc_node_page_state(page_to_nid(page), NR_PAGECACHE);
}
write_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();

2005-12-06 18:29:07

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 2/3] Make nr_mapped a per node counter

Make nr_mapped a per node counter

The per cpu nr_mapped counter is important because it allows a determination
how many pages of a node are not mapped, which would allow a more effiecient
means of determining when a node should reclaim memory.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc3/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc3.orig/include/linux/page-flags.h 2005-12-01 00:35:38.000000000 -0800
+++ linux-2.6.15-rc3/include/linux/page-flags.h 2005-12-01 00:35:49.000000000 -0800
@@ -85,7 +85,6 @@ struct page_state {
unsigned long nr_writeback; /* Pages under writeback */
unsigned long nr_unstable; /* NFS unstable pages */
unsigned long nr_page_table_pages;/* Pages used for pagetables */
- unsigned long nr_mapped; /* mapped into pagetables */
unsigned long nr_slab; /* In slab */
#define GET_PAGE_STATE_LAST nr_slab

@@ -165,8 +164,8 @@ extern void __mod_page_state(unsigned lo
/*
* Node based accounting with per cpu differentials.
*/
-enum node_stat_item { };
-#define NR_STAT_ITEMS 0
+enum node_stat_item { NR_MAPPED };
+#define NR_STAT_ITEMS 1

extern unsigned long vm_stat_global[NR_STAT_ITEMS];
extern unsigned long vm_stat_node[MAX_NUMNODES][NR_STAT_ITEMS];
Index: linux-2.6.15-rc3/drivers/base/node.c
===================================================================
--- linux-2.6.15-rc3.orig/drivers/base/node.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/drivers/base/node.c 2005-12-01 00:35:49.000000000 -0800
@@ -53,8 +53,6 @@ static ssize_t node_read_meminfo(struct
ps.nr_dirty = 0;
if ((long)ps.nr_writeback < 0)
ps.nr_writeback = 0;
- if ((long)ps.nr_mapped < 0)
- ps.nr_mapped = 0;
if ((long)ps.nr_slab < 0)
ps.nr_slab = 0;

@@ -83,7 +81,7 @@ static ssize_t node_read_meminfo(struct
nid, K(i.freeram - i.freehigh),
nid, K(ps.nr_dirty),
nid, K(ps.nr_writeback),
- nid, K(ps.nr_mapped),
+ nid, K(vm_stat_node[nid][NR_MAPPED]),
nid, K(ps.nr_slab));
n += hugetlb_report_node_meminfo(nid, buf + n);
return n;
Index: linux-2.6.15-rc3/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.15-rc3.orig/fs/proc/proc_misc.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/fs/proc/proc_misc.c 2005-12-01 00:35:49.000000000 -0800
@@ -190,7 +190,7 @@ static int meminfo_read_proc(char *page,
K(i.freeswap),
K(ps.nr_dirty),
K(ps.nr_writeback),
- K(ps.nr_mapped),
+ K(vm_stat_global[NR_MAPPED]),
K(ps.nr_slab),
K(allowed),
K(committed),
Index: linux-2.6.15-rc3/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc3.orig/mm/vmscan.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/mm/vmscan.c 2005-12-01 00:35:49.000000000 -0800
@@ -967,7 +967,7 @@ int try_to_free_pages(struct zone **zone
}

for (priority = DEF_PRIORITY; priority >= 0; priority--) {
- sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_mapped = vm_stat_global[NR_MAPPED];
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
sc.priority = priority;
@@ -1056,7 +1056,7 @@ loop_again:
sc.gfp_mask = GFP_KERNEL;
sc.may_writepage = 0;
sc.may_swap = 1;
- sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_mapped = vm_stat_global[NR_MAPPED];

inc_page_state(pageoutrun);

@@ -1373,7 +1373,7 @@ int zone_reclaim(struct zone *zone, gfp_
sc.gfp_mask = gfp_mask;
sc.may_writepage = 0;
sc.may_swap = 0;
- sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_mapped = vm_stat_global[NR_MAPPED];
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
/* scan at the highest priority */
Index: linux-2.6.15-rc3/mm/page-writeback.c
===================================================================
--- linux-2.6.15-rc3.orig/mm/page-writeback.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/mm/page-writeback.c 2005-12-01 00:35:49.000000000 -0800
@@ -111,7 +111,7 @@ static void get_writeback_state(struct w
{
wbs->nr_dirty = read_page_state(nr_dirty);
wbs->nr_unstable = read_page_state(nr_unstable);
- wbs->nr_mapped = read_page_state(nr_mapped);
+ wbs->nr_mapped = vm_stat_global[NR_MAPPED];
wbs->nr_writeback = read_page_state(nr_writeback);
}

Index: linux-2.6.15-rc3/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc3.orig/mm/page_alloc.c 2005-12-01 00:34:02.000000000 -0800
+++ linux-2.6.15-rc3/mm/page_alloc.c 2005-12-01 00:35:49.000000000 -0800
@@ -1400,7 +1400,7 @@ void show_free_areas(void)
ps.nr_unstable,
nr_free_pages(),
ps.nr_slab,
- ps.nr_mapped,
+ vm_stat_global[NR_MAPPED],
ps.nr_page_table_pages);

for_each_zone(zone) {
Index: linux-2.6.15-rc3/mm/rmap.c
===================================================================
--- linux-2.6.15-rc3.orig/mm/rmap.c 2005-11-28 19:51:27.000000000 -0800
+++ linux-2.6.15-rc3/mm/rmap.c 2005-12-01 00:35:49.000000000 -0800
@@ -454,7 +454,7 @@ void page_add_anon_rmap(struct page *pag

page->index = linear_page_index(vma, address);

- inc_page_state(nr_mapped);
+ inc_node_page_state(page_to_nid(page), NR_MAPPED);
}
/* else checking page index and mapping is racy */
}
@@ -471,7 +471,7 @@ void page_add_file_rmap(struct page *pag
BUG_ON(!pfn_valid(page_to_pfn(page)));

if (atomic_inc_and_test(&page->_mapcount))
- inc_page_state(nr_mapped);
+ inc_node_page_state(page_to_nid(page), NR_MAPPED);
}

/**
@@ -495,7 +495,7 @@ void page_remove_rmap(struct page *page)
*/
if (page_test_and_clear_dirty(page))
set_page_dirty(page);
- dec_page_state(nr_mapped);
+ dec_node_page_state(page_to_nid(page), NR_MAPPED);
}
}

2005-12-06 18:35:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

> +static inline void mod_node_page_state(int node, enum node_stat_item item, int delta)
> +{
> + vm_stat_diff[get_cpu()][node][item] += delta;
> + put_cpu();

Instead of get/put_cpu I would use a local_t. This would give much better code
on i386/x86-64. I have some plans to port over all the MM statistics counters
over to local_t, still stuck, but for new code it should be definitely done.

-Andi

2005-12-06 19:08:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Tue, 6 Dec 2005, Andi Kleen wrote:

> > +static inline void mod_node_page_state(int node, enum node_stat_item item, int delta)
> > +{
> > + vm_stat_diff[get_cpu()][node][item] += delta;
> > + put_cpu();
>
> Instead of get/put_cpu I would use a local_t. This would give much better code
> on i386/x86-64. I have some plans to port over all the MM statistics counters
> over to local_t, still stuck, but for new code it should be definitely done.

Yuck. That code uses atomic operations and is not aware of atomic64_t.

2005-12-06 19:26:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Tue, Dec 06, 2005 at 11:08:42AM -0800, Christoph Lameter wrote:
> On Tue, 6 Dec 2005, Andi Kleen wrote:
>
> > > +static inline void mod_node_page_state(int node, enum node_stat_item item, int delta)
> > > +{
> > > + vm_stat_diff[get_cpu()][node][item] += delta;
> > > + put_cpu();
> >
> > Instead of get/put_cpu I would use a local_t. This would give much better code
> > on i386/x86-64. I have some plans to port over all the MM statistics counters
> > over to local_t, still stuck, but for new code it should be definitely done.
>
> Yuck. That code uses atomic operations and is not aware of atomic64_t.

Hmm? What code are you looking at?

At least i386/x86-64/generic don't use any atomic operations, just
normal non atomic on bus but atomic for interrupts local rmw.

Do you actually need 64bit?

-Andi

2005-12-06 19:36:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Tue, 6 Dec 2005, Andi Kleen wrote:

> > Yuck. That code uses atomic operations and is not aware of atomic64_t.
> Hmm? What code are you looking at?
include/asm-generic/local.h. this is the default right? And
include/asm-ia64/local.h.

> At least i386/x86-64/generic don't use any atomic operations, just
> normal non atomic on bus but atomic for interrupts local rmw.

inc/dec are atomic by default on x86_64?

> Do you actually need 64bit?

32 bit limits us in the worst case to 8 Terabytes of RAM (assuming a very
small page size of 4k and 31 bit available for an atomic variable
[sparc]). SGI already has installations with 15 Terabytes of RAM.

2005-12-06 20:06:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Tue, Dec 06, 2005 at 11:36:43AM -0800, Christoph Lameter wrote:
> On Tue, 6 Dec 2005, Andi Kleen wrote:
>
> > > Yuck. That code uses atomic operations and is not aware of atomic64_t.
> > Hmm? What code are you looking at?
> include/asm-generic/local.h. this is the default right? And
> include/asm-ia64/local.h.
>
> > At least i386/x86-64/generic don't use any atomic operations, just
> > normal non atomic on bus but atomic for interrupts local rmw.
>
> inc/dec are atomic by default on x86_64?

They are atomic against interrupts on the same CPU. And on Linux
also atomic against preempt moving you to another CPU. And all that
without the cost of a bus lock. And that is what local_t is about.

>
> > Do you actually need 64bit?
>
> 32 bit limits us in the worst case to 8 Terabytes of RAM (assuming a very
> small page size of 4k and 31 bit available for an atomic variable
> [sparc]). SGI already has installations with 15 Terabytes of RAM.

Ok we'll need a local64_t then. No big deal - can be easily added.
Or perhaps better a long_local_t so that 32bit doesn't need to
pay the cost.

-Andi

2005-12-06 22:52:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Tue, 6 Dec 2005, Andi Kleen wrote:

> Ok we'll need a local64_t then. No big deal - can be easily added.
> Or perhaps better a long_local_t so that 32bit doesn't need to
> pay the cost.

I jusw saw that ia64 already has local_t as 64 bit, so that is no problem
for us. Here is a patch that would convert the framework to use local_t.
Is that okay?

The problem with this solution is that the use of local_t will lead to the
use of atomic operations (in case the preemption status is unknown). It
may be better to use atomic operations and simply drop the per_cpu stuff.
That way the summing of the per cpu variables is avoided and the
stats are accurate in real time.

Seems that local.h is rarely used. There was an obvious mistake in there
for ia64.

Index: linux-2.6.15-rc5/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/page_alloc.c 2005-12-06 10:13:49.000000000 -0800
+++ linux-2.6.15-rc5/mm/page_alloc.c 2005-12-06 14:43:41.000000000 -0800
@@ -560,26 +560,25 @@ static int rmqueue_bulk(struct zone *zon
static spinlock_t node_stat_lock;
unsigned long vm_stat_global[NR_STAT_ITEMS];
unsigned long vm_stat_node[MAX_NUMNODES][NR_STAT_ITEMS];
-int vm_stat_diff[NR_CPUS][MAX_NUMNODES][NR_STAT_ITEMS];
+DEFINE_PER_CPU(local_t [MAX_NUMNODES][NR_STAT_ITEMS], vm_stat_diff);

void refresh_vm_stats(void) {
- int cpu;
int node;
int i;

spin_lock(&node_stat_lock);

- cpu = get_cpu();
for_each_online_node(node)
for(i = 0; i < NR_STAT_ITEMS; i++) {
- int * p = vm_stat_diff[cpu][node]+i;
- if (*p) {
- vm_stat_node[node][i] += *p;
- vm_stat_global[i] += *p;
- *p = 0;
+ long v;
+
+ v = cpu_local_read(vm_stat_diff[node][i]);
+ if (v) {
+ vm_stat_node[node][i] += v;
+ vm_stat_global[i] += v;
+ cpu_local_set(vm_stat_diff[node][i], 0);
}
}
- put_cpu();

spin_unlock(&node_stat_lock);
}
Index: linux-2.6.15-rc5/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/page-flags.h 2005-12-06 10:15:59.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/page-flags.h 2005-12-06 14:47:03.000000000 -0800
@@ -8,6 +8,7 @@
#include <linux/percpu.h>
#include <linux/cache.h>
#include <asm/pgtable.h>
+#include <asm/local.h>

/*
* Various page->flags bits:
@@ -169,12 +170,19 @@ enum node_stat_item { NR_MAPPED, NR_PAGE

extern unsigned long vm_stat_global[NR_STAT_ITEMS];
extern unsigned long vm_stat_node[MAX_NUMNODES][NR_STAT_ITEMS];
-extern int vm_stat_diff[NR_CPUS][MAX_NUMNODES][NR_STAT_ITEMS];
+DECLARE_PER_CPU(local_t [MAX_NUMNODES][NR_STAT_ITEMS], vm_stat_diff);

static inline void mod_node_page_state(int node, enum node_stat_item item, int delta)
{
- vm_stat_diff[get_cpu()][node][item] += delta;
- put_cpu();
+ cpu_local_add(delta, vm_stat_diff[node][item]);
+}
+
+/*
+ * For use when we know that preemption is disabled. Avoids atomic operations.
+ */
+static inline void __mod_node_page_state(int node, enum node_stat_item item, int delta)
+{
+ __local_add(delta, &__get_cpu_var(vm_stat_diff[node][item]));
}

#define inc_node_page_state(node, item) mod_node_page_state(node, item, 1)
Index: linux-2.6.15-rc5/include/asm-ia64/local.h
===================================================================
--- linux-2.6.15-rc5.orig/include/asm-ia64/local.h 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/include/asm-ia64/local.h 2005-12-06 14:39:47.000000000 -0800
@@ -17,7 +17,7 @@ typedef struct {
#define local_set(l, i) atomic64_set(&(l)->val, i)
#define local_inc(l) atomic64_inc(&(l)->val)
#define local_dec(l) atomic64_dec(&(l)->val)
-#define local_add(l) atomic64_add(&(l)->val)
+#define local_add(i, l) atomic64_add((i), &(l)->val)
#define local_sub(l) atomic64_sub(&(l)->val)

/* Non-atomic variants, i.e., preemption disabled and won't be touched in interrupt, etc. */

2005-12-06 23:05:50

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 2/3] Make nr_mapped a per node counter

Christoph Lameter wrote:
> Make nr_mapped a per node counter
>
> The per cpu nr_mapped counter is important because it allows a determination
> how many pages of a node are not mapped, which would allow a more effiecient
> means of determining when a node should reclaim memory.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux-2.6.15-rc3/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.15-rc3.orig/include/linux/page-flags.h 2005-12-01 00:35:38.000000000 -0800
> +++ linux-2.6.15-rc3/include/linux/page-flags.h 2005-12-01 00:35:49.000000000 -0800
> @@ -85,7 +85,6 @@ struct page_state {
> unsigned long nr_writeback; /* Pages under writeback */
> unsigned long nr_unstable; /* NFS unstable pages */
> unsigned long nr_page_table_pages;/* Pages used for pagetables */
> - unsigned long nr_mapped; /* mapped into pagetables */
> unsigned long nr_slab; /* In slab */
> #define GET_PAGE_STATE_LAST nr_slab
>
> @@ -165,8 +164,8 @@ extern void __mod_page_state(unsigned lo
> /*
> * Node based accounting with per cpu differentials.
> */
> -enum node_stat_item { };
> -#define NR_STAT_ITEMS 0
> +enum node_stat_item { NR_MAPPED };
> +#define NR_STAT_ITEMS 1
>
> extern unsigned long vm_stat_global[NR_STAT_ITEMS];
> extern unsigned long vm_stat_node[MAX_NUMNODES][NR_STAT_ITEMS];
> Index: linux-2.6.15-rc3/drivers/base/node.c
> ===================================================================
> --- linux-2.6.15-rc3.orig/drivers/base/node.c 2005-11-28 19:51:27.000000000 -0800
> +++ linux-2.6.15-rc3/drivers/base/node.c 2005-12-01 00:35:49.000000000 -0800
> @@ -53,8 +53,6 @@ static ssize_t node_read_meminfo(struct
> ps.nr_dirty = 0;
> if ((long)ps.nr_writeback < 0)
> ps.nr_writeback = 0;
> - if ((long)ps.nr_mapped < 0)
> - ps.nr_mapped = 0;
> if ((long)ps.nr_slab < 0)
> ps.nr_slab = 0;
>
> @@ -83,7 +81,7 @@ static ssize_t node_read_meminfo(struct
> nid, K(i.freeram - i.freehigh),
> nid, K(ps.nr_dirty),
> nid, K(ps.nr_writeback),
> - nid, K(ps.nr_mapped),
> + nid, K(vm_stat_node[nid][NR_MAPPED]),
> nid, K(ps.nr_slab));
> n += hugetlb_report_node_meminfo(nid, buf + n);
> return n;
> Index: linux-2.6.15-rc3/fs/proc/proc_misc.c
> ===================================================================
> --- linux-2.6.15-rc3.orig/fs/proc/proc_misc.c 2005-11-28 19:51:27.000000000 -0800
> +++ linux-2.6.15-rc3/fs/proc/proc_misc.c 2005-12-01 00:35:49.000000000 -0800
> @@ -190,7 +190,7 @@ static int meminfo_read_proc(char *page,
> K(i.freeswap),
> K(ps.nr_dirty),
> K(ps.nr_writeback),
> - K(ps.nr_mapped),
> + K(vm_stat_global[NR_MAPPED]),
> K(ps.nr_slab),
> K(allowed),
> K(committed),
> Index: linux-2.6.15-rc3/mm/vmscan.c
> ===================================================================
> --- linux-2.6.15-rc3.orig/mm/vmscan.c 2005-11-28 19:51:27.000000000 -0800
> +++ linux-2.6.15-rc3/mm/vmscan.c 2005-12-01 00:35:49.000000000 -0800
> @@ -967,7 +967,7 @@ int try_to_free_pages(struct zone **zone
> }
>
> for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> - sc.nr_mapped = read_page_state(nr_mapped);
> + sc.nr_mapped = vm_stat_global[NR_MAPPED];
> sc.nr_scanned = 0;
> sc.nr_reclaimed = 0;
> sc.priority = priority;
> @@ -1056,7 +1056,7 @@ loop_again:
> sc.gfp_mask = GFP_KERNEL;
> sc.may_writepage = 0;
> sc.may_swap = 1;
> - sc.nr_mapped = read_page_state(nr_mapped);
> + sc.nr_mapped = vm_stat_global[NR_MAPPED];
>

Any chance you can wrap these in macros? (something like read_page_node_state())

I gather Andrew did this so that they can easily be defined out for things
that don't want them (maybe, embedded systems).

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-06 23:08:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

Christoph Lameter wrote:
> [RFC] Framework for accurate node based statistics
>
> Currently we have various vm counters that are split per cpu. This arrangement
> does not allow access to per node statistics that are important to optimize
> VM behavior for NUMA architectures. All one can say from the per_cpu
> differential variables is how much a certain variable was changed by this cpu
> without being able to deduce how many pages in each node are of a certain type.
>
> This patch introduces a generic framework to allow accurate per node vm
> statistics through a large per node and per cpu array. The numbers are
> consolidated when the slab drainer runs (every 3 seconds or so) into global
> and per node counters. VM functions can then check these statistics by
> simply accessing the node specific or global counter.
>
> A significant problem with this approach is that the statistics are only
> accumulated every 3 seconds or so. I have tried various other approaches
> but they typically end up with having to add atomic variables to critical
> VM paths. I'd be glad if someone else had a bright idea on how to improve
> the situation.
>

Why not have per-node * per-cpu counters?

Or even use the current per-zone * per-cpu counters, and work out your
node details from there?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-06 23:37:15

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Wed, 7 Dec 2005, Nick Piggin wrote:

> Why not have per-node * per-cpu counters?

Yes, that is exactly what this patch implements.

> Or even use the current per-zone * per-cpu counters, and work out your
> node details from there?

I am not aware of any per-zone per cpu counters.

2005-12-06 23:40:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Tue, 6 Dec 2005, Christoph Lameter wrote:

> I am not aware of any per-zone per cpu counters.

Argh. Wrong. Yes there are counters in the per cpu structures for each
zone. The pointers here could be folded into that and then would give us
zone based statistics which may be better than per node statistics for
decision making about memory in a zone.

2005-12-07 05:51:25

by Keith Owens

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Tue, 6 Dec 2005 14:52:33 -0800 (PST),
Christoph Lameter <[email protected]> wrote:
>+DEFINE_PER_CPU(local_t [MAX_NUMNODES][NR_STAT_ITEMS], vm_stat_diff);

How big is that array going to get? The total per cpu data area is
limited to 64K on IA64 and we already use at least 34K.

2005-12-07 06:44:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

Christoph Lameter wrote:
> On Wed, 7 Dec 2005, Nick Piggin wrote:
>
>
>>Why not have per-node * per-cpu counters?
>
>
> Yes, that is exactly what this patch implements.
>

Sorry, I think I meant: why don't you just use the "add all counters
from all per-cpu of the node" in order to find the node-statistic?

Ie. like the node based page_state statistics that we already have.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-07 18:25:29

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Wed, 7 Dec 2005, Keith Owens wrote:

> On Tue, 6 Dec 2005 14:52:33 -0800 (PST),
> Christoph Lameter <[email protected]> wrote:
> >+DEFINE_PER_CPU(local_t [MAX_NUMNODES][NR_STAT_ITEMS], vm_stat_diff);
>
> How big is that array going to get? The total per cpu data area is
> limited to 64K on IA64 and we already use at least 34K.

Maximum around 1k nodes and I guess we may end up with 16 counters:

1024*16*8 = 131k ?

2005-12-07 18:27:15

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Wed, 7 Dec 2005, Nick Piggin wrote:

> Sorry, I think I meant: why don't you just use the "add all counters
> from all per-cpu of the node" in order to find the node-statistic?

which function is that?

2005-12-07 18:40:17

by Tony Luck

[permalink] [raw]
Subject: RE: [RFC 1/3] Framework for accurate node based statistics

>> How big is that array going to get? The total per cpu data area is
>> limited to 64K on IA64 and we already use at least 34K.
>
> Maximum around 1k nodes and I guess we may end up with 16 counters:
>
> 1024*16*8 = 131k ?

Ouch.

Can you live with a pointer to that monster block of space in the
per-cpu area?

Otherwise the next step up is a 256K per cpu area ... which I wouldn't
want to make the default (so we'll have another 2*X explosion in the
number of possible configs to test).

-Tony

2005-12-07 18:48:29

by Christoph Lameter

[permalink] [raw]
Subject: RE: [RFC 1/3] Framework for accurate node based statistics

On Wed, 7 Dec 2005, Luck, Tony wrote:

> Can you live with a pointer to that monster block of space in the
> per-cpu area?
>
> Otherwise the next step up is a 256K per cpu area ... which I wouldn't
> want to make the default (so we'll have another 2*X explosion in the
> number of possible configs to test).

Lets wait. I just did this to show how local_t could be implemented. This
is a RFC and the major problems (f.e. the 3 second delay)
have not been addressed so this is all vaporware for now.

2005-12-07 22:59:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

Christoph Lameter wrote:
> On Wed, 7 Dec 2005, Nick Piggin wrote:
>
>
>>Sorry, I think I meant: why don't you just use the "add all counters
>>from all per-cpu of the node" in order to find the node-statistic?
>
>
> which function is that?
>

I'm thinking of get_page_state_node... but that's not quite the same
thing. I guess sum all per-CPU counters from all zones in the node,
but that's going to be costly on big machines.

So I'm not sure, I guess I don't have any bright ideas... there is the
batching approach used by current pagecache_acct - is something like
that not sufficient either?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-08 00:02:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Thu, 8 Dec 2005, Nick Piggin wrote:

> Christoph Lameter wrote:
> > On Wed, 7 Dec 2005, Nick Piggin wrote:
> > > Sorry, I think I meant: why don't you just use the "add all counters
> > > from all per-cpu of the node" in order to find the node-statistic?
> > which function is that?
> >
>
> I'm thinking of get_page_state_node... but that's not quite the same
> thing. I guess sum all per-CPU counters from all zones in the node,
> but that's going to be costly on big machines.

The per cpu counters count when a cpu did an allocation. They do not count
on which node the allocation was done and are thereofre not useful to
determine the memory use on one node.

> So I'm not sure, I guess I don't have any bright ideas... there is the
> batching approach used by current pagecache_acct - is something like
> that not sufficient either?

The framework provides a similar approach by keeping differential
counters for each processor.

2005-12-08 00:13:13

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

Christoph Lameter wrote:
> On Thu, 8 Dec 2005, Nick Piggin wrote:
>
>
>>Christoph Lameter wrote:
>>
>>>On Wed, 7 Dec 2005, Nick Piggin wrote:
>>>
>>>>Sorry, I think I meant: why don't you just use the "add all counters
>>>>from all per-cpu of the node" in order to find the node-statistic?
>>>
>>>which function is that?
>>>
>>
>>I'm thinking of get_page_state_node... but that's not quite the same
>>thing. I guess sum all per-CPU counters from all zones in the node,
>>but that's going to be costly on big machines.
>
>
> The per cpu counters count when a cpu did an allocation. They do not count
> on which node the allocation was done and are thereofre not useful to
> determine the memory use on one node.
>

Yes, not that exact function of course.

>
>>So I'm not sure, I guess I don't have any bright ideas... there is the
>>batching approach used by current pagecache_acct - is something like
>>that not sufficient either?
>
>
> The framework provides a similar approach by keeping differential
> counters for each processor.
>

But the accounting delay has the unbounded error problem that the
batching approach does not.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-08 00:35:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/3] Framework for accurate node based statistics

On Thu, 8 Dec 2005, Nick Piggin wrote:

> > The framework provides a similar approach by keeping differential counters
> > for each processor.
> But the accounting delay has the unbounded error problem that the
> batching approach does not.

Ok. We could switch to batching in order to avoid using the
slab reaper.