2005-12-10 00:54:59

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 0/6] Zoned VM stats

Zone based VM statistics are necessary to be able to determine what the state
of memory in one zone is. In a NUMA system this can be helpful to do local
reclaim and other memory optimizations by shifting VM load to optimize
page allocation. It is also helpful to know how the computing load affects
the memory allocations on various zones.

The patchset introduces a framework for counters that is a cross between the
existing page_stats --which are simply global counters split per cpu-- and the
approach of deferred incremental updates implemented for nr_pagecache.

Small per cpu 8 bit counters are introduced in struct zone. If counting
exceeds certain threshold then the counters are accumulated in an array in
the zone of the page and in a global array. This means that access to
VM counter information for a zone and for the whole machine is possible
by simply indexing an array. [Thanks to Nick Piggin for pointing me
at that approach].

The patchset currently consists of 6 pieces.

1. Framwork

Implements counter functionality but does not define any. Contains atomic_long_t hack.

2. nr_mapped

Make nr_mapped a zone based counter.

3. nr_pagecache

Make nr_pagecache a zone based counter.

4. Extend /proc output

Output the complete information contained in the global and zone statistics
array to /proc files.

5. nr_slab

Convert nr_slab a zone to NR_SLAB.

6. nr_page_table_pages

Convert nr_page_table_pages to NR_PAGETABLE


2005-12-10 00:55:33

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 3/6] Make nr_pagecache a per zone counter

Make nr_pagecache a per node variable

Currently a single atomic variable is used to establish the size of the page cache
in the whole machine. The zoned VM counters have the same method of implementation
as the nr_pagecache code. Remove the special implementation for nr_pagecache and make
it a zoned counter. We will then be able to figure out how much of the memory in a
zone is used by the pagecache.

Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc5/include/linux/pagemap.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/pagemap.h 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/pagemap.h 2005-12-09 16:28:42.000000000 -0800
@@ -99,51 +99,6 @@ int add_to_page_cache_lru(struct page *p
extern void remove_from_page_cache(struct page *page);
extern void __remove_from_page_cache(struct page *page);

-extern atomic_t nr_pagecache;
-
-#ifdef CONFIG_SMP
-
-#define PAGECACHE_ACCT_THRESHOLD max(16, NR_CPUS * 2)
-DECLARE_PER_CPU(long, nr_pagecache_local);
-
-/*
- * pagecache_acct implements approximate accounting for pagecache.
- * vm_enough_memory() do not need high accuracy. Writers will keep
- * an offset in their per-cpu arena and will spill that into the
- * global count whenever the absolute value of the local count
- * exceeds the counter's threshold.
- *
- * MUST be protected from preemption.
- * current protection is mapping->page_lock.
- */
-static inline void pagecache_acct(int count)
-{
- long *local;
-
- local = &__get_cpu_var(nr_pagecache_local);
- *local += count;
- if (*local > PAGECACHE_ACCT_THRESHOLD || *local < -PAGECACHE_ACCT_THRESHOLD) {
- atomic_add(*local, &nr_pagecache);
- *local = 0;
- }
-}
-
-#else
-
-static inline void pagecache_acct(int count)
-{
- atomic_add(count, &nr_pagecache);
-}
-#endif
-
-static inline unsigned long get_page_cache_size(void)
-{
- int ret = atomic_read(&nr_pagecache);
- if (unlikely(ret < 0))
- ret = 0;
- return ret;
-}
-
/*
* Return byte-offset into filesystem object for page.
*/
Index: linux-2.6.15-rc5/mm/swap_state.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/swap_state.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/swap_state.c 2005-12-09 16:28:42.000000000 -0800
@@ -84,7 +84,7 @@ static int __add_to_swap_cache(struct pa
SetPageSwapCache(page);
set_page_private(page, entry.val);
total_swapcache_pages++;
- pagecache_acct(1);
+ __inc_zone_page_state(page_zone(page), NR_PAGECACHE);
}
write_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();
@@ -129,7 +129,7 @@ void __delete_from_swap_cache(struct pag
set_page_private(page, 0);
ClearPageSwapCache(page);
total_swapcache_pages--;
- pagecache_acct(-1);
+ __dec_zone_page_state(page_zone(page), NR_PAGECACHE);
INC_CACHE_INFO(del_total);
}

Index: linux-2.6.15-rc5/mm/filemap.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/filemap.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/filemap.c 2005-12-09 16:28:42.000000000 -0800
@@ -115,7 +115,7 @@ void __remove_from_page_cache(struct pag
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
- pagecache_acct(-1);
+ __dec_zone_page_state(page_zone(page), NR_PAGECACHE);
}

void remove_from_page_cache(struct page *page)
@@ -390,7 +390,7 @@ int add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = offset;
mapping->nrpages++;
- pagecache_acct(1);
+ __inc_zone_page_state(page_zone(page), NR_PAGECACHE);
}
write_unlock_irq(&mapping->tree_lock);
radix_tree_preload_end();
Index: linux-2.6.15-rc5/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/page_alloc.c 2005-12-09 16:28:38.000000000 -0800
+++ linux-2.6.15-rc5/mm/page_alloc.c 2005-12-09 16:28:42.000000000 -0800
@@ -1227,12 +1227,6 @@ static void show_node(struct zone *zone)
*/
static DEFINE_PER_CPU(struct page_state, page_states) = {0};

-atomic_t nr_pagecache = ATOMIC_INIT(0);
-EXPORT_SYMBOL(nr_pagecache);
-#ifdef CONFIG_SMP
-DEFINE_PER_CPU(long, nr_pagecache_local) = 0;
-#endif
-
void __get_page_state(struct page_state *ret, int nr, cpumask_t *cpumask)
{
int cpu = 0;
Index: linux-2.6.15-rc5/mm/mmap.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/mmap.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/mmap.c 2005-12-09 16:28:42.000000000 -0800
@@ -95,7 +95,7 @@ int __vm_enough_memory(long pages, int c
if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
unsigned long n;

- free = get_page_cache_size();
+ free = global_page_state(NR_PAGECACHE);
free += nr_swap_pages;

/*
Index: linux-2.6.15-rc5/mm/nommu.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/nommu.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/nommu.c 2005-12-09 16:28:42.000000000 -0800
@@ -1114,7 +1114,7 @@ int __vm_enough_memory(long pages, int c
if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
unsigned long n;

- free = get_page_cache_size();
+ free = global_page_state(NR_PAGECACHE);
free += nr_swap_pages;

/*
Index: linux-2.6.15-rc5/arch/sparc64/kernel/sys_sunos32.c
===================================================================
--- linux-2.6.15-rc5.orig/arch/sparc64/kernel/sys_sunos32.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/arch/sparc64/kernel/sys_sunos32.c 2005-12-09 16:28:42.000000000 -0800
@@ -154,7 +154,7 @@ asmlinkage int sunos_brk(u32 baddr)
* simple, it hopefully works in most obvious cases.. Easy to
* fool it, but this should catch most mistakes.
*/
- freepages = get_page_cache_size();
+ freepages = global_page_state(NR_PAGECACHE);
freepages >>= 1;
freepages += nr_free_pages();
freepages += nr_swap_pages;
Index: linux-2.6.15-rc5/arch/sparc/kernel/sys_sunos.c
===================================================================
--- linux-2.6.15-rc5.orig/arch/sparc/kernel/sys_sunos.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/arch/sparc/kernel/sys_sunos.c 2005-12-09 16:28:42.000000000 -0800
@@ -195,7 +195,7 @@ asmlinkage int sunos_brk(unsigned long b
* simple, it hopefully works in most obvious cases.. Easy to
* fool it, but this should catch most mistakes.
*/
- freepages = get_page_cache_size();
+ freepages = global_page_state(NR_PAGECACHE);
freepages >>= 1;
freepages += nr_free_pages();
freepages += nr_swap_pages;
Index: linux-2.6.15-rc5/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.15-rc5.orig/fs/proc/proc_misc.c 2005-12-09 16:28:37.000000000 -0800
+++ linux-2.6.15-rc5/fs/proc/proc_misc.c 2005-12-09 16:28:42.000000000 -0800
@@ -142,7 +142,7 @@ static int meminfo_read_proc(char *page,
allowed = ((totalram_pages - hugetlb_total_pages())
* sysctl_overcommit_ratio / 100) + total_swap_pages;

- cached = get_page_cache_size() - total_swapcache_pages - i.bufferram;
+ cached = global_page_state(NR_PAGECACHE) - total_swapcache_pages - i.bufferram;
if (cached < 0)
cached = 0;

2005-12-10 00:54:56

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 2/6] Make nr_mapped a per zone counter

Make nr_mapped a per zone counter

The per cpu nr_mapped counter is important because it allows a determination
how many pages of a zone are not mapped, which would allow a more efficient
means of determining when we need to reclaim memory in a zone.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc5/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/page-flags.h 2005-12-09 16:27:13.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/page-flags.h 2005-12-09 16:31:45.000000000 -0800
@@ -85,7 +85,6 @@ struct page_state {
unsigned long nr_writeback; /* Pages under writeback */
unsigned long nr_unstable; /* NFS unstable pages */
unsigned long nr_page_table_pages;/* Pages used for pagetables */
- unsigned long nr_mapped; /* mapped into pagetables */
unsigned long nr_slab; /* In slab */
#define GET_PAGE_STATE_LAST nr_slab

Index: linux-2.6.15-rc5/drivers/base/node.c
===================================================================
--- linux-2.6.15-rc5.orig/drivers/base/node.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/drivers/base/node.c 2005-12-09 16:32:25.000000000 -0800
@@ -43,18 +43,18 @@ static ssize_t node_read_meminfo(struct
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long nr_mapped;

si_meminfo_node(&i, nid);
get_page_state_node(&ps, nid);
__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+ nr_mapped = node_page_state(nid, NR_MAPPED);

/* Check for negative values in these approximate counters */
if ((long)ps.nr_dirty < 0)
ps.nr_dirty = 0;
if ((long)ps.nr_writeback < 0)
ps.nr_writeback = 0;
- if ((long)ps.nr_mapped < 0)
- ps.nr_mapped = 0;
if ((long)ps.nr_slab < 0)
ps.nr_slab = 0;

@@ -83,7 +83,7 @@ static ssize_t node_read_meminfo(struct
nid, K(i.freeram - i.freehigh),
nid, K(ps.nr_dirty),
nid, K(ps.nr_writeback),
- nid, K(ps.nr_mapped),
+ nid, K(nr_mapped),
nid, K(ps.nr_slab));
n += hugetlb_report_node_meminfo(nid, buf + n);
return n;
Index: linux-2.6.15-rc5/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.15-rc5.orig/fs/proc/proc_misc.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/fs/proc/proc_misc.c 2005-12-09 16:31:45.000000000 -0800
@@ -190,7 +190,7 @@ static int meminfo_read_proc(char *page,
K(i.freeswap),
K(ps.nr_dirty),
K(ps.nr_writeback),
- K(ps.nr_mapped),
+ K(global_page_state(NR_MAPPED)),
K(ps.nr_slab),
K(allowed),
K(committed),
Index: linux-2.6.15-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/vmscan.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/vmscan.c 2005-12-09 16:31:45.000000000 -0800
@@ -967,7 +967,7 @@ int try_to_free_pages(struct zone **zone
}

for (priority = DEF_PRIORITY; priority >= 0; priority--) {
- sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_mapped = global_page_state(NR_MAPPED);
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
sc.priority = priority;
@@ -1056,7 +1056,7 @@ loop_again:
sc.gfp_mask = GFP_KERNEL;
sc.may_writepage = 0;
sc.may_swap = 1;
- sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_mapped = global_page_state(NR_MAPPED);

inc_page_state(pageoutrun);

@@ -1373,7 +1373,7 @@ int zone_reclaim(struct zone *zone, gfp_
sc.gfp_mask = gfp_mask;
sc.may_writepage = 0;
sc.may_swap = 0;
- sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_mapped = global_page_state(NR_MAPPED);
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
/* scan at the highest priority */
Index: linux-2.6.15-rc5/mm/page-writeback.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/page-writeback.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/page-writeback.c 2005-12-09 16:31:45.000000000 -0800
@@ -111,7 +111,7 @@ static void get_writeback_state(struct w
{
wbs->nr_dirty = read_page_state(nr_dirty);
wbs->nr_unstable = read_page_state(nr_unstable);
- wbs->nr_mapped = read_page_state(nr_mapped);
+ wbs->nr_mapped = global_page_state(NR_MAPPED);
wbs->nr_writeback = read_page_state(nr_writeback);
}

Index: linux-2.6.15-rc5/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/page_alloc.c 2005-12-09 16:28:30.000000000 -0800
+++ linux-2.6.15-rc5/mm/page_alloc.c 2005-12-09 16:31:45.000000000 -0800
@@ -1435,7 +1435,7 @@ void show_free_areas(void)
ps.nr_unstable,
nr_free_pages(),
ps.nr_slab,
- ps.nr_mapped,
+ global_page_state(NR_MAPPED),
ps.nr_page_table_pages);

for_each_zone(zone) {
Index: linux-2.6.15-rc5/mm/rmap.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/rmap.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/rmap.c 2005-12-09 16:31:45.000000000 -0800
@@ -454,7 +454,7 @@ void page_add_anon_rmap(struct page *pag

page->index = linear_page_index(vma, address);

- inc_page_state(nr_mapped);
+ inc_zone_page_state(page_zone(page), NR_MAPPED);
}
/* else checking page index and mapping is racy */
}
@@ -471,7 +471,7 @@ void page_add_file_rmap(struct page *pag
BUG_ON(!pfn_valid(page_to_pfn(page)));

if (atomic_inc_and_test(&page->_mapcount))
- inc_page_state(nr_mapped);
+ inc_zone_page_state(page_zone(page), NR_MAPPED);
}

/**
@@ -495,7 +495,7 @@ void page_remove_rmap(struct page *page)
*/
if (page_test_and_clear_dirty(page))
set_page_dirty(page);
- dec_page_state(nr_mapped);
+ dec_zone_page_state(page_zone(page), NR_MAPPED);
}
}

2005-12-10 00:56:06

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 1/6] Framework

Currently we have various vm counters that are split per cpu. This arrangement
does not allow access to per zone statistics that are important to optimize
VM behavior for NUMA architectures. All one can say from the per_cpu
differential variables is how much a certain variable was changed by this cpu
without being able to deduce how many pages in each zone are of a certain type.

This framework here implements differential counters for each processor
in struct zone. The differential counters are consolidated when a threshold is
exceeded (like the current implementation for nr_pageache), when slab reaping
occurs or when a consolidation function is called. The consolidation uses atomic
operations and accumulates counters per zone in the zone structure and globally
in the vm_stat array. VM function can access the counts by simply indexing
a global or zone specific array.

The arrangement of counters in an array simplifies processing when output
has to be generated for /proc/*.

Counter updates can be done by calling *_zone_page_state or
__*_zone_page_state. The second function can be called if it is
known that interrupts are disabled.

Contains a hack to get atomic_long_t without really having it.
That hack can be removed when atomic_long is available.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc5/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/page_alloc.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/page_alloc.c 2005-12-09 16:34:52.000000000 -0800
@@ -556,7 +556,69 @@ static int rmqueue_bulk(struct zone *zon
return allocated;
}

+/*
+ * Manage combined zone based / global counters
+ */
+atomic_long_t vm_stat[NR_STAT_ITEMS];
+
+/*
+ * Update the zone counters for one cpu.
+ * Called from the slab reaper once in awhile.
+ */
+void refresh_cpu_vm_stats(void)
+{
+ struct zone *zone;
+ int i;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ for_each_zone(zone) {
+ struct per_cpu_pageset *pcp = zone_pcp(zone, raw_smp_processor_id());
+
+ for(i = 0; i < NR_STAT_ITEMS; i++) {
+ int v;
+
+ v = pcp->vm_stat_diff[i];
+ if (v) {
+ pcp->vm_stat_diff[i] = 0;
+ atomic_long_add(v, &zone->vm_stat[i]);
+ atomic_long_add(v, &vm_stat[i]);
+ }
+ }
+ }
+ local_irq_restore(flags);
+}
+
+static void __refresh_cpu_vm_stats(void *dummy)
+{
+ refresh_cpu_vm_stats();
+}
+
+/*
+ * Consolidate all counters.
+ *
+ * Note that the result is less inaccurate but still inaccurate
+ * since concurrent processes can increment/decrement counters
+ * while this functions runs.
+ */
+void refresh_vm_stats(void)
+{
+ schedule_on_each_cpu(__refresh_cpu_vm_stats, NULL);
+}
+
+unsigned long node_page_state(int node, enum zone_stat_item item)
+{
+ struct zone *zones = NODE_DATA(node)->node_zones;
+ int i;
+ unsigned long v = 0;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ v += atomic_long_read(&zones[i].vm_stat[item]);
+ return v;
+}
+
#ifdef CONFIG_NUMA
+
/* Called from the slab reaper to drain remote pagesets */
void drain_remote_pages(void)
{
Index: linux-2.6.15-rc5/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/page-flags.h 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/page-flags.h 2005-12-09 16:33:35.000000000 -0800
@@ -163,6 +163,59 @@ extern void __mod_page_state(unsigned lo
} while (0)

/*
+ * Zone based accounting with per cpu differentials.
+ */
+#define STAT_THRESHOLD 32
+
+extern atomic_long_t vm_stat[NR_STAT_ITEMS];
+
+#define global_page_state(__x) atomic_long_read(&vm_stat[__x])
+#define zone_page_state(__z,__x) atomic_long_read(&(__z)->vm_stat[__x])
+extern unsigned long node_page_state(int node, enum zone_stat_item);
+
+/*
+ * For use when we know that interrupts are disabled.
+ */
+static inline void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, int delta)
+{
+ s8 *p;
+ long x;
+
+ p = &zone_pcp(zone, raw_smp_processor_id())->vm_stat_diff[item];
+ x = delta + *p;
+
+ if (unlikely(x > STAT_THRESHOLD || x < -STAT_THRESHOLD)) {
+ atomic_long_add(x, &zone->vm_stat[item]);
+ atomic_long_add(x, &vm_stat[item]);
+ x = 0;
+ }
+
+ *p = x;
+}
+
+#define __inc_zone_page_state(zone, item) __mod_zone_page_state(zone, item, 1)
+#define __dec_zone_page_state(zone, item) __mod_zone_page_state(zone, item, -1)
+#define __add_zone_page_state(zone, item) __mod_zone_page_state(zone, item, delta)
+#define __sub_zone_page_state(zone, item) __mod_zone_page_state(zone, item, -(delta))
+
+/*
+ * For an unknown interrupt state
+ */
+static inline void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, int delta)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __mod_zone_page_state(zone, item, delta);
+ local_irq_restore(flags);
+}
+
+#define inc_zone_page_state(zone, item) mod_zone_page_state(zone, item, 1)
+#define dec_zone_page_state(zone, item) mod_zone_page_state(zone, item, -1)
+#define add_zone_page_state(zone, item, delta) mod_zone_page_state(zone, item, delta)
+#define sub_zone_page_state(zone, item, delta) mod_zone_page_state(zone, item, -(delta))
+
+/*
* Manipulation of page state flags
*/
#define PageLocked(page) \
Index: linux-2.6.15-rc5/include/linux/gfp.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/gfp.h 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/gfp.h 2005-12-09 16:27:13.000000000 -0800
@@ -153,8 +153,12 @@ extern void FASTCALL(free_cold_page(stru
void page_alloc_init(void);
#ifdef CONFIG_NUMA
void drain_remote_pages(void);
+void refresh_cpu_vm_stats(void);
+void refresh_vm_stats(void);
#else
static inline void drain_remote_pages(void) { };
+static inline void refresh_cpu_vm_stats(void) { };
+static inline void refresh_vm_stats(void) { };
#endif

#endif /* __LINUX_GFP_H */
Index: linux-2.6.15-rc5/mm/slab.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/slab.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/slab.c 2005-12-09 16:27:13.000000000 -0800
@@ -3359,6 +3359,7 @@ next:
check_irq_on();
up(&cache_chain_sem);
drain_remote_pages();
+ refresh_cpu_vm_stats();
/* Setup the next iteration */
schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC);
}
Index: linux-2.6.15-rc5/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/mmzone.h 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/mmzone.h 2005-12-09 16:27:13.000000000 -0800
@@ -44,6 +44,23 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif

+enum zone_stat_item { NR_MAPPED, NR_PAGECACHE };
+#define NR_STAT_ITEMS 2
+
+/*
+ * A hacky way of defining atomic long. Remove when
+ * atomic_long_t becomes available.
+ */
+#ifdef CONFIG_64BIT
+#define atomic_long_t atomic64_t
+#define atomic_long_add atomic64_add
+#define atomic_long_read atomic64_read
+#else
+#define atomic_long_t atomic_t
+#define atomic_long_add atomic_add
+#define atomic_long_read atomic_read
+#endif
+
struct per_cpu_pages {
int count; /* number of pages in the list */
int low; /* low watermark, refill needed */
@@ -54,6 +71,8 @@ struct per_cpu_pages {

struct per_cpu_pageset {
struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
+ s8 vm_stat_diff[NR_STAT_ITEMS];
+
#ifdef CONFIG_NUMA
unsigned long numa_hit; /* allocated in intended node */
unsigned long numa_miss; /* allocated in non intended node */
@@ -150,6 +169,8 @@ struct zone {
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */

+ /* Zone statistics */
+ atomic_long_t vm_stat[NR_STAT_ITEMS];
/*
* Does the allocator try to reclaim pages from the zone as soon
* as it fails a watermark_ok() in __alloc_pages?
@@ -236,7 +257,6 @@ struct zone {
char *name;
} ____cacheline_maxaligned_in_smp;

-
/*
* The "priority" of VM scanning is how much of the queues we will scan in one
* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the

2005-12-10 00:55:38

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 4/6] Expanded node and zone statistics

Extend zone, node and global statistics by printing all counters from the stats
array.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc5/drivers/base/node.c
===================================================================
--- linux-2.6.15-rc5.orig/drivers/base/node.c 2005-12-09 16:32:25.000000000 -0800
+++ linux-2.6.15-rc5/drivers/base/node.c 2005-12-09 16:32:53.000000000 -0800
@@ -43,12 +43,14 @@ static ssize_t node_read_meminfo(struct
unsigned long inactive;
unsigned long active;
unsigned long free;
- unsigned long nr_mapped;
+ int j;
+ unsigned long nr[NR_STAT_ITEMS];

si_meminfo_node(&i, nid);
get_page_state_node(&ps, nid);
__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
- nr_mapped = node_page_state(nid, NR_MAPPED);
+ for (j = 0; j < NR_STAT_ITEMS; j++)
+ nr[j] = node_page_state(nid, j);

/* Check for negative values in these approximate counters */
if ((long)ps.nr_dirty < 0)
@@ -71,6 +73,7 @@ static ssize_t node_read_meminfo(struct
"Node %d Dirty: %8lu kB\n"
"Node %d Writeback: %8lu kB\n"
"Node %d Mapped: %8lu kB\n"
+ "Node %d Pagecache: %8lu kB\n"
"Node %d Slab: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
@@ -83,7 +86,8 @@ static ssize_t node_read_meminfo(struct
nid, K(i.freeram - i.freehigh),
nid, K(ps.nr_dirty),
nid, K(ps.nr_writeback),
- nid, K(nr_mapped),
+ nid, K(nr[NR_MAPPED]),
+ nid, K(nr[NR_PAGECACHE]),
nid, K(ps.nr_slab));
n += hugetlb_report_node_meminfo(nid, buf + n);
return n;
Index: linux-2.6.15-rc5/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/page_alloc.c 2005-12-09 16:32:28.000000000 -0800
+++ linux-2.6.15-rc5/mm/page_alloc.c 2005-12-09 16:32:53.000000000 -0800
@@ -556,6 +556,8 @@ static int rmqueue_bulk(struct zone *zon
return allocated;
}

+char *stat_item_descr[NR_STAT_ITEMS] = { "mapped","pagecache" };
+
/*
* Manage combined zone based / global counters
*/
@@ -2230,6 +2232,11 @@ static int zoneinfo_show(struct seq_file
zone->nr_scan_active, zone->nr_scan_inactive,
zone->spanned_pages,
zone->present_pages);
+ for(i = 0; i < NR_STAT_ITEMS; i++)
+ seq_printf(m, "\n %-8s %lu",
+ stat_item_descr[i],
+ zone_page_state(zone, i));
+
seq_printf(m,
"\n protection: (%lu",
zone->lowmem_reserve[0]);

2005-12-10 00:56:07

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 5/6] Make nr_slab a per zone counter

The number of slab pages in use is currently a counter split per cpu.
Make the number of slab pages a per zone counter so that we can see how
many slab pages have been allocated in each zone.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc5/drivers/base/node.c
===================================================================
--- linux-2.6.15-rc5.orig/drivers/base/node.c 2005-12-09 16:32:53.000000000 -0800
+++ linux-2.6.15-rc5/drivers/base/node.c 2005-12-09 16:32:57.000000000 -0800
@@ -88,7 +88,7 @@ static ssize_t node_read_meminfo(struct
nid, K(ps.nr_writeback),
nid, K(nr[NR_MAPPED]),
nid, K(nr[NR_PAGECACHE]),
- nid, K(ps.nr_slab));
+ nid, K(nr[NR_SLAB]));
n += hugetlb_report_node_meminfo(nid, buf + n);
return n;
}
Index: linux-2.6.15-rc5/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.15-rc5.orig/fs/proc/proc_misc.c 2005-12-09 16:32:28.000000000 -0800
+++ linux-2.6.15-rc5/fs/proc/proc_misc.c 2005-12-09 16:32:57.000000000 -0800
@@ -191,7 +191,7 @@ static int meminfo_read_proc(char *page,
K(ps.nr_dirty),
K(ps.nr_writeback),
K(global_page_state(NR_MAPPED)),
- K(ps.nr_slab),
+ K(global_page_state(NR_SLAB)),
K(allowed),
K(committed),
K(ps.nr_page_table_pages),
Index: linux-2.6.15-rc5/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/page_alloc.c 2005-12-09 16:32:53.000000000 -0800
+++ linux-2.6.15-rc5/mm/page_alloc.c 2005-12-09 16:32:57.000000000 -0800
@@ -556,7 +556,7 @@ static int rmqueue_bulk(struct zone *zon
return allocated;
}

-char *stat_item_descr[NR_STAT_ITEMS] = { "mapped","pagecache" };
+char *stat_item_descr[NR_STAT_ITEMS] = { "mapped","pagecache", "slab" };

/*
* Manage combined zone based / global counters
@@ -1430,7 +1430,7 @@ void show_free_areas(void)
ps.nr_writeback,
ps.nr_unstable,
nr_free_pages(),
- ps.nr_slab,
+ global_page_state(NR_SLAB),
global_page_state(NR_MAPPED),
ps.nr_page_table_pages);

Index: linux-2.6.15-rc5/mm/slab.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/slab.c 2005-12-09 16:27:13.000000000 -0800
+++ linux-2.6.15-rc5/mm/slab.c 2005-12-09 16:32:57.000000000 -0800
@@ -1213,7 +1213,7 @@ static void *kmem_getpages(kmem_cache_t
i = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
atomic_add(i, &slab_reclaim_pages);
- add_page_state(nr_slab, i);
+ add_zone_page_state(page_zone(page), NR_SLAB, i);
while (i--) {
SetPageSlab(page);
page++;
@@ -1235,7 +1235,7 @@ static void kmem_freepages(kmem_cache_t
BUG();
page++;
}
- sub_page_state(nr_slab, nr_freed);
+ sub_zone_page_state(page_zone(page), NR_SLAB, nr_freed);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += nr_freed;
free_pages((unsigned long)addr, cachep->gfporder);
Index: linux-2.6.15-rc5/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/page-flags.h 2005-12-09 16:31:45.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/page-flags.h 2005-12-09 16:32:57.000000000 -0800
@@ -85,8 +85,7 @@ struct page_state {
unsigned long nr_writeback; /* Pages under writeback */
unsigned long nr_unstable; /* NFS unstable pages */
unsigned long nr_page_table_pages;/* Pages used for pagetables */
- unsigned long nr_slab; /* In slab */
-#define GET_PAGE_STATE_LAST nr_slab
+#define GET_PAGE_STATE_LAST nr_page_table_pages

/*
* The below are zeroed by get_page_state(). Use get_full_page_state()
Index: linux-2.6.15-rc5/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/mmzone.h 2005-12-09 16:27:13.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/mmzone.h 2005-12-09 16:32:57.000000000 -0800
@@ -44,8 +44,8 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif

-enum zone_stat_item { NR_MAPPED, NR_PAGECACHE };
-#define NR_STAT_ITEMS 2
+enum zone_stat_item { NR_MAPPED, NR_PAGECACHE, NR_SLAB };
+#define NR_STAT_ITEMS 3

/*
* A hacky way of defining atomic long. Remove when

2005-12-10 00:56:07

by Christoph Lameter

[permalink] [raw]
Subject: [RFC 6/6] Make nr_pagecache a per zone counter

The nr_page_table_pages counter is currently implemented as a counter
split per cpu. nr_page_table_pages has therefore currently meaning as a
counter of the page table pages in the system as a whole.

This patch switches the counter to use a zone based couter. It is then
possible to determine how many pages in a zone are used for page tables.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc5/mm/memory.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/memory.c 2005-12-03 21:10:42.000000000 -0800
+++ linux-2.6.15-rc5/mm/memory.c 2005-12-09 16:33:00.000000000 -0800
@@ -116,7 +116,7 @@ static void free_pte_range(struct mmu_ga
pmd_clear(pmd);
pte_lock_deinit(page);
pte_free_tlb(tlb, page);
- dec_page_state(nr_page_table_pages);
+ dec_zone_page_state(page_zone(page), NR_PAGETABLE);
tlb->mm->nr_ptes--;
}

@@ -302,7 +302,7 @@ int __pte_alloc(struct mm_struct *mm, pm
pte_free(new);
} else {
mm->nr_ptes++;
- inc_page_state(nr_page_table_pages);
+ inc_zone_page_state(page_zone(new), NR_PAGETABLE);
pmd_populate(mm, pmd, new);
}
spin_unlock(&mm->page_table_lock);
Index: linux-2.6.15-rc5/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc5.orig/mm/page_alloc.c 2005-12-09 16:32:57.000000000 -0800
+++ linux-2.6.15-rc5/mm/page_alloc.c 2005-12-09 16:33:00.000000000 -0800
@@ -556,7 +556,7 @@ static int rmqueue_bulk(struct zone *zon
return allocated;
}

-char *stat_item_descr[NR_STAT_ITEMS] = { "mapped","pagecache", "slab" };
+char *stat_item_descr[NR_STAT_ITEMS] = { "mapped","pagecache", "slab", "pagetable" };

/*
* Manage combined zone based / global counters
@@ -1432,7 +1432,7 @@ void show_free_areas(void)
nr_free_pages(),
global_page_state(NR_SLAB),
global_page_state(NR_MAPPED),
- ps.nr_page_table_pages);
+ global_page_state(NR_PAGETABLE));

for_each_zone(zone) {
int i;
Index: linux-2.6.15-rc5/include/linux/page-flags.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/page-flags.h 2005-12-09 16:32:57.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/page-flags.h 2005-12-09 16:33:00.000000000 -0800
@@ -84,8 +84,7 @@ struct page_state {
unsigned long nr_dirty; /* Dirty writeable pages */
unsigned long nr_writeback; /* Pages under writeback */
unsigned long nr_unstable; /* NFS unstable pages */
- unsigned long nr_page_table_pages;/* Pages used for pagetables */
-#define GET_PAGE_STATE_LAST nr_page_table_pages
+#define GET_PAGE_STATE_LAST nr_unstable

/*
* The below are zeroed by get_page_state(). Use get_full_page_state()
Index: linux-2.6.15-rc5/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc5.orig/include/linux/mmzone.h 2005-12-09 16:32:57.000000000 -0800
+++ linux-2.6.15-rc5/include/linux/mmzone.h 2005-12-09 16:33:00.000000000 -0800
@@ -44,8 +44,8 @@ struct zone_padding {
#define ZONE_PADDING(name)
#endif

-enum zone_stat_item { NR_MAPPED, NR_PAGECACHE, NR_SLAB };
-#define NR_STAT_ITEMS 3
+enum zone_stat_item { NR_MAPPED, NR_PAGECACHE, NR_SLAB, NR_PAGETABLE };
+#define NR_STAT_ITEMS 4

/*
* A hacky way of defining atomic long. Remove when
Index: linux-2.6.15-rc5/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.15-rc5.orig/fs/proc/proc_misc.c 2005-12-09 16:32:57.000000000 -0800
+++ linux-2.6.15-rc5/fs/proc/proc_misc.c 2005-12-09 16:33:00.000000000 -0800
@@ -194,7 +194,7 @@ static int meminfo_read_proc(char *page,
K(global_page_state(NR_SLAB)),
K(allowed),
K(committed),
- K(ps.nr_page_table_pages),
+ K(global_page_state(NR_PAGETABLE)),
(unsigned long)VMALLOC_TOTAL >> 10,
vmi.used >> 10,
vmi.largest_chunk >> 10
Index: linux-2.6.15-rc5/drivers/base/node.c
===================================================================
--- linux-2.6.15-rc5.orig/drivers/base/node.c 2005-12-09 16:32:57.000000000 -0800
+++ linux-2.6.15-rc5/drivers/base/node.c 2005-12-09 16:33:00.000000000 -0800
@@ -57,8 +57,6 @@ static ssize_t node_read_meminfo(struct
ps.nr_dirty = 0;
if ((long)ps.nr_writeback < 0)
ps.nr_writeback = 0;
- if ((long)ps.nr_slab < 0)
- ps.nr_slab = 0;

n = sprintf(buf, "\n"
"Node %d MemTotal: %8lu kB\n"

2005-12-10 03:32:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

> +#define global_page_state(__x) atomic_long_read(&vm_stat[__x])
> +#define zone_page_state(__z,__x) atomic_long_read(&(__z)->vm_stat[__x])
> +extern unsigned long node_page_state(int node, enum zone_stat_item);
> +
> +/*
> + * For use when we know that interrupts are disabled.

Why do you need to disable interupts for atomic_t ?
If you just want to prevent switching CPUs that could be
done with get_cpu(), but alternatively you could just ignore
that race (it wouldn't be a big issue to still increment
the counter on the old CPU)

And why atomic and not just local_t? On x86/x86-64 local_t
would be much cheaper at least. It's not long, but that could
be as well added.

-Andi

2005-12-11 19:01:48

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC 3/6] Make nr_pagecache a per zone counter

On Fri, Dec 09, 2005 at 04:54:56PM -0800, Christoph Lameter wrote:
> Make nr_pagecache a per node variable
>
> Currently a single atomic variable is used to establish the size of the page cache
> in the whole machine. The zoned VM counters have the same method of implementation
> as the nr_pagecache code. Remove the special implementation for nr_pagecache and make
> it a zoned counter. We will then be able to figure out how much of the memory in a
> zone is used by the pagecache.
>
> Updates of the page cache counters are always performed with interrupts off.
> We can therefore use the __ variant here.

By the way, why does nr_pagecache needs to be an atomic variable on UP systems?

#ifdef CONFIG_SMP
...
#else

static inline void pagecache_acct(int count)
{
atomic_add(count, &nr_pagecache);
}
#endif

2005-12-11 19:48:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 3/6] Make nr_pagecache a per zone counter

> By the way, why does nr_pagecache needs to be an atomic variable on UP systems?

At least on X86 UP atomic doesn't use the LOCK prefix and is thus quite
cheap. I would expect other architectures who care about UP performance
(= not IA64) to be similar.

-Andi

2005-12-11 21:53:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC 3/6] Make nr_pagecache a per zone counter

On Sun, Dec 11, 2005 at 08:48:40PM +0100, Andi Kleen wrote:
> > By the way, why does nr_pagecache needs to be an atomic variable on UP systems?
>
> At least on X86 UP atomic doesn't use the LOCK prefix and is thus quite
> cheap. I would expect other architectures who care about UP performance
> (= not IA64) to be similar.

But in practice the variable does not need to be an atomic type for UP, but
simply a word, since stores are atomic on UP systems, no?

Several arches seem to use additional atomicity instructions on
atomic functions:

PPC:
static __inline__ void atomic_add(int a, atomic_t *v)
{
int t;

__asm__ __volatile__(
"1: lwarx %0,0,%3 # atomic_add\n\
add %0,%2,%0\n"
PPC405_ERR77(0,%3)
" stwcx. %0,0,%3 \n\
bne- 1b"
: "=&r" (t), "=m" (v->counter)
: "r" (a), "r" (&v->counter), "m" (v->counter)
: "cc");
}

"lwarx" and "stwcx." wouldnt be necessary for updating nr_pagecache
on UP.


SPARC:
int __atomic_add_return(int i, atomic_t *v)
{
int ret;
unsigned long flags;
spin_lock_irqsave(ATOMIC_HASH(v), flags);

ret = (v->counter += i);

spin_unlock_irqrestore(ATOMIC_HASH(v), flags);
return ret;
}



2005-12-12 03:46:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

Christoph Lameter wrote:

> +/*
> + * For use when we know that interrupts are disabled.
> + */
> +static inline void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, int delta)
> +{

Before this goes through, I have a full patch to do similar for the
rest of the statistics, and which will make names consistent with what
you have (shouldn't be a lot of clashes though).

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-12 03:51:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 3/6] Make nr_pagecache a per zone counter

Marcelo Tosatti wrote:
> On Sun, Dec 11, 2005 at 08:48:40PM +0100, Andi Kleen wrote:
>
>>>By the way, why does nr_pagecache needs to be an atomic variable on UP systems?
>>
>>At least on X86 UP atomic doesn't use the LOCK prefix and is thus quite
>>cheap. I would expect other architectures who care about UP performance
>>(= not IA64) to be similar.
>
>
> But in practice the variable does not need to be an atomic type for UP, but
> simply a word, since stores are atomic on UP systems, no?
>
> Several arches seem to use additional atomicity instructions on
> atomic functions:
>

Yeah, this is to protect from interrupts and is common to most
load store architectures. It is possible we could have
atomic_xxx_irq / atomic_xxx_irqsave functions for these, however
I think nobody has yet demostrated the improvements outweigh the
complexity that would be added.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-12 03:56:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

On Mon, Dec 12, 2005 at 02:46:42PM +1100, Nick Piggin wrote:
> Christoph Lameter wrote:
>
> >+/*
> >+ * For use when we know that interrupts are disabled.
> >+ */
> >+static inline void __mod_zone_page_state(struct zone *zone, enum
> >zone_stat_item item, int delta)
> >+{
>
> Before this goes through, I have a full patch to do similar for the
> rest of the statistics, and which will make names consistent with what
> you have (shouldn't be a lot of clashes though).

I also have a patch to change them all to local_t, greatly simplifying
it (e.g. the counters can be done inline then)

-Andi

2005-12-12 04:14:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

Andi Kleen wrote:
> On Mon, Dec 12, 2005 at 02:46:42PM +1100, Nick Piggin wrote:
>
>>Christoph Lameter wrote:
>>
>>
>>>+/*
>>>+ * For use when we know that interrupts are disabled.
>>>+ */
>>>+static inline void __mod_zone_page_state(struct zone *zone, enum
>>>zone_stat_item item, int delta)
>>>+{
>>
>>Before this goes through, I have a full patch to do similar for the
>>rest of the statistics, and which will make names consistent with what
>>you have (shouldn't be a lot of clashes though).
>
>
> I also have a patch to change them all to local_t, greatly simplifying
> it (e.g. the counters can be done inline then)
>

Cool. That is a patch that should go on top of mine, because most of
my patch is aimed at moving modifications under interrupts-off sections,
so you would then be able to use __local_xxx operations very easily for
most of the counters here.

However I'm still worried about the use of locals tripling the cacheline
size of a hot-path structure on some 64-bit architectures. Probably we
should get them to try to move to the atomic64 scheme before using
local_t here.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-12 04:21:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

On Mon, Dec 12, 2005 at 03:14:53PM +1100, Nick Piggin wrote:
> Andi Kleen wrote:
> >On Mon, Dec 12, 2005 at 02:46:42PM +1100, Nick Piggin wrote:
> >
> >>Christoph Lameter wrote:
> >>
> >>
> >>>+/*
> >>>+ * For use when we know that interrupts are disabled.
> >>>+ */
> >>>+static inline void __mod_zone_page_state(struct zone *zone, enum
> >>>zone_stat_item item, int delta)
> >>>+{
> >>
> >>Before this goes through, I have a full patch to do similar for the
> >>rest of the statistics, and which will make names consistent with what
> >>you have (shouldn't be a lot of clashes though).
> >
> >
> >I also have a patch to change them all to local_t, greatly simplifying
> >it (e.g. the counters can be done inline then)
> >
>
> Cool. That is a patch that should go on top of mine, because most of
> my patch is aimed at moving modifications under interrupts-off sections,

That's obsolete then. With local_t you don't need to turn off interrupts
anymore.

> However I'm still worried about the use of locals tripling the cacheline
> size of a hot-path structure on some 64-bit architectures. Probably we
> should get them to try to move to the atomic64 scheme before using
> local_t here.

I think the right fix for those is to just change the fallback local_t
to disable interrupts again - that should be a better tradeoff and
when they have a better alternative they can implement it in the arch.

(in fact i did a patch for that too, but considered throwing it away
again because I don't have a good way to test it)

-Andi

2005-12-12 04:28:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

Andi Kleen wrote:
> On Mon, Dec 12, 2005 at 03:14:53PM +1100, Nick Piggin wrote:

>>Cool. That is a patch that should go on top of mine, because most of
>>my patch is aimed at moving modifications under interrupts-off sections,
>
>
> That's obsolete then.

No it isn't.

> With local_t you don't need to turn off interrupts
> anymore.
>

Then you can't use __local_xxx, and so many architectures will use
atomic instructions (the ones who don't are the ones with tripled
cacheline footprint of this structure).

Sure i386 and x86-64 are happy, but this would probably slow down
most other architectures.

>
>>However I'm still worried about the use of locals tripling the cacheline
>>size of a hot-path structure on some 64-bit architectures. Probably we
>>should get them to try to move to the atomic64 scheme before using
>>local_t here.
>
>
> I think the right fix for those is to just change the fallback local_t
> to disable interrupts again - that should be a better tradeoff and
> when they have a better alternative they can implement it in the arch.
>

Probably right.

> (in fact i did a patch for that too, but considered throwing it away
> again because I don't have a good way to test it)
>

Yep, it will be difficult to test.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-12 04:51:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

> >With local_t you don't need to turn off interrupts
> >anymore.
> >
>
> Then you can't use __local_xxx, and so many architectures will use
> atomic instructions (the ones who don't are the ones with tripled
> cacheline footprint of this structure).

They are wrong then. atomic instructions is the wrong implementation
and they would be better off with asm-generic.

If anything they should use per_cpu counters for interrupts and
use seq locks. Or just turn off the interrupts for a short time
in the low level code.

>
> Sure i386 and x86-64 are happy, but this would probably slow down
> most other architectures.

I think it is better to fix the other architectures then - if they
are really using a full scale bus lock for this they're just wrong.

I don't think it is a good idea to do a large change in generic
code just for dumb low level code.

-Andi

2005-12-12 07:05:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

Andi Kleen wrote:

>>Then you can't use __local_xxx, and so many architectures will use
>>atomic instructions (the ones who don't are the ones with tripled
>>cacheline footprint of this structure).
>
>
> They are wrong then. atomic instructions is the wrong implementation
> and they would be better off with asm-generic.
>

Yes I mean atomic and per-cpu. Same as asm-generic.

> If anything they should use per_cpu counters for interrupts and
> use seq locks.

How would seqlocks help?

> Or just turn off the interrupts for a short time
> in the low level code.
>

This is exactly what mod_page_state does, which is what my patches
eliminate. For a small but significant performance improvement.

>
>>Sure i386 and x86-64 are happy, but this would probably slow down
>>most other architectures.
>
>
> I think it is better to fix the other architectures then - if they
> are really using a full scale bus lock for this they're just wrong.
>
> I don't think it is a good idea to do a large change in generic
> code just for dumb low level code.
>

It is not a large change at all, just some shuffling of mod_page_state
and friends to go under pre-existing interrupts-off sections.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-12-12 12:51:48

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC 3/6] Make nr_pagecache a per zone counter

On Mon, Dec 12, 2005 at 02:51:13PM +1100, Nick Piggin wrote:
> Marcelo Tosatti wrote:
> >On Sun, Dec 11, 2005 at 08:48:40PM +0100, Andi Kleen wrote:
> >
> >>>By the way, why does nr_pagecache needs to be an atomic variable on UP
> >>>systems?
> >>
> >>At least on X86 UP atomic doesn't use the LOCK prefix and is thus quite
> >>cheap. I would expect other architectures who care about UP performance
> >>(= not IA64) to be similar.
> >
> >
> >But in practice the variable does not need to be an atomic type for UP, but
> >simply a word, since stores are atomic on UP systems, no?
> >
> >Several arches seem to use additional atomicity instructions on
> >atomic functions:
> >
>
> Yeah, this is to protect from interrupts and is common to most
> load store architectures. It is possible we could have
> atomic_xxx_irq / atomic_xxx_irqsave functions for these, however
> I think nobody has yet demostrated the improvements outweigh the
> complexity that would be added.

Hi Nick,

But nr_pagecache is not accessed at interrupt code, is it? It does
not need to be an atomic type.

2005-12-12 16:33:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/6] Framework

On Sat, 10 Dec 2005, Andi Kleen wrote:

> > +#define global_page_state(__x) atomic_long_read(&vm_stat[__x])
> > +#define zone_page_state(__z,__x) atomic_long_read(&(__z)->vm_stat[__x])
> > +extern unsigned long node_page_state(int node, enum zone_stat_item);
> > +
> > +/*
> > + * For use when we know that interrupts are disabled.
>
> Why do you need to disable interupts for atomic_t ?

Interrupts need to be disabled because the processing of the byte sized
differential could be interrupted.

> If you just want to prevent switching CPUs that could be
> done with get_cpu(), but alternatively you could just ignore
> that race (it wouldn't be a big issue to still increment
> the counter on the old CPU)

There is no increment or decrement right now. We add an offset and that
offset could easily burst the limits of a byte sized differential. A check
needs to happen before the differential is updated.

> And why atomic and not just local_t? On x86/x86-64 local_t
> would be much cheaper at least. It's not long, but that could
> be as well added.

local_t is long on ia64.

The atomics are used for global updates of counters in struct zone and the
vm_stats array. local_t wont help there.

local_t could be used for the differentials. Special functions for
increment and decrement could use the non-interruptible nature of inc/decs
on i386 and x86_64.

There is no byte sized local_t though so its difficult to use local_t
here. I think this whole local_t stuff is not too useful after all.
Could we add an incp/decp macro that is like cmpxchg? That macro should
be able to operation on various sizes of counters.

2005-12-12 16:34:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 3/6] Make nr_pagecache a per zone counter

On Mon, 12 Dec 2005, Marcelo Tosatti wrote:

> But nr_pagecache is not accessed at interrupt code, is it? It does
> not need to be an atomic type.

nr_pagecache is only updated when interrupts are disabled. It could be
simply switched to unsigned long for UP.