2006-01-17 02:04:13

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch] mm: Convert global dirty_exceeded flag to per-node node_dirty_exceeded

This is a repost. I did not get any comments when I last posted this
about a month back. So I guess this patch is all good :)

Andrew, can this get some exposure in -mm?

Thanks,
Kiran


Convert global dirty_exceeded flag to per-node node_dirty_exceeded.

dirty_exceeded ping pongs between nodes in order to force all cpus in
the system to increase the frequency of calls to balance_dirty_pages.

Currently dirty_exceeded is used by balance_dirty_pages_ratelimited to
force all CPUs in the system call balance_dirty_pages often, in order to
reduce the amount of dirty pages in the entire system (based on
dirty_thresh and one CPU exceeding thee ratelimits). As dirty_exceeded
is a global variable, it will ping-pong between nodes of a NUMA system
which is not good.

We think it is OK to reduce the affect of dirty_exceeded to node basis.
That will give us on one hand having multiple CPUs (of that node)
calling balance_dirty_pages often (based on same logic as before) and on
the other hand not causing ping-pong of dirty_exceeded with the other
nodes.

Following patch changes the global dirty_exceeded flag to a per-node
flag. Hence, balance_dirty_pages is called more often only on the node
which exceeded the dirty_threshold. The SMP case is left as was before.

Comments?

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.15-mm4/mm/page-writeback.c
===================================================================
--- linux-2.6.15-mm4.orig/mm/page-writeback.c 2006-01-16 12:04:51.000000000 -0800
+++ linux-2.6.15-mm4/mm/page-writeback.c 2006-01-16 13:26:48.000000000 -0800
@@ -46,7 +46,20 @@
static long ratelimit_pages = 32;

static long total_pages; /* The total number of pages in the machine. */
-static int dirty_exceeded; /* Dirty mem may be over limit */
+
+/*
+ * node_dirty_exceeded is used to indicate that dirty mem in the current node
+ * maybe over the limit and we need to increase the frequency of calls to
+ * balance_dirty_pages
+ */
+#ifdef CONFIG_NUMA
+static DEFINE_PER_CPU(int, _dirty_exceeded) = { 0 };
+#define node_dirty_exceeded \
+ per_cpu(_dirty_exceeded, node_to_first_cpu(numa_node_id()))
+#else
+static int node_dirty_exceeded;
+#endif
+

/*
* When balance_dirty_pages decides that the caller needs to perform some
@@ -212,7 +225,7 @@ static void balance_dirty_pages(struct a
if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
break;

- dirty_exceeded = 1;
+ node_dirty_exceeded = 1;

/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
* Unstable writes are a feature of certain networked
@@ -235,7 +248,7 @@ static void balance_dirty_pages(struct a
}

if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
- dirty_exceeded = 0;
+ node_dirty_exceeded = 0;

if (writeback_in_progress(bdi))
return; /* pdflush is already working this queue */
@@ -272,7 +285,7 @@ void balance_dirty_pages_ratelimited(str
long ratelimit;

ratelimit = ratelimit_pages;
- if (dirty_exceeded)
+ if (node_dirty_exceeded)
ratelimit = 8;

/*


2006-01-17 02:13:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mm: Convert global dirty_exceeded flag to per-node node_dirty_exceeded

Ravikiran G Thirumalai <[email protected]> wrote:
>
> This is a repost. I did not get any comments when I last posted this
> about a month back. So I guess this patch is all good :)
>

I thought it was fairly ghastly, sorry ;)

> Convert global dirty_exceeded flag to per-node node_dirty_exceeded.
>
> dirty_exceeded ping pongs between nodes in order to force all cpus in
> the system to increase the frequency of calls to balance_dirty_pages.
>
> Currently dirty_exceeded is used by balance_dirty_pages_ratelimited to
> force all CPUs in the system call balance_dirty_pages often, in order to
> reduce the amount of dirty pages in the entire system (based on
> dirty_thresh and one CPU exceeding thee ratelimits). As dirty_exceeded
> is a global variable, it will ping-pong between nodes of a NUMA system
> which is not good.

Did you not test this obvious little optimisation?

--- devel/mm/page-writeback.c~mm-dirty_exceeded-speedup 2006-01-16 18:11:36.000000000 -0800
+++ devel-akpm/mm/page-writeback.c 2006-01-16 18:11:56.000000000 -0800
@@ -212,7 +212,8 @@ static void balance_dirty_pages(struct a
if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
break;

- dirty_exceeded = 1;
+ if (!dirty_exceeded)
+ dirty_exceeded = 1;

/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
* Unstable writes are a feature of certain networked
@@ -234,7 +235,7 @@ static void balance_dirty_pages(struct a
blk_congestion_wait(WRITE, HZ/10);
}

- if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
+ if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh && dirty_exceeded)
dirty_exceeded = 0;

if (writeback_in_progress(bdi))
_

2006-01-18 01:30:05

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch] mm: Convert global dirty_exceeded flag to per-node node_dirty_exceeded

On Mon, Jan 16, 2006 at 06:13:23PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > Convert global dirty_exceeded flag to per-node node_dirty_exceeded.
> >
> > dirty_exceeded ping pongs between nodes in order to force all cpus in
> > the system to increase the frequency of calls to balance_dirty_pages.
> >
> > Currently dirty_exceeded is used by balance_dirty_pages_ratelimited to
> > force all CPUs in the system call balance_dirty_pages often, in order to
> > reduce the amount of dirty pages in the entire system (based on
> > dirty_thresh and one CPU exceeding thee ratelimits). As dirty_exceeded
> > is a global variable, it will ping-pong between nodes of a NUMA system
> > which is not good.
>
> Did you not test this obvious little optimisation?

We ran the test we encountered this problem on with your patch.
At first it looked like it did not help. But later we found that there was
false sharing on this variable. So padding dirty_exceeded to cacheline
along with your patch helps.

Thanks,
Kiran


Avoid false sharing on dirty_exceeded.

Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: build-2.6.15/mm/page-writeback.c
===================================================================
--- build-2.6.15.orig/mm/page-writeback.c 2006-01-17 12:46:00.000000000 -0800
+++ build-2.6.15/mm/page-writeback.c 2006-01-17 16:21:47.000000000 -0800
@@ -46,7 +46,7 @@
static long ratelimit_pages = 32;

static long total_pages; /* The total number of pages in the machine. */
-static int dirty_exceeded; /* Dirty mem may be over limit */
+static int dirty_exceeded __cacheline_aligned_in_smp; /* Dirty mem may be over limit */

/*
* When balance_dirty_pages decides that the caller needs to perform some

2006-01-18 02:10:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mm: Convert global dirty_exceeded flag to per-node node_dirty_exceeded

Ravikiran G Thirumalai <[email protected]> wrote:
>
> On Mon, Jan 16, 2006 at 06:13:23PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <[email protected]> wrote:
> > >
> > > Convert global dirty_exceeded flag to per-node node_dirty_exceeded.
> > >
> > > dirty_exceeded ping pongs between nodes in order to force all cpus in
> > > the system to increase the frequency of calls to balance_dirty_pages.
> > >
> > > Currently dirty_exceeded is used by balance_dirty_pages_ratelimited to
> > > force all CPUs in the system call balance_dirty_pages often, in order to
> > > reduce the amount of dirty pages in the entire system (based on
> > > dirty_thresh and one CPU exceeding thee ratelimits). As dirty_exceeded
> > > is a global variable, it will ping-pong between nodes of a NUMA system
> > > which is not good.
> >
> > Did you not test this obvious little optimisation?
>
> We ran the test we encountered this problem on with your patch.
> At first it looked like it did not help. But later we found that there was
> false sharing on this variable.

OK. That's a bit nasty, isn't it? It can work well or poorly for
different people depending upon vagaries of .config and the linker.

We should find out what it was sharing _with_. Could you please run

nm -n vmlinux| grep -C5 dirty_exceeded


2006-01-18 03:10:12

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch] mm: Convert global dirty_exceeded flag to per-node node_dirty_exceeded

On Tue, Jan 17, 2006 at 06:09:56PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > On Mon, Jan 16, 2006 at 06:13:23PM -0800, Andrew Morton wrote:
> > > Ravikiran G Thirumalai <[email protected]> wrote:
> > > >
> > > > Convert global dirty_exceeded flag to per-node node_dirty_exceeded.
> > > >
> > > > dirty_exceeded ping pongs between nodes in order to force all cpus in
> > > > the system to increase the frequency of calls to balance_dirty_pages.
> > > >
> > > > Currently dirty_exceeded is used by balance_dirty_pages_ratelimited to
> > > > force all CPUs in the system call balance_dirty_pages often, in order to
> > > > reduce the amount of dirty pages in the entire system (based on
> > > > dirty_thresh and one CPU exceeding thee ratelimits). As dirty_exceeded
> > > > is a global variable, it will ping-pong between nodes of a NUMA system
> > > > which is not good.
> > >
> > > Did you not test this obvious little optimisation?
> >
> > We ran the test we encountered this problem on with your patch.
> > At first it looked like it did not help. But later we found that there was
> > false sharing on this variable.
>
> OK. That's a bit nasty, isn't it? It can work well or poorly for
> different people depending upon vagaries of .config and the linker.
>
> We should find out what it was sharing _with_. Could you please run
>
> nm -n vmlinux| grep -C5 dirty_exceeded

ffffffff805290c0 b irq_dir
ffffffff80529838 b root_irq_dir
ffffffff80529880 B max_pfn
ffffffff80529888 B min_low_pfn
ffffffff80529890 B max_low_pfn
ffffffff80529900 B nr_pagecache
ffffffff80529908 B nr_swap_pages
ffffffff80529980 b boot_pageset
ffffffff8052a980 B laptop_mode
ffffffff8052a984 B block_dump
ffffffff8052a988 b dirty_exceeded
ffffffff8052a990 b total_pages
ffffffff8052a998 B nr_pdflush_threads
ffffffff8052a9a0 b last_empty_jifs
ffffffff8052a9c0 B slab_reclaim_pages
ffffffff8052a9c4 b slab_break_gfp_order
ffffffff8052a9c8 b g_cpucache_up
ffffffff8052a9d0 b cache_chain
ffffffff8052a9e0 b cache_chain_sem
ffffffff8052aa00 b offslab_limit
ffffffff8052aa08 B page_cluster

Maybe slab_reclaim_pages is the culprit?

Thanks,
Kiran

2006-01-18 03:22:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mm: Convert global dirty_exceeded flag to per-node node_dirty_exceeded

Ravikiran G Thirumalai <[email protected]> wrote:
>
> > OK. That's a bit nasty, isn't it? It can work well or poorly for
> > different people depending upon vagaries of .config and the linker.
> >
> > We should find out what it was sharing _with_. Could you please run
> >
> > nm -n vmlinux| grep -C5 dirty_exceeded
>
> ffffffff805290c0 b irq_dir
> ffffffff80529838 b root_irq_dir
> ffffffff80529880 B max_pfn
> ffffffff80529888 B min_low_pfn
> ffffffff80529890 B max_low_pfn
> ffffffff80529900 B nr_pagecache
> ffffffff80529908 B nr_swap_pages
> ffffffff80529980 b boot_pageset
> ffffffff8052a980 B laptop_mode
> ffffffff8052a984 B block_dump
> ffffffff8052a988 b dirty_exceeded
> ffffffff8052a990 b total_pages
> ffffffff8052a998 B nr_pdflush_threads
> ffffffff8052a9a0 b last_empty_jifs
> ffffffff8052a9c0 B slab_reclaim_pages
> ffffffff8052a9c4 b slab_break_gfp_order
> ffffffff8052a9c8 b g_cpucache_up
> ffffffff8052a9d0 b cache_chain
> ffffffff8052a9e0 b cache_chain_sem
> ffffffff8052aa00 b offslab_limit
> ffffffff8052aa08 B page_cluster
>
> Maybe slab_reclaim_pages is the culprit?
>

I guess so - pretty much everything else there should be in __read_mostly
anwyay, apart from nr_pagecache.

slab_reclaim_pages is just a silly beancounting thing for overcommit. We
could give it the approximate-counting treatment like pagecache_acct() I
guess.