2011-09-20 13:45:11

by Johannes Weiner

[permalink] [raw]
Subject: [patch 0/4] 50% faster writing to your USB drive!*

*if you use ntfs-3g copy files larger than main memory

or: per-zone dirty limits

There have been several discussions and patches around the issue of
dirty pages being written from page reclaim, that is, they reach the
end of the LRU list before they are cleaned.

Proposed reasons for this are the divergence of dirtying age from page
cache age, on one hand, and unequal distribution of the globally
limited dirty memory across the LRU lists of different zones.

Mel's recent patches to reduce writes from reclaim, by simply skipping
over dirty pages until a certain amount of memory pressure builds up,
do help quite a bit. But they can only deal with a limited length of
runs of dirty pages before kswapd goes to lower priority levels to
balance the zone and begins writing.

The unequal distribution of dirty memory between zones is easily
observable through the statistics in /proc/zoneinfo, but the test
results varied between filesystems. To get an overview of where and
how often different page cache pages are created and dirtied, I hacked
together an object tracker that remembers the instantiator of a page
cache page and associates with it the paths that dirty or activate the
page, together with counters that indicate how often those operations
occur.

Btrfs, for example, appears to be activating a significant amount of
regularly written tree data with mark_page_accessed(), even with a
purely linear, page-aligned write load. So in addition to the already
unbounded dirty memory on smaller zones, this is a divergence between
page age and dirtying age and leads to a situation where the pages
reclaimed next are not the ones that are also flushed next:

pgactivate
min| median| max
xfs: 5.000| 6.500| 20.000
fuse-ntfs: 5.000| 19.000| 275.000
ext4: 2.000| 67.000| 810.000
btrfs: 2915.000|3316.500|5786.000

ext4's delalloc, on the other hand, refuses regular write attemps from
kjournald, but the write index of the inode is still advanced for
cyclic write ranges and so the pages are not even immediately written
when the inode is selected again.

I cc'd the filesystem people because it is at least conceivable that
things could be improved on their side, but I do think the problem is
mainly with the VM and needs fixing there.

This patch series implements per-zone dirty limits, derived from the
configured global dirty limits and the individual zone size, that the
page allocator uses to distribute pages allocated for writing across
the allowable zones. Even with pages dirtied out of the inactive LRU
order this gives page reclaim a minimum number of clean pages on each
LRU so that balancing a zone should no longer require writeback in the
common case.

The previous version included code to wake the flushers and stall the
allocation on NUMA setups where the load is bound to a node that is in
itself not large enough to reach the global dirty limits, but I am
still trying to get it to work reliably and dropped it for now, the
series has merits even without it.

Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000

fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000

btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000

ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000

Even though most filesystems already ignore the write request from
reclaim, we were reluctant in the past to remove it, as it was still
theoretically our only means to stay on top of the dirty pages on a
per-zone basis. This patchset should get us closer to removing the
dreaded writepage call from page reclaim altogether.

Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2011-09-20 13:45:12

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/4] mm: exclude reserved pages from dirtyable memory

The amount of dirtyable pages should not include the total number of
free pages: there is a number of reserved pages that the page
allocator and kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of
reserved pages, the more likely it becomes for reclaim to run into
dirty pages:

+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages
is mostly much smaller in absolute numbers.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page-writeback.c | 8 +++++---
mm/page_alloc.c | 1 +
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ed4116..e28f8e0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -316,6 +316,7 @@ struct zone {
* sysctl_lowmem_reserve_ratio sysctl changes.
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
+ unsigned long totalreserve_pages;

#ifdef CONFIG_NUMA
int node;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index da6d263..9f896db 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -169,8 +169,9 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];

- x += zone_page_state(z, NR_FREE_PAGES) +
- zone_reclaimable_pages(z);
+ x += zone_page_state(z, NR_FREE_PAGES) -
+ zone->totalreserve_pages;
+ x += zone_reclaimable_pages(z);
}
/*
* Make sure that the number of highmem pages is never larger
@@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
{
unsigned long x;

- x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+ x = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
+ x += global_reclaimable_pages();

if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1dba05e..7e8e2ee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5075,6 +5075,7 @@ static void calculate_totalreserve_pages(void)

if (max > zone->present_pages)
max = zone->present_pages;
+ zone->totalreserve_pages = max;
reserve_pages += max;
}
}
--
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 13:45:14

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

Tell the page allocator that pages allocated through
grab_cache_page_write_begin() are expected to become dirty soon.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/filemap.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 645a080..cf0352d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2349,8 +2349,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags)
{
int status;
+ gfp_t gfp_mask;
struct page *page;
gfp_t gfp_notmask = 0;
+
+ gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
if (flags & AOP_FLAG_NOFS)
gfp_notmask = __GFP_FS;
repeat:
@@ -2358,7 +2361,7 @@ repeat:
if (page)
goto found;

- page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
+ page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
if (!page)
return NULL;
status = add_to_page_cache_lru(page, mapping, index,
--
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 13:45:13

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/4] mm: writeback: distribute write pages across allowable zones

This patch allows allocators to pass __GFP_WRITE when they know in
advance that the allocated page will be written to and become dirty
soon. The page allocator will then attempt to distribute those
allocations across zones, such that no single zone will end up full of
dirty, and thus more or less, unreclaimable pages.

The global dirty limits are put in proportion to the respective zone's
amount of dirtyable memory and allocations diverted to other zones
when the limit is reached.

For now, the problem remains for NUMA configurations where the zones
allowed for allocation are in sum not big enough to trigger the global
dirty limits, but a future approach to solve this can reuse the
per-zone dirty limit infrastructure laid out in this patch to have
dirty throttling and the flusher threads consider individual zones.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/gfp.h | 4 ++-
include/linux/writeback.h | 1 +
mm/page-writeback.c | 66 +++++++++++++++++++++++++++++++++++++-------
mm/page_alloc.c | 22 ++++++++++++++-
4 files changed, 80 insertions(+), 13 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..50efc7e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
#endif
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
+#define ___GFP_WRITE 0x1000000u

/*
* GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;

#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
+#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */

/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a5f495f..c96ee0c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data);
static inline void laptop_sync_completion(void) { }
#endif
void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);

extern unsigned long global_dirty_limit;

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9f896db..1fc714c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -142,6 +142,22 @@ unsigned long global_dirty_limit;
static struct prop_descriptor vm_completions;
static struct prop_descriptor vm_dirties;

+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+ unsigned long x;
+ /*
+ * To keep a reasonable ratio between dirty memory and lowmem,
+ * highmem is not considered dirtyable on a global level.
+ *
+ * But we allow individual highmem zones to hold a potentially
+ * bigger share of that global amount of dirty pages as long
+ * as they have enough free or reclaimable pages around.
+ */
+ x = zone_page_state(zone, NR_FREE_PAGES) - zone->totalreserve_pages;
+ x += zone_reclaimable_pages(zone);
+ return x;
+}
+
/*
* Work out the current dirty-memory clamping and background writeout
* thresholds.
@@ -417,7 +433,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
}

/*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ * dirty_limits - background-writeback and dirty-throttling thresholds
*
* Calculate the dirty thresholds based on sysctl parameters
* - vm.dirty_background_ratio or vm.dirty_background_bytes
@@ -425,24 +441,35 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
* The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
* real-time tasks.
*/
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+static void dirty_limits(struct zone *zone,
+ unsigned long *pbackground,
+ unsigned long *pdirty)
{
+ unsigned long uninitialized_var(zone_memory);
+ unsigned long available_memory;
+ unsigned long global_memory;
unsigned long background;
- unsigned long dirty;
- unsigned long uninitialized_var(available_memory);
struct task_struct *tsk;
+ unsigned long dirty;

- if (!vm_dirty_bytes || !dirty_background_bytes)
- available_memory = determine_dirtyable_memory();
+ global_memory = determine_dirtyable_memory();
+ if (zone)
+ available_memory = zone_memory = zone_dirtyable_memory(zone);
+ else
+ available_memory = global_memory;

- if (vm_dirty_bytes)
+ if (vm_dirty_bytes) {
dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
- else
+ if (zone)
+ dirty = dirty * zone_memory / global_memory;
+ } else
dirty = (vm_dirty_ratio * available_memory) / 100;

- if (dirty_background_bytes)
+ if (dirty_background_bytes) {
background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
- else
+ if (zone)
+ background = background * zone_memory / global_memory;
+ } else
background = (dirty_background_ratio * available_memory) / 100;

if (background >= dirty)
@@ -452,9 +479,15 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
background += background / 4;
dirty += dirty / 4;
}
+ if (!zone)
+ trace_global_dirty_state(background, dirty);
*pbackground = background;
*pdirty = dirty;
- trace_global_dirty_state(background, dirty);
+}
+
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+{
+ dirty_limits(NULL, pbackground, pdirty);
}

/**
@@ -875,6 +908,17 @@ void throttle_vm_writeout(gfp_t gfp_mask)
}
}

+bool zone_dirty_ok(struct zone *zone)
+{
+ unsigned long background_thresh, dirty_thresh;
+
+ dirty_limits(zone, &background_thresh, &dirty_thresh);
+
+ return zone_page_state(zone, NR_FILE_DIRTY) +
+ zone_page_state(zone, NR_UNSTABLE_NFS) +
+ zone_page_state(zone, NR_WRITEBACK) <= dirty_thresh;
+}
+
/*
* sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e8e2ee..3cca043 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1368,6 +1368,7 @@ failed:
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+#define ALLOC_SLOWPATH 0x80 /* allocator retrying */

#ifdef CONFIG_FAIL_PAGE_ALLOC

@@ -1667,6 +1668,25 @@ zonelist_scan:
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
continue;
+ /*
+ * This may look like it would increase pressure on
+ * lower zones by failing allocations in higher zones
+ * before they are full. But once they are full, the
+ * allocations fall back to lower zones anyway, and
+ * then this check actually protects the lower zones
+ * from a flood of dirty page allocations.
+ *
+ * XXX: Allow allocations to potentially exceed the
+ * per-zone dirty limit in the slowpath before going
+ * into reclaim, which is important when NUMA nodes
+ * are not big enough to reach the global limit. The
+ * proper fix on these setups will require awareness
+ * of zones in the dirty-throttling and the flusher
+ * threads.
+ */
+ if (!(alloc_flags & ALLOC_SLOWPATH) &&
+ (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+ goto this_zone_full;

BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
@@ -2111,7 +2131,7 @@ restart:
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
*/
- alloc_flags = gfp_to_alloc_flags(gfp_mask);
+ alloc_flags = gfp_to_alloc_flags(gfp_mask) | ALLOC_SLOWPATH;

/*
* Find the true preferred zone if the allocation is unconstrained by
--
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 13:45:15

by Johannes Weiner

[permalink] [raw]
Subject: [patch 4/4] Btrfs: pass __GFP_WRITE for buffered write page allocations

Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.

Signed-off-by: Johannes Weiner <[email protected]>
---
fs/btrfs/file.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e7872e4..ea1b892 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1084,7 +1084,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
again:
for (i = 0; i < num_pages; i++) {
pages[i] = find_or_create_page(inode->i_mapping, index + i,
- GFP_NOFS);
+ GFP_NOFS | __GFP_WRITE);
if (!pages[i]) {
faili = i - 1;
err = -ENOMEM;
--
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 13:56:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 4/4] Btrfs: pass __GFP_WRITE for buffered write page allocations

On Tue, Sep 20, 2011 at 03:45:15PM +0200, Johannes Weiner wrote:
> Tell the page allocator that pages allocated for a buffered write are
> expected to become dirty soon.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> fs/btrfs/file.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e7872e4..ea1b892 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1084,7 +1084,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
> again:
> for (i = 0; i < num_pages; i++) {
> pages[i] = find_or_create_page(inode->i_mapping, index + i,
> - GFP_NOFS);
> + GFP_NOFS | __GFP_WRITE);

Btw and unrelated to this particular series, I think this should use
grab_cache_page_write_begin() in the first place.

Most grab_cache_page calls were replaced recently (a94733d "Btrfs: use
find_or_create_page instead of grab_cache_page") to be able to pass
GFP_NOFS, but the pages are now also no longer __GFP_HIGHMEM and
__GFP_MOVABLE, which irks both x86_32 and memory hotplug.

It might be better to change grab_cache_page instead to take a flags
argument that allows passing AOP_FLAG_NOFS and revert the sites back
to this helper?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 14:09:38

by Josef Bacik

[permalink] [raw]
Subject: Re: [patch 4/4] Btrfs: pass __GFP_WRITE for buffered write page allocations

On 09/20/2011 09:56 AM, Johannes Weiner wrote:
> On Tue, Sep 20, 2011 at 03:45:15PM +0200, Johannes Weiner wrote:
>> Tell the page allocator that pages allocated for a buffered write are
>> expected to become dirty soon.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
>> ---
>> fs/btrfs/file.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index e7872e4..ea1b892 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -1084,7 +1084,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
>> again:
>> for (i = 0; i < num_pages; i++) {
>> pages[i] = find_or_create_page(inode->i_mapping, index + i,
>> - GFP_NOFS);
>> + GFP_NOFS | __GFP_WRITE);
>
> Btw and unrelated to this particular series, I think this should use
> grab_cache_page_write_begin() in the first place.
>
> Most grab_cache_page calls were replaced recently (a94733d "Btrfs: use
> find_or_create_page instead of grab_cache_page") to be able to pass
> GFP_NOFS, but the pages are now also no longer __GFP_HIGHMEM and
> __GFP_MOVABLE, which irks both x86_32 and memory hotplug.
>
> It might be better to change grab_cache_page instead to take a flags
> argument that allows passing AOP_FLAG_NOFS and revert the sites back
> to this helper?

So I can do

pages[i] = grab_cache_page_write_begin(inode->i_mapping, index + i,
AOP_FLAG_NOFS);

right? All we need is nofs, so I can just go through and change
everybody to that. I'd rather not have to go through and change
grab_cache_page() to take a flags argument and change all the callers, I
have a bad habit of screwing stuff like that up :). Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 14:14:03

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 4/4] Btrfs: pass __GFP_WRITE for buffered write page allocations

On Tue, Sep 20, 2011 at 10:09:38AM -0400, Josef Bacik wrote:
> On 09/20/2011 09:56 AM, Johannes Weiner wrote:
> > On Tue, Sep 20, 2011 at 03:45:15PM +0200, Johannes Weiner wrote:
> >> Tell the page allocator that pages allocated for a buffered write are
> >> expected to become dirty soon.
> >>
> >> Signed-off-by: Johannes Weiner <[email protected]>
> >> ---
> >> fs/btrfs/file.c | 2 +-
> >> 1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> >> index e7872e4..ea1b892 100644
> >> --- a/fs/btrfs/file.c
> >> +++ b/fs/btrfs/file.c
> >> @@ -1084,7 +1084,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
> >> again:
> >> for (i = 0; i < num_pages; i++) {
> >> pages[i] = find_or_create_page(inode->i_mapping, index + i,
> >> - GFP_NOFS);
> >> + GFP_NOFS | __GFP_WRITE);
> >
> > Btw and unrelated to this particular series, I think this should use
> > grab_cache_page_write_begin() in the first place.
> >
> > Most grab_cache_page calls were replaced recently (a94733d "Btrfs: use
> > find_or_create_page instead of grab_cache_page") to be able to pass
> > GFP_NOFS, but the pages are now also no longer __GFP_HIGHMEM and
> > __GFP_MOVABLE, which irks both x86_32 and memory hotplug.
> >
> > It might be better to change grab_cache_page instead to take a flags
> > argument that allows passing AOP_FLAG_NOFS and revert the sites back
> > to this helper?
>
> So I can do
>
> pages[i] = grab_cache_page_write_begin(inode->i_mapping, index + i,
> AOP_FLAG_NOFS);
>
> right? All we need is nofs, so I can just go through and change
> everybody to that.

It does wait_on_page_writeback() in addition, so it may not be
appropriate for every callsite, I haven't checked. But everything
that grabs a page for writing should be fine if you do it like this.

> I'd rather not have to go through and change grab_cache_page() to
> take a flags argument and change all the callers, I have a bad habit
> of screwing stuff like that up :).

Yeah, there are quite a few. If we can get around it, all the better.

Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 14:25:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

In addition to regular write shouldn't __do_fault and do_wp_page also
calls this if they are called on file backed mappings?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 15:21:15

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 1/4] mm: exclude reserved pages from dirtyable memory

On 09/20/2011 09:45 AM, Johannes Weiner wrote:
> The amount of dirtyable pages should not include the total number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
>
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:

> Signed-off-by: Johannes Weiner<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 18:36:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 2/4] mm: writeback: distribute write pages across allowable zones

On 09/20/2011 09:45 AM, Johannes Weiner wrote:
> This patch allows allocators to pass __GFP_WRITE when they know in
> advance that the allocated page will be written to and become dirty
> soon. The page allocator will then attempt to distribute those
> allocations across zones, such that no single zone will end up full of
> dirty, and thus more or less, unreclaimable pages.
>
> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory and allocations diverted to other zones
> when the limit is reached.
>
> For now, the problem remains for NUMA configurations where the zones
> allowed for allocation are in sum not big enough to trigger the global
> dirty limits, but a future approach to solve this can reuse the
> per-zone dirty limit infrastructure laid out in this patch to have
> dirty throttling and the flusher threads consider individual zones.
>
> Signed-off-by: Johannes Weiner<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

The amount of work done in a __GFP_WRITE allocation looks
a little daunting, but doing that a million times probably
outweighs waiting on the disk even once, so...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 18:38:03

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

On 09/20/2011 10:25 AM, Christoph Hellwig wrote:
> In addition to regular write shouldn't __do_fault and do_wp_page also
> calls this if they are called on file backed mappings?
>

Probably not do_wp_page since it always creates an
anonymous page, which are not very relevant to the
dirty page cache accounting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 18:40:39

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

On 09/20/2011 09:45 AM, Johannes Weiner wrote:
> Tell the page allocator that pages allocated through
> grab_cache_page_write_begin() are expected to become dirty soon.
>
> Signed-off-by: Johannes Weiner<[email protected]>

Reviewed-by: Rik van Riel<[email protected]>

The missing codepaths pointed out by Christoph either
create new anonymous pages, or mark ptes pointing at
existing page cache pages writeable.

Either way, those should not need __GFP_WRITE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 18:40:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

On Tue, Sep 20, 2011 at 02:38:03PM -0400, Rik van Riel wrote:
> On 09/20/2011 10:25 AM, Christoph Hellwig wrote:
> >In addition to regular write shouldn't __do_fault and do_wp_page also
> >calls this if they are called on file backed mappings?
> >
>
> Probably not do_wp_page since it always creates an
> anonymous page, which are not very relevant to the
> dirty page cache accounting.

Well, it doesn't always - but for the case where it doesn't we
do not allocate a new page at all so you're right in the end :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-20 18:41:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 4/4] Btrfs: pass __GFP_WRITE for buffered write page allocations

On 09/20/2011 09:45 AM, Johannes Weiner wrote:
> Tell the page allocator that pages allocated for a buffered write are
> expected to become dirty soon.
>
> Signed-off-by: Johannes Weiner<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-21 11:04:35

by Shaohua Li

[permalink] [raw]
Subject: Re: [patch 2/4] mm: writeback: distribute write pages across allowable zones

On Tue, 2011-09-20 at 21:45 +0800, Johannes Weiner wrote:
> This patch allows allocators to pass __GFP_WRITE when they know in
> advance that the allocated page will be written to and become dirty
> soon. The page allocator will then attempt to distribute those
> allocations across zones, such that no single zone will end up full of
> dirty, and thus more or less, unreclaimable pages.
>
> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory and allocations diverted to other zones
> when the limit is reached.
>
> For now, the problem remains for NUMA configurations where the zones
> allowed for allocation are in sum not big enough to trigger the global
> dirty limits, but a future approach to solve this can reuse the
> per-zone dirty limit infrastructure laid out in this patch to have
> dirty throttling and the flusher threads consider individual zones.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/gfp.h | 4 ++-
> include/linux/writeback.h | 1 +
> mm/page-writeback.c | 66 +++++++++++++++++++++++++++++++++++++-------
> mm/page_alloc.c | 22 ++++++++++++++-
> 4 files changed, 80 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 3a76faf..50efc7e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -36,6 +36,7 @@ struct vm_area_struct;
> #endif
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> +#define ___GFP_WRITE 0x1000000u
>
> /*
> * GFP bitmasks..
> @@ -85,6 +86,7 @@ struct vm_area_struct;
>
> #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
> #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> +#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
>
> /*
> * This may seem redundant, but it's a way of annotating false positives vs.
> @@ -92,7 +94,7 @@ struct vm_area_struct;
> */
> #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>
> -#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a5f495f..c96ee0c 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data);
> static inline void laptop_sync_completion(void) { }
> #endif
> void throttle_vm_writeout(gfp_t gfp_mask);
> +bool zone_dirty_ok(struct zone *zone);
>
> extern unsigned long global_dirty_limit;
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 9f896db..1fc714c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -142,6 +142,22 @@ unsigned long global_dirty_limit;
> static struct prop_descriptor vm_completions;
> static struct prop_descriptor vm_dirties;
>
> +static unsigned long zone_dirtyable_memory(struct zone *zone)
> +{
> + unsigned long x;
> + /*
> + * To keep a reasonable ratio between dirty memory and lowmem,
> + * highmem is not considered dirtyable on a global level.
> + *
> + * But we allow individual highmem zones to hold a potentially
> + * bigger share of that global amount of dirty pages as long
> + * as they have enough free or reclaimable pages around.
> + */
> + x = zone_page_state(zone, NR_FREE_PAGES) - zone->totalreserve_pages;
> + x += zone_reclaimable_pages(zone);
> + return x;
> +}
> +
> /*
> * Work out the current dirty-memory clamping and background writeout
> * thresholds.
> @@ -417,7 +433,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> }
>
> /*
> - * global_dirty_limits - background-writeback and dirty-throttling thresholds
> + * dirty_limits - background-writeback and dirty-throttling thresholds
> *
> * Calculate the dirty thresholds based on sysctl parameters
> * - vm.dirty_background_ratio or vm.dirty_background_bytes
> @@ -425,24 +441,35 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
> * real-time tasks.
> */
> -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> +static void dirty_limits(struct zone *zone,
> + unsigned long *pbackground,
> + unsigned long *pdirty)
> {
> + unsigned long uninitialized_var(zone_memory);
> + unsigned long available_memory;
> + unsigned long global_memory;
> unsigned long background;
> - unsigned long dirty;
> - unsigned long uninitialized_var(available_memory);
> struct task_struct *tsk;
> + unsigned long dirty;
>
> - if (!vm_dirty_bytes || !dirty_background_bytes)
> - available_memory = determine_dirtyable_memory();
> + global_memory = determine_dirtyable_memory();
> + if (zone)
> + available_memory = zone_memory = zone_dirtyable_memory(zone);
> + else
> + available_memory = global_memory;
>
> - if (vm_dirty_bytes)
> + if (vm_dirty_bytes) {
> dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> - else
> + if (zone)
> + dirty = dirty * zone_memory / global_memory;
> + } else
> dirty = (vm_dirty_ratio * available_memory) / 100;
>
> - if (dirty_background_bytes)
> + if (dirty_background_bytes) {
> background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> - else
> + if (zone)
> + background = background * zone_memory / global_memory;
> + } else
> background = (dirty_background_ratio * available_memory) / 100;
>
> if (background >= dirty)
> @@ -452,9 +479,15 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> background += background / 4;
> dirty += dirty / 4;
> }
> + if (!zone)
> + trace_global_dirty_state(background, dirty);
> *pbackground = background;
> *pdirty = dirty;
> - trace_global_dirty_state(background, dirty);
> +}
> +
> +void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> +{
> + dirty_limits(NULL, pbackground, pdirty);
> }
>
> /**
> @@ -875,6 +908,17 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> }
> }
>
> +bool zone_dirty_ok(struct zone *zone)
> +{
> + unsigned long background_thresh, dirty_thresh;
> +
> + dirty_limits(zone, &background_thresh, &dirty_thresh);
> +
> + return zone_page_state(zone, NR_FILE_DIRTY) +
> + zone_page_state(zone, NR_UNSTABLE_NFS) +
> + zone_page_state(zone, NR_WRITEBACK) <= dirty_thresh;
> +}
> +
> /*
> * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
> */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7e8e2ee..3cca043 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1368,6 +1368,7 @@ failed:
> #define ALLOC_HARDER 0x10 /* try to alloc harder */
> #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
> #define ALLOC_CPUSET 0x40 /* check for correct cpuset */
> +#define ALLOC_SLOWPATH 0x80 /* allocator retrying */
>
> #ifdef CONFIG_FAIL_PAGE_ALLOC
>
> @@ -1667,6 +1668,25 @@ zonelist_scan:
> if ((alloc_flags & ALLOC_CPUSET) &&
> !cpuset_zone_allowed_softwall(zone, gfp_mask))
> continue;
> + /*
> + * This may look like it would increase pressure on
> + * lower zones by failing allocations in higher zones
> + * before they are full. But once they are full, the
> + * allocations fall back to lower zones anyway, and
> + * then this check actually protects the lower zones
> + * from a flood of dirty page allocations.
if increasing pressure on lower zones isn't a problem since higher zones
will eventually be full, how about a workload without too many writes,
so higher zones will not be full. In such case, increasing low zone
pressure sounds not good.

Thanks,
Shaohua



2011-09-21 13:35:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/4] mm: writeback: distribute write pages across allowable zones

On Wed, Sep 21, 2011 at 07:04:28PM +0800, Shaohua Li wrote:
> On Tue, 2011-09-20 at 21:45 +0800, Johannes Weiner wrote:
> > This patch allows allocators to pass __GFP_WRITE when they know in
> > advance that the allocated page will be written to and become dirty
> > soon. The page allocator will then attempt to distribute those
> > allocations across zones, such that no single zone will end up full of
> > dirty, and thus more or less, unreclaimable pages.
> >
> > The global dirty limits are put in proportion to the respective zone's
> > amount of dirtyable memory and allocations diverted to other zones
> > when the limit is reached.
> >
> > For now, the problem remains for NUMA configurations where the zones
> > allowed for allocation are in sum not big enough to trigger the global
> > dirty limits, but a future approach to solve this can reuse the
> > per-zone dirty limit infrastructure laid out in this patch to have
> > dirty throttling and the flusher threads consider individual zones.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > include/linux/gfp.h | 4 ++-
> > include/linux/writeback.h | 1 +
> > mm/page-writeback.c | 66 +++++++++++++++++++++++++++++++++++++-------
> > mm/page_alloc.c | 22 ++++++++++++++-
> > 4 files changed, 80 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 3a76faf..50efc7e 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -36,6 +36,7 @@ struct vm_area_struct;
> > #endif
> > #define ___GFP_NO_KSWAPD 0x400000u
> > #define ___GFP_OTHER_NODE 0x800000u
> > +#define ___GFP_WRITE 0x1000000u
> >
> > /*
> > * GFP bitmasks..
> > @@ -85,6 +86,7 @@ struct vm_area_struct;
> >
> > #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
> > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> > +#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
> >
> > /*
> > * This may seem redundant, but it's a way of annotating false positives vs.
> > @@ -92,7 +94,7 @@ struct vm_area_struct;
> > */
> > #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
> >
> > -#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */
> > +#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> >
> > /* This equals 0, but use constants in case they ever change */
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index a5f495f..c96ee0c 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data);
> > static inline void laptop_sync_completion(void) { }
> > #endif
> > void throttle_vm_writeout(gfp_t gfp_mask);
> > +bool zone_dirty_ok(struct zone *zone);
> >
> > extern unsigned long global_dirty_limit;
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 9f896db..1fc714c 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -142,6 +142,22 @@ unsigned long global_dirty_limit;
> > static struct prop_descriptor vm_completions;
> > static struct prop_descriptor vm_dirties;
> >
> > +static unsigned long zone_dirtyable_memory(struct zone *zone)
> > +{
> > + unsigned long x;
> > + /*
> > + * To keep a reasonable ratio between dirty memory and lowmem,
> > + * highmem is not considered dirtyable on a global level.
> > + *
> > + * But we allow individual highmem zones to hold a potentially
> > + * bigger share of that global amount of dirty pages as long
> > + * as they have enough free or reclaimable pages around.
> > + */
> > + x = zone_page_state(zone, NR_FREE_PAGES) - zone->totalreserve_pages;
> > + x += zone_reclaimable_pages(zone);
> > + return x;
> > +}
> > +
> > /*
> > * Work out the current dirty-memory clamping and background writeout
> > * thresholds.
> > @@ -417,7 +433,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> > }
> >
> > /*
> > - * global_dirty_limits - background-writeback and dirty-throttling thresholds
> > + * dirty_limits - background-writeback and dirty-throttling thresholds
> > *
> > * Calculate the dirty thresholds based on sysctl parameters
> > * - vm.dirty_background_ratio or vm.dirty_background_bytes
> > @@ -425,24 +441,35 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> > * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
> > * real-time tasks.
> > */
> > -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> > +static void dirty_limits(struct zone *zone,
> > + unsigned long *pbackground,
> > + unsigned long *pdirty)
> > {
> > + unsigned long uninitialized_var(zone_memory);
> > + unsigned long available_memory;
> > + unsigned long global_memory;
> > unsigned long background;
> > - unsigned long dirty;
> > - unsigned long uninitialized_var(available_memory);
> > struct task_struct *tsk;
> > + unsigned long dirty;
> >
> > - if (!vm_dirty_bytes || !dirty_background_bytes)
> > - available_memory = determine_dirtyable_memory();
> > + global_memory = determine_dirtyable_memory();
> > + if (zone)
> > + available_memory = zone_memory = zone_dirtyable_memory(zone);
> > + else
> > + available_memory = global_memory;
> >
> > - if (vm_dirty_bytes)
> > + if (vm_dirty_bytes) {
> > dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > - else
> > + if (zone)
> > + dirty = dirty * zone_memory / global_memory;
> > + } else
> > dirty = (vm_dirty_ratio * available_memory) / 100;
> >
> > - if (dirty_background_bytes)
> > + if (dirty_background_bytes) {
> > background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > - else
> > + if (zone)
> > + background = background * zone_memory / global_memory;
> > + } else
> > background = (dirty_background_ratio * available_memory) / 100;
> >
> > if (background >= dirty)
> > @@ -452,9 +479,15 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> > background += background / 4;
> > dirty += dirty / 4;
> > }
> > + if (!zone)
> > + trace_global_dirty_state(background, dirty);
> > *pbackground = background;
> > *pdirty = dirty;
> > - trace_global_dirty_state(background, dirty);
> > +}
> > +
> > +void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> > +{
> > + dirty_limits(NULL, pbackground, pdirty);
> > }
> >
> > /**
> > @@ -875,6 +908,17 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > }
> > }
> >
> > +bool zone_dirty_ok(struct zone *zone)
> > +{
> > + unsigned long background_thresh, dirty_thresh;
> > +
> > + dirty_limits(zone, &background_thresh, &dirty_thresh);
> > +
> > + return zone_page_state(zone, NR_FILE_DIRTY) +
> > + zone_page_state(zone, NR_UNSTABLE_NFS) +
> > + zone_page_state(zone, NR_WRITEBACK) <= dirty_thresh;
> > +}
> > +
> > /*
> > * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
> > */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7e8e2ee..3cca043 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1368,6 +1368,7 @@ failed:
> > #define ALLOC_HARDER 0x10 /* try to alloc harder */
> > #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
> > #define ALLOC_CPUSET 0x40 /* check for correct cpuset */
> > +#define ALLOC_SLOWPATH 0x80 /* allocator retrying */
> >
> > #ifdef CONFIG_FAIL_PAGE_ALLOC
> >
> > @@ -1667,6 +1668,25 @@ zonelist_scan:
> > if ((alloc_flags & ALLOC_CPUSET) &&
> > !cpuset_zone_allowed_softwall(zone, gfp_mask))
> > continue;
> > + /*
> > + * This may look like it would increase pressure on
> > + * lower zones by failing allocations in higher zones
> > + * before they are full. But once they are full, the
> > + * allocations fall back to lower zones anyway, and
> > + * then this check actually protects the lower zones
> > + * from a flood of dirty page allocations.
> if increasing pressure on lower zones isn't a problem since higher zones
> will eventually be full, how about a workload without too many writes,
> so higher zones will not be full. In such case, increasing low zone
> pressure sounds not good.

While there is a shift of dirty pages possible in workloads that were
able to completely fit those pages in the highest zone, the extent of
that shift is limited, which should prevent it from becoming a
practical burden for the lower zones.

Because the number of dirtyable pages does include neither a zone's
lowmem reserves, nor the watermarks, nor kernel allocations, a lower
zone does not receive a bigger share than it can afford when the
allocations are diverted from the higher zones.

To put it into perspective: with these patches there could be an
increased allocation latency for a workload with a writer of fixed
size fitting into the Normal zone and an allocator that suddenly
requires more than 3G (~ DMA32 size minus the 20% allowable dirty
pages) DMA32 memory. That sounds a bit artificial, to me at least.
And without these patches, we encounter exactly those allocation
latencies on a regular basis when writing files larger than memory.

Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-21 14:04:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 1/4] mm: exclude reserved pages from dirtyable memory

On Tue, Sep 20, 2011 at 03:45:12PM +0200, Johannes Weiner wrote:
> The amount of dirtyable pages should not include the total number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
>
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
>
> +----------+ ---
> | anon | |
> +----------+ |
> | | |
> | | -- dirty limit new -- flusher new
> | file | | |
> | | | |
> | | -- dirty limit old -- flusher old
> | | |
> +----------+ --- reclaim
> | reserved |
> +----------+
> | kernel |
> +----------+
>
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix. In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
>
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/page-writeback.c | 8 +++++---
> mm/page_alloc.c | 1 +
> 3 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1ed4116..e28f8e0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -316,6 +316,7 @@ struct zone {
> * sysctl_lowmem_reserve_ratio sysctl changes.
> */
> unsigned long lowmem_reserve[MAX_NR_ZONES];
> + unsigned long totalreserve_pages;
>

This is nit-picking but totalreserve_pages is a poor name because it's
a per-zone value that is one of the lowmem_reserve[] fields instead
of a total. After this patch, we have zone->totalreserve_pages and
totalreserve_pages but are not related to the same thing.
but they are not the same.

It gets confusing once you consider what the values are
for. lowmem_reserve is part of a placement policy that limits the
number of pages placed in lower zones that allocated from higher
zones. totalreserve_pages is related to the overcommit heuristic
where it is assuming that the most interesting type of allocation
is GFP_HIGHUSER.

This begs the question - what is this new field, where does it come
from, what does it want from us? Should we take it to our Patch Leader?

This field ultimately affects what zone is used to allocate a new
page so it's related to placement policy. That implies the naming then
should indicate it is related to lowmem_reserve - largest_lowmem_reserve?

Alternative, make it clear that it's one of the lowmem_reserve
values and store the index instead of the value - largest_reserve_idx?

> #ifdef CONFIG_NUMA
> int node;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index da6d263..9f896db 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -169,8 +169,9 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> struct zone *z =
> &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>
> - x += zone_page_state(z, NR_FREE_PAGES) +
> - zone_reclaimable_pages(z);
> + x += zone_page_state(z, NR_FREE_PAGES) -
> + zone->totalreserve_pages;
> + x += zone_reclaimable_pages(z);
> }

This is highmem so zone->totalreserve_pages should always be 0.

Otherwise, the patch seems fine.

> /*
> * Make sure that the number of highmem pages is never larger
> @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
> {
> unsigned long x;
>
> - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> + x = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
> + x += global_reclaimable_pages();
>
> if (!vm_highmem_is_dirtyable)
> x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1dba05e..7e8e2ee 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5075,6 +5075,7 @@ static void calculate_totalreserve_pages(void)
>
> if (max > zone->present_pages)
> max = zone->present_pages;
> + zone->totalreserve_pages = max;
> reserve_pages += max;
> }
> }

--
Mel Gorman
SUSE Labs

2011-09-21 14:09:35

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

On Tue, Sep 20, 2011 at 02:40:34PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 20, 2011 at 02:38:03PM -0400, Rik van Riel wrote:
> > On 09/20/2011 10:25 AM, Christoph Hellwig wrote:
> > >In addition to regular write shouldn't __do_fault and do_wp_page also
> > >calls this if they are called on file backed mappings?
> >
> > Probably not do_wp_page since it always creates an
> > anonymous page, which are not very relevant to the
> > dirty page cache accounting.
>
> Well, it doesn't always - but for the case where it doesn't we
> do not allocate a new page at all so you're right in the end :)

I think it could be useful to annotate write-fault allocations in
filemap_fault(), but these pages are mostly allocated in the readahead
code, so this could turn into a more invasive project.

It can be done incrementally, however, the series as it stands does
not require it to be useful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-21 14:30:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 2/4] mm: writeback: distribute write pages across allowable zones

On Tue, Sep 20, 2011 at 03:45:13PM +0200, Johannes Weiner wrote:
> This patch allows allocators to pass __GFP_WRITE when they know in
> advance that the allocated page will be written to and become dirty
> soon. The page allocator will then attempt to distribute those
> allocations across zones, such that no single zone will end up full of
> dirty, and thus more or less, unreclaimable pages.
>

I know this came up the last time but an explanation why lowmem
pressure is not expected to be a problem should be in the changelog.

"At first glance, it would appear that there is a lowmem pressure risk
but it is not the case. Highmem is not considered dirtyable memory.
Hence, if highmem is very large, the global amount of dirty memory
will fit in the highmem zone without falling back to the lower zones
and causing lowmem pressure. If highmem is small then the amount of
pages that 'spill over' to lower zones is limited and no likely to
significantly increase the risk of lowmem pressure due to things like
pagetable page allocations for example. In other words, the timing of
when lowmem pressure happens changes but overall the pressure is roughly
the same".

or something.

> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory and allocations diverted to other zones
> when the limit is reached.
>
> For now, the problem remains for NUMA configurations where the zones
> allowed for allocation are in sum not big enough to trigger the global
> dirty limits, but a future approach to solve this can reuse the
> per-zone dirty limit infrastructure laid out in this patch to have
> dirty throttling and the flusher threads consider individual zones.
>

While I think this particular point is important, I don't think it
should be a show stopped for the series.

I'm going to steal Andrew's line here as well - you explain what you are
doing in the patch leader but not why.

> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/gfp.h | 4 ++-
> include/linux/writeback.h | 1 +
> mm/page-writeback.c | 66 +++++++++++++++++++++++++++++++++++++-------
> mm/page_alloc.c | 22 ++++++++++++++-
> 4 files changed, 80 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 3a76faf..50efc7e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -36,6 +36,7 @@ struct vm_area_struct;
> #endif
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> +#define ___GFP_WRITE 0x1000000u
>
> /*
> * GFP bitmasks..
> @@ -85,6 +86,7 @@ struct vm_area_struct;
>
> #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
> #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> +#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
>
> /*
> * This may seem redundant, but it's a way of annotating false positives vs.
> @@ -92,7 +94,7 @@ struct vm_area_struct;
> */
> #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>
> -#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a5f495f..c96ee0c 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data);
> static inline void laptop_sync_completion(void) { }
> #endif
> void throttle_vm_writeout(gfp_t gfp_mask);
> +bool zone_dirty_ok(struct zone *zone);
>
> extern unsigned long global_dirty_limit;
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 9f896db..1fc714c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -142,6 +142,22 @@ unsigned long global_dirty_limit;
> static struct prop_descriptor vm_completions;
> static struct prop_descriptor vm_dirties;
>
> +static unsigned long zone_dirtyable_memory(struct zone *zone)
> +{
> + unsigned long x;
> + /*
> + * To keep a reasonable ratio between dirty memory and lowmem,
> + * highmem is not considered dirtyable on a global level.
> + *
> + * But we allow individual highmem zones to hold a potentially
> + * bigger share of that global amount of dirty pages as long
> + * as they have enough free or reclaimable pages around.
> + */
> + x = zone_page_state(zone, NR_FREE_PAGES) - zone->totalreserve_pages;
> + x += zone_reclaimable_pages(zone);
> + return x;
> +}
> +
> /*
> * Work out the current dirty-memory clamping and background writeout
> * thresholds.
> @@ -417,7 +433,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> }
>
> /*
> - * global_dirty_limits - background-writeback and dirty-throttling thresholds
> + * dirty_limits - background-writeback and dirty-throttling thresholds
> *
> * Calculate the dirty thresholds based on sysctl parameters
> * - vm.dirty_background_ratio or vm.dirty_background_bytes
> @@ -425,24 +441,35 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
> * real-time tasks.
> */
> -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> +static void dirty_limits(struct zone *zone,
> + unsigned long *pbackground,
> + unsigned long *pdirty)
> {
> + unsigned long uninitialized_var(zone_memory);
> + unsigned long available_memory;
> + unsigned long global_memory;
> unsigned long background;
> - unsigned long dirty;
> - unsigned long uninitialized_var(available_memory);
> struct task_struct *tsk;
> + unsigned long dirty;
>
> - if (!vm_dirty_bytes || !dirty_background_bytes)
> - available_memory = determine_dirtyable_memory();
> + global_memory = determine_dirtyable_memory();
> + if (zone)
> + available_memory = zone_memory = zone_dirtyable_memory(zone);
> + else
> + available_memory = global_memory;
>
> - if (vm_dirty_bytes)
> + if (vm_dirty_bytes) {
> dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> - else
> + if (zone)
> + dirty = dirty * zone_memory / global_memory;
> + } else
> dirty = (vm_dirty_ratio * available_memory) / 100;
>
> - if (dirty_background_bytes)
> + if (dirty_background_bytes) {
> background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> - else
> + if (zone)
> + background = background * zone_memory / global_memory;
> + } else
> background = (dirty_background_ratio * available_memory) / 100;
>
> if (background >= dirty)
> @@ -452,9 +479,15 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> background += background / 4;
> dirty += dirty / 4;
> }
> + if (!zone)
> + trace_global_dirty_state(background, dirty);
> *pbackground = background;
> *pdirty = dirty;
> - trace_global_dirty_state(background, dirty);
> +}
> +
> +void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> +{
> + dirty_limits(NULL, pbackground, pdirty);
> }
>
> /**
> @@ -875,6 +908,17 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> }
> }
>
> +bool zone_dirty_ok(struct zone *zone)
> +{
> + unsigned long background_thresh, dirty_thresh;
> +
> + dirty_limits(zone, &background_thresh, &dirty_thresh);
> +
> + return zone_page_state(zone, NR_FILE_DIRTY) +
> + zone_page_state(zone, NR_UNSTABLE_NFS) +
> + zone_page_state(zone, NR_WRITEBACK) <= dirty_thresh;
> +}
> +
> /*
> * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
> */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7e8e2ee..3cca043 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1368,6 +1368,7 @@ failed:
> #define ALLOC_HARDER 0x10 /* try to alloc harder */
> #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
> #define ALLOC_CPUSET 0x40 /* check for correct cpuset */
> +#define ALLOC_SLOWPATH 0x80 /* allocator retrying */
>
> #ifdef CONFIG_FAIL_PAGE_ALLOC
>
> @@ -1667,6 +1668,25 @@ zonelist_scan:
> if ((alloc_flags & ALLOC_CPUSET) &&
> !cpuset_zone_allowed_softwall(zone, gfp_mask))
> continue;
> + /*
> + * This may look like it would increase pressure on
> + * lower zones by failing allocations in higher zones
> + * before they are full. But once they are full, the
> + * allocations fall back to lower zones anyway, and
> + * then this check actually protects the lower zones
> + * from a flood of dirty page allocations.
> + *
> + * XXX: Allow allocations to potentially exceed the
> + * per-zone dirty limit in the slowpath before going
> + * into reclaim, which is important when NUMA nodes
> + * are not big enough to reach the global limit. The
> + * proper fix on these setups will require awareness
> + * of zones in the dirty-throttling and the flusher
> + * threads.
> + */

Here would be a good reason to explain why we sometimes allow
__GFP_WRITE pages to fall back to lower zones. As it is, the reader
is required to remember that this affects LRU ordering and when/if
reclaim tries to write back the page.

> + if (!(alloc_flags & ALLOC_SLOWPATH) &&
> + (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> + goto this_zone_full;
>
> BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> @@ -2111,7 +2131,7 @@ restart:
> * reclaim. Now things get more complex, so set up alloc_flags according
> * to how we want to proceed.
> */
> - alloc_flags = gfp_to_alloc_flags(gfp_mask);
> + alloc_flags = gfp_to_alloc_flags(gfp_mask) | ALLOC_SLOWPATH;
>

Instead of adding ALLOC_SLOWPATH, check for ALLOC_WMARK_LOW which is
only set in the fast path.

> /*
> * Find the true preferred zone if the allocation is unconstrained by

Functionally, I did not find a problem with the patch.

--
Mel Gorman
SUSE Labs

2011-09-21 14:34:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

On Tue, Sep 20, 2011 at 03:45:14PM +0200, Johannes Weiner wrote:
> Tell the page allocator that pages allocated through
> grab_cache_page_write_begin() are expected to become dirty soon.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-21 15:03:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 1/4] mm: exclude reserved pages from dirtyable memory

On Wed, Sep 21, 2011 at 03:04:23PM +0100, Mel Gorman wrote:
> On Tue, Sep 20, 2011 at 03:45:12PM +0200, Johannes Weiner wrote:
> > The amount of dirtyable pages should not include the total number of
> > free pages: there is a number of reserved pages that the page
> > allocator and kswapd always try to keep free.
> >
> > The closer (reclaimable pages - dirty pages) is to the number of
> > reserved pages, the more likely it becomes for reclaim to run into
> > dirty pages:
> >
> > +----------+ ---
> > | anon | |
> > +----------+ |
> > | | |
> > | | -- dirty limit new -- flusher new
> > | file | | |
> > | | | |
> > | | -- dirty limit old -- flusher old
> > | | |
> > +----------+ --- reclaim
> > | reserved |
> > +----------+
> > | kernel |
> > +----------+
> >
> > Not treating reserved pages as dirtyable on a global level is only a
> > conceptual fix. In reality, dirty pages are not distributed equally
> > across zones and reclaim runs into dirty pages on a regular basis.
> >
> > But it is important to get this right before tackling the problem on a
> > per-zone level, where the distance between reclaim and the dirty pages
> > is mostly much smaller in absolute numbers.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > include/linux/mmzone.h | 1 +
> > mm/page-writeback.c | 8 +++++---
> > mm/page_alloc.c | 1 +
> > 3 files changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 1ed4116..e28f8e0 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -316,6 +316,7 @@ struct zone {
> > * sysctl_lowmem_reserve_ratio sysctl changes.
> > */
> > unsigned long lowmem_reserve[MAX_NR_ZONES];
> > + unsigned long totalreserve_pages;
> >
>
> This is nit-picking but totalreserve_pages is a poor name because it's
> a per-zone value that is one of the lowmem_reserve[] fields instead
> of a total. After this patch, we have zone->totalreserve_pages and
> totalreserve_pages but are not related to the same thing.
> but they are not the same.
>

As you correctly pointed out to be on IRC, zone->totalreserve_pages
is not the lowmem_reserve because it takes the high_wmark into
account. Sorry about that, I should have kept thinking. The name is
still poor though because it does not explain what the value is or
what it means.

zone->FOO value needs to be related to lowmem_reserve because this
is related to balancing zone usage.

zone->FOO value should also be related to the high_wmark because
this is avoiding writeback from page reclaim

err....... umm... this?

/*
* When allocating a new page that is expected to be
* dirtied soon, the number of free pages and the
* dirty_balance reserve are taken into account. The
* objective is that the globally allowed number of dirty
* pages should be distributed throughout the zones such
* that it is very unlikely that page reclaim will call
* ->writepage.
*
* dirty_balance_reserve takes both lowmem_reserve and
* the high watermark into account. The lowmem_reserve
* is taken into account because we don't want the
* distribution of dirty pages to unnecessarily increase
* lowmem pressure. The watermark is taken into account
* because it's correlated with when kswapd wakes up
* and how long it stays awake.
*/
unsigned long dirty_balance_reserve.

--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-21 23:02:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/4] mm: writeback: distribute write pages across allowable zones

On Tue, 20 Sep 2011 15:45:13 +0200
Johannes Weiner <[email protected]> wrote:

> This patch allows allocators to pass __GFP_WRITE when they know in
> advance that the allocated page will be written to and become dirty
> soon. The page allocator will then attempt to distribute those
> allocations across zones, such that no single zone will end up full of
> dirty, and thus more or less, unreclaimable pages.

Across all zones, or across the zones within the node or what? Some
more description of how all this plays with NUMA is needed, please.

> The global dirty limits are put in proportion to the respective zone's
> amount of dirtyable memory

I don't know what this means. How can a global limit be controlled by
what is happening within each single zone? Please describe this design
concept fully.

> and allocations diverted to other zones
> when the limit is reached.

hm.

> For now, the problem remains for NUMA configurations where the zones
> allowed for allocation are in sum not big enough to trigger the global
> dirty limits, but a future approach to solve this can reuse the
> per-zone dirty limit infrastructure laid out in this patch to have
> dirty throttling and the flusher threads consider individual zones.
>
> ...
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -36,6 +36,7 @@ struct vm_area_struct;
> #endif
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> +#define ___GFP_WRITE 0x1000000u
>
> /*
> * GFP bitmasks..
>
> ...
>
> +static unsigned long zone_dirtyable_memory(struct zone *zone)

Appears to return the number of pages in a particular zone which are
considered "dirtyable". Some discussion of how this decision is made
would be illuminating.

> +{
> + unsigned long x;
> + /*
> + * To keep a reasonable ratio between dirty memory and lowmem,
> + * highmem is not considered dirtyable on a global level.

Whereabouts in the kernel is this policy implemented?
determine_dirtyable_memory()? It does (or can) consider highmem
pages? Comment seems wrong?

Should we rename determine_dirtyable_memory() to
global_dirtyable_memory(), to get some sense of its relationship with
zone_dirtyable_memory()?

> + * But we allow individual highmem zones to hold a potentially
> + * bigger share of that global amount of dirty pages as long
> + * as they have enough free or reclaimable pages around.
> + */
> + x = zone_page_state(zone, NR_FREE_PAGES) - zone->totalreserve_pages;
> + x += zone_reclaimable_pages(zone);
> + return x;
> +}
> +
>
> ...
>
> -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> +static void dirty_limits(struct zone *zone,
> + unsigned long *pbackground,
> + unsigned long *pdirty)
> {
> + unsigned long uninitialized_var(zone_memory);
> + unsigned long available_memory;
> + unsigned long global_memory;
> unsigned long background;
> - unsigned long dirty;
> - unsigned long uninitialized_var(available_memory);
> struct task_struct *tsk;
> + unsigned long dirty;
>
> - if (!vm_dirty_bytes || !dirty_background_bytes)
> - available_memory = determine_dirtyable_memory();
> + global_memory = determine_dirtyable_memory();
> + if (zone)
> + available_memory = zone_memory = zone_dirtyable_memory(zone);
> + else
> + available_memory = global_memory;
>
> - if (vm_dirty_bytes)
> + if (vm_dirty_bytes) {
> dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> - else
> + if (zone)

So passing zone==NULL alters dirty_limits()'s behaviour. Seems that it
flips the function between global_dirty_limits and zone_dirty_limits?

Would it be better if we actually had separate global_dirty_limits()
and zone_dirty_limits() rather than a magical mode?

> + dirty = dirty * zone_memory / global_memory;
> + } else
> dirty = (vm_dirty_ratio * available_memory) / 100;
>
> - if (dirty_background_bytes)
> + if (dirty_background_bytes) {
> background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> - else
> + if (zone)
> + background = background * zone_memory / global_memory;
> + } else
> background = (dirty_background_ratio * available_memory) / 100;
>
> if (background >= dirty)
>
> ...
>
> +bool zone_dirty_ok(struct zone *zone)

Full description of the return value, please.

> +{
> + unsigned long background_thresh, dirty_thresh;
> +
> + dirty_limits(zone, &background_thresh, &dirty_thresh);
> +
> + return zone_page_state(zone, NR_FILE_DIRTY) +
> + zone_page_state(zone, NR_UNSTABLE_NFS) +
> + zone_page_state(zone, NR_WRITEBACK) <= dirty_thresh;
> +}

We never needed to calculate &background_thresh,. I wonder if that
matters.

>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-22 08:52:42

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/4] mm: writeback: distribute write pages across allowable zones

On Wed, Sep 21, 2011 at 04:02:26PM -0700, Andrew Morton wrote:
> On Tue, 20 Sep 2011 15:45:13 +0200
> Johannes Weiner <[email protected]> wrote:
>
> > This patch allows allocators to pass __GFP_WRITE when they know in
> > advance that the allocated page will be written to and become dirty
> > soon. The page allocator will then attempt to distribute those
> > allocations across zones, such that no single zone will end up full of
> > dirty, and thus more or less, unreclaimable pages.
>
> Across all zones, or across the zones within the node or what? Some
> more description of how all this plays with NUMA is needed, please.

Across the zones the allocator considers for allocation, which on NUMA
is determined by the allocating task's NUMA memory policy.

> > The global dirty limits are put in proportion to the respective zone's
> > amount of dirtyable memory
>
> I don't know what this means. How can a global limit be controlled by
> what is happening within each single zone? Please describe this design
> concept fully.

Yikes, it's mein English.

A zone's dirty limit is to the zone's contribution of dirtyable memory
what the global dirty limit is to the global amount of dirtyable
memory.

As a result, the sum of the dirty limits of all existing zones equals
the global dirty limit, such that no single zone receives more than
its fair share of the globally allowable dirty pages.

When the allocator tries to allocate from the list of allowable zones,
it skips those that have reached their maximum share of dirty pages.

> > For now, the problem remains for NUMA configurations where the zones
> > allowed for allocation are in sum not big enough to trigger the global
> > dirty limits, but a future approach to solve this can reuse the
> > per-zone dirty limit infrastructure laid out in this patch to have
> > dirty throttling and the flusher threads consider individual zones.

> > +static unsigned long zone_dirtyable_memory(struct zone *zone)
>
> Appears to return the number of pages in a particular zone which are
> considered "dirtyable". Some discussion of how this decision is made
> would be illuminating.

Is the proportional relationship between zones and the global level a
satisfactory explanation?

Because I am looking for a central place to explain all this.

> > +{
> > + unsigned long x;
> > + /*
> > + * To keep a reasonable ratio between dirty memory and lowmem,
> > + * highmem is not considered dirtyable on a global level.
>
> Whereabouts in the kernel is this policy implemented?
> determine_dirtyable_memory()? It does (or can) consider highmem
> pages? Comment seems wrong?

Yes, in determine_dirtyable_memory().

It is possible to configure an unreasonable ratio between dirty memory
and lowmem with the vm_highmem_is_dirtyable sysctl. The point is that
even though highmem is subtracted from the effective amount of global
dirtyable memory again (which is strictly a big-picture measure), we
only care about the individual zone here and so highmem can very much
always hold dirty pages up to its dirty limit.

> Should we rename determine_dirtyable_memory() to
> global_dirtyable_memory(), to get some sense of its relationship with
> zone_dirtyable_memory()?

Sounds good.

> > + * But we allow individual highmem zones to hold a potentially
> > + * bigger share of that global amount of dirty pages as long
> > + * as they have enough free or reclaimable pages around.
> > + */
> > + x = zone_page_state(zone, NR_FREE_PAGES) - zone->totalreserve_pages;
> > + x += zone_reclaimable_pages(zone);
> > + return x;
> > +}
> > +
> >
> > ...
> >
> > -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> > +static void dirty_limits(struct zone *zone,
> > + unsigned long *pbackground,
> > + unsigned long *pdirty)
> > {
> > + unsigned long uninitialized_var(zone_memory);
> > + unsigned long available_memory;
> > + unsigned long global_memory;
> > unsigned long background;
> > - unsigned long dirty;
> > - unsigned long uninitialized_var(available_memory);
> > struct task_struct *tsk;
> > + unsigned long dirty;
> >
> > - if (!vm_dirty_bytes || !dirty_background_bytes)
> > - available_memory = determine_dirtyable_memory();
> > + global_memory = determine_dirtyable_memory();
> > + if (zone)
> > + available_memory = zone_memory = zone_dirtyable_memory(zone);
> > + else
> > + available_memory = global_memory;
> >
> > - if (vm_dirty_bytes)
> > + if (vm_dirty_bytes) {
> > dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > - else
> > + if (zone)
>
> So passing zone==NULL alters dirty_limits()'s behaviour. Seems that it
> flips the function between global_dirty_limits and zone_dirty_limits?

Yes.

> Would it be better if we actually had separate global_dirty_limits()
> and zone_dirty_limits() rather than a magical mode?

I did that the first time around, but Mel raised the valid point that
this will be bad for maintainability.

The global dirty limit and the per-zone dirty limit are not only
incidentally calculated the same way, they are intentionally similar
in the geometrical sense (modulo workarounds for not having fp
arithmetic), so it would be good to keep this stuff together.

But the same applies to determine_dirtyable_memory() and
zone_dirtyable_memory(), so they should be done the same way and I
don't care too much which that would be.

If noone complains, I would structure the code such that
global_dirtyable_memory() and zone_dirtyable_memory(), as well as
global_dirty_limits() and zone_dirty_limits() are separate functions
next to each other with a big fat comment above that block explaining
the per-zone dirty limits and the proportional relationship to the
global parameters.

> > + dirty = dirty * zone_memory / global_memory;
> > + } else
> > dirty = (vm_dirty_ratio * available_memory) / 100;
> >
> > - if (dirty_background_bytes)
> > + if (dirty_background_bytes) {
> > background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > - else
> > + if (zone)
> > + background = background * zone_memory / global_memory;
> > + } else
> > background = (dirty_background_ratio * available_memory) / 100;
> >
> > if (background >= dirty)
> >
> > ...
> >
> > +bool zone_dirty_ok(struct zone *zone)
>
> Full description of the return value, please.

Returns false when the zone has reached its maximum share of the
global allowed dirty pages, true otherwise.

>
> > +{
> > + unsigned long background_thresh, dirty_thresh;
> > +
> > + dirty_limits(zone, &background_thresh, &dirty_thresh);
> > +
> > + return zone_page_state(zone, NR_FILE_DIRTY) +
> > + zone_page_state(zone, NR_UNSTABLE_NFS) +
> > + zone_page_state(zone, NR_WRITEBACK) <= dirty_thresh;
> > +}
>
> We never needed to calculate &background_thresh,. I wonder if that
> matters.

I didn't think dirty_limits() could take another branch, but if I
split up the function I will drop it. It's not rocket science and can
be easily added on demand.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-22 09:03:26

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/4] mm: exclude reserved pages from dirtyable memory

On Wed, Sep 21, 2011 at 04:03:28PM +0100, Mel Gorman wrote:
> On Wed, Sep 21, 2011 at 03:04:23PM +0100, Mel Gorman wrote:
> > On Tue, Sep 20, 2011 at 03:45:12PM +0200, Johannes Weiner wrote:
> > > The amount of dirtyable pages should not include the total number of
> > > free pages: there is a number of reserved pages that the page
> > > allocator and kswapd always try to keep free.
> > >
> > > The closer (reclaimable pages - dirty pages) is to the number of
> > > reserved pages, the more likely it becomes for reclaim to run into
> > > dirty pages:
> > >
> > > +----------+ ---
> > > | anon | |
> > > +----------+ |
> > > | | |
> > > | | -- dirty limit new -- flusher new
> > > | file | | |
> > > | | | |
> > > | | -- dirty limit old -- flusher old
> > > | | |
> > > +----------+ --- reclaim
> > > | reserved |
> > > +----------+
> > > | kernel |
> > > +----------+
> > >
> > > Not treating reserved pages as dirtyable on a global level is only a
> > > conceptual fix. In reality, dirty pages are not distributed equally
> > > across zones and reclaim runs into dirty pages on a regular basis.
> > >
> > > But it is important to get this right before tackling the problem on a
> > > per-zone level, where the distance between reclaim and the dirty pages
> > > is mostly much smaller in absolute numbers.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> > > ---
> > > include/linux/mmzone.h | 1 +
> > > mm/page-writeback.c | 8 +++++---
> > > mm/page_alloc.c | 1 +
> > > 3 files changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 1ed4116..e28f8e0 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -316,6 +316,7 @@ struct zone {
> > > * sysctl_lowmem_reserve_ratio sysctl changes.
> > > */
> > > unsigned long lowmem_reserve[MAX_NR_ZONES];
> > > + unsigned long totalreserve_pages;
> > >
> >
> > This is nit-picking but totalreserve_pages is a poor name because it's
> > a per-zone value that is one of the lowmem_reserve[] fields instead
> > of a total. After this patch, we have zone->totalreserve_pages and
> > totalreserve_pages but are not related to the same thing.
> > but they are not the same.
>
> As you correctly pointed out to be on IRC, zone->totalreserve_pages
> is not the lowmem_reserve because it takes the high_wmark into
> account. Sorry about that, I should have kept thinking. The name is
> still poor though because it does not explain what the value is or
> what it means.
>
> zone->FOO value needs to be related to lowmem_reserve because this
> is related to balancing zone usage.
>
> zone->FOO value should also be related to the high_wmark because
> this is avoiding writeback from page reclaim
>
> err....... umm... this?
>
> /*
> * When allocating a new page that is expected to be
> * dirtied soon, the number of free pages and the
> * dirty_balance reserve are taken into account. The
> * objective is that the globally allowed number of dirty
> * pages should be distributed throughout the zones such
> * that it is very unlikely that page reclaim will call
> * ->writepage.
> *
> * dirty_balance_reserve takes both lowmem_reserve and
> * the high watermark into account. The lowmem_reserve
> * is taken into account because we don't want the
> * distribution of dirty pages to unnecessarily increase
> * lowmem pressure. The watermark is taken into account
> * because it's correlated with when kswapd wakes up
> * and how long it stays awake.
> */
> unsigned long dirty_balance_reserve.

Yes, that's much better, thanks.

I assume this is meant the same for both the zone and the global level
and we should not mess with totalreserve_pages in either case?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-22 10:54:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 1/4] mm: exclude reserved pages from dirtyable memory

On Thu, Sep 22, 2011 at 11:03:26AM +0200, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 04:03:28PM +0100, Mel Gorman wrote:
> > On Wed, Sep 21, 2011 at 03:04:23PM +0100, Mel Gorman wrote:
> > > On Tue, Sep 20, 2011 at 03:45:12PM +0200, Johannes Weiner wrote:
> > > > The amount of dirtyable pages should not include the total number of
> > > > free pages: there is a number of reserved pages that the page
> > > > allocator and kswapd always try to keep free.
> > > >
> > > > The closer (reclaimable pages - dirty pages) is to the number of
> > > > reserved pages, the more likely it becomes for reclaim to run into
> > > > dirty pages:
> > > >
> > > > +----------+ ---
> > > > | anon | |
> > > > +----------+ |
> > > > | | |
> > > > | | -- dirty limit new -- flusher new
> > > > | file | | |
> > > > | | | |
> > > > | | -- dirty limit old -- flusher old
> > > > | | |
> > > > +----------+ --- reclaim
> > > > | reserved |
> > > > +----------+
> > > > | kernel |
> > > > +----------+
> > > >
> > > > Not treating reserved pages as dirtyable on a global level is only a
> > > > conceptual fix. In reality, dirty pages are not distributed equally
> > > > across zones and reclaim runs into dirty pages on a regular basis.
> > > >
> > > > But it is important to get this right before tackling the problem on a
> > > > per-zone level, where the distance between reclaim and the dirty pages
> > > > is mostly much smaller in absolute numbers.
> > > >
> > > > Signed-off-by: Johannes Weiner <[email protected]>
> > > > ---
> > > > include/linux/mmzone.h | 1 +
> > > > mm/page-writeback.c | 8 +++++---
> > > > mm/page_alloc.c | 1 +
> > > > 3 files changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 1ed4116..e28f8e0 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -316,6 +316,7 @@ struct zone {
> > > > * sysctl_lowmem_reserve_ratio sysctl changes.
> > > > */
> > > > unsigned long lowmem_reserve[MAX_NR_ZONES];
> > > > + unsigned long totalreserve_pages;
> > > >
> > >
> > > This is nit-picking but totalreserve_pages is a poor name because it's
> > > a per-zone value that is one of the lowmem_reserve[] fields instead
> > > of a total. After this patch, we have zone->totalreserve_pages and
> > > totalreserve_pages but are not related to the same thing.
> > > but they are not the same.
> >
> > As you correctly pointed out to be on IRC, zone->totalreserve_pages
> > is not the lowmem_reserve because it takes the high_wmark into
> > account. Sorry about that, I should have kept thinking. The name is
> > still poor though because it does not explain what the value is or
> > what it means.
> >
> > zone->FOO value needs to be related to lowmem_reserve because this
> > is related to balancing zone usage.
> >
> > zone->FOO value should also be related to the high_wmark because
> > this is avoiding writeback from page reclaim
> >
> > err....... umm... this?
> >
> > /*
> > * When allocating a new page that is expected to be
> > * dirtied soon, the number of free pages and the
> > * dirty_balance reserve are taken into account. The
> > * objective is that the globally allowed number of dirty
> > * pages should be distributed throughout the zones such
> > * that it is very unlikely that page reclaim will call
> > * ->writepage.
> > *
> > * dirty_balance_reserve takes both lowmem_reserve and
> > * the high watermark into account. The lowmem_reserve
> > * is taken into account because we don't want the
> > * distribution of dirty pages to unnecessarily increase
> > * lowmem pressure. The watermark is taken into account
> > * because it's correlated with when kswapd wakes up
> > * and how long it stays awake.
> > */
> > unsigned long dirty_balance_reserve.
>
> Yes, that's much better, thanks.
>
> I assume this is meant the same for both the zone and the global level
> and we should not mess with totalreserve_pages in either case?

Yes. I'd even suggest changing the name of totalreserve_pages to make
it clear it is related to overcommit rather than pfmemalloc, dirty
or any other reserve. i.e. s/totalreserve_pages/overcommit_reserve/

--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-23 14:38:17

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

The amount of dirtyable pages should not include the full number of
free pages: there is a number of reserved pages that the page
allocator and kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of
reserved pages, the more likely it becomes for reclaim to run into
dirty pages:

+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+

This patch introduces a per-zone dirty reserve that takes both the
lowmem reserve as well as the high watermark of the zone into account,
and a global sum of those per-zone values that is subtracted from the
global amount of dirtyable pages. The lowmem reserve is unavailable
to page cache allocations and kswapd tries to keep the high watermark
free. We don't want to end up in a situation where reclaim has to
clean pages in order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages
is mostly much smaller in absolute numbers.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 6 ++++++
include/linux/swap.h | 1 +
mm/page-writeback.c | 6 ++++--
mm/page_alloc.c | 19 +++++++++++++++++++
4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ed4116..37a61e7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -317,6 +317,12 @@ struct zone {
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];

+ /*
+ * This is a per-zone reserve of pages that should not be
+ * considered dirtyable memory.
+ */
+ unsigned long dirty_balance_reserve;
+
#ifdef CONFIG_NUMA
int node;
/*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b156e80..9021453 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -209,6 +209,7 @@ struct swap_list_t {
/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
extern unsigned long totalreserve_pages;
+extern unsigned long dirty_balance_reserve;
extern unsigned int nr_free_buffer_pages(void);
extern unsigned int nr_free_pagecache_pages(void);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index da6d263..c8acf8a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -170,7 +170,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];

x += zone_page_state(z, NR_FREE_PAGES) +
- zone_reclaimable_pages(z);
+ zone_reclaimable_pages(z) -
+ zone->dirty_balance_reserve;
}
/*
* Make sure that the number of highmem pages is never larger
@@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
{
unsigned long x;

- x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+ x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+ dirty_balance_reserve;

if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1dba05e..f8cba89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);

unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
+/*
+ * When calculating the number of globally allowed dirty pages, there
+ * is a certain number of per-zone reserves that should not be
+ * considered dirtyable memory. This is the sum of those reserves
+ * over all existing zones that contribute dirtyable memory.
+ */
+unsigned long dirty_balance_reserve __read_mostly;
+
int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

@@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
if (max > zone->present_pages)
max = zone->present_pages;
reserve_pages += max;
+ /*
+ * Lowmem reserves are not available to
+ * GFP_HIGHUSER page cache allocations and
+ * kswapd tries to balance zones to their high
+ * watermark. As a result, neither should be
+ * regarded as dirtyable memory, to prevent a
+ * situation where reclaim has to clean pages
+ * in order to balance the zones.
+ */
+ zone->dirty_balance_reserve = max;
}
}
+ dirty_balance_reserve = reserve_pages;
totalreserve_pages = reserve_pages;
}

--
1.7.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-23 14:41:07

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/2/4] mm: writeback: cleanups in preparation for per-zone dirty limits

On Thu, Sep 22, 2011 at 10:52:42AM +0200, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 04:02:26PM -0700, Andrew Morton wrote:
> > Should we rename determine_dirtyable_memory() to
> > global_dirtyable_memory(), to get some sense of its relationship with
> > zone_dirtyable_memory()?
>
> Sounds good.

---

The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.

Rename determine_dirtyable_memory() to global_dirtyable_memory()
before adding the zone-specific version, and fix up its documentation.

Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that
their relationship is more apparent and that they can be commented on
as a group.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page-writeback.c | 92 +++++++++++++++++++++++++-------------------------
1 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c8acf8a..78604a6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -186,12 +186,12 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
}

/**
- * determine_dirtyable_memory - amount of memory that may be used
+ * global_dirtyable_memory - number of globally dirtyable pages
*
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
+ * Returns the global number of pages potentially available for dirty
+ * page cache. This is the base value for the global dirty limits.
*/
-static unsigned long determine_dirtyable_memory(void)
+static unsigned long global_dirtyable_memory(void)
{
unsigned long x;

@@ -205,6 +205,47 @@ static unsigned long determine_dirtyable_memory(void)
}

/*
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - vm.dirty_background_ratio or vm.dirty_background_bytes
+ * - vm.dirty_ratio or vm.dirty_bytes
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * real-time tasks.
+ */
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+{
+ unsigned long background;
+ unsigned long dirty;
+ unsigned long uninitialized_var(available_memory);
+ struct task_struct *tsk;
+
+ if (!vm_dirty_bytes || !dirty_background_bytes)
+ available_memory = global_dirtyable_memory();
+
+ if (vm_dirty_bytes)
+ dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+ else
+ dirty = (vm_dirty_ratio * available_memory) / 100;
+
+ if (dirty_background_bytes)
+ background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+ else
+ background = (dirty_background_ratio * available_memory) / 100;
+
+ if (background >= dirty)
+ background = dirty / 2;
+ tsk = current;
+ if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+ background += background / 4;
+ dirty += dirty / 4;
+ }
+ *pbackground = background;
+ *pdirty = dirty;
+ trace_global_dirty_state(background, dirty);
+}
+
+/*
* couple the period to the dirty_ratio:
*
* period/2 ~ roundup_pow_of_two(dirty limit)
@@ -216,7 +257,7 @@ static int calc_period_shift(void)
if (vm_dirty_bytes)
dirty_total = vm_dirty_bytes / PAGE_SIZE;
else
- dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
+ dirty_total = (vm_dirty_ratio * global_dirtyable_memory()) /
100;
return 2 + ilog2(dirty_total - 1);
}
@@ -416,47 +457,6 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
return max(thresh, global_dirty_limit);
}

-/*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
- *
- * Calculate the dirty thresholds based on sysctl parameters
- * - vm.dirty_background_ratio or vm.dirty_background_bytes
- * - vm.dirty_ratio or vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
- */
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
-{
- unsigned long background;
- unsigned long dirty;
- unsigned long uninitialized_var(available_memory);
- struct task_struct *tsk;
-
- if (!vm_dirty_bytes || !dirty_background_bytes)
- available_memory = determine_dirtyable_memory();
-
- if (vm_dirty_bytes)
- dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
- else
- dirty = (vm_dirty_ratio * available_memory) / 100;
-
- if (dirty_background_bytes)
- background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
- else
- background = (dirty_background_ratio * available_memory) / 100;
-
- if (background >= dirty)
- background = dirty / 2;
- tsk = current;
- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
- background += background / 4;
- dirty += dirty / 4;
- }
- *pbackground = background;
- *pdirty = dirty;
- trace_global_dirty_state(background, dirty);
-}
-
/**
* bdi_dirty_limit - @bdi's share of dirty throttling threshold
* @bdi: the backing_dev_info to query
--
1.7.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-23 14:42:48

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

The maximum number of dirty pages that exist in the system at any time
is determined by a number of pages considered dirtyable and a
user-configured percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all
the zones in the system minus their lowmem reserves and high
watermarks, so that the system can retain a healthy number of free
pages without having to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does
not care about the global state but rather the state of individual
memory zones. And right now there is nothing that prevents one zone
from filling up with dirty pages while other zones are spared, which
frequently leads to situations where kswapd, in order to restore the
watermark of free pages, does indeed have to write pages from that
zone's LRU list. This can interfere so badly with IO from the flusher
threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
requests from reclaim already, taking away the VM's only possibility
to keep such a zone balanced, aside from hoping the flushers will soon
clean pages from that zone.

Enter per-zone dirty limits. They are to a zone's dirtyable memory
what the global limit is to the global amount of dirtyable memory, and
try to make sure that no single zone receives more than its fair share
of the globally allowed dirty pages in the first place. As the number
of pages considered dirtyable exclude the zones' lowmem reserves and
high watermarks, the maximum number of dirty pages in a zone is such
that the zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to
pass __GFP_WRITE when they know in advance that the page will be
written to and become dirty soon. The page allocator will then
attempt to allocate from the first zone of the zonelist - which on
NUMA is determined by the task's NUMA memory policy - that has not
exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads
that previously could fit their dirty pages completely in the higher
zone may be forced to allocate from lower zones, but the amount of
pages that 'spill over' are limited themselves by the lower zones'
dirty constraints, and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for
NUMA configurations where the zones allowed for allocation are in sum
not big enough to trigger the global dirty limits, wake up the flusher
threads and remedy the situation. Because of this, an allocation that
could not succeed on any of the considered zones is allowed to ignore
the dirty limits before going into direct reclaim or even failing the
allocation, until a future patch changes the global dirty throttling
and flusher thread activation so that they take individual zone states
into account.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/gfp.h | 4 ++-
include/linux/writeback.h | 1 +
mm/page-writeback.c | 83 +++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 29 ++++++++++++++++
4 files changed, 116 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..50efc7e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
#endif
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
+#define ___GFP_WRITE 0x1000000u

/*
* GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;

#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
+#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */

/*
* This may seem redundant, but it's a way of annotating false positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a5f495f..c96ee0c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data);
static inline void laptop_sync_completion(void) { }
#endif
void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);

extern unsigned long global_dirty_limit;

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 78604a6..f60fd57 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -159,6 +159,25 @@ static struct prop_descriptor vm_dirties;
* We make sure that the background writeout level is below the adjusted
* clamping level.
*/
+
+/*
+ * In a memory zone, there is a certain amount of pages we consider
+ * available for the page cache, which is essentially the number of
+ * free and reclaimable pages, minus some zone reserves to protect
+ * lowmem and the ability to uphold the zone's watermarks without
+ * requiring writeback.
+ *
+ * This number of dirtyable pages is the base value of which the
+ * user-configurable dirty ratio is the effictive number of pages that
+ * are allowed to be actually dirtied. Per individual zone, or
+ * globally by using the sum of dirtyable pages over all zones.
+ *
+ * Because the user is allowed to specify the dirty limit globally as
+ * absolute number of bytes, calculating the per-zone dirty limit can
+ * require translating the configured limit into a percentage of
+ * global dirtyable memory first.
+ */
+
static unsigned long highmem_dirtyable_memory(unsigned long total)
{
#ifdef CONFIG_HIGHMEM
@@ -245,6 +264,70 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
trace_global_dirty_state(background, dirty);
}

+/**
+ * zone_dirtyable_memory - number of dirtyable pages in a zone
+ * @zone: the zone
+ *
+ * Returns the zone's number of pages potentially available for dirty
+ * page cache. This is the base value for the per-zone dirty limits.
+ */
+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+ /*
+ * The effective global number of dirtyable pages may exclude
+ * highmem as a big-picture measure to keep the ratio between
+ * dirty memory and lowmem reasonable.
+ *
+ * But this function is purely about the individual zone and a
+ * highmem zone can hold its share of dirty pages, so we don't
+ * care about vm_highmem_is_dirtyable here.
+ */
+ return zone_page_state(zone, NR_FREE_PAGES) +
+ zone_reclaimable_pages(zone) -
+ zone->dirty_balance_reserve;
+}
+
+/**
+ * zone_dirty_limit - maximum number of dirty pages allowed in a zone
+ * @zone: the zone
+ *
+ * Returns the maximum number of dirty pages allowed in a zone, based
+ * on the zone's dirtyable memory.
+ */
+static unsigned long zone_dirty_limit(struct zone *zone)
+{
+ unsigned long zone_memory = zone_dirtyable_memory(zone);
+ struct task_struct *tsk = current;
+ unsigned long dirty;
+
+ if (vm_dirty_bytes)
+ dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
+ zone_memory / global_dirtyable_memory();
+ else
+ dirty = vm_dirty_ratio * zone_memory / 100;
+
+ if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+ dirty += dirty / 4;
+
+ return dirty;
+}
+
+/**
+ * zone_dirty_ok - tells whether a zone is within its dirty limits
+ * @zone: the zone to check
+ *
+ * Returns %true when the dirty pages in @zone are within the zone's
+ * dirty limit, %false if the limit is exceeded.
+ */
+bool zone_dirty_ok(struct zone *zone)
+{
+ unsigned long limit = zone_dirty_limit(zone);
+
+ return zone_page_state(zone, NR_FILE_DIRTY) +
+ zone_page_state(zone, NR_UNSTABLE_NFS) +
+ zone_page_state(zone, NR_WRITEBACK) <= limit;
+}
+
/*
* couple the period to the dirty_ratio:
*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8cba89..afaf59e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1675,6 +1675,35 @@ zonelist_scan:
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
continue;
+ /*
+ * When allocating a page cache page for writing, we
+ * want to get it from a zone that is within its dirty
+ * limit, such that no single zone holds more than its
+ * proportional share of globally allowed dirty pages.
+ * The dirty limits take into account the zone's
+ * lowmem reserves and high watermark so that kswapd
+ * should be able to balance it without having to
+ * write pages from its LRU list.
+ *
+ * This may look like it could increase pressure on
+ * lower zones by failing allocations in higher zones
+ * before they are full. But the pages that do spill
+ * over are limited as the lower zones are protected
+ * by this very same mechanism. It should not become
+ * a practical burden to them.
+ *
+ * XXX: For now, allow allocations to potentially
+ * exceed the per-zone dirty limit in the slowpath
+ * (ALLOC_WMARK_LOW unset) before going into reclaim,
+ * which is important when on a NUMA setup the allowed
+ * zones are together not big enough to reach the
+ * global limit. The proper fix for these situations
+ * will require awareness of zones in the
+ * dirty-throttling and the flusher threads.
+ */
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+ goto this_zone_full;

BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
--
1.7.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-28 04:55:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

Hi Hannes,

On Fri, Sep 23, 2011 at 04:38:17PM +0200, Johannes Weiner wrote:
> The amount of dirtyable pages should not include the full number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
>
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
>
> +----------+ ---
> | anon | |
> +----------+ |
> | | |
> | | -- dirty limit new -- flusher new
> | file | | |
> | | | |
> | | -- dirty limit old -- flusher old
> | | |
> +----------+ --- reclaim
> | reserved |
> +----------+
> | kernel |
> +----------+
>
> This patch introduces a per-zone dirty reserve that takes both the
> lowmem reserve as well as the high watermark of the zone into account,
> and a global sum of those per-zone values that is subtracted from the
> global amount of dirtyable pages. The lowmem reserve is unavailable
> to page cache allocations and kswapd tries to keep the high watermark
> free. We don't want to end up in a situation where reclaim has to
> clean pages in order to balance zones.
>
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix. In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
>
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/mmzone.h | 6 ++++++
> include/linux/swap.h | 1 +
> mm/page-writeback.c | 6 ++++--
> mm/page_alloc.c | 19 +++++++++++++++++++
> 4 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1ed4116..37a61e7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -317,6 +317,12 @@ struct zone {
> */
> unsigned long lowmem_reserve[MAX_NR_ZONES];
>
> + /*
> + * This is a per-zone reserve of pages that should not be
> + * considered dirtyable memory.
> + */
> + unsigned long dirty_balance_reserve;
> +
> #ifdef CONFIG_NUMA
> int node;
> /*
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b156e80..9021453 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -209,6 +209,7 @@ struct swap_list_t {
> /* linux/mm/page_alloc.c */
> extern unsigned long totalram_pages;
> extern unsigned long totalreserve_pages;
> +extern unsigned long dirty_balance_reserve;
> extern unsigned int nr_free_buffer_pages(void);
> extern unsigned int nr_free_pagecache_pages(void);
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index da6d263..c8acf8a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -170,7 +170,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>
> x += zone_page_state(z, NR_FREE_PAGES) +
> - zone_reclaimable_pages(z);
> + zone_reclaimable_pages(z) -
> + zone->dirty_balance_reserve;
> }
> /*
> * Make sure that the number of highmem pages is never larger
> @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
> {
> unsigned long x;
>
> - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> + x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
> + dirty_balance_reserve;
>
> if (!vm_highmem_is_dirtyable)
> x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1dba05e..f8cba89 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
>
> unsigned long totalram_pages __read_mostly;
> unsigned long totalreserve_pages __read_mostly;
> +/*
> + * When calculating the number of globally allowed dirty pages, there
> + * is a certain number of per-zone reserves that should not be
> + * considered dirtyable memory. This is the sum of those reserves
> + * over all existing zones that contribute dirtyable memory.
> + */
> +unsigned long dirty_balance_reserve __read_mostly;
> +
> int percpu_pagelist_fraction;
> gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>
> @@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
> if (max > zone->present_pages)
> max = zone->present_pages;
> reserve_pages += max;
> + /*
> + * Lowmem reserves are not available to
> + * GFP_HIGHUSER page cache allocations and
> + * kswapd tries to balance zones to their high
> + * watermark. As a result, neither should be
> + * regarded as dirtyable memory, to prevent a
> + * situation where reclaim has to clean pages
> + * in order to balance the zones.
> + */

Could you put Mel's description instead of it if you don't mind?
If I didn't see Mel's thing, maybe I wouldn't suggest but it looks
more easier to understand.

> + zone->dirty_balance_reserve = max;
> }
> }
> + dirty_balance_reserve = reserve_pages;
> totalreserve_pages = reserve_pages;
> }
>
> --
> 1.7.6.2
>

--
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-28 05:56:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> The maximum number of dirty pages that exist in the system at any time
> is determined by a number of pages considered dirtyable and a
> user-configured percentage of those, or an absolute number in bytes.

It's explanation of old approach.

>
> This number of dirtyable pages is the sum of memory provided by all
> the zones in the system minus their lowmem reserves and high
> watermarks, so that the system can retain a healthy number of free
> pages without having to reclaim dirty pages.

It's a explanation of new approach.

>
> But there is a flaw in that we have a zoned page allocator which does
> not care about the global state but rather the state of individual
> memory zones. And right now there is nothing that prevents one zone
> from filling up with dirty pages while other zones are spared, which
> frequently leads to situations where kswapd, in order to restore the
> watermark of free pages, does indeed have to write pages from that
> zone's LRU list. This can interfere so badly with IO from the flusher
> threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> requests from reclaim already, taking away the VM's only possibility
> to keep such a zone balanced, aside from hoping the flushers will soon
> clean pages from that zone.

It's a explanation of old approach, again!
Shoudn't we move above phrase of new approach into below?

>
> Enter per-zone dirty limits. They are to a zone's dirtyable memory
> what the global limit is to the global amount of dirtyable memory, and
> try to make sure that no single zone receives more than its fair share
> of the globally allowed dirty pages in the first place. As the number
> of pages considered dirtyable exclude the zones' lowmem reserves and
> high watermarks, the maximum number of dirty pages in a zone is such
> that the zone can always be balanced without requiring page cleaning.
>
> As this is a placement decision in the page allocator and pages are
> dirtied only after the allocation, this patch allows allocators to
> pass __GFP_WRITE when they know in advance that the page will be
> written to and become dirty soon. The page allocator will then
> attempt to allocate from the first zone of the zonelist - which on
> NUMA is determined by the task's NUMA memory policy - that has not
> exceeded its dirty limit.
>
> At first glance, it would appear that the diversion to lower zones can
> increase pressure on them, but this is not the case. With a full high
> zone, allocations will be diverted to lower zones eventually, so it is
> more of a shift in timing of the lower zone allocations. Workloads
> that previously could fit their dirty pages completely in the higher
> zone may be forced to allocate from lower zones, but the amount of
> pages that 'spill over' are limited themselves by the lower zones'
> dirty constraints, and thus unlikely to become a problem.

That's a good justification.

>
> For now, the problem of unfair dirty page distribution remains for
> NUMA configurations where the zones allowed for allocation are in sum
> not big enough to trigger the global dirty limits, wake up the flusher
> threads and remedy the situation. Because of this, an allocation that
> could not succeed on any of the considered zones is allowed to ignore
> the dirty limits before going into direct reclaim or even failing the
> allocation, until a future patch changes the global dirty throttling
> and flusher thread activation so that they take individual zone states
> into account.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Otherwise, looks good to me.
Reviewed-by: Minchan Kim <[email protected]>

--
Kinds regards,
Minchan Kim

2011-09-28 05:57:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 1/2/4] mm: writeback: cleanups in preparation for per-zone dirty limits

On Fri, Sep 23, 2011 at 04:41:07PM +0200, Johannes Weiner wrote:
> On Thu, Sep 22, 2011 at 10:52:42AM +0200, Johannes Weiner wrote:
> > On Wed, Sep 21, 2011 at 04:02:26PM -0700, Andrew Morton wrote:
> > > Should we rename determine_dirtyable_memory() to
> > > global_dirtyable_memory(), to get some sense of its relationship with
> > > zone_dirtyable_memory()?
> >
> > Sounds good.
>
> ---
>
> The next patch will introduce per-zone dirty limiting functions in
> addition to the traditional global dirty limiting.
>
> Rename determine_dirtyable_memory() to global_dirtyable_memory()
> before adding the zone-specific version, and fix up its documentation.
>
> Also, move the functions to determine the dirtyable memory and the
> function to calculate the dirty limit based on that together so that
> their relationship is more apparent and that they can be commented on
> as a group.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
--
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-28 06:02:38

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

On Tue, Sep 20, 2011 at 03:45:14PM +0200, Johannes Weiner wrote:
> Tell the page allocator that pages allocated through
> grab_cache_page_write_begin() are expected to become dirty soon.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kinds regards,
Minchan Kim

2011-09-28 07:11:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

On Wed, Sep 28, 2011 at 02:56:40PM +0900, Minchan Kim wrote:
> On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> > The maximum number of dirty pages that exist in the system at any time
> > is determined by a number of pages considered dirtyable and a
> > user-configured percentage of those, or an absolute number in bytes.
>
> It's explanation of old approach.

What do you mean? This does not change with this patch. We still
have a number of dirtyable pages and a limit that is applied
relatively to this number.

> > This number of dirtyable pages is the sum of memory provided by all
> > the zones in the system minus their lowmem reserves and high
> > watermarks, so that the system can retain a healthy number of free
> > pages without having to reclaim dirty pages.
>
> It's a explanation of new approach.

Same here, this aspect is also not changed with this patch!

> > But there is a flaw in that we have a zoned page allocator which does
> > not care about the global state but rather the state of individual
> > memory zones. And right now there is nothing that prevents one zone
> > from filling up with dirty pages while other zones are spared, which
> > frequently leads to situations where kswapd, in order to restore the
> > watermark of free pages, does indeed have to write pages from that
> > zone's LRU list. This can interfere so badly with IO from the flusher
> > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> > requests from reclaim already, taking away the VM's only possibility
> > to keep such a zone balanced, aside from hoping the flushers will soon
> > clean pages from that zone.
>
> It's a explanation of old approach, again!
> Shoudn't we move above phrase of new approach into below?

Everything above describes the current behaviour (at the point of this
patch, so respecting lowmem_reserve e.g. is part of the current
behaviour by now) and its problems. And below follows a description
of how the patch tries to fix it.

> > Enter per-zone dirty limits. They are to a zone's dirtyable memory
> > what the global limit is to the global amount of dirtyable memory, and
> > try to make sure that no single zone receives more than its fair share
> > of the globally allowed dirty pages in the first place. As the number
> > of pages considered dirtyable exclude the zones' lowmem reserves and
> > high watermarks, the maximum number of dirty pages in a zone is such
> > that the zone can always be balanced without requiring page cleaning.
> >
> > As this is a placement decision in the page allocator and pages are
> > dirtied only after the allocation, this patch allows allocators to
> > pass __GFP_WRITE when they know in advance that the page will be
> > written to and become dirty soon. The page allocator will then
> > attempt to allocate from the first zone of the zonelist - which on
> > NUMA is determined by the task's NUMA memory policy - that has not
> > exceeded its dirty limit.
> >
> > At first glance, it would appear that the diversion to lower zones can
> > increase pressure on them, but this is not the case. With a full high
> > zone, allocations will be diverted to lower zones eventually, so it is
> > more of a shift in timing of the lower zone allocations. Workloads
> > that previously could fit their dirty pages completely in the higher
> > zone may be forced to allocate from lower zones, but the amount of
> > pages that 'spill over' are limited themselves by the lower zones'
> > dirty constraints, and thus unlikely to become a problem.
>
> That's a good justification.
>
> > For now, the problem of unfair dirty page distribution remains for
> > NUMA configurations where the zones allowed for allocation are in sum
> > not big enough to trigger the global dirty limits, wake up the flusher
> > threads and remedy the situation. Because of this, an allocation that
> > could not succeed on any of the considered zones is allowed to ignore
> > the dirty limits before going into direct reclaim or even failing the
> > allocation, until a future patch changes the global dirty throttling
> > and flusher thread activation so that they take individual zone states
> > into account.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Otherwise, looks good to me.
> Reviewed-by: Minchan Kim <[email protected]>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-28 07:50:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

On Wed, Sep 28, 2011 at 01:55:51PM +0900, Minchan Kim wrote:
> Hi Hannes,
>
> On Fri, Sep 23, 2011 at 04:38:17PM +0200, Johannes Weiner wrote:
> > The amount of dirtyable pages should not include the full number of
> > free pages: there is a number of reserved pages that the page
> > allocator and kswapd always try to keep free.
> >
> > The closer (reclaimable pages - dirty pages) is to the number of
> > reserved pages, the more likely it becomes for reclaim to run into
> > dirty pages:
> >
> > +----------+ ---
> > | anon | |
> > +----------+ |
> > | | |
> > | | -- dirty limit new -- flusher new
> > | file | | |
> > | | | |
> > | | -- dirty limit old -- flusher old
> > | | |
> > +----------+ --- reclaim
> > | reserved |
> > +----------+
> > | kernel |
> > +----------+
> >
> > This patch introduces a per-zone dirty reserve that takes both the
> > lowmem reserve as well as the high watermark of the zone into account,
> > and a global sum of those per-zone values that is subtracted from the
> > global amount of dirtyable pages. The lowmem reserve is unavailable
> > to page cache allocations and kswapd tries to keep the high watermark
> > free. We don't want to end up in a situation where reclaim has to
> > clean pages in order to balance zones.
> >
> > Not treating reserved pages as dirtyable on a global level is only a
> > conceptual fix. In reality, dirty pages are not distributed equally
> > across zones and reclaim runs into dirty pages on a regular basis.
> >
> > But it is important to get this right before tackling the problem on a
> > per-zone level, where the distance between reclaim and the dirty pages
> > is mostly much smaller in absolute numbers.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > include/linux/mmzone.h | 6 ++++++
> > include/linux/swap.h | 1 +
> > mm/page-writeback.c | 6 ++++--
> > mm/page_alloc.c | 19 +++++++++++++++++++
> > 4 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 1ed4116..37a61e7 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -317,6 +317,12 @@ struct zone {
> > */
> > unsigned long lowmem_reserve[MAX_NR_ZONES];
> >
> > + /*
> > + * This is a per-zone reserve of pages that should not be
> > + * considered dirtyable memory.
> > + */
> > + unsigned long dirty_balance_reserve;
> > +
> > #ifdef CONFIG_NUMA
> > int node;
> > /*
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b156e80..9021453 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -209,6 +209,7 @@ struct swap_list_t {
> > /* linux/mm/page_alloc.c */
> > extern unsigned long totalram_pages;
> > extern unsigned long totalreserve_pages;
> > +extern unsigned long dirty_balance_reserve;
> > extern unsigned int nr_free_buffer_pages(void);
> > extern unsigned int nr_free_pagecache_pages(void);
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index da6d263..c8acf8a 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -170,7 +170,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> > &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> >
> > x += zone_page_state(z, NR_FREE_PAGES) +
> > - zone_reclaimable_pages(z);
> > + zone_reclaimable_pages(z) -
> > + zone->dirty_balance_reserve;
> > }
> > /*
> > * Make sure that the number of highmem pages is never larger
> > @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
> > {
> > unsigned long x;
> >
> > - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > + x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
> > + dirty_balance_reserve;
> >
> > if (!vm_highmem_is_dirtyable)
> > x -= highmem_dirtyable_memory(x);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1dba05e..f8cba89 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
> >
> > unsigned long totalram_pages __read_mostly;
> > unsigned long totalreserve_pages __read_mostly;
> > +/*
> > + * When calculating the number of globally allowed dirty pages, there
> > + * is a certain number of per-zone reserves that should not be
> > + * considered dirtyable memory. This is the sum of those reserves
> > + * over all existing zones that contribute dirtyable memory.
> > + */
> > +unsigned long dirty_balance_reserve __read_mostly;
> > +
> > int percpu_pagelist_fraction;
> > gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> >
> > @@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
> > if (max > zone->present_pages)
> > max = zone->present_pages;
> > reserve_pages += max;
> > + /*
> > + * Lowmem reserves are not available to
> > + * GFP_HIGHUSER page cache allocations and
> > + * kswapd tries to balance zones to their high
> > + * watermark. As a result, neither should be
> > + * regarded as dirtyable memory, to prevent a
> > + * situation where reclaim has to clean pages
> > + * in order to balance the zones.
> > + */
>
> Could you put Mel's description instead of it if you don't mind?
> If I didn't see Mel's thing, maybe I wouldn't suggest but it looks
> more easier to understand.

I changed it because it was already referring to allocation placement,
but at the point in time where this comment is introduced there is no
allocation placement based on dirty pages yet.

The other thing is that it said lowmem_reserves were respected to
prevent increasing lowmem pressure, but lowmem is protected by the
watermark checks during the allocation. I took it into account to not
end up with a number of dirtyable pages that is bigger than the amount
of technically available page cache pages. Otherwise, you could end
up with all page cache pages in a zone dirtied at the time reclaim
kicks in and we are back to square one.

Maybe you can fingerpoint to the part that is harder to understand so
I can fix it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-28 09:27:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 1/2/4] mm: writeback: cleanups in preparation for per-zone dirty limits

On Fri, Sep 23, 2011 at 04:41:07PM +0200, Johannes Weiner wrote:
> On Thu, Sep 22, 2011 at 10:52:42AM +0200, Johannes Weiner wrote:
> > On Wed, Sep 21, 2011 at 04:02:26PM -0700, Andrew Morton wrote:
> > > Should we rename determine_dirtyable_memory() to
> > > global_dirtyable_memory(), to get some sense of its relationship with
> > > zone_dirtyable_memory()?
> >
> > Sounds good.
>
> ---
>
> The next patch will introduce per-zone dirty limiting functions in
> addition to the traditional global dirty limiting.
>
> Rename determine_dirtyable_memory() to global_dirtyable_memory()
> before adding the zone-specific version, and fix up its documentation.
>
> Also, move the functions to determine the dirtyable memory and the
> function to calculate the dirty limit based on that together so that
> their relationship is more apparent and that they can be commented on
> as a group.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-09-28 09:36:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> The maximum number of dirty pages that exist in the system at any time
> is determined by a number of pages considered dirtyable and a
> user-configured percentage of those, or an absolute number in bytes.
>
> This number of dirtyable pages is the sum of memory provided by all
> the zones in the system minus their lowmem reserves and high
> watermarks, so that the system can retain a healthy number of free
> pages without having to reclaim dirty pages.
>
> But there is a flaw in that we have a zoned page allocator which does
> not care about the global state but rather the state of individual
> memory zones. And right now there is nothing that prevents one zone
> from filling up with dirty pages while other zones are spared, which
> frequently leads to situations where kswapd, in order to restore the
> watermark of free pages, does indeed have to write pages from that
> zone's LRU list. This can interfere so badly with IO from the flusher
> threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> requests from reclaim already, taking away the VM's only possibility
> to keep such a zone balanced, aside from hoping the flushers will soon
> clean pages from that zone.
>
> Enter per-zone dirty limits. They are to a zone's dirtyable memory
> what the global limit is to the global amount of dirtyable memory, and
> try to make sure that no single zone receives more than its fair share
> of the globally allowed dirty pages in the first place. As the number
> of pages considered dirtyable exclude the zones' lowmem reserves and
> high watermarks, the maximum number of dirty pages in a zone is such
> that the zone can always be balanced without requiring page cleaning.
>
> As this is a placement decision in the page allocator and pages are
> dirtied only after the allocation, this patch allows allocators to
> pass __GFP_WRITE when they know in advance that the page will be
> written to and become dirty soon. The page allocator will then
> attempt to allocate from the first zone of the zonelist - which on
> NUMA is determined by the task's NUMA memory policy - that has not
> exceeded its dirty limit.
>
> At first glance, it would appear that the diversion to lower zones can
> increase pressure on them, but this is not the case. With a full high
> zone, allocations will be diverted to lower zones eventually, so it is
> more of a shift in timing of the lower zone allocations. Workloads
> that previously could fit their dirty pages completely in the higher
> zone may be forced to allocate from lower zones, but the amount of
> pages that 'spill over' are limited themselves by the lower zones'
> dirty constraints, and thus unlikely to become a problem.
>
> For now, the problem of unfair dirty page distribution remains for
> NUMA configurations where the zones allowed for allocation are in sum
> not big enough to trigger the global dirty limits, wake up the flusher
> threads and remedy the situation. Because of this, an allocation that
> could not succeed on any of the considered zones is allowed to ignore
> the dirty limits before going into direct reclaim or even failing the
> allocation, until a future patch changes the global dirty throttling
> and flusher thread activation so that they take individual zone states
> into account.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2011-09-28 18:09:07

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

On Wed, Sep 28, 2011 at 09:11:54AM +0200, Johannes Weiner wrote:
> On Wed, Sep 28, 2011 at 02:56:40PM +0900, Minchan Kim wrote:
> > On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> > > The maximum number of dirty pages that exist in the system at any time
> > > is determined by a number of pages considered dirtyable and a
> > > user-configured percentage of those, or an absolute number in bytes.
> >
> > It's explanation of old approach.
>
> What do you mean? This does not change with this patch. We still
> have a number of dirtyable pages and a limit that is applied
> relatively to this number.
>
> > > This number of dirtyable pages is the sum of memory provided by all
> > > the zones in the system minus their lowmem reserves and high
> > > watermarks, so that the system can retain a healthy number of free
> > > pages without having to reclaim dirty pages.
> >
> > It's a explanation of new approach.
>
> Same here, this aspect is also not changed with this patch!
>
> > > But there is a flaw in that we have a zoned page allocator which does
> > > not care about the global state but rather the state of individual
> > > memory zones. And right now there is nothing that prevents one zone
> > > from filling up with dirty pages while other zones are spared, which
> > > frequently leads to situations where kswapd, in order to restore the
> > > watermark of free pages, does indeed have to write pages from that
> > > zone's LRU list. This can interfere so badly with IO from the flusher
> > > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> > > requests from reclaim already, taking away the VM's only possibility
> > > to keep such a zone balanced, aside from hoping the flushers will soon
> > > clean pages from that zone.
> >
> > It's a explanation of old approach, again!
> > Shoudn't we move above phrase of new approach into below?
>
> Everything above describes the current behaviour (at the point of this
> patch, so respecting lowmem_reserve e.g. is part of the current
> behaviour by now) and its problems. And below follows a description
> of how the patch tries to fix it.

It seems that it's not a good choice to use "old" and "new" terms.
Hannes, please ignore, it's not a biggie.

--
Kind regards,
Minchan Kim

2011-09-28 18:35:30

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

On Wed, Sep 28, 2011 at 09:50:54AM +0200, Johannes Weiner wrote:
> On Wed, Sep 28, 2011 at 01:55:51PM +0900, Minchan Kim wrote:
> > Hi Hannes,
> >
> > On Fri, Sep 23, 2011 at 04:38:17PM +0200, Johannes Weiner wrote:
> > > The amount of dirtyable pages should not include the full number of
> > > free pages: there is a number of reserved pages that the page
> > > allocator and kswapd always try to keep free.
> > >
> > > The closer (reclaimable pages - dirty pages) is to the number of
> > > reserved pages, the more likely it becomes for reclaim to run into
> > > dirty pages:
> > >
> > > +----------+ ---
> > > | anon | |
> > > +----------+ |
> > > | | |
> > > | | -- dirty limit new -- flusher new
> > > | file | | |
> > > | | | |
> > > | | -- dirty limit old -- flusher old
> > > | | |
> > > +----------+ --- reclaim
> > > | reserved |
> > > +----------+
> > > | kernel |
> > > +----------+
> > >
> > > This patch introduces a per-zone dirty reserve that takes both the
> > > lowmem reserve as well as the high watermark of the zone into account,
> > > and a global sum of those per-zone values that is subtracted from the
> > > global amount of dirtyable pages. The lowmem reserve is unavailable
> > > to page cache allocations and kswapd tries to keep the high watermark
> > > free. We don't want to end up in a situation where reclaim has to
> > > clean pages in order to balance zones.
> > >
> > > Not treating reserved pages as dirtyable on a global level is only a
> > > conceptual fix. In reality, dirty pages are not distributed equally
> > > across zones and reclaim runs into dirty pages on a regular basis.
> > >
> > > But it is important to get this right before tackling the problem on a
> > > per-zone level, where the distance between reclaim and the dirty pages
> > > is mostly much smaller in absolute numbers.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> > > ---
> > > include/linux/mmzone.h | 6 ++++++
> > > include/linux/swap.h | 1 +
> > > mm/page-writeback.c | 6 ++++--
> > > mm/page_alloc.c | 19 +++++++++++++++++++
> > > 4 files changed, 30 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 1ed4116..37a61e7 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -317,6 +317,12 @@ struct zone {
> > > */
> > > unsigned long lowmem_reserve[MAX_NR_ZONES];
> > >
> > > + /*
> > > + * This is a per-zone reserve of pages that should not be
> > > + * considered dirtyable memory.
> > > + */
> > > + unsigned long dirty_balance_reserve;
> > > +
> > > #ifdef CONFIG_NUMA
> > > int node;
> > > /*
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index b156e80..9021453 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -209,6 +209,7 @@ struct swap_list_t {
> > > /* linux/mm/page_alloc.c */
> > > extern unsigned long totalram_pages;
> > > extern unsigned long totalreserve_pages;
> > > +extern unsigned long dirty_balance_reserve;
> > > extern unsigned int nr_free_buffer_pages(void);
> > > extern unsigned int nr_free_pagecache_pages(void);
> > >
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index da6d263..c8acf8a 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -170,7 +170,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> > > &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> > >
> > > x += zone_page_state(z, NR_FREE_PAGES) +
> > > - zone_reclaimable_pages(z);
> > > + zone_reclaimable_pages(z) -
> > > + zone->dirty_balance_reserve;
> > > }
> > > /*
> > > * Make sure that the number of highmem pages is never larger
> > > @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
> > > {
> > > unsigned long x;
> > >
> > > - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > > + x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
> > > + dirty_balance_reserve;
> > >
> > > if (!vm_highmem_is_dirtyable)
> > > x -= highmem_dirtyable_memory(x);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 1dba05e..f8cba89 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
> > >
> > > unsigned long totalram_pages __read_mostly;
> > > unsigned long totalreserve_pages __read_mostly;
> > > +/*
> > > + * When calculating the number of globally allowed dirty pages, there
> > > + * is a certain number of per-zone reserves that should not be
> > > + * considered dirtyable memory. This is the sum of those reserves
> > > + * over all existing zones that contribute dirtyable memory.
> > > + */
> > > +unsigned long dirty_balance_reserve __read_mostly;
> > > +
> > > int percpu_pagelist_fraction;
> > > gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> > >
> > > @@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
> > > if (max > zone->present_pages)
> > > max = zone->present_pages;
> > > reserve_pages += max;
> > > + /*
> > > + * Lowmem reserves are not available to
> > > + * GFP_HIGHUSER page cache allocations and
> > > + * kswapd tries to balance zones to their high
> > > + * watermark. As a result, neither should be
> > > + * regarded as dirtyable memory, to prevent a
> > > + * situation where reclaim has to clean pages
> > > + * in order to balance the zones.
> > > + */
> >
> > Could you put Mel's description instead of it if you don't mind?
> > If I didn't see Mel's thing, maybe I wouldn't suggest but it looks
> > more easier to understand.
>
> I changed it because it was already referring to allocation placement,
> but at the point in time where this comment is introduced there is no
> allocation placement based on dirty pages yet.

Right. at this point, you don't introduce allocation placement yet but
I knew about that and it seems I was too hasty.
But I hope you add a comment about allocation placement when you introduce it.
Of course, you added it in page_alloc.c but I like adding short summary comment
on field as Mel does. Adding short summary comment on field helps understanding
why the field is introduced without jumping in and out.

>
> The other thing is that it said lowmem_reserves were respected to
> prevent increasing lowmem pressure, but lowmem is protected by the
> watermark checks during the allocation. I took it into account to not
> end up with a number of dirtyable pages that is bigger than the amount
> of technically available page cache pages. Otherwise, you could end
> up with all page cache pages in a zone dirtied at the time reclaim
> kicks in and we are back to square one.
>
> Maybe you can fingerpoint to the part that is harder to understand so
> I can fix it?

I don't mean yours is hard to understand. It seems to be a preference.
Mel's explanation is more straightforward, I think.
He explained objective, method and why we select the method in introduction briefly.
I like such summary on the field.

But as I said, it might be a preference so if you mind it, I don't insist on it.

--
Kind regards,
Minchan Kim