From: Dave Chinner <[email protected]>
When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence enterring the filesystem to do writeback can then lead to
stack overruns. This problem was recently encountered x86_64 systems with
8k stacks running XFS with simple storage configurations.
Writeback from direct reclaim also adversely affects background writeback. The
background flusher threads should already be taking care of cleaning dirty
pages, and direct reclaim will kick them if they aren't already doing work. If
direct reclaim is also calling ->writepage, it will cause the IO patterns from
the background flusher threads to be upset by LRU-order writeback from
pageout() which can be effectively random IO. Having competing sources of IO
trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.
Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.
Reported-by: John Berthels <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
---
mm/vmscan.c | 13 ++++++-------
1 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0e5f15..5321ac4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* writeout. So in laptop mode, write out the whole world.
*/
writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
- if (total_scanned > writeback_threshold) {
+ if (total_scanned > writeback_threshold)
wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
- sc->may_writepage = 1;
- }
/* Take a nap, wait for some writeback to complete */
if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
{
struct scan_control sc = {
.gfp_mask = gfp_mask,
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.may_unmap = 1,
.may_swap = 1,
@@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
struct zone *zone, int nid)
{
struct scan_control sc = {
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.may_unmap = 1,
.may_swap = !noswap,
.swappiness = swappiness,
@@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
{
struct zonelist *zonelist;
struct scan_control sc = {
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.may_unmap = 1,
.may_swap = !noswap,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
struct reclaim_state reclaim_state;
int priority;
struct scan_control sc = {
- .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+ .may_writepage = (current_is_kswapd() &&
+ (zone_reclaim_mode & RECLAIM_WRITE)),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
.nr_to_reclaim = max_t(unsigned long, nr_pages,
--
1.6.5
Hi
> From: Dave Chinner <[email protected]>
>
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
>
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
>
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
Ummm..
This patch is harder to ack. This patch's pros/cons seems
Pros:
1) prevent XFS stack overflow
2) improve io workload performance
Cons:
3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
So, If we only need to consider io workload this is no downside. but
it can't.
I think (1) is XFS issue. XFS should care it itself. but (2) is really
VM issue. Now our VM makes too agressive pageout() and decrease io
throughput. I've heard this issue from Chris (cc to him). I'd like to
fix this. but we never kill pageout() completely because we can't
assume users don't run high order allocation workload.
(perhaps Mel's memory compaction code is going to improve much and
we can kill lumpy reclaim in future. but it's another story)
Thanks.
>
> Reported-by: John Berthels <[email protected]>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> mm/vmscan.c | 13 ++++++-------
> 1 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e0e5f15..5321ac4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> * writeout. So in laptop mode, write out the whole world.
> */
> writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> - if (total_scanned > writeback_threshold) {
> + if (total_scanned > writeback_threshold)
> wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> - sc->may_writepage = 1;
> - }
>
> /* Take a nap, wait for some writeback to complete */
> if (!sc->hibernation_mode && sc->nr_scanned &&
> @@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> {
> struct scan_control sc = {
> .gfp_mask = gfp_mask,
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .may_unmap = 1,
> .may_swap = 1,
> @@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> struct zone *zone, int nid)
> {
> struct scan_control sc = {
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .may_unmap = 1,
> .may_swap = !noswap,
> .swappiness = swappiness,
> @@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> {
> struct zonelist *zonelist;
> struct scan_control sc = {
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .may_unmap = 1,
> .may_swap = !noswap,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> @@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> struct reclaim_state reclaim_state;
> int priority;
> struct scan_control sc = {
> - .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> + .may_writepage = (current_is_kswapd() &&
> + (zone_reclaim_mode & RECLAIM_WRITE)),
> .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> .may_swap = 1,
> .nr_to_reclaim = max_t(unsigned long, nr_pages,
> --
> 1.6.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Tue, Apr 13, 2010 at 10:17:58AM +1000, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
>
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
>
It's already known that the VM requesting specific pages be cleaned and
reclaimed is a bad IO pattern but unfortunately it is still required by
lumpy reclaim. This change would appear to break that although I haven't
tested it to be 100% sure.
Even without high-order considerations, this patch would appear to make
fairly large changes to how direct reclaim behaves. It would no longer
wait on page writeback for example so direct reclaim will return sooner
than it did potentially going OOM if there were a lot of dirty pages and
it made no progress during direct reclaim.
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
>
If an FS caller cannot re-enter the FS, it should be using GFP_NOFS
instead of GFP_KERNEL.
> Reported-by: John Berthels <[email protected]>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> mm/vmscan.c | 13 ++++++-------
> 1 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e0e5f15..5321ac4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> * writeout. So in laptop mode, write out the whole world.
> */
> writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> - if (total_scanned > writeback_threshold) {
> + if (total_scanned > writeback_threshold)
> wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> - sc->may_writepage = 1;
> - }
>
> /* Take a nap, wait for some writeback to complete */
> if (!sc->hibernation_mode && sc->nr_scanned &&
> @@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> {
> struct scan_control sc = {
> .gfp_mask = gfp_mask,
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .may_unmap = 1,
> .may_swap = 1,
> @@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> struct zone *zone, int nid)
> {
> struct scan_control sc = {
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .may_unmap = 1,
> .may_swap = !noswap,
> .swappiness = swappiness,
> @@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> {
> struct zonelist *zonelist;
> struct scan_control sc = {
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .may_unmap = 1,
> .may_swap = !noswap,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> @@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> struct reclaim_state reclaim_state;
> int priority;
> struct scan_control sc = {
> - .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> + .may_writepage = (current_is_kswapd() &&
> + (zone_reclaim_mode & RECLAIM_WRITE)),
> .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> .may_swap = 1,
> .nr_to_reclaim = max_t(unsigned long, nr_pages,
> --
> 1.6.5
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, Apr 13, 2010 at 05:31:25PM +0900, KOSAKI Motohiro wrote:
> > From: Dave Chinner <[email protected]>
> >
> > When we enter direct reclaim we may have used an arbitrary amount of stack
> > space, and hence enterring the filesystem to do writeback can then lead to
> > stack overruns. This problem was recently encountered x86_64 systems with
> > 8k stacks running XFS with simple storage configurations.
> >
> > Writeback from direct reclaim also adversely affects background writeback. The
> > background flusher threads should already be taking care of cleaning dirty
> > pages, and direct reclaim will kick them if they aren't already doing work. If
> > direct reclaim is also calling ->writepage, it will cause the IO patterns from
> > the background flusher threads to be upset by LRU-order writeback from
> > pageout() which can be effectively random IO. Having competing sources of IO
> > trying to clean pages on the same backing device reduces throughput by
> > increasing the amount of seeks that the backing device has to do to write back
> > the pages.
> >
> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > Set up the relevant scan_control structures to enforce this, and prevent
> > sc->may_writepage from being set in other places in the direct reclaim path in
> > response to other events.
>
> Ummm..
> This patch is harder to ack. This patch's pros/cons seems
>
> Pros:
> 1) prevent XFS stack overflow
> 2) improve io workload performance
>
> Cons:
> 3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
>
> So, If we only need to consider io workload this is no downside. but
> it can't.
>
> I think (1) is XFS issue. XFS should care it itself.
The filesystem is irrelevant, IMO.
The traces from the reporter showed that we've got close to a 2k
stack footprint for memory allocation to direct reclaim and then we
can put the entire writeback path on top of that. This is roughly
3.5k for XFS, and then depending on the storage subsystem
configuration and transport can be another 2k of stack needed below
XFS.
IOWs, if we completely ignore the filesystem stack usage, there's
still up to 4k of stack needed in the direct reclaim path. Given
that one of the stack traces supplied show direct reclaim being
entered with over 3k of stack already used, pretty much any
filesystem is capable of blowing an 8k stack.
So, this is not an XFS issue, even though XFS is the first to
uncover it. Don't shoot the messenger....
> but (2) is really
> VM issue. Now our VM makes too agressive pageout() and decrease io
> throughput. I've heard this issue from Chris (cc to him). I'd like to
> fix this.
I didn't expect this to be easy. ;)
I had a good look at what the code was doing before I wrote the
patch, and IMO, there is no good reason for issuing IO from direct
reclaim.
My reasoning is as follows - consider a system with a typical
sata disk and the machine is low on memory and in direct reclaim.
direct reclaim is taking pages of the end of the LRU and writing
them one at a time from there. It is scanning thousands of pages
pages and it triggers IO on on the dirty ones it comes across.
This is done with no regard to the IO patterns it generates - it can
(and frequently does) result in completely random single page IO
patterns hitting the disk, and as a result cleaning pages happens
really, really slowly. If we are in a OOM situation, the machine
will grind to a halt as it struggles to clean maybe 1MB of RAM per
second.
On the other hand, if the IO is well formed then the disk might be
capable of 100MB/s. The background flusher threads and filesystems
try very hard to issue well formed IOs, so the difference in the
rate that memory can be cleaned may be a couple of orders of
magnitude.
(Of course, the difference will typically be somewhere in between
these two extremes, but I'm simply trying to illustrate how big
the difference in performance can be.)
IOWs, the background flusher threads are there to clean memory by
issuing IO as efficiently as possible. Direct reclaim is very
efficient at reclaiming clean memory, but it really, really sucks at
cleaning dirty memory in a predictable and deterministic manner. It
is also much more likely to hit worst case IO patterns than the
background flusher threads.
Hence I think that direct reclaim should be deferring to the
background flusher threads for cleaning memory and not trying to be
doing it itself.
> but we never kill pageout() completely because we can't
> assume users don't run high order allocation workload.
I think that lumpy reclaim will still work just fine.
Lumpy reclaim appears to be using IO as a method of slowing
down the reclaim cycle - the congestion_wait() call will still
function as it does now if the background flusher threads are active
and causing congestion. I don't see why lumpy reclaim specifically
needs to be issuing IO to make it work - if the congestion_wait() is
not waiting long enough then wait longer - don't issue IO to extend
the wait time.
Also, there doesn't appear to be anything special about the chunks of
pages it's issuing IO on and waiting for, either. They are simply
the last N pages on the LRU that could be grabbed so they have no
guarantee of contiguity, so the IO it issues does nothing specific
to help higher order allocations to succeed.
Hence it really seems to me that the effectiveness of lumpy reclaim
is determined mostly by the effectiveness of the IO subsystem - the
faster the IO subsystem cleans pages, the less time lumpy reclaim
will block and the faster it will free pages. From this observation
and the fact that issuing IO only from the bdi flusher threads will
have the same effect (improves IO subsystem effectiveness), it seems
to me that lumpy reclaim should not be adversely affected by this
change.
Of course, the code is a maze of twisty passages, so I probably
missed something important. Hopefully someone can tell me what. ;)
FWIW, the biggest problem here is that I have absolutely no clue on
how to test what the impact on lumpy reclaim really is. Does anyone
have a relatively simple test that can be run to determine what the
impact is?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Apr 13, 2010 at 10:58:15AM +0100, Mel Gorman wrote:
> On Tue, Apr 13, 2010 at 10:17:58AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > When we enter direct reclaim we may have used an arbitrary amount of stack
> > space, and hence enterring the filesystem to do writeback can then lead to
> > stack overruns. This problem was recently encountered x86_64 systems with
> > 8k stacks running XFS with simple storage configurations.
> >
> > Writeback from direct reclaim also adversely affects background writeback. The
> > background flusher threads should already be taking care of cleaning dirty
> > pages, and direct reclaim will kick them if they aren't already doing work. If
> > direct reclaim is also calling ->writepage, it will cause the IO patterns from
> > the background flusher threads to be upset by LRU-order writeback from
> > pageout() which can be effectively random IO. Having competing sources of IO
> > trying to clean pages on the same backing device reduces throughput by
> > increasing the amount of seeks that the backing device has to do to write back
> > the pages.
> >
>
> It's already known that the VM requesting specific pages be cleaned and
> reclaimed is a bad IO pattern but unfortunately it is still required by
> lumpy reclaim. This change would appear to break that although I haven't
> tested it to be 100% sure.
How do you test it? I'd really like to be able to test this myself....
> Even without high-order considerations, this patch would appear to make
> fairly large changes to how direct reclaim behaves. It would no longer
> wait on page writeback for example so direct reclaim will return sooner
AFAICT it still waits for pages under writeback in exactly the same manner
it does now. shrink_page_list() does the following completely
separately to the sc->may_writepage flag:
666 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
667 (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
668
669 if (PageWriteback(page)) {
670 /*
671 * Synchronous reclaim is performed in two passes,
672 * first an asynchronous pass over the list to
673 * start parallel writeback, and a second synchronous
674 * pass to wait for the IO to complete. Wait here
675 * for any page for which writeback has already
676 * started.
677 */
678 if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
679 wait_on_page_writeback(page);
680 else
681 goto keep_locked;
682 }
So if the page is under writeback, PAGEOUT_IO_SYNC is set and
we can enter the fs, it will still wait for writeback to complete
just like it does now.
However, the current code only uses PAGEOUT_IO_SYNC in lumpy
reclaim, so for most typical workloads direct reclaim does not wait
on page writeback, either. Hence, this patch doesn't appear to
change the actions taken on a page under writeback in direct
reclaim....
> than it did potentially going OOM if there were a lot of dirty pages and
> it made no progress during direct reclaim.
I did a fair bit of low/small memory testing. This is a subjective
observation, but I definitely seemed to get less severe OOM
situations and better overall responisveness with this patch than
compared to when direct reclaim was doing writeback.
> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > Set up the relevant scan_control structures to enforce this, and prevent
> > sc->may_writepage from being set in other places in the direct reclaim path in
> > response to other events.
> >
>
> If an FS caller cannot re-enter the FS, it should be using GFP_NOFS
> instead of GFP_KERNEL.
This problem is not a filesystem recursion problem which is, as I
understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
code that uses signficant stack before trying to allocate memory
that is the problem. e.g a select() system call:
Depth Size Location (47 entries)
----- ---- --------
0) 7568 16 mempool_alloc_slab+0x16/0x20
1) 7552 144 mempool_alloc+0x65/0x140
2) 7408 96 get_request+0x124/0x370
3) 7312 144 get_request_wait+0x29/0x1b0
4) 7168 96 __make_request+0x9b/0x490
5) 7072 208 generic_make_request+0x3df/0x4d0
6) 6864 80 submit_bio+0x7c/0x100
7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
....
32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
33) 3120 384 shrink_page_list+0x65e/0x840
34) 2736 528 shrink_zone+0x63f/0xe10
35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
36) 2096 128 try_to_free_pages+0x77/0x80
37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
38) 1728 48 alloc_pages_current+0x8c/0xe0
39) 1680 16 __get_free_pages+0xe/0x50
40) 1664 48 __pollwait+0xca/0x110
41) 1616 32 unix_poll+0x28/0xc0
42) 1584 16 sock_poll+0x1d/0x20
43) 1568 912 do_select+0x3d6/0x700
44) 656 416 core_sys_select+0x18c/0x2c0
45) 240 112 sys_select+0x4f/0x110
46) 128 128 system_call_fastpath+0x16/0x1b
There's 1.6k of stack used before memory allocation is called, 3.1k
used there before ->writepage is entered, XFS used 3.5k, and
if the mempool needed to allocate a page it would have blown the
stack. If there was any significant storage subsystem (add dm, md
and/or scsi of some kind), it would have blown the stack.
Basically, there is not enough stack space available to allow direct
reclaim to enter ->writepage _anywhere_ according to the stack usage
profiles we are seeing here....
Cheers,
Dave.
--
Dave Chinner
[email protected]
Hi
> > Pros:
> > 1) prevent XFS stack overflow
> > 2) improve io workload performance
> >
> > Cons:
> > 3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
> >
> > So, If we only need to consider io workload this is no downside. but
> > it can't.
> >
> > I think (1) is XFS issue. XFS should care it itself.
>
> The filesystem is irrelevant, IMO.
>
> The traces from the reporter showed that we've got close to a 2k
> stack footprint for memory allocation to direct reclaim and then we
> can put the entire writeback path on top of that. This is roughly
> 3.5k for XFS, and then depending on the storage subsystem
> configuration and transport can be another 2k of stack needed below
> XFS.
>
> IOWs, if we completely ignore the filesystem stack usage, there's
> still up to 4k of stack needed in the direct reclaim path. Given
> that one of the stack traces supplied show direct reclaim being
> entered with over 3k of stack already used, pretty much any
> filesystem is capable of blowing an 8k stack.
>
> So, this is not an XFS issue, even though XFS is the first to
> uncover it. Don't shoot the messenger....
Thanks explanation. I haven't noticed direct reclaim consume
2k stack. I'll investigate it and try diet it.
But XFS 3.5K stack consumption is too large too. please diet too.
> > but (2) is really
> > VM issue. Now our VM makes too agressive pageout() and decrease io
> > throughput. I've heard this issue from Chris (cc to him). I'd like to
> > fix this.
>
> I didn't expect this to be easy. ;)
>
> I had a good look at what the code was doing before I wrote the
> patch, and IMO, there is no good reason for issuing IO from direct
> reclaim.
>
> My reasoning is as follows - consider a system with a typical
> sata disk and the machine is low on memory and in direct reclaim.
>
> direct reclaim is taking pages of the end of the LRU and writing
> them one at a time from there. It is scanning thousands of pages
> pages and it triggers IO on on the dirty ones it comes across.
> This is done with no regard to the IO patterns it generates - it can
> (and frequently does) result in completely random single page IO
> patterns hitting the disk, and as a result cleaning pages happens
> really, really slowly. If we are in a OOM situation, the machine
> will grind to a halt as it struggles to clean maybe 1MB of RAM per
> second.
>
> On the other hand, if the IO is well formed then the disk might be
> capable of 100MB/s. The background flusher threads and filesystems
> try very hard to issue well formed IOs, so the difference in the
> rate that memory can be cleaned may be a couple of orders of
> magnitude.
>
> (Of course, the difference will typically be somewhere in between
> these two extremes, but I'm simply trying to illustrate how big
> the difference in performance can be.)
>
> IOWs, the background flusher threads are there to clean memory by
> issuing IO as efficiently as possible. Direct reclaim is very
> efficient at reclaiming clean memory, but it really, really sucks at
> cleaning dirty memory in a predictable and deterministic manner. It
> is also much more likely to hit worst case IO patterns than the
> background flusher threads.
>
> Hence I think that direct reclaim should be deferring to the
> background flusher threads for cleaning memory and not trying to be
> doing it itself.
Well, you seems continue to discuss io workload. I don't disagree
such point.
example, If only order-0 reclaim skip pageout(), we will get the above
benefit too.
> > but we never kill pageout() completely because we can't
> > assume users don't run high order allocation workload.
>
> I think that lumpy reclaim will still work just fine.
>
> Lumpy reclaim appears to be using IO as a method of slowing
> down the reclaim cycle - the congestion_wait() call will still
> function as it does now if the background flusher threads are active
> and causing congestion. I don't see why lumpy reclaim specifically
> needs to be issuing IO to make it work - if the congestion_wait() is
> not waiting long enough then wait longer - don't issue IO to extend
> the wait time.
lumpy reclaim is for allocation high order page. then, it not only
reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
is often newly page and still dirty. then we enfoce pageout cleaning
and discard it.
When high order allocation occur, we don't only need free enough amount
memory, but also need free enough contenious memory block.
If we need to consider _only_ io throughput, waiting flusher thread
might faster perhaps, but actually we also need to consider reclaim
latency. I'm worry about such point too.
> Also, there doesn't appear to be anything special about the chunks of
> pages it's issuing IO on and waiting for, either. They are simply
> the last N pages on the LRU that could be grabbed so they have no
> guarantee of contiguity, so the IO it issues does nothing specific
> to help higher order allocations to succeed.
It does. lumpy reclaim doesn't grab last N pages. instead grab contenious
memory chunk. please see isolate_lru_pages().
>
> Hence it really seems to me that the effectiveness of lumpy reclaim
> is determined mostly by the effectiveness of the IO subsystem - the
> faster the IO subsystem cleans pages, the less time lumpy reclaim
> will block and the faster it will free pages. From this observation
> and the fact that issuing IO only from the bdi flusher threads will
> have the same effect (improves IO subsystem effectiveness), it seems
> to me that lumpy reclaim should not be adversely affected by this
> change.
>
> Of course, the code is a maze of twisty passages, so I probably
> missed something important. Hopefully someone can tell me what. ;)
>
> FWIW, the biggest problem here is that I have absolutely no clue on
> how to test what the impact on lumpy reclaim really is. Does anyone
> have a relatively simple test that can be run to determine what the
> impact is?
So, can you please run two workloads concurrently?
- Normal IO workload (fio, iozone, etc..)
- echo $NUM > /proc/sys/vm/nr_hugepages
Most typical high order allocation is occur by blutal wireless LAN driver.
(or some cheap LAN card)
But sadly, If the test depend on specific hardware, our discussion might
make mess maze easily. then, I hope to use hugepage feature instead.
Thanks.
On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > > Pros:
> > > 1) prevent XFS stack overflow
> > > 2) improve io workload performance
> > >
> > > Cons:
> > > 3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
> > >
> > > So, If we only need to consider io workload this is no downside. but
> > > it can't.
> > >
> > > I think (1) is XFS issue. XFS should care it itself.
> >
> > The filesystem is irrelevant, IMO.
> >
> > The traces from the reporter showed that we've got close to a 2k
> > stack footprint for memory allocation to direct reclaim and then we
> > can put the entire writeback path on top of that. This is roughly
> > 3.5k for XFS, and then depending on the storage subsystem
> > configuration and transport can be another 2k of stack needed below
> > XFS.
> >
> > IOWs, if we completely ignore the filesystem stack usage, there's
> > still up to 4k of stack needed in the direct reclaim path. Given
> > that one of the stack traces supplied show direct reclaim being
> > entered with over 3k of stack already used, pretty much any
> > filesystem is capable of blowing an 8k stack.
> >
> > So, this is not an XFS issue, even though XFS is the first to
> > uncover it. Don't shoot the messenger....
>
> Thanks explanation. I haven't noticed direct reclaim consume
> 2k stack. I'll investigate it and try diet it.
> But XFS 3.5K stack consumption is too large too. please diet too.
It hasn't grown in the last 2 years after the last major diet where
all the fat was trimmed from it in the last round of the i386 4k
stack vs XFS saga. it seems that everything else around XFS has
grown in that time, and now we are blowing stacks again....
> > Hence I think that direct reclaim should be deferring to the
> > background flusher threads for cleaning memory and not trying to be
> > doing it itself.
>
> Well, you seems continue to discuss io workload. I don't disagree
> such point.
>
> example, If only order-0 reclaim skip pageout(), we will get the above
> benefit too.
But it won't prevent start blowups...
> > > but we never kill pageout() completely because we can't
> > > assume users don't run high order allocation workload.
> >
> > I think that lumpy reclaim will still work just fine.
> >
> > Lumpy reclaim appears to be using IO as a method of slowing
> > down the reclaim cycle - the congestion_wait() call will still
> > function as it does now if the background flusher threads are active
> > and causing congestion. I don't see why lumpy reclaim specifically
> > needs to be issuing IO to make it work - if the congestion_wait() is
> > not waiting long enough then wait longer - don't issue IO to extend
> > the wait time.
>
> lumpy reclaim is for allocation high order page. then, it not only
> reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
> is often newly page and still dirty. then we enfoce pageout cleaning
> and discard it.
Ok, I see that now - I missed the second call to __isolate_lru_pages()
in isolate_lru_pages().
> When high order allocation occur, we don't only need free enough amount
> memory, but also need free enough contenious memory block.
Agreed, that was why I was kind of surprised not to find it was
doing that. But, as you have pointed out, that was my mistake.
> If we need to consider _only_ io throughput, waiting flusher thread
> might faster perhaps, but actually we also need to consider reclaim
> latency. I'm worry about such point too.
True, but without know how to test and measure such things I can't
really comment...
> > Of course, the code is a maze of twisty passages, so I probably
> > missed something important. Hopefully someone can tell me what. ;)
> >
> > FWIW, the biggest problem here is that I have absolutely no clue on
> > how to test what the impact on lumpy reclaim really is. Does anyone
> > have a relatively simple test that can be run to determine what the
> > impact is?
>
> So, can you please run two workloads concurrently?
> - Normal IO workload (fio, iozone, etc..)
> - echo $NUM > /proc/sys/vm/nr_hugepages
What do I measure/observe/record that is meaningful?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Apr 13, 2010 at 09:19:02PM +1000, Dave Chinner wrote:
> On Tue, Apr 13, 2010 at 10:58:15AM +0100, Mel Gorman wrote:
> > On Tue, Apr 13, 2010 at 10:17:58AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <[email protected]>
> > >
> > > When we enter direct reclaim we may have used an arbitrary amount of stack
> > > space, and hence enterring the filesystem to do writeback can then lead to
> > > stack overruns. This problem was recently encountered x86_64 systems with
> > > 8k stacks running XFS with simple storage configurations.
> > >
> > > Writeback from direct reclaim also adversely affects background writeback. The
> > > background flusher threads should already be taking care of cleaning dirty
> > > pages, and direct reclaim will kick them if they aren't already doing work. If
> > > direct reclaim is also calling ->writepage, it will cause the IO patterns from
> > > the background flusher threads to be upset by LRU-order writeback from
> > > pageout() which can be effectively random IO. Having competing sources of IO
> > > trying to clean pages on the same backing device reduces throughput by
> > > increasing the amount of seeks that the backing device has to do to write back
> > > the pages.
> > >
> >
> > It's already known that the VM requesting specific pages be cleaned and
> > reclaimed is a bad IO pattern but unfortunately it is still required by
> > lumpy reclaim. This change would appear to break that although I haven't
> > tested it to be 100% sure.
>
> How do you test it? I'd really like to be able to test this myself....
>
Depends. For raw effectiveness, I run a series of performance-related
benchmarks with a final test that
o Starts a number of parallel compiles that in combination are 1.25 times
of physical memory in total size
o Sleep three minutes
o Start allocating huge pages recording the latency required for each one
o Record overall success rate and graph latency over time
Lumpy reclaim both increases the success rate and reduces the latency.
> > Even without high-order considerations, this patch would appear to make
> > fairly large changes to how direct reclaim behaves. It would no longer
> > wait on page writeback for example so direct reclaim will return sooner
>
> AFAICT it still waits for pages under writeback in exactly the same manner
> it does now. shrink_page_list() does the following completely
> separately to the sc->may_writepage flag:
>
> 666 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> 667 (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
> 668
> 669 if (PageWriteback(page)) {
> 670 /*
> 671 * Synchronous reclaim is performed in two passes,
> 672 * first an asynchronous pass over the list to
> 673 * start parallel writeback, and a second synchronous
> 674 * pass to wait for the IO to complete. Wait here
> 675 * for any page for which writeback has already
> 676 * started.
> 677 */
> 678 if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
> 679 wait_on_page_writeback(page);
> 680 else
> 681 goto keep_locked;
> 682 }
>
Right, so it'll still wait on writeback but won't kick it off. That
would still be a fairly significant change in behaviour though. Think of
synchronous lumpy reclaim for example where it queues up a contiguous
batch of patches and then waits on them to writeback..
> So if the page is under writeback, PAGEOUT_IO_SYNC is set and
> we can enter the fs, it will still wait for writeback to complete
> just like it does now.
>
But it would be no longer queueing them for writeback so it'd be
depending heavily on kswapd or a background cleaning daemon to clean
them.
> However, the current code only uses PAGEOUT_IO_SYNC in lumpy
> reclaim, so for most typical workloads direct reclaim does not wait
> on page writeback, either.
No, but it does queue them back on the LRU where they might be clean the
next time they are found on the list. How significant a problem this is
I couldn't tell you but it could show a corner case where a large number
of direct reclaimers are encountering dirty pages frequenctly and
recycling them around the LRU list instead of cleaning them.
> Hence, this patch doesn't appear to
> change the actions taken on a page under writeback in direct
> reclaim....
>
It does, but indirectly. The impact is very direct for lumpy reclaim
obviously. For other direct reclaim, pages that were at the end of the
LRU list are no longer getting cleaned before doing another lap through
the LRU list.
The consequences of the latter are harder to predict.
> > than it did potentially going OOM if there were a lot of dirty pages and
> > it made no progress during direct reclaim.
>
> I did a fair bit of low/small memory testing. This is a subjective
> observation, but I definitely seemed to get less severe OOM
> situations and better overall responisveness with this patch than
> compared to when direct reclaim was doing writeback.
>
And it is possible that it is best overall of only kswapd and the
background cleaner are queueing pages for IO. All I can say for sure is
that this does appear to hurt lumpy reclaim and does affect normal
direct reclaim where I have no predictions.
> > > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > > Set up the relevant scan_control structures to enforce this, and prevent
> > > sc->may_writepage from being set in other places in the direct reclaim path in
> > > response to other events.
> > >
> >
> > If an FS caller cannot re-enter the FS, it should be using GFP_NOFS
> > instead of GFP_KERNEL.
>
> This problem is not a filesystem recursion problem which is, as I
> understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> code that uses signficant stack before trying to allocate memory
> that is the problem. e.g a select() system call:
>
> Depth Size Location (47 entries)
> ----- ---- --------
> 0) 7568 16 mempool_alloc_slab+0x16/0x20
> 1) 7552 144 mempool_alloc+0x65/0x140
> 2) 7408 96 get_request+0x124/0x370
> 3) 7312 144 get_request_wait+0x29/0x1b0
> 4) 7168 96 __make_request+0x9b/0x490
> 5) 7072 208 generic_make_request+0x3df/0x4d0
> 6) 6864 80 submit_bio+0x7c/0x100
> 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> ....
> 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 33) 3120 384 shrink_page_list+0x65e/0x840
> 34) 2736 528 shrink_zone+0x63f/0xe10
> 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> 36) 2096 128 try_to_free_pages+0x77/0x80
> 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> 38) 1728 48 alloc_pages_current+0x8c/0xe0
> 39) 1680 16 __get_free_pages+0xe/0x50
> 40) 1664 48 __pollwait+0xca/0x110
> 41) 1616 32 unix_poll+0x28/0xc0
> 42) 1584 16 sock_poll+0x1d/0x20
> 43) 1568 912 do_select+0x3d6/0x700
> 44) 656 416 core_sys_select+0x18c/0x2c0
> 45) 240 112 sys_select+0x4f/0x110
> 46) 128 128 system_call_fastpath+0x16/0x1b
>
> There's 1.6k of stack used before memory allocation is called, 3.1k
> used there before ->writepage is entered, XFS used 3.5k, and
> if the mempool needed to allocate a page it would have blown the
> stack. If there was any significant storage subsystem (add dm, md
> and/or scsi of some kind), it would have blown the stack.
>
> Basically, there is not enough stack space available to allow direct
> reclaim to enter ->writepage _anywhere_ according to the stack usage
> profiles we are seeing here....
>
I'm not denying the evidence but how has it been gotten away with for years
then? Prevention of writeback isn't the answer without figuring out how
direct reclaimers can queue pages for IO and in the case of lumpy reclaim
doing sync IO, then waiting on those pages.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > This problem is not a filesystem recursion problem which is, as I
> > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > code that uses signficant stack before trying to allocate memory
> > that is the problem. e.g a select() system call:
> >
> > Depth Size Location (47 entries)
> > ----- ---- --------
> > 0) 7568 16 mempool_alloc_slab+0x16/0x20
> > 1) 7552 144 mempool_alloc+0x65/0x140
> > 2) 7408 96 get_request+0x124/0x370
> > 3) 7312 144 get_request_wait+0x29/0x1b0
> > 4) 7168 96 __make_request+0x9b/0x490
> > 5) 7072 208 generic_make_request+0x3df/0x4d0
> > 6) 6864 80 submit_bio+0x7c/0x100
> > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > ....
> > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> > 33) 3120 384 shrink_page_list+0x65e/0x840
> > 34) 2736 528 shrink_zone+0x63f/0xe10
> > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> > 36) 2096 128 try_to_free_pages+0x77/0x80
> > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> > 38) 1728 48 alloc_pages_current+0x8c/0xe0
> > 39) 1680 16 __get_free_pages+0xe/0x50
> > 40) 1664 48 __pollwait+0xca/0x110
> > 41) 1616 32 unix_poll+0x28/0xc0
> > 42) 1584 16 sock_poll+0x1d/0x20
> > 43) 1568 912 do_select+0x3d6/0x700
> > 44) 656 416 core_sys_select+0x18c/0x2c0
> > 45) 240 112 sys_select+0x4f/0x110
> > 46) 128 128 system_call_fastpath+0x16/0x1b
> >
> > There's 1.6k of stack used before memory allocation is called, 3.1k
> > used there before ->writepage is entered, XFS used 3.5k, and
> > if the mempool needed to allocate a page it would have blown the
> > stack. If there was any significant storage subsystem (add dm, md
> > and/or scsi of some kind), it would have blown the stack.
> >
> > Basically, there is not enough stack space available to allow direct
> > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > profiles we are seeing here....
> >
>
> I'm not denying the evidence but how has it been gotten away with for years
> then? Prevention of writeback isn't the answer without figuring out how
> direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> doing sync IO, then waiting on those pages.
So, I've been reading along, nodding my head to Dave's side of things
because seeks are evil and direct reclaim makes seeks. I'd really loev
for direct reclaim to somehow trigger writepages on large chunks instead
of doing page by page spatters of IO to the drive.
But, somewhere along the line I overlooked the part of Dave's stack trace
that said:
43) 1568 912 do_select+0x3d6/0x700
Huh, 912 bytes...for select, really? From poll.h:
/* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
additional memory. */
#define MAX_STACK_ALLOC 832
#define FRONTEND_STACK_ALLOC 256
#define SELECT_STACK_ALLOC FRONTEND_STACK_ALLOC
#define POLL_STACK_ALLOC FRONTEND_STACK_ALLOC
#define WQUEUES_STACK_ALLOC (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
#define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
So, select is intentionally trying to use that much stack. It should be using
GFP_NOFS if it really wants to suck down that much stack...if only the
kernel had some sort of way to dynamically allocate ram, it could try
that too.
-chris
Hi, Dave.
On Tue, Apr 13, 2010 at 9:17 AM, Dave Chinner <[email protected]> wrote:
> From: Dave Chinner <[email protected]>
>
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
>
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
>
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
I think your solution is rather aggressive change as Mel and Kosaki
already pointed out.
Do flush thread aware LRU of dirty pages in system level recency not
dirty pages recency?
Of course flush thread can clean dirty pages faster than direct reclaimer.
But if it don't aware LRUness, hot page thrashing can be happened by
corner case.
It could lost write merge.
And non-rotation storage might be not big of seek cost.
I think we have to consider that case if we decide to change direct reclaim I/O.
How do we separate the problem?
1. stack hogging problem.
2. direct reclaim random write.
And try to solve one by one instead of all at once.
--
Kind regards,
Minchan Kim
On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > This problem is not a filesystem recursion problem which is, as I
> > > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > > code that uses signficant stack before trying to allocate memory
> > > that is the problem. e.g a select() system call:
> > >
> > > Depth Size Location (47 entries)
> > > ----- ---- --------
> > > 0) 7568 16 mempool_alloc_slab+0x16/0x20
> > > 1) 7552 144 mempool_alloc+0x65/0x140
> > > 2) 7408 96 get_request+0x124/0x370
> > > 3) 7312 144 get_request_wait+0x29/0x1b0
> > > 4) 7168 96 __make_request+0x9b/0x490
> > > 5) 7072 208 generic_make_request+0x3df/0x4d0
> > > 6) 6864 80 submit_bio+0x7c/0x100
> > > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > > ....
> > > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> > > 33) 3120 384 shrink_page_list+0x65e/0x840
> > > 34) 2736 528 shrink_zone+0x63f/0xe10
> > > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> > > 36) 2096 128 try_to_free_pages+0x77/0x80
> > > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> > > 38) 1728 48 alloc_pages_current+0x8c/0xe0
> > > 39) 1680 16 __get_free_pages+0xe/0x50
> > > 40) 1664 48 __pollwait+0xca/0x110
> > > 41) 1616 32 unix_poll+0x28/0xc0
> > > 42) 1584 16 sock_poll+0x1d/0x20
> > > 43) 1568 912 do_select+0x3d6/0x700
> > > 44) 656 416 core_sys_select+0x18c/0x2c0
> > > 45) 240 112 sys_select+0x4f/0x110
> > > 46) 128 128 system_call_fastpath+0x16/0x1b
> > >
> > > There's 1.6k of stack used before memory allocation is called, 3.1k
> > > used there before ->writepage is entered, XFS used 3.5k, and
> > > if the mempool needed to allocate a page it would have blown the
> > > stack. If there was any significant storage subsystem (add dm, md
> > > and/or scsi of some kind), it would have blown the stack.
> > >
> > > Basically, there is not enough stack space available to allow direct
> > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > profiles we are seeing here....
> > >
> >
> > I'm not denying the evidence but how has it been gotten away with for years
> > then? Prevention of writeback isn't the answer without figuring out how
> > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > doing sync IO, then waiting on those pages.
>
> So, I've been reading along, nodding my head to Dave's side of things
> because seeks are evil and direct reclaim makes seeks. I'd really loev
> for direct reclaim to somehow trigger writepages on large chunks instead
> of doing page by page spatters of IO to the drive.
Perhaps drop the lock on the page if it is held and call one of the
helpers that filesystems use to do this, like:
filemap_write_and_wait(page->mapping);
> But, somewhere along the line I overlooked the part of Dave's stack trace
> that said:
>
> 43) 1568 912 do_select+0x3d6/0x700
>
> Huh, 912 bytes...for select, really? From poll.h:
Sure, it's bad, but we focussing on the specific case misses the
point that even code that is using minimal stack can enter direct
reclaim after consuming 1.5k of stack. e.g.:
50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
51) 3104 384 shrink_page_list+0x65e/0x840
52) 2720 528 shrink_zone+0x63f/0xe10
53) 2192 112 do_try_to_free_pages+0xc2/0x3c0
54) 2080 128 try_to_free_pages+0x77/0x80
55) 1952 240 __alloc_pages_nodemask+0x3e4/0x710
56) 1712 48 alloc_pages_current+0x8c/0xe0
57) 1664 32 __page_cache_alloc+0x67/0x70
58) 1632 144 __do_page_cache_readahead+0xd3/0x220
59) 1488 16 ra_submit+0x21/0x30
60) 1472 80 ondemand_readahead+0x11d/0x250
61) 1392 64 page_cache_async_readahead+0xa9/0xe0
62) 1328 592 __generic_file_splice_read+0x48a/0x530
63) 736 48 generic_file_splice_read+0x4f/0x90
64) 688 96 xfs_splice_read+0xf2/0x130 [xfs]
65) 592 32 xfs_file_splice_read+0x4b/0x50 [xfs]
66) 560 64 do_splice_to+0x77/0xb0
67) 496 112 splice_direct_to_actor+0xcc/0x1c0
68) 384 80 do_splice_direct+0x57/0x80
69) 304 96 do_sendfile+0x16c/0x1e0
70) 208 80 sys_sendfile64+0x8d/0xb0
71) 128 128 system_call_fastpath+0x16/0x1b
Yes, __generic_file_splice_read() is a hog, but they seem to be
_everywhere_ today...
> So, select is intentionally trying to use that much stack. It should be using
> GFP_NOFS if it really wants to suck down that much stack...
The code that did the allocation is called from multiple different
contexts - how is it supposed to know that in some of those contexts
it is supposed to treat memory allocation differently?
This is my point - if you introduce a new semantic to memory allocation
that is "use GFP_NOFS when you are using too much stack" and too much
stack is more than 15% of the stack, then pretty much every code path
will need to set that flag...
> if only the
> kernel had some sort of way to dynamically allocate ram, it could try
> that too.
Sure, but to play the devil's advocate: if memory allocation blows
the stack, then surely avoiding allocation by using stack variables
is safer? ;)
FWIW, even if we use GFP_NOFS, allocation+reclaim can still use 2k
of stack; stuff like the radix tree code appears to be a significant
user of stack now:
Depth Size Location (56 entries)
----- ---- --------
0) 7904 48 __call_rcu+0x67/0x190
1) 7856 16 call_rcu_sched+0x15/0x20
2) 7840 16 call_rcu+0xe/0x10
3) 7824 272 radix_tree_delete+0x159/0x2e0
4) 7552 32 __remove_from_page_cache+0x21/0x110
5) 7520 64 __remove_mapping+0xe8/0x130
6) 7456 384 shrink_page_list+0x400/0x860
7) 7072 528 shrink_zone+0x636/0xdc0
8) 6544 112 do_try_to_free_pages+0xc2/0x3c0
9) 6432 112 try_to_free_pages+0x64/0x70
10) 6320 256 __alloc_pages_nodemask+0x3d2/0x710
11) 6064 48 alloc_pages_current+0x8c/0xe0
12) 6016 32 __page_cache_alloc+0x67/0x70
13) 5984 80 find_or_create_page+0x50/0xb0
14) 5904 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
or even just calling ->releasepage and freeing bufferheads:
Depth Size Location (55 entries)
----- ---- --------
0) 7440 48 add_partial+0x26/0x90
1) 7392 64 __slab_free+0x1a9/0x380
2) 7328 64 kmem_cache_free+0xb9/0x160
3) 7264 16 free_buffer_head+0x25/0x50
4) 7248 64 try_to_free_buffers+0x79/0xc0
5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs]
6) 7024 16 try_to_release_page+0x33/0x60
7) 7008 384 shrink_page_list+0x585/0x860
8) 6624 528 shrink_zone+0x636/0xdc0
9) 6096 112 do_try_to_free_pages+0xc2/0x3c0
10) 5984 112 try_to_free_pages+0x64/0x70
11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710
12) 5616 48 alloc_pages_current+0x8c/0xe0
13) 5568 32 __page_cache_alloc+0x67/0x70
14) 5536 80 find_or_create_page+0x50/0xb0
15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
And another eye-opening example, this time deep in the sata driver
layer:
Depth Size Location (72 entries)
----- ---- --------
0) 8336 304 select_task_rq_fair+0x235/0xad0
1) 8032 96 try_to_wake_up+0x189/0x3f0
2) 7936 16 default_wake_function+0x12/0x20
3) 7920 32 autoremove_wake_function+0x16/0x40
4) 7888 64 __wake_up_common+0x5a/0x90
5) 7824 64 __wake_up+0x48/0x70
6) 7760 64 insert_work+0x9f/0xb0
7) 7696 48 __queue_work+0x36/0x50
8) 7648 16 queue_work_on+0x4d/0x60
9) 7632 16 queue_work+0x1f/0x30
10) 7616 16 queue_delayed_work+0x2d/0x40
11) 7600 32 ata_pio_queue_task+0x35/0x40
12) 7568 48 ata_sff_qc_issue+0x146/0x2f0
13) 7520 96 mv_qc_issue+0x12d/0x540 [sata_mv]
14) 7424 96 ata_qc_issue+0x1fe/0x320
15) 7328 64 ata_scsi_translate+0xae/0x1a0
16) 7264 64 ata_scsi_queuecmd+0xbf/0x2f0
17) 7200 48 scsi_dispatch_cmd+0x114/0x2b0
18) 7152 96 scsi_request_fn+0x419/0x590
19) 7056 32 __blk_run_queue+0x82/0x150
20) 7024 48 elv_insert+0x1aa/0x2d0
21) 6976 48 __elv_add_request+0x83/0xd0
22) 6928 96 __make_request+0x139/0x490
23) 6832 208 generic_make_request+0x3df/0x4d0
24) 6624 80 submit_bio+0x7c/0x100
25) 6544 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
We need at least _700_ bytes of stack free just to call queue_work(),
and that now happens deep in the guts of the driver subsystem below XFS.
This trace shows 1.8k of stack usage on a simple, single sata disk
storage subsystem, so my estimate of 2k of stack for the storage system
below XFS is too small - a worst case of 2.5-3k of stack space is probably
closer to the mark.
This is the sort of thing I'm pointing at when I say that stack
usage outside XFS has grown significantly significantly over the
past couple of years. Given XFS has remained pretty much the same or
even reduced slightly over the same time period, blaming XFS or
saying "callers should use GFP_NOFS" seems like a cop-out to me.
Regardless of the IO pattern performance issues, writeback via
direct reclaim just uses too much stack to be safe these days...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Apr 14, 2010 at 12:36:59AM +1000, Dave Chinner wrote:
> On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > have a relatively simple test that can be run to determine what the
> > > impact is?
> >
> > So, can you please run two workloads concurrently?
> > - Normal IO workload (fio, iozone, etc..)
> > - echo $NUM > /proc/sys/vm/nr_hugepages
>
> What do I measure/observe/record that is meaningful?
So, a rough as guts first pass - just run a large dd (8 times the
size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
the entire of memory in huge pages (500) every 5 seconds. The IO
rate is roughly 100MB/s, so it takes 75-85s to complete the dd.
The script:
$ cat t.sh
#!/bin/bash
echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
dd if=/dev/zero of=/mnt/scratch/test bs=1024k count=8000 > /dev/null 2>&1 &
(
for i in `seq 1 1 20`; do
sleep 5
/usr/bin/time --format="wall %e" sh -c "echo 500 > /proc/sys/vm/nr_hugepages" 2>&1
grep HugePages_Total /proc/meminfo
done
) | awk '
/wall/ { wall += $2; cnt += 1 }
/Pages/ { pages[cnt] = $2 }
END { printf "average wall time %f\nPages step: ", wall / cnt ;
for (i = 1; i <= cnt; i++) {
printf "%d ", pages[i];
}
}'
----
And the output looks like:
$ sudo ./t.sh
average wall time 0.954500
Pages step: 97 101 101 121 173 173 173 173 173 173 175 194 195 195 202 220 226 419 423 426
$
Run 50 times in a loop, and the outputs averaged, the existing lumpy
reclaim resulted in:
dave@test-1:~$ cat current.txt | awk -f av.awk
av. wall = 0.519385 secs
av Pages step: 192 228 242 255 265 272 279 284 289 294 298 303 307 322 342 366 383 401 412 420
And with my patch that disables ->writepage:
dave@test-1:~$ cat no-direct.txt | awk -f av.awk
av. wall = 0.554163 secs
av Pages step: 231 283 310 316 323 328 336 340 345 351 356 359 364 377 388 397 413 423 432 439
Basically, with my patch lumpy reclaim was *substantially* more
effective with only a slight increase in average allocation latency
with this test case.
I need to add a marker to the output that records when the dd
completes, but from monitoring the writeback rates via PCP, they
were in the balllpark of 85-100MB/s for the existing code, and
95-110MB/s with my patch. Hence it improved both IO throughput and
the effectiveness of lumpy reclaim.
On the down side, I did have an OOM killer invocation with my patch
after about 150 iterations - dd failed an order zero allocation
because there were 455 huge pages allocated and there were only
_320_ available pages for IO, all of which were under IO. i.e. lumpy
reclaim worked so well that the machine got into order-0 page
starvation.
I know this is a simple test case, but it shows much better results
than I think anyone (even me) is expecting...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Apr 14, 2010 at 09:24:33AM +0900, Minchan Kim wrote:
> Hi, Dave.
>
> On Tue, Apr 13, 2010 at 9:17 AM, Dave Chinner <[email protected]> wrote:
> > From: Dave Chinner <[email protected]>
> >
> > When we enter direct reclaim we may have used an arbitrary amount of stack
> > space, and hence enterring the filesystem to do writeback can then lead to
> > stack overruns. This problem was recently encountered x86_64 systems with
> > 8k stacks running XFS with simple storage configurations.
> >
> > Writeback from direct reclaim also adversely affects background writeback. The
> > background flusher threads should already be taking care of cleaning dirty
> > pages, and direct reclaim will kick them if they aren't already doing work. If
> > direct reclaim is also calling ->writepage, it will cause the IO patterns from
> > the background flusher threads to be upset by LRU-order writeback from
> > pageout() which can be effectively random IO. Having competing sources of IO
> > trying to clean pages on the same backing device reduces throughput by
> > increasing the amount of seeks that the backing device has to do to write back
> > the pages.
> >
> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > Set up the relevant scan_control structures to enforce this, and prevent
> > sc->may_writepage from being set in other places in the direct reclaim path in
> > response to other events.
>
> I think your solution is rather aggressive change as Mel and Kosaki
> already pointed out.
It may be agressive, but writeback from direct reclaim is, IMO, one
of the worst aspects of the current VM design because of it's
adverse effect on the IO subsystem.
I'd prefer to remove it completely that continue to try and patch
around it, especially given that everyone seems to agree that it
does have an adverse affect on IO...
> Do flush thread aware LRU of dirty pages in system level recency not
> dirty pages recency?
It writes back in the order inodes were dirtied. i.e. the LRU is a
coarser measure, but it it still definitely there. It also takes
into account fairness of IO between dirty inodes, so no one dirty
inode prevents IO beining issued on a other dirty inodes on the
LRU...
> Of course flush thread can clean dirty pages faster than direct reclaimer.
> But if it don't aware LRUness, hot page thrashing can be happened by
> corner case.
> It could lost write merge.
>
> And non-rotation storage might be not big of seek cost.
Non-rotational storage still goes faster when it is fed large, well
formed IOs.
> I think we have to consider that case if we decide to change direct reclaim I/O.
>
> How do we separate the problem?
>
> 1. stack hogging problem.
> 2. direct reclaim random write.
AFAICT, the only way to _reliably_ avoid the stack usage problem is
to avoid writeback in direct reclaim. That has the side effect of
fixing #2 as well, so do they really need separating?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, 14 Apr 2010 11:40:41 +1000
Dave Chinner <[email protected]> wrote:
> 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 51) 3104 384 shrink_page_list+0x65e/0x840
> 52) 2720 528 shrink_zone+0x63f/0xe10
A bit OFF TOPIC.
Could you share disassemble of shrink_zone() ?
In my environ.
00000000000115a0 <shrink_zone>:
115a0: 55 push %rbp
115a1: 48 89 e5 mov %rsp,%rbp
115a4: 41 57 push %r15
115a6: 41 56 push %r14
115a8: 41 55 push %r13
115aa: 41 54 push %r12
115ac: 53 push %rbx
115ad: 48 83 ec 78 sub $0x78,%rsp
115b1: e8 00 00 00 00 callq 115b6 <shrink_zone+0x16>
115b6: 48 89 75 80 mov %rsi,-0x80(%rbp)
disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
until retrun.
I may misunderstand something...
Thanks,
-Kame
On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 14 Apr 2010 11:40:41 +1000
> Dave Chinner <[email protected]> wrote:
>
> > 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
> > 51) 3104 384 shrink_page_list+0x65e/0x840
> > 52) 2720 528 shrink_zone+0x63f/0xe10
>
> A bit OFF TOPIC.
>
> Could you share disassemble of shrink_zone() ?
>
> In my environ.
> 00000000000115a0 <shrink_zone>:
> 115a0: 55 push %rbp
> 115a1: 48 89 e5 mov %rsp,%rbp
> 115a4: 41 57 push %r15
> 115a6: 41 56 push %r14
> 115a8: 41 55 push %r13
> 115aa: 41 54 push %r12
> 115ac: 53 push %rbx
> 115ad: 48 83 ec 78 sub $0x78,%rsp
> 115b1: e8 00 00 00 00 callq 115b6 <shrink_zone+0x16>
> 115b6: 48 89 75 80 mov %rsi,-0x80(%rbp)
>
> disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> until retrun.
I see the same. I didn't compile those kernels, though. IIUC,
they were built through the Ubuntu build infrastructure, so there is
something different in terms of compiler, compiler options or config
to what we are both using. Most likely it is the compiler inlining,
though Chris's patches to prevent that didn't seem to change the
stack usage.
I'm trying to get a stack trace from the kernel that has shrink_zone
in it, but I haven't succeeded yet....
Cheers,
Dave.
--
Dave Chinner
[email protected]
> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 14 Apr 2010 11:40:41 +1000
> > Dave Chinner <[email protected]> wrote:
> >
> > > 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
> > > 51) 3104 384 shrink_page_list+0x65e/0x840
> > > 52) 2720 528 shrink_zone+0x63f/0xe10
> >
> > A bit OFF TOPIC.
> >
> > Could you share disassemble of shrink_zone() ?
> >
> > In my environ.
> > 00000000000115a0 <shrink_zone>:
> > 115a0: 55 push %rbp
> > 115a1: 48 89 e5 mov %rsp,%rbp
> > 115a4: 41 57 push %r15
> > 115a6: 41 56 push %r14
> > 115a8: 41 55 push %r13
> > 115aa: 41 54 push %r12
> > 115ac: 53 push %rbx
> > 115ad: 48 83 ec 78 sub $0x78,%rsp
> > 115b1: e8 00 00 00 00 callq 115b6 <shrink_zone+0x16>
> > 115b6: 48 89 75 80 mov %rsi,-0x80(%rbp)
> >
> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> > until retrun.
>
> I see the same. I didn't compile those kernels, though. IIUC,
> they were built through the Ubuntu build infrastructure, so there is
> something different in terms of compiler, compiler options or config
> to what we are both using. Most likely it is the compiler inlining,
> though Chris's patches to prevent that didn't seem to change the
> stack usage.
>
> I'm trying to get a stack trace from the kernel that has shrink_zone
> in it, but I haven't succeeded yet....
I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
>> > On Wed, 14 Apr 2010 11:40:41 +1000
>> > Dave Chinner <[email protected]> wrote:
>> >
>> > > Â 50) Â Â 3168 Â Â Â 64 Â xfs_vm_writepage+0xab/0x160 [xfs]
>> > > Â 51) Â Â 3104 Â Â 384 Â shrink_page_list+0x65e/0x840
>> > > Â 52) Â Â 2720 Â Â 528 Â shrink_zone+0x63f/0xe10
>> >
>> > A bit OFF TOPIC.
>> >
>> > Could you share disassemble of shrink_zone() ?
>> >
>> > In my environ.
>> > 00000000000115a0 <shrink_zone>:
>> >   115a0:    55            push  %rbp
>> >   115a1:    48 89 e5         mov   %rsp,%rbp
>> >   115a4:    41 57          push  %r15
>> >   115a6:    41 56          push  %r14
>> >   115a8:    41 55          push  %r13
>> >   115aa:    41 54          push  %r12
>> >   115ac:    53            push  %rbx
>> >   115ad:    48 83 ec 78       sub   $0x78,%rsp
>> >   115b1:    e8 00 00 00 00      callq  115b6 <shrink_zone+0x16>
>> >   115b6:    48 89 75 80       mov   %rsi,-0x80(%rbp)
>> >
>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
>> > until retrun.
>>
>> I see the same. I didn't compile those kernels, though. IIUC,
>> they were built through the Ubuntu build infrastructure, so there is
>> something different in terms of compiler, compiler options or config
>> to what we are both using. Most likely it is the compiler inlining,
>> though Chris's patches to prevent that didn't seem to change the
>> stack usage.
>>
>> I'm trying to get a stack trace from the kernel that has shrink_zone
>> in it, but I haven't succeeded yet....
>
> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
>
In my case, 0x110 byte in 32 bit machine.
I think it's possible in 64 bit machine.
00001830 <shrink_zone>:
1830: 55 push %ebp
1831: 89 e5 mov %esp,%ebp
1833: 57 push %edi
1834: 56 push %esi
1835: 53 push %ebx
1836: 81 ec 10 01 00 00 sub $0x110,%esp
183c: 89 85 24 ff ff ff mov %eax,-0xdc(%ebp)
1842: 89 95 20 ff ff ff mov %edx,-0xe0(%ebp)
1848: 89 8d 1c ff ff ff mov %ecx,-0xe4(%ebp)
184e: 8b 41 04 mov 0x4(%ecx)
my gcc is following as.
barrios@barriostarget:~/mmotm$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
4.3.3-5ubuntu4'
--with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-targets=all --with-tune=generic
--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
--target=i486-linux-gnu
Thread model: posix
gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
Is it depends on config?
I attach my config.
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. Â For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
--
Kind regards,
Minchan Kim
> On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > Hi
> >
> > > > Pros:
> > > > 1) prevent XFS stack overflow
> > > > 2) improve io workload performance
> > > >
> > > > Cons:
> > > > 3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
> > > >
> > > > So, If we only need to consider io workload this is no downside. but
> > > > it can't.
> > > >
> > > > I think (1) is XFS issue. XFS should care it itself.
> > >
> > > The filesystem is irrelevant, IMO.
> > >
> > > The traces from the reporter showed that we've got close to a 2k
> > > stack footprint for memory allocation to direct reclaim and then we
> > > can put the entire writeback path on top of that. This is roughly
> > > 3.5k for XFS, and then depending on the storage subsystem
> > > configuration and transport can be another 2k of stack needed below
> > > XFS.
> > >
> > > IOWs, if we completely ignore the filesystem stack usage, there's
> > > still up to 4k of stack needed in the direct reclaim path. Given
> > > that one of the stack traces supplied show direct reclaim being
> > > entered with over 3k of stack already used, pretty much any
> > > filesystem is capable of blowing an 8k stack.
> > >
> > > So, this is not an XFS issue, even though XFS is the first to
> > > uncover it. Don't shoot the messenger....
> >
> > Thanks explanation. I haven't noticed direct reclaim consume
> > 2k stack. I'll investigate it and try diet it.
> > But XFS 3.5K stack consumption is too large too. please diet too.
>
> It hasn't grown in the last 2 years after the last major diet where
> all the fat was trimmed from it in the last round of the i386 4k
> stack vs XFS saga. it seems that everything else around XFS has
> grown in that time, and now we are blowing stacks again....
I have dumb question, If xfs haven't bloat stack usage, why 3.5
stack usage works fine on 4k stack kernel? It seems impossible.
Please don't think I blame you. I don't know what is "4k stack vs XFS saga".
I merely want to understand what you said.
> > > Hence I think that direct reclaim should be deferring to the
> > > background flusher threads for cleaning memory and not trying to be
> > > doing it itself.
> >
> > Well, you seems continue to discuss io workload. I don't disagree
> > such point.
> >
> > example, If only order-0 reclaim skip pageout(), we will get the above
> > benefit too.
>
> But it won't prevent start blowups...
>
> > > > but we never kill pageout() completely because we can't
> > > > assume users don't run high order allocation workload.
> > >
> > > I think that lumpy reclaim will still work just fine.
> > >
> > > Lumpy reclaim appears to be using IO as a method of slowing
> > > down the reclaim cycle - the congestion_wait() call will still
> > > function as it does now if the background flusher threads are active
> > > and causing congestion. I don't see why lumpy reclaim specifically
> > > needs to be issuing IO to make it work - if the congestion_wait() is
> > > not waiting long enough then wait longer - don't issue IO to extend
> > > the wait time.
> >
> > lumpy reclaim is for allocation high order page. then, it not only
> > reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
> > is often newly page and still dirty. then we enfoce pageout cleaning
> > and discard it.
>
> Ok, I see that now - I missed the second call to __isolate_lru_pages()
> in isolate_lru_pages().
No problem. It's one of VM mess. Usual developers don't know it :-)
> > When high order allocation occur, we don't only need free enough amount
> > memory, but also need free enough contenious memory block.
>
> Agreed, that was why I was kind of surprised not to find it was
> doing that. But, as you have pointed out, that was my mistake.
>
> > If we need to consider _only_ io throughput, waiting flusher thread
> > might faster perhaps, but actually we also need to consider reclaim
> > latency. I'm worry about such point too.
>
> True, but without know how to test and measure such things I can't
> really comment...
Agreed. I know making VM mesurement benchmark is very difficult. but
probably it is necessary....
I'm sorry, now I can't give you good convenient benchmark.
>
> > > Of course, the code is a maze of twisty passages, so I probably
> > > missed something important. Hopefully someone can tell me what. ;)
> > >
> > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > have a relatively simple test that can be run to determine what the
> > > impact is?
> >
> > So, can you please run two workloads concurrently?
> > - Normal IO workload (fio, iozone, etc..)
> > - echo $NUM > /proc/sys/vm/nr_hugepages
>
> What do I measure/observe/record that is meaningful?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > This problem is not a filesystem recursion problem which is, as I
> > > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > > code that uses signficant stack before trying to allocate memory
> > > that is the problem. e.g a select() system call:
> > >
> > > Depth Size Location (47 entries)
> > > ----- ---- --------
> > > 0) 7568 16 mempool_alloc_slab+0x16/0x20
> > > 1) 7552 144 mempool_alloc+0x65/0x140
> > > 2) 7408 96 get_request+0x124/0x370
> > > 3) 7312 144 get_request_wait+0x29/0x1b0
> > > 4) 7168 96 __make_request+0x9b/0x490
> > > 5) 7072 208 generic_make_request+0x3df/0x4d0
> > > 6) 6864 80 submit_bio+0x7c/0x100
> > > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > > ....
> > > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> > > 33) 3120 384 shrink_page_list+0x65e/0x840
> > > 34) 2736 528 shrink_zone+0x63f/0xe10
> > > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> > > 36) 2096 128 try_to_free_pages+0x77/0x80
> > > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> > > 38) 1728 48 alloc_pages_current+0x8c/0xe0
> > > 39) 1680 16 __get_free_pages+0xe/0x50
> > > 40) 1664 48 __pollwait+0xca/0x110
> > > 41) 1616 32 unix_poll+0x28/0xc0
> > > 42) 1584 16 sock_poll+0x1d/0x20
> > > 43) 1568 912 do_select+0x3d6/0x700
> > > 44) 656 416 core_sys_select+0x18c/0x2c0
> > > 45) 240 112 sys_select+0x4f/0x110
> > > 46) 128 128 system_call_fastpath+0x16/0x1b
> > >
> > > There's 1.6k of stack used before memory allocation is called, 3.1k
> > > used there before ->writepage is entered, XFS used 3.5k, and
> > > if the mempool needed to allocate a page it would have blown the
> > > stack. If there was any significant storage subsystem (add dm, md
> > > and/or scsi of some kind), it would have blown the stack.
> > >
> > > Basically, there is not enough stack space available to allow direct
> > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > profiles we are seeing here....
> > >
> >
> > I'm not denying the evidence but how has it been gotten away with for years
> > then? Prevention of writeback isn't the answer without figuring out how
> > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > doing sync IO, then waiting on those pages.
>
> So, I've been reading along, nodding my head to Dave's side of things
> because seeks are evil and direct reclaim makes seeks. I'd really loev
> for direct reclaim to somehow trigger writepages on large chunks instead
> of doing page by page spatters of IO to the drive.
>
> But, somewhere along the line I overlooked the part of Dave's stack trace
> that said:
>
> 43) 1568 912 do_select+0x3d6/0x700
>
> Huh, 912 bytes...for select, really? From poll.h:
>
> /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
> additional memory. */
> #define MAX_STACK_ALLOC 832
> #define FRONTEND_STACK_ALLOC 256
> #define SELECT_STACK_ALLOC FRONTEND_STACK_ALLOC
> #define POLL_STACK_ALLOC FRONTEND_STACK_ALLOC
> #define WQUEUES_STACK_ALLOC (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> #define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
>
> So, select is intentionally trying to use that much stack. It should be using
> GFP_NOFS if it really wants to suck down that much stack...if only the
> kernel had some sort of way to dynamically allocate ram, it could try
> that too.
Yeah, Of cource much. I would propse to revert 70674f95c0.
But I doubt GFP_NOFS solve our issue.
> On Wed, Apr 14, 2010 at 12:36:59AM +1000, Dave Chinner wrote:
> > On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > > have a relatively simple test that can be run to determine what the
> > > > impact is?
> > >
> > > So, can you please run two workloads concurrently?
> > > - Normal IO workload (fio, iozone, etc..)
> > > - echo $NUM > /proc/sys/vm/nr_hugepages
> >
> > What do I measure/observe/record that is meaningful?
>
> So, a rough as guts first pass - just run a large dd (8 times the
> size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
> the entire of memory in huge pages (500) every 5 seconds. The IO
> rate is roughly 100MB/s, so it takes 75-85s to complete the dd.
>
> The script:
>
> $ cat t.sh
> #!/bin/bash
>
> echo 0 > /proc/sys/vm/nr_hugepages
> echo 3 > /proc/sys/vm/drop_caches
>
> dd if=/dev/zero of=/mnt/scratch/test bs=1024k count=8000 > /dev/null 2>&1 &
>
> (
> for i in `seq 1 1 20`; do
> sleep 5
> /usr/bin/time --format="wall %e" sh -c "echo 500 > /proc/sys/vm/nr_hugepages" 2>&1
> grep HugePages_Total /proc/meminfo
> done
> ) | awk '
> /wall/ { wall += $2; cnt += 1 }
> /Pages/ { pages[cnt] = $2 }
> END { printf "average wall time %f\nPages step: ", wall / cnt ;
> for (i = 1; i <= cnt; i++) {
> printf "%d ", pages[i];
> }
> }'
> ----
>
> And the output looks like:
>
> $ sudo ./t.sh
> average wall time 0.954500
> Pages step: 97 101 101 121 173 173 173 173 173 173 175 194 195 195 202 220 226 419 423 426
> $
>
> Run 50 times in a loop, and the outputs averaged, the existing lumpy
> reclaim resulted in:
>
> dave@test-1:~$ cat current.txt | awk -f av.awk
> av. wall = 0.519385 secs
> av Pages step: 192 228 242 255 265 272 279 284 289 294 298 303 307 322 342 366 383 401 412 420
>
> And with my patch that disables ->writepage:
>
> dave@test-1:~$ cat no-direct.txt | awk -f av.awk
> av. wall = 0.554163 secs
> av Pages step: 231 283 310 316 323 328 336 340 345 351 356 359 364 377 388 397 413 423 432 439
>
> Basically, with my patch lumpy reclaim was *substantially* more
> effective with only a slight increase in average allocation latency
> with this test case.
>
> I need to add a marker to the output that records when the dd
> completes, but from monitoring the writeback rates via PCP, they
> were in the balllpark of 85-100MB/s for the existing code, and
> 95-110MB/s with my patch. Hence it improved both IO throughput and
> the effectiveness of lumpy reclaim.
>
> On the down side, I did have an OOM killer invocation with my patch
> after about 150 iterations - dd failed an order zero allocation
> because there were 455 huge pages allocated and there were only
> _320_ available pages for IO, all of which were under IO. i.e. lumpy
> reclaim worked so well that the machine got into order-0 page
> starvation.
>
> I know this is a simple test case, but it shows much better results
> than I think anyone (even me) is expecting...
Ummm...
Probably, I have to say I'm sorry. I guess my last mail give you
a misunderstand.
To be honest, I'm not interest this artificial non fragmentation case.
The above test-case does 1) discard all cache 2) fill pages by streaming
io. then, it makes artificial "file offset neighbor == block neighbor == PFN neighbor"
situation. then, file offset order writeout by flusher thread can make
PFN contenious pages effectively.
Why I dont interest it? because lumpy reclaim is a technique for
avoiding external fragmentation mess. IOW, it is for avoiding worst
case. but your test case seems to mesure best one.
> On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > This problem is not a filesystem recursion problem which is, as I
> > > > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > > > code that uses signficant stack before trying to allocate memory
> > > > that is the problem. e.g a select() system call:
> > > >
> > > > Depth Size Location (47 entries)
> > > > ----- ---- --------
> > > > 0) 7568 16 mempool_alloc_slab+0x16/0x20
> > > > 1) 7552 144 mempool_alloc+0x65/0x140
> > > > 2) 7408 96 get_request+0x124/0x370
> > > > 3) 7312 144 get_request_wait+0x29/0x1b0
> > > > 4) 7168 96 __make_request+0x9b/0x490
> > > > 5) 7072 208 generic_make_request+0x3df/0x4d0
> > > > 6) 6864 80 submit_bio+0x7c/0x100
> > > > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > > > ....
> > > > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> > > > 33) 3120 384 shrink_page_list+0x65e/0x840
> > > > 34) 2736 528 shrink_zone+0x63f/0xe10
> > > > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> > > > 36) 2096 128 try_to_free_pages+0x77/0x80
> > > > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> > > > 38) 1728 48 alloc_pages_current+0x8c/0xe0
> > > > 39) 1680 16 __get_free_pages+0xe/0x50
> > > > 40) 1664 48 __pollwait+0xca/0x110
> > > > 41) 1616 32 unix_poll+0x28/0xc0
> > > > 42) 1584 16 sock_poll+0x1d/0x20
> > > > 43) 1568 912 do_select+0x3d6/0x700
> > > > 44) 656 416 core_sys_select+0x18c/0x2c0
> > > > 45) 240 112 sys_select+0x4f/0x110
> > > > 46) 128 128 system_call_fastpath+0x16/0x1b
> > > >
> > > > There's 1.6k of stack used before memory allocation is called, 3.1k
> > > > used there before ->writepage is entered, XFS used 3.5k, and
> > > > if the mempool needed to allocate a page it would have blown the
> > > > stack. If there was any significant storage subsystem (add dm, md
> > > > and/or scsi of some kind), it would have blown the stack.
> > > >
> > > > Basically, there is not enough stack space available to allow direct
> > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > profiles we are seeing here....
> > > >
> > >
> > > I'm not denying the evidence but how has it been gotten away with for years
> > > then? Prevention of writeback isn't the answer without figuring out how
> > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > doing sync IO, then waiting on those pages.
> >
> > So, I've been reading along, nodding my head to Dave's side of things
> > because seeks are evil and direct reclaim makes seeks. I'd really loev
> > for direct reclaim to somehow trigger writepages on large chunks instead
> > of doing page by page spatters of IO to the drive.
I agree that "seeks are evil and direct reclaim makes seeks". Actually,
making 4k io is not must for pageout. So, probably we can improve it.
> Perhaps drop the lock on the page if it is held and call one of the
> helpers that filesystems use to do this, like:
>
> filemap_write_and_wait(page->mapping);
Sorry, I'm lost what you talk about. Why do we need per-file waiting?
If file is 1GB file, do we need to wait 1GB writeout?
>
> > But, somewhere along the line I overlooked the part of Dave's stack trace
> > that said:
> >
> > 43) 1568 912 do_select+0x3d6/0x700
> >
> > Huh, 912 bytes...for select, really? From poll.h:
>
> Sure, it's bad, but we focussing on the specific case misses the
> point that even code that is using minimal stack can enter direct
> reclaim after consuming 1.5k of stack. e.g.:
checkstack.pl says do_select() and __generic_file_splice_read() are one
of worstest stack consumer. both sould be fixed.
also, checkstack.pl says such stack eater aren't so much.
>
> 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 51) 3104 384 shrink_page_list+0x65e/0x840
> 52) 2720 528 shrink_zone+0x63f/0xe10
> 53) 2192 112 do_try_to_free_pages+0xc2/0x3c0
> 54) 2080 128 try_to_free_pages+0x77/0x80
> 55) 1952 240 __alloc_pages_nodemask+0x3e4/0x710
> 56) 1712 48 alloc_pages_current+0x8c/0xe0
> 57) 1664 32 __page_cache_alloc+0x67/0x70
> 58) 1632 144 __do_page_cache_readahead+0xd3/0x220
> 59) 1488 16 ra_submit+0x21/0x30
> 60) 1472 80 ondemand_readahead+0x11d/0x250
> 61) 1392 64 page_cache_async_readahead+0xa9/0xe0
> 62) 1328 592 __generic_file_splice_read+0x48a/0x530
> 63) 736 48 generic_file_splice_read+0x4f/0x90
> 64) 688 96 xfs_splice_read+0xf2/0x130 [xfs]
> 65) 592 32 xfs_file_splice_read+0x4b/0x50 [xfs]
> 66) 560 64 do_splice_to+0x77/0xb0
> 67) 496 112 splice_direct_to_actor+0xcc/0x1c0
> 68) 384 80 do_splice_direct+0x57/0x80
> 69) 304 96 do_sendfile+0x16c/0x1e0
> 70) 208 80 sys_sendfile64+0x8d/0xb0
> 71) 128 128 system_call_fastpath+0x16/0x1b
>
> Yes, __generic_file_splice_read() is a hog, but they seem to be
> _everywhere_ today...
>
> > So, select is intentionally trying to use that much stack. It should be using
> > GFP_NOFS if it really wants to suck down that much stack...
>
> The code that did the allocation is called from multiple different
> contexts - how is it supposed to know that in some of those contexts
> it is supposed to treat memory allocation differently?
>
> This is my point - if you introduce a new semantic to memory allocation
> that is "use GFP_NOFS when you are using too much stack" and too much
> stack is more than 15% of the stack, then pretty much every code path
> will need to set that flag...
Nodding my head to Dave's side. changing caller argument seems not good
solution. I mean
- do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
- reclaim and xfs (and other something else) need to diet.
Also, I believe stack eater function should be created waring. patch attached.
> > if only the
> > kernel had some sort of way to dynamically allocate ram, it could try
> > that too.
>
> Sure, but to play the devil's advocate: if memory allocation blows
> the stack, then surely avoiding allocation by using stack variables
> is safer? ;)
>
> FWIW, even if we use GFP_NOFS, allocation+reclaim can still use 2k
> of stack; stuff like the radix tree code appears to be a significant
> user of stack now:
>
> Depth Size Location (56 entries)
> ----- ---- --------
> 0) 7904 48 __call_rcu+0x67/0x190
> 1) 7856 16 call_rcu_sched+0x15/0x20
> 2) 7840 16 call_rcu+0xe/0x10
> 3) 7824 272 radix_tree_delete+0x159/0x2e0
> 4) 7552 32 __remove_from_page_cache+0x21/0x110
> 5) 7520 64 __remove_mapping+0xe8/0x130
> 6) 7456 384 shrink_page_list+0x400/0x860
> 7) 7072 528 shrink_zone+0x636/0xdc0
> 8) 6544 112 do_try_to_free_pages+0xc2/0x3c0
> 9) 6432 112 try_to_free_pages+0x64/0x70
> 10) 6320 256 __alloc_pages_nodemask+0x3d2/0x710
> 11) 6064 48 alloc_pages_current+0x8c/0xe0
> 12) 6016 32 __page_cache_alloc+0x67/0x70
> 13) 5984 80 find_or_create_page+0x50/0xb0
> 14) 5904 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
>
> or even just calling ->releasepage and freeing bufferheads:
>
> Depth Size Location (55 entries)
> ----- ---- --------
> 0) 7440 48 add_partial+0x26/0x90
> 1) 7392 64 __slab_free+0x1a9/0x380
> 2) 7328 64 kmem_cache_free+0xb9/0x160
> 3) 7264 16 free_buffer_head+0x25/0x50
> 4) 7248 64 try_to_free_buffers+0x79/0xc0
> 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs]
> 6) 7024 16 try_to_release_page+0x33/0x60
> 7) 7008 384 shrink_page_list+0x585/0x860
> 8) 6624 528 shrink_zone+0x636/0xdc0
> 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0
> 10) 5984 112 try_to_free_pages+0x64/0x70
> 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710
> 12) 5616 48 alloc_pages_current+0x8c/0xe0
> 13) 5568 32 __page_cache_alloc+0x67/0x70
> 14) 5536 80 find_or_create_page+0x50/0xb0
> 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
>
> And another eye-opening example, this time deep in the sata driver
> layer:
>
> Depth Size Location (72 entries)
> ----- ---- --------
> 0) 8336 304 select_task_rq_fair+0x235/0xad0
> 1) 8032 96 try_to_wake_up+0x189/0x3f0
> 2) 7936 16 default_wake_function+0x12/0x20
> 3) 7920 32 autoremove_wake_function+0x16/0x40
> 4) 7888 64 __wake_up_common+0x5a/0x90
> 5) 7824 64 __wake_up+0x48/0x70
> 6) 7760 64 insert_work+0x9f/0xb0
> 7) 7696 48 __queue_work+0x36/0x50
> 8) 7648 16 queue_work_on+0x4d/0x60
> 9) 7632 16 queue_work+0x1f/0x30
> 10) 7616 16 queue_delayed_work+0x2d/0x40
> 11) 7600 32 ata_pio_queue_task+0x35/0x40
> 12) 7568 48 ata_sff_qc_issue+0x146/0x2f0
> 13) 7520 96 mv_qc_issue+0x12d/0x540 [sata_mv]
> 14) 7424 96 ata_qc_issue+0x1fe/0x320
> 15) 7328 64 ata_scsi_translate+0xae/0x1a0
> 16) 7264 64 ata_scsi_queuecmd+0xbf/0x2f0
> 17) 7200 48 scsi_dispatch_cmd+0x114/0x2b0
> 18) 7152 96 scsi_request_fn+0x419/0x590
> 19) 7056 32 __blk_run_queue+0x82/0x150
> 20) 7024 48 elv_insert+0x1aa/0x2d0
> 21) 6976 48 __elv_add_request+0x83/0xd0
> 22) 6928 96 __make_request+0x139/0x490
> 23) 6832 208 generic_make_request+0x3df/0x4d0
> 24) 6624 80 submit_bio+0x7c/0x100
> 25) 6544 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
>
> We need at least _700_ bytes of stack free just to call queue_work(),
> and that now happens deep in the guts of the driver subsystem below XFS.
> This trace shows 1.8k of stack usage on a simple, single sata disk
> storage subsystem, so my estimate of 2k of stack for the storage system
> below XFS is too small - a worst case of 2.5-3k of stack space is probably
> closer to the mark.
your explanation is very interesting. I have a (probably dumb) question.
Why nobody faced stack overflow issue in past? now I think every users
easily get stack overflow if your explanation is correct.
>
> This is the sort of thing I'm pointing at when I say that stack
> usage outside XFS has grown significantly significantly over the
> past couple of years. Given XFS has remained pretty much the same or
> even reduced slightly over the same time period, blaming XFS or
> saying "callers should use GFP_NOFS" seems like a cop-out to me.
> Regardless of the IO pattern performance issues, writeback via
> direct reclaim just uses too much stack to be safe these days...
Yeah, My answer is simple, All stack eater should be fixed.
but XFS seems not innocence too. 3.5K is enough big although
xfs have use such amount since very ago.
===========================================================
Subject: [PATCH] kconfig: reduce FRAME_WARN default value to 512
Surprisedly, now several odd functions use very much stack.
% objdump -d vmlinux | ./scripts/checkstack.pl
0xffffffff81e3db07 get_next_block [vmlinux]: 1976
0xffffffff8130b9bd node_read_meminfo [vmlinux]: 1240
0xffffffff811553fd do_sys_poll [vmlinux]: 1000
0xffffffff8122b49d test_aead [vmlinux]: 904
0xffffffff81154c9d do_select [vmlinux]: 888
0xffffffff81168d9d default_file_splice_read [vmlinux]: 760
Oh well, Every developers have to pay attention a stack usage!
Thus, this patch reduce FRAME_WARN default value to 512.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
lib/Kconfig.debug | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ff01710..44ebba6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -28,8 +28,7 @@ config ENABLE_MUST_CHECK
config FRAME_WARN
int "Warn for stack frames larger than (needs gcc 4.4)"
range 0 8192
- default 1024 if !64BIT
- default 2048 if 64BIT
+ default 512
help
Tell gcc to warn at build time for stack frames larger than this.
Setting this too low will cause a lot of warnings.
--
1.6.5.2
On Wed, Apr 14, 2010 at 02:54:14PM +0900, KOSAKI Motohiro wrote:
> > On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 14 Apr 2010 11:40:41 +1000
> > > Dave Chinner <[email protected]> wrote:
> > >
> > > > 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
> > > > 51) 3104 384 shrink_page_list+0x65e/0x840
> > > > 52) 2720 528 shrink_zone+0x63f/0xe10
> > >
> > > A bit OFF TOPIC.
> > >
> > > Could you share disassemble of shrink_zone() ?
> > >
> > > In my environ.
> > > 00000000000115a0 <shrink_zone>:
> > > 115a0: 55 push %rbp
> > > 115a1: 48 89 e5 mov %rsp,%rbp
> > > 115a4: 41 57 push %r15
> > > 115a6: 41 56 push %r14
> > > 115a8: 41 55 push %r13
> > > 115aa: 41 54 push %r12
> > > 115ac: 53 push %rbx
> > > 115ad: 48 83 ec 78 sub $0x78,%rsp
> > > 115b1: e8 00 00 00 00 callq 115b6 <shrink_zone+0x16>
> > > 115b6: 48 89 75 80 mov %rsi,-0x80(%rbp)
> > >
> > > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> > > until retrun.
> >
> > I see the same. I didn't compile those kernels, though. IIUC,
> > they were built through the Ubuntu build infrastructure, so there is
> > something different in terms of compiler, compiler options or config
> > to what we are both using. Most likely it is the compiler inlining,
> > though Chris's patches to prevent that didn't seem to change the
> > stack usage.
> >
> > I'm trying to get a stack trace from the kernel that has shrink_zone
> > in it, but I haven't succeeded yet....
>
> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
Ok, so here's a trace at the top of the stack from a kernel with a
the above shrink_zone disassembly:
$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (49 entries)
----- ---- --------
0) 6152 112 force_qs_rnp+0x58/0x150
1) 6040 48 force_quiescent_state+0x1a7/0x1f0
2) 5992 48 __call_rcu+0x13d/0x190
3) 5944 16 call_rcu_sched+0x15/0x20
4) 5928 16 call_rcu+0xe/0x10
5) 5912 240 radix_tree_delete+0x14a/0x2d0
6) 5672 32 __remove_from_page_cache+0x21/0x110
7) 5640 64 __remove_mapping+0x86/0x100
8) 5576 272 shrink_page_list+0x2fd/0x5a0
9) 5304 400 shrink_inactive_list+0x313/0x730
10) 4904 176 shrink_zone+0x3d1/0x490
11) 4728 128 do_try_to_free_pages+0x2b6/0x380
12) 4600 112 try_to_free_pages+0x5e/0x60
13) 4488 272 __alloc_pages_nodemask+0x3fb/0x730
14) 4216 48 alloc_pages_current+0x87/0xd0
15) 4168 32 __page_cache_alloc+0x67/0x70
16) 4136 80 find_or_create_page+0x4f/0xb0
17) 4056 160 _xfs_buf_lookup_pages+0x150/0x390
.....
So the differences are most likely from the compiler doing
automatic inlining of static functions...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Apr 14, 2010 at 3:13 PM, Minchan Kim <[email protected]> wrote:
> On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
>>> > On Wed, 14 Apr 2010 11:40:41 +1000
>>> > Dave Chinner <[email protected]> wrote:
>>> >
>>> > > Â 50) Â Â 3168 Â Â Â 64 Â xfs_vm_writepage+0xab/0x160 [xfs]
>>> > > Â 51) Â Â 3104 Â Â 384 Â shrink_page_list+0x65e/0x840
>>> > > Â 52) Â Â 2720 Â Â 528 Â shrink_zone+0x63f/0xe10
>>> >
>>> > A bit OFF TOPIC.
>>> >
>>> > Could you share disassemble of shrink_zone() ?
>>> >
>>> > In my environ.
>>> > 00000000000115a0 <shrink_zone>:
>>> >   115a0:    55            push  %rbp
>>> >   115a1:    48 89 e5         mov   %rsp,%rbp
>>> >   115a4:    41 57          push  %r15
>>> >   115a6:    41 56          push  %r14
>>> >   115a8:    41 55          push  %r13
>>> >   115aa:    41 54          push  %r12
>>> >   115ac:    53            push  %rbx
>>> >   115ad:    48 83 ec 78       sub   $0x78,%rsp
>>> >   115b1:    e8 00 00 00 00      callq  115b6 <shrink_zone+0x16>
>>> >   115b6:    48 89 75 80       mov   %rsi,-0x80(%rbp)
>>> >
>>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
>>> > until retrun.
>>>
>>> I see the same. I didn't compile those kernels, though. IIUC,
>>> they were built through the Ubuntu build infrastructure, so there is
>>> something different in terms of compiler, compiler options or config
>>> to what we are both using. Most likely it is the compiler inlining,
>>> though Chris's patches to prevent that didn't seem to change the
>>> stack usage.
>>>
>>> I'm trying to get a stack trace from the kernel that has shrink_zone
>>> in it, but I haven't succeeded yet....
>>
>> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
>>
>
> In my case, 0x110 byte in 32 bit machine.
> I think it's possible in 64 bit machine.
>
> 00001830 <shrink_zone>:
>   1830:    55            push  %ebp
>   1831:    89 e5          mov   %esp,%ebp
>   1833:    57            push  %edi
>   1834:    56            push  %esi
>   1835:    53            push  %ebx
>   1836:    81 ec 10 01 00 00    sub   $0x110,%esp
>   183c:    89 85 24 ff ff ff    mov   %eax,-0xdc(%ebp)
>   1842:    89 95 20 ff ff ff    mov   %edx,-0xe0(%ebp)
>   1848:    89 8d 1c ff ff ff    mov   %ecx,-0xe4(%ebp)
>   184e:    8b 41 04         mov   0x4(%ecx)
>
> my gcc is following as.
>
> barrios@barriostarget:~/mmotm$ gcc -v
> Using built-in specs.
> Target: i486-linux-gnu
> Configured with: ../src/configure -v --with-pkgversion='Ubuntu
> 4.3.3-5ubuntu4'
> --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
> --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
> --enable-shared --with-system-zlib --libexecdir=/usr/lib
> --without-included-gettext --enable-threads=posix --enable-nls
> --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
> --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
> --enable-mpfr --enable-targets=all --with-tune=generic
> --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
> --target=i486-linux-gnu
> Thread model: posix
> gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
>
>
> Is it depends on config?
> I attach my config.
I changed shrink list by noinline_for_stack.
The result is following as.
00001fe0 <shrink_zone>:
1fe0: 55 push %ebp
1fe1: 89 e5 mov %esp,%ebp
1fe3: 57 push %edi
1fe4: 56 push %esi
1fe5: 53 push %ebx
1fe6: 83 ec 4c sub $0x4c,%esp
1fe9: 89 45 c0 mov %eax,-0x40(%ebp)
1fec: 89 55 bc mov %edx,-0x44(%ebp)
1fef: 89 4d b8 mov %ecx,-0x48(%ebp)
0x110 -> 0x4c.
Should we have to add noinline_for_stack for shrink_list?
--
Kind regards,
Minchan Kim
On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > Basically, there is not enough stack space available to allow direct
> > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > profiles we are seeing here....
> > > > >
> > > >
> > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > doing sync IO, then waiting on those pages.
> > >
> > > So, I've been reading along, nodding my head to Dave's side of things
> > > because seeks are evil and direct reclaim makes seeks. I'd really loev
> > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > of doing page by page spatters of IO to the drive.
>
> I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> making 4k io is not must for pageout. So, probably we can improve it.
>
>
> > Perhaps drop the lock on the page if it is held and call one of the
> > helpers that filesystems use to do this, like:
> >
> > filemap_write_and_wait(page->mapping);
>
> Sorry, I'm lost what you talk about. Why do we need per-file
> waiting? If file is 1GB file, do we need to wait 1GB writeout?
So use filemap_fdatawrite(page->mapping), or if it's better only
to start IO on a segment of the file, use
filemap_fdatawrite_range(page->mapping, start, end)....
> > > But, somewhere along the line I overlooked the part of Dave's stack trace
> > > that said:
> > >
> > > 43) 1568 912 do_select+0x3d6/0x700
> > >
> > > Huh, 912 bytes...for select, really? From poll.h:
> >
> > Sure, it's bad, but we focussing on the specific case misses the
> > point that even code that is using minimal stack can enter direct
> > reclaim after consuming 1.5k of stack. e.g.:
>
> checkstack.pl says do_select() and __generic_file_splice_read() are one
> of worstest stack consumer. both sould be fixed.
the deepest call chain in queue_work() needs 700 bytes of stack
to complete, wait_for_completion() requires almost 2k of stack space
at it's deepest, the scheduler has some heavy stack users, etc,
and these are all functions that appear at the top of the stack.
> also, checkstack.pl says such stack eater aren't so much.
Yeah, but when we have ia callchain 70 or more functions deep,
even 100 bytes of stack is a lot....
> > > So, select is intentionally trying to use that much stack. It should be using
> > > GFP_NOFS if it really wants to suck down that much stack...
> >
> > The code that did the allocation is called from multiple different
> > contexts - how is it supposed to know that in some of those contexts
> > it is supposed to treat memory allocation differently?
> >
> > This is my point - if you introduce a new semantic to memory allocation
> > that is "use GFP_NOFS when you are using too much stack" and too much
> > stack is more than 15% of the stack, then pretty much every code path
> > will need to set that flag...
>
> Nodding my head to Dave's side. changing caller argument seems not good
> solution. I mean
> - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
> - reclaim and xfs (and other something else) need to diet.
The list I'm seeing so far includes:
- scheduler
- completion interfaces
- radix tree
- memory allocation, memory reclaim
- anything that implements ->writepage
- select
- splice read
> Also, I believe stack eater function should be created waring. patch attached.
Good start, but 512 bytes will only catch select and splice read,
and there are 300-400 byte functions in the above list that sit near
the top of the stack....
> > We need at least _700_ bytes of stack free just to call queue_work(),
> > and that now happens deep in the guts of the driver subsystem below XFS.
> > This trace shows 1.8k of stack usage on a simple, single sata disk
> > storage subsystem, so my estimate of 2k of stack for the storage system
> > below XFS is too small - a worst case of 2.5-3k of stack space is probably
> > closer to the mark.
>
> your explanation is very interesting. I have a (probably dumb) question.
> Why nobody faced stack overflow issue in past? now I think every users
> easily get stack overflow if your explanation is correct.
It's always a problem, but the focus on minimising stack usage has
gone away since i386 has mostly disappeared from server rooms.
XFS has always been the thing that triggered stack usage problems
first - the first reports of problems on x86_64 with 8k stacks in low
memory situations have only just come in, and this is the first time
in a couple of years I've paid close attention to stack usage
outside XFS. What I'm seeing is not pretty....
> > This is the sort of thing I'm pointing at when I say that stack
> > usage outside XFS has grown significantly significantly over the
> > past couple of years. Given XFS has remained pretty much the same or
> > even reduced slightly over the same time period, blaming XFS or
> > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > Regardless of the IO pattern performance issues, writeback via
> > direct reclaim just uses too much stack to be safe these days...
>
> Yeah, My answer is simple, All stack eater should be fixed.
> but XFS seems not innocence too. 3.5K is enough big although
> xfs have use such amount since very ago.
XFS used to use much more than that - significant effort has been
put into reduce the stack footprint over many years. There's not
much left to trim without rewriting half the filesystem...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Apr 14, 2010 at 03:52:10PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > > Thanks explanation. I haven't noticed direct reclaim consume
> > > 2k stack. I'll investigate it and try diet it.
> > > But XFS 3.5K stack consumption is too large too. please diet too.
> >
> > It hasn't grown in the last 2 years after the last major diet where
> > all the fat was trimmed from it in the last round of the i386 4k
> > stack vs XFS saga. it seems that everything else around XFS has
> > grown in that time, and now we are blowing stacks again....
>
> I have dumb question, If xfs haven't bloat stack usage, why 3.5
> stack usage works fine on 4k stack kernel? It seems impossible.
Because on a 32 bit kernel it's somewhere between 2-2.5k of stack
space. That being said, XFS _will_ blow a 4k stack on anything other
than the most basic storage configurations, and if you run out of
memory it is almost guaranteed to do so.
> Please don't think I blame you. I don't know what is "4k stack vs XFS saga".
> I merely want to understand what you said.
Over a period of years there were repeated attempts to make the
default stack size on i386 4k, despite it being known to cause
problems one relatively common configurations. Every time it was
brought up it was rejected, but every few months somebody else made
an attempt to make it the default. There was a lot of flamage
directed at XFS because it was seen as the reason that 4k stacks
were not made the default....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Apr 14, 2010 at 1:44 PM, Dave Chinner <[email protected]> wrote:
> On Wed, Apr 14, 2010 at 09:24:33AM +0900, Minchan Kim wrote:
>> Hi, Dave.
>>
>> On Tue, Apr 13, 2010 at 9:17 AM, Dave Chinner <[email protected]> wrote:
>> > From: Dave Chinner <[email protected]>
>> >
>> > When we enter direct reclaim we may have used an arbitrary amount of stack
>> > space, and hence enterring the filesystem to do writeback can then lead to
>> > stack overruns. This problem was recently encountered x86_64 systems with
>> > 8k stacks running XFS with simple storage configurations.
>> >
>> > Writeback from direct reclaim also adversely affects background writeback. The
>> > background flusher threads should already be taking care of cleaning dirty
>> > pages, and direct reclaim will kick them if they aren't already doing work. If
>> > direct reclaim is also calling ->writepage, it will cause the IO patterns from
>> > the background flusher threads to be upset by LRU-order writeback from
>> > pageout() which can be effectively random IO. Having competing sources of IO
>> > trying to clean pages on the same backing device reduces throughput by
>> > increasing the amount of seeks that the backing device has to do to write back
>> > the pages.
>> >
>> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
>> > Set up the relevant scan_control structures to enforce this, and prevent
>> > sc->may_writepage from being set in other places in the direct reclaim path in
>> > response to other events.
>>
>> I think your solution is rather aggressive change as Mel and Kosaki
>> already pointed out.
>
> It may be agressive, but writeback from direct reclaim is, IMO, one
> of the worst aspects of the current VM design because of it's
> adverse effect on the IO subsystem.
Tend to agree. But De we need it by last resort if flusher thread
can't catch up
write stream?
Or In my opinion, Could I/O layer have better throttle logic than now?
>
> I'd prefer to remove it completely that continue to try and patch
> around it, especially given that everyone seems to agree that it
> does have an adverse affect on IO...
Of course, If everybody agree, we can do it.
For it, we need many benchmark result which is very hard.
Maybe I will help it in embedded system.
>
>> Do flush thread aware LRU of dirty pages in system level recency not
>> dirty pages recency?
>
> It writes back in the order inodes were dirtied. i.e. the LRU is a
> coarser measure, but it it still definitely there. It also takes
> into account fairness of IO between dirty inodes, so no one dirty
> inode prevents IO beining issued on a other dirty inodes on the
> LRU...
Thanks.
It seems to be lost recency.
I am not sure how much it affects system performance.
>
>> Of course flush thread can clean dirty pages faster than direct reclaimer.
>> But if it don't aware LRUness, hot page thrashing can be happened by
>> corner case.
>> It could lost write merge.
>>
>> And non-rotation storage might be not big of seek cost.
>
> Non-rotational storage still goes faster when it is fed large, well
> formed IOs.
Agreed. I missed. Nand device is stronger than HDD about random read.
But ramdom write is very weak in performance and wear-leveling.
>
>> I think we have to consider that case if we decide to change direct reclaim I/O.
>>
>> How do we separate the problem?
>>
>> 1. stack hogging problem.
>> 2. direct reclaim random write.
>
> AFAICT, the only way to _reliably_ avoid the stack usage problem is
> to avoid writeback in direct reclaim. That has the side effect of
> fixing #2 as well, so do they really need separating?
If we can do it, it's good.
but 2. problem is not easy to fix, I think.
Compared to 2, 1 is rather easy.
So I thought we can solve 1 firstly and then focusing 2.
If your suggestion is right, then we can apply your idea.
Then we don't need to revert the patch of 1 since small stack usage is
always good
if we don't lost big performance.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
--
Kind regards,
Minchan Kim
On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > profiles we are seeing here....
> > > > > >
> > > > >
> > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > doing sync IO, then waiting on those pages.
> > > >
> > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > because seeks are evil and direct reclaim makes seeks. I'd really loev
> > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > of doing page by page spatters of IO to the drive.
> >
> > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > making 4k io is not must for pageout. So, probably we can improve it.
> >
> >
> > > Perhaps drop the lock on the page if it is held and call one of the
> > > helpers that filesystems use to do this, like:
> > >
> > > filemap_write_and_wait(page->mapping);
> >
> > Sorry, I'm lost what you talk about. Why do we need per-file
> > waiting? If file is 1GB file, do we need to wait 1GB writeout?
>
> So use filemap_fdatawrite(page->mapping), or if it's better only
> to start IO on a segment of the file, use
> filemap_fdatawrite_range(page->mapping, start, end)....
>
That does not help the stack usage issue, the caller ends up in
->writepages. From an IO perspective, it'll be better from a seek point of
view but from a VM perspective, it may or may not be cleaning the right pages.
So I think this is a red herring.
> > > > But, somewhere along the line I overlooked the part of Dave's stack trace
> > > > that said:
> > > >
> > > > 43) 1568 912 do_select+0x3d6/0x700
> > > >
> > > > Huh, 912 bytes...for select, really? From poll.h:
> > >
> > > Sure, it's bad, but we focussing on the specific case misses the
> > > point that even code that is using minimal stack can enter direct
> > > reclaim after consuming 1.5k of stack. e.g.:
> >
> > checkstack.pl says do_select() and __generic_file_splice_read() are one
> > of worstest stack consumer. both sould be fixed.
>
> the deepest call chain in queue_work() needs 700 bytes of stack
> to complete, wait_for_completion() requires almost 2k of stack space
> at it's deepest, the scheduler has some heavy stack users, etc,
> and these are all functions that appear at the top of the stack.
>
The real issue here then is that stack usage has gone out of control.
Disabling ->writepage in direct reclaim does not guarantee that stack
usage will not be a problem again. From your traces, page reclaim itself
seems to be a big dirty hog.
Differences in what people see in their machines may be down to architecture,
compiler but most likely inlining. Changing inlining will not fix the problem,
it'll just move the stack usage around.
> > also, checkstack.pl says such stack eater aren't so much.
>
> Yeah, but when we have ia callchain 70 or more functions deep,
> even 100 bytes of stack is a lot....
>
> > > > So, select is intentionally trying to use that much stack. It should be using
> > > > GFP_NOFS if it really wants to suck down that much stack...
> > >
> > > The code that did the allocation is called from multiple different
> > > contexts - how is it supposed to know that in some of those contexts
> > > it is supposed to treat memory allocation differently?
> > >
> > > This is my point - if you introduce a new semantic to memory allocation
> > > that is "use GFP_NOFS when you are using too much stack" and too much
> > > stack is more than 15% of the stack, then pretty much every code path
> > > will need to set that flag...
> >
> > Nodding my head to Dave's side. changing caller argument seems not good
> > solution. I mean
> > - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
> > - reclaim and xfs (and other something else) need to diet.
>
> The list I'm seeing so far includes:
> - scheduler
> - completion interfaces
> - radix tree
> - memory allocation, memory reclaim
> - anything that implements ->writepage
> - select
> - splice read
>
> > Also, I believe stack eater function should be created waring. patch attached.
>
> Good start, but 512 bytes will only catch select and splice read,
> and there are 300-400 byte functions in the above list that sit near
> the top of the stack....
>
They will need to be tackled in turn then but obviously there should be
a focus on the common paths. The reclaim paths do seem particularly
heavy and it's down to a lot of temporary variables. I might not get the
time today but what I'm going to try do some time this week is
o Look at what temporary variables are copies of other pieces of information
o See what variables live for the duration of reclaim but are not needed
for all of it (i.e. uninline parts of it so variables do not persist)
o See if it's possible to dynamically allocate scan_control
The last one is the trickiest. Basically, the idea would be to move as much
into scan_control as possible. Then, instead of allocating it on the stack,
allocate a fixed number of them at boot-time (NR_CPU probably) protected by
a semaphore. Limit the number of direct reclaimers that can be active at a
time to the number of scan_control variables. kswapd could still allocate
its on the stack or with kmalloc.
If it works out, it would have two main benefits. Limits the number of
processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
reclaim, there is too much going on. It would also shrink the stack usage
particularly if some of the stack variables are moved into scan_control.
Maybe someone will beat me to looking at the feasibility of this.
> > > We need at least _700_ bytes of stack free just to call queue_work(),
> > > and that now happens deep in the guts of the driver subsystem below XFS.
> > > This trace shows 1.8k of stack usage on a simple, single sata disk
> > > storage subsystem, so my estimate of 2k of stack for the storage system
> > > below XFS is too small - a worst case of 2.5-3k of stack space is probably
> > > closer to the mark.
> >
> > your explanation is very interesting. I have a (probably dumb) question.
> > Why nobody faced stack overflow issue in past? now I think every users
> > easily get stack overflow if your explanation is correct.
>
> It's always a problem, but the focus on minimising stack usage has
> gone away since i386 has mostly disappeared from server rooms.
>
> XFS has always been the thing that triggered stack usage problems
> first - the first reports of problems on x86_64 with 8k stacks in low
> memory situations have only just come in, and this is the first time
> in a couple of years I've paid close attention to stack usage
> outside XFS. What I'm seeing is not pretty....
>
> > > This is the sort of thing I'm pointing at when I say that stack
> > > usage outside XFS has grown significantly significantly over the
> > > past couple of years. Given XFS has remained pretty much the same or
> > > even reduced slightly over the same time period, blaming XFS or
> > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > Regardless of the IO pattern performance issues, writeback via
> > > direct reclaim just uses too much stack to be safe these days...
> >
> > Yeah, My answer is simple, All stack eater should be fixed.
> > but XFS seems not innocence too. 3.5K is enough big although
> > xfs have use such amount since very ago.
>
> XFS used to use much more than that - significant effort has been
> put into reduce the stack footprint over many years. There's not
> much left to trim without rewriting half the filesystem...
>
I don't think he is levelling a complain at XFS in particular - just pointing
out that it's heavy too. Still, we should be gratful that XFS is sort of
a "Stack Canary". If it dies, everyone else could be in trouble soon :)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, 14 Apr 2010 16:19:02 +0900
Minchan Kim <[email protected]> wrote:
> On Wed, Apr 14, 2010 at 3:13 PM, Minchan Kim <[email protected]> wrote:
> > On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
> > <[email protected]> wrote:
> >>> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
> >>> > On Wed, 14 Apr 2010 11:40:41 +1000
> >>> > Dave Chinner <[email protected]> wrote:
> >>> >
> >>> > > Â 50) Â Â 3168 Â Â Â 64 Â xfs_vm_writepage+0xab/0x160 [xfs]
> >>> > > Â 51) Â Â 3104 Â Â 384 Â shrink_page_list+0x65e/0x840
> >>> > > Â 52) Â Â 2720 Â Â 528 Â shrink_zone+0x63f/0xe10
> >>> >
> >>> > A bit OFF TOPIC.
> >>> >
> >>> > Could you share disassemble of shrink_zone() ?
> >>> >
> >>> > In my environ.
> >>> > 00000000000115a0 <shrink_zone>:
> >>> >   115a0:    55            push  %rbp
> >>> >   115a1:    48 89 e5         mov   %rsp,%rbp
> >>> >   115a4:    41 57          push  %r15
> >>> >   115a6:    41 56          push  %r14
> >>> >   115a8:    41 55          push  %r13
> >>> >   115aa:    41 54          push  %r12
> >>> >   115ac:    53            push  %rbx
> >>> >   115ad:    48 83 ec 78       sub   $0x78,%rsp
> >>> >   115b1:    e8 00 00 00 00      callq  115b6 <shrink_zone+0x16>
> >>> >   115b6:    48 89 75 80       mov   %rsi,-0x80(%rbp)
> >>> >
> >>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> >>> > until retrun.
> >>>
> >>> I see the same. I didn't compile those kernels, though. IIUC,
> >>> they were built through the Ubuntu build infrastructure, so there is
> >>> something different in terms of compiler, compiler options or config
> >>> to what we are both using. Most likely it is the compiler inlining,
> >>> though Chris's patches to prevent that didn't seem to change the
> >>> stack usage.
> >>>
> >>> I'm trying to get a stack trace from the kernel that has shrink_zone
> >>> in it, but I haven't succeeded yet....
> >>
> >> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
> >>
> >
> > In my case, 0x110 byte in 32 bit machine.
> > I think it's possible in 64 bit machine.
> >
> > 00001830 <shrink_zone>:
> >   1830:    55            push  %ebp
> >   1831:    89 e5          mov   %esp,%ebp
> >   1833:    57            push  %edi
> >   1834:    56            push  %esi
> >   1835:    53            push  %ebx
> >   1836:    81 ec 10 01 00 00    sub   $0x110,%esp
> >   183c:    89 85 24 ff ff ff    mov   %eax,-0xdc(%ebp)
> >   1842:    89 95 20 ff ff ff    mov   %edx,-0xe0(%ebp)
> >   1848:    89 8d 1c ff ff ff    mov   %ecx,-0xe4(%ebp)
> >   184e:    8b 41 04         mov   0x4(%ecx)
> >
> > my gcc is following as.
> >
> > barrios@barriostarget:~/mmotm$ gcc -v
> > Using built-in specs.
> > Target: i486-linux-gnu
> > Configured with: ../src/configure -v --with-pkgversion='Ubuntu
> > 4.3.3-5ubuntu4'
> > --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
> > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
> > --enable-shared --with-system-zlib --libexecdir=/usr/lib
> > --without-included-gettext --enable-threads=posix --enable-nls
> > --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
> > --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
> > --enable-mpfr --enable-targets=all --with-tune=generic
> > --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
> > --target=i486-linux-gnu
> > Thread model: posix
> > gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
> >
> >
> > Is it depends on config?
> > I attach my config.
>
> I changed shrink list by noinline_for_stack.
> The result is following as.
>
>
> 00001fe0 <shrink_zone>:
> 1fe0: 55 push %ebp
> 1fe1: 89 e5 mov %esp,%ebp
> 1fe3: 57 push %edi
> 1fe4: 56 push %esi
> 1fe5: 53 push %ebx
> 1fe6: 83 ec 4c sub $0x4c,%esp
> 1fe9: 89 45 c0 mov %eax,-0x40(%ebp)
> 1fec: 89 55 bc mov %edx,-0x44(%ebp)
> 1fef: 89 4d b8 mov %ecx,-0x48(%ebp)
>
> 0x110 -> 0x4c.
>
> Should we have to add noinline_for_stack for shrink_list?
>
Hmm. about shirnk_zone(), I don't think uninlining functions directly called
by shrink_zone() can be a help.
Total stack size of call-chain will be still big.
Thanks,
-Kame
On Wed, Apr 14, 2010 at 6:42 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 14 Apr 2010 16:19:02 +0900
> Minchan Kim <[email protected]> wrote:
>
>> On Wed, Apr 14, 2010 at 3:13 PM, Minchan Kim <[email protected]> wrote:
>> > On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
>> > <[email protected]> wrote:
>> >>> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
>> >>> > On Wed, 14 Apr 2010 11:40:41 +1000
>> >>> > Dave Chinner <[email protected]> wrote:
>> >>> >
>> >>> > > Â 50) Â Â 3168 Â Â Â 64 Â xfs_vm_writepage+0xab/0x160 [xfs]
>> >>> > > Â 51) Â Â 3104 Â Â 384 Â shrink_page_list+0x65e/0x840
>> >>> > > Â 52) Â Â 2720 Â Â 528 Â shrink_zone+0x63f/0xe10
>> >>> >
>> >>> > A bit OFF TOPIC.
>> >>> >
>> >>> > Could you share disassemble of shrink_zone() ?
>> >>> >
>> >>> > In my environ.
>> >>> > 00000000000115a0 <shrink_zone>:
>> >>> >   115a0:    55            push  %rbp
>> >>> >   115a1:    48 89 e5         mov   %rsp,%rbp
>> >>> >   115a4:    41 57          push  %r15
>> >>> >   115a6:    41 56          push  %r14
>> >>> >   115a8:    41 55          push  %r13
>> >>> >   115aa:    41 54          push  %r12
>> >>> >   115ac:    53            push  %rbx
>> >>> >   115ad:    48 83 ec 78       sub   $0x78,%rsp
>> >>> >   115b1:    e8 00 00 00 00      callq  115b6 <shrink_zone+0x16>
>> >>> >   115b6:    48 89 75 80       mov   %rsi,-0x80(%rbp)
>> >>> >
>> >>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
>> >>> > until retrun.
>> >>>
>> >>> I see the same. I didn't compile those kernels, though. IIUC,
>> >>> they were built through the Ubuntu build infrastructure, so there is
>> >>> something different in terms of compiler, compiler options or config
>> >>> to what we are both using. Most likely it is the compiler inlining,
>> >>> though Chris's patches to prevent that didn't seem to change the
>> >>> stack usage.
>> >>>
>> >>> I'm trying to get a stack trace from the kernel that has shrink_zone
>> >>> in it, but I haven't succeeded yet....
>> >>
>> >> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
>> >>
>> >
>> > In my case, 0x110 byte in 32 bit machine.
>> > I think it's possible in 64 bit machine.
>> >
>> > 00001830 <shrink_zone>:
>> >   1830:    55            push  %ebp
>> >   1831:    89 e5          mov   %esp,%ebp
>> >   1833:    57            push  %edi
>> >   1834:    56            push  %esi
>> >   1835:    53            push  %ebx
>> >   1836:    81 ec 10 01 00 00    sub   $0x110,%esp
>> >   183c:    89 85 24 ff ff ff    mov   %eax,-0xdc(%ebp)
>> >   1842:    89 95 20 ff ff ff    mov   %edx,-0xe0(%ebp)
>> >   1848:    89 8d 1c ff ff ff    mov   %ecx,-0xe4(%ebp)
>> >   184e:    8b 41 04         mov   0x4(%ecx)
>> >
>> > my gcc is following as.
>> >
>> > barrios@barriostarget:~/mmotm$ gcc -v
>> > Using built-in specs.
>> > Target: i486-linux-gnu
>> > Configured with: ../src/configure -v --with-pkgversion='Ubuntu
>> > 4.3.3-5ubuntu4'
>> > --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
>> > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
>> > --enable-shared --with-system-zlib --libexecdir=/usr/lib
>> > --without-included-gettext --enable-threads=posix --enable-nls
>> > --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
>> > --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
>> > --enable-mpfr --enable-targets=all --with-tune=generic
>> > --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
>> > --target=i486-linux-gnu
>> > Thread model: posix
>> > gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
>> >
>> >
>> > Is it depends on config?
>> > I attach my config.
>>
>> I changed shrink list by noinline_for_stack.
>> The result is following as.
>>
>>
>> 00001fe0 <shrink_zone>:
>>   1fe0:    55            push  %ebp
>>   1fe1:    89 e5          mov   %esp,%ebp
>>   1fe3:    57            push  %edi
>>   1fe4:    56            push  %esi
>>   1fe5:    53            push  %ebx
>>   1fe6:    83 ec 4c         sub   $0x4c,%esp
>>   1fe9:    89 45 c0         mov   %eax,-0x40(%ebp)
>>   1fec:    89 55 bc         mov   %edx,-0x44(%ebp)
>>   1fef:    89 4d b8         mov   %ecx,-0x48(%ebp)
>>
>> 0x110 -> 0x4c.
>>
>> Should we have to add noinline_for_stack for shrink_list?
>>
>
> Hmm. about shirnk_zone(), I don't think uninlining functions directly called
> by shrink_zone() can be a help.
> Total stack size of call-chain will be still big.
Absolutely.
But above 500 byte usage is one of hogger and uninlining is not
critical about reclaim performance. So I think we don't get any lost
than gain.
But I don't get in a hurry. adhoc approach is not good.
I hope when Mel tackles down consumption of stack in reclaim path, he
modifies this part, too.
Thanks.
> Thanks,
> -Kame
>
>
>
--
Kind regards,
Minchan Kim
Chris Mason <[email protected]> writes:
>
> Huh, 912 bytes...for select, really? From poll.h:
>
> /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
> additional memory. */
> #define MAX_STACK_ALLOC 832
> #define FRONTEND_STACK_ALLOC 256
> #define SELECT_STACK_ALLOC FRONTEND_STACK_ALLOC
> #define POLL_STACK_ALLOC FRONTEND_STACK_ALLOC
> #define WQUEUES_STACK_ALLOC (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> #define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
>
> So, select is intentionally trying to use that much stack. It should be using
> GFP_NOFS if it really wants to suck down that much stack...
There are lots of other call chains which use multiple KB bytes by itself,
so why not give select() that measly 832 bytes?
You think only file systems are allowed to use stack? :)
Basically if you cannot tolerate 1K (or more likely more) of stack
used before your fs is called you're toast in lots of other situations
anyways.
> kernel had some sort of way to dynamically allocate ram, it could try
> that too.
It does this for large inputs, but the whole point of the stack fast
path is to avoid it for common cases when a small number of fds is
only needed.
It's significantly slower to go to any external allocator.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Apr 14, 2010 at 07:01:47PM +0900, Minchan Kim wrote:
> >> >>> > Dave Chinner <[email protected]> wrote:
> >> >>> >
> >> >>> > > ?50) ? ? 3168 ? ? ?64 ? xfs_vm_writepage+0xab/0x160 [xfs]
> >> >>> > > ?51) ? ? 3104 ? ? 384 ? shrink_page_list+0x65e/0x840
> >> >>> > > ?52) ? ? 2720 ? ? 528 ? shrink_zone+0x63f/0xe10
> >> >>> >
> >> >>> > A bit OFF TOPIC.
> >> >>> >
> >> >>> > Could you share disassemble of shrink_zone() ?
> >> >>> >
> >> >>> > In my environ.
> >> >>> > 00000000000115a0 <shrink_zone>:
> >> >>> > ? ?115a0: ? ? ? 55 ? ? ? ? ? ? ? ? ? ? ?push ? %rbp
> >> >>> > ? ?115a1: ? ? ? 48 89 e5 ? ? ? ? ? ? ? ?mov ? ?%rsp,%rbp
> >> >>> > ? ?115a4: ? ? ? 41 57 ? ? ? ? ? ? ? ? ? push ? %r15
> >> >>> > ? ?115a6: ? ? ? 41 56 ? ? ? ? ? ? ? ? ? push ? %r14
> >> >>> > ? ?115a8: ? ? ? 41 55 ? ? ? ? ? ? ? ? ? push ? %r13
> >> >>> > ? ?115aa: ? ? ? 41 54 ? ? ? ? ? ? ? ? ? push ? %r12
> >> >>> > ? ?115ac: ? ? ? 53 ? ? ? ? ? ? ? ? ? ? ?push ? %rbx
> >> >>> > ? ?115ad: ? ? ? 48 83 ec 78 ? ? ? ? ? ? sub ? ?$0x78,%rsp
> >> >>> > ? ?115b1: ? ? ? e8 00 00 00 00 ? ? ? ? ?callq ?115b6 <shrink_zone+0x16>
> >> >>> > ? ?115b6: ? ? ? 48 89 75 80 ? ? ? ? ? ? mov ? ?%rsi,-0x80(%rbp)
> >> >>> >
> >> >>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> >> >>> > until retrun.
> >> >>>
> >> >>> I see the same. I didn't compile those kernels, though. IIUC,
> >> >>> they were built through the Ubuntu build infrastructure, so there is
> >> >>> something different in terms of compiler, compiler options or config
> >> >>> to what we are both using. Most likely it is the compiler inlining,
> >> >>> though Chris's patches to prevent that didn't seem to change the
> >> >>> stack usage.
> >> >>>
> >> >>> I'm trying to get a stack trace from the kernel that has shrink_zone
> >> >>> in it, but I haven't succeeded yet....
> >> >>
> >> >> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
> >> >>
> >> >
> >> > In my case, 0x110 byte in 32 bit machine.
> >> > I think it's possible in 64 bit machine.
> >> >
> >> > 00001830 <shrink_zone>:
> >> > ? ?1830: ? ? ? 55 ? ? ? ? ? ? ? ? ? ? ?push ? %ebp
> >> > ? ?1831: ? ? ? 89 e5 ? ? ? ? ? ? ? ? ? mov ? ?%esp,%ebp
> >> > ? ?1833: ? ? ? 57 ? ? ? ? ? ? ? ? ? ? ?push ? %edi
> >> > ? ?1834: ? ? ? 56 ? ? ? ? ? ? ? ? ? ? ?push ? %esi
> >> > ? ?1835: ? ? ? 53 ? ? ? ? ? ? ? ? ? ? ?push ? %ebx
> >> > ? ?1836: ? ? ? 81 ec 10 01 00 00 ? ? ? sub ? ?$0x110,%esp
> >> > ? ?183c: ? ? ? 89 85 24 ff ff ff ? ? ? mov ? ?%eax,-0xdc(%ebp)
> >> > ? ?1842: ? ? ? 89 95 20 ff ff ff ? ? ? mov ? ?%edx,-0xe0(%ebp)
> >> > ? ?1848: ? ? ? 89 8d 1c ff ff ff ? ? ? mov ? ?%ecx,-0xe4(%ebp)
> >> > ? ?184e: ? ? ? 8b 41 04 ? ? ? ? ? ? ? ?mov ? ?0x4(%ecx)
> >> >
> >> > my gcc is following as.
> >> >
> >> > barrios@barriostarget:~/mmotm$ gcc -v
> >> > Using built-in specs.
> >> > Target: i486-linux-gnu
> >> > Configured with: ../src/configure -v --with-pkgversion='Ubuntu
> >> > 4.3.3-5ubuntu4'
> >> > --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
> >> > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
> >> > --enable-shared --with-system-zlib --libexecdir=/usr/lib
> >> > --without-included-gettext --enable-threads=posix --enable-nls
> >> > --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
> >> > --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
> >> > --enable-mpfr --enable-targets=all --with-tune=generic
> >> > --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
> >> > --target=i486-linux-gnu
> >> > Thread model: posix
> >> > gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
> >> >
> >> >
> >> > Is it depends on config?
> >> > I attach my config.
> >>
> >> I changed shrink list by noinline_for_stack.
> >> The result is following as.
> >>
> >>
> >> 00001fe0 <shrink_zone>:
> >> ? ? 1fe0: ? ? ? 55 ? ? ? ? ? ? ? ? ? ? ?push ? %ebp
> >> ? ? 1fe1: ? ? ? 89 e5 ? ? ? ? ? ? ? ? ? mov ? ?%esp,%ebp
> >> ? ? 1fe3: ? ? ? 57 ? ? ? ? ? ? ? ? ? ? ?push ? %edi
> >> ? ? 1fe4: ? ? ? 56 ? ? ? ? ? ? ? ? ? ? ?push ? %esi
> >> ? ? 1fe5: ? ? ? 53 ? ? ? ? ? ? ? ? ? ? ?push ? %ebx
> >> ? ? 1fe6: ? ? ? 83 ec 4c ? ? ? ? ? ? ? ?sub ? ?$0x4c,%esp
> >> ? ? 1fe9: ? ? ? 89 45 c0 ? ? ? ? ? ? ? ?mov ? ?%eax,-0x40(%ebp)
> >> ? ? 1fec: ? ? ? 89 55 bc ? ? ? ? ? ? ? ?mov ? ?%edx,-0x44(%ebp)
> >> ? ? 1fef: ? ? ? 89 4d b8 ? ? ? ? ? ? ? ?mov ? ?%ecx,-0x48(%ebp)
> >>
> >> 0x110 -> 0x4c.
> >>
> >> Should we have to add noinline_for_stack for shrink_list?
> >>
> >
> > Hmm. about shirnk_zone(), I don't think uninlining functions directly called
> > by shrink_zone() can be a help.
> > Total stack size of call-chain will be still big.
>
> Absolutely.
> But above 500 byte usage is one of hogger and uninlining is not
> critical about reclaim performance. So I think we don't get any lost
> than gain.
>
Beat in mind that uninlining can slightly increase the stack usage in some
cases because arguments, return addresses and the like have to be pushed
onto the stack. Inlining or unlining is only the answer when it reduces the
number of stack variables that exist at any given time.
> But I don't get in a hurry. adhoc approach is not good.
> I hope when Mel tackles down consumption of stack in reclaim path, he
> modifies this part, too.
>
It'll be at least two days before I get the chance to try. A lot of the
temporary variables used in the reclaim path have existed for some time so
it will take a while.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, Apr 14, 2010 at 7:07 PM, Mel Gorman <[email protected]> wrote:
> On Wed, Apr 14, 2010 at 07:01:47PM +0900, Minchan Kim wrote:
>> >> >>> > Dave Chinner <[email protected]> wrote:
>> >> >>> >
>> >> >>> > > Â 50) Â Â 3168 Â Â Â 64 Â xfs_vm_writepage+0xab/0x160 [xfs]
>> >> >>> > > Â 51) Â Â 3104 Â Â 384 Â shrink_page_list+0x65e/0x840
>> >> >>> > > Â 52) Â Â 2720 Â Â 528 Â shrink_zone+0x63f/0xe10
>> >> >>> >
>> >> >>> > A bit OFF TOPIC.
>> >> >>> >
>> >> >>> > Could you share disassemble of shrink_zone() ?
>> >> >>> >
>> >> >>> > In my environ.
>> >> >>> > 00000000000115a0 <shrink_zone>:
>> >> >>> >   115a0:    55            push  %rbp
>> >> >>> >   115a1:    48 89 e5         mov   %rsp,%rbp
>> >> >>> >   115a4:    41 57          push  %r15
>> >> >>> >   115a6:    41 56          push  %r14
>> >> >>> >   115a8:    41 55          push  %r13
>> >> >>> >   115aa:    41 54          push  %r12
>> >> >>> >   115ac:    53            push  %rbx
>> >> >>> >   115ad:    48 83 ec 78       sub   $0x78,%rsp
>> >> >>> >   115b1:    e8 00 00 00 00      callq  115b6 <shrink_zone+0x16>
>> >> >>> >   115b6:    48 89 75 80       mov   %rsi,-0x80(%rbp)
>> >> >>> >
>> >> >>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
>> >> >>> > until retrun.
>> >> >>>
>> >> >>> I see the same. I didn't compile those kernels, though. IIUC,
>> >> >>> they were built through the Ubuntu build infrastructure, so there is
>> >> >>> something different in terms of compiler, compiler options or config
>> >> >>> to what we are both using. Most likely it is the compiler inlining,
>> >> >>> though Chris's patches to prevent that didn't seem to change the
>> >> >>> stack usage.
>> >> >>>
>> >> >>> I'm trying to get a stack trace from the kernel that has shrink_zone
>> >> >>> in it, but I haven't succeeded yet....
>> >> >>
>> >> >> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
>> >> >>
>> >> >
>> >> > In my case, 0x110 byte in 32 bit machine.
>> >> > I think it's possible in 64 bit machine.
>> >> >
>> >> > 00001830 <shrink_zone>:
>> >> >   1830:    55            push  %ebp
>> >> >   1831:    89 e5          mov   %esp,%ebp
>> >> >   1833:    57            push  %edi
>> >> >   1834:    56            push  %esi
>> >> >   1835:    53            push  %ebx
>> >> >   1836:    81 ec 10 01 00 00    sub   $0x110,%esp
>> >> >   183c:    89 85 24 ff ff ff    mov   %eax,-0xdc(%ebp)
>> >> >   1842:    89 95 20 ff ff ff    mov   %edx,-0xe0(%ebp)
>> >> >   1848:    89 8d 1c ff ff ff    mov   %ecx,-0xe4(%ebp)
>> >> >   184e:    8b 41 04         mov   0x4(%ecx)
>> >> >
>> >> > my gcc is following as.
>> >> >
>> >> > barrios@barriostarget:~/mmotm$ gcc -v
>> >> > Using built-in specs.
>> >> > Target: i486-linux-gnu
>> >> > Configured with: ../src/configure -v --with-pkgversion='Ubuntu
>> >> > 4.3.3-5ubuntu4'
>> >> > --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
>> >> > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
>> >> > --enable-shared --with-system-zlib --libexecdir=/usr/lib
>> >> > --without-included-gettext --enable-threads=posix --enable-nls
>> >> > --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
>> >> > --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
>> >> > --enable-mpfr --enable-targets=all --with-tune=generic
>> >> > --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
>> >> > --target=i486-linux-gnu
>> >> > Thread model: posix
>> >> > gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
>> >> >
>> >> >
>> >> > Is it depends on config?
>> >> > I attach my config.
>> >>
>> >> I changed shrink list by noinline_for_stack.
>> >> The result is following as.
>> >>
>> >>
>> >> 00001fe0 <shrink_zone>:
>> >>   1fe0:    55            push  %ebp
>> >>   1fe1:    89 e5          mov   %esp,%ebp
>> >>   1fe3:    57            push  %edi
>> >>   1fe4:    56            push  %esi
>> >>   1fe5:    53            push  %ebx
>> >>   1fe6:    83 ec 4c         sub   $0x4c,%esp
>> >>   1fe9:    89 45 c0         mov   %eax,-0x40(%ebp)
>> >>   1fec:    89 55 bc         mov   %edx,-0x44(%ebp)
>> >>   1fef:    89 4d b8         mov   %ecx,-0x48(%ebp)
>> >>
>> >> 0x110 -> 0x4c.
>> >>
>> >> Should we have to add noinline_for_stack for shrink_list?
>> >>
>> >
>> > Hmm. about shirnk_zone(), I don't think uninlining functions directly called
>> > by shrink_zone() can be a help.
>> > Total stack size of call-chain will be still big.
>>
>> Absolutely.
>> But above 500 byte usage is one of hogger and uninlining is not
>> critical about reclaim performance. So I think we don't get any lost
>> than gain.
>>
>
> Beat in mind that uninlining can slightly increase the stack usage in some
> cases because arguments, return addresses and the like have to be pushed
> onto the stack. Inlining or unlining is only the answer when it reduces the
> number of stack variables that exist at any given time.
Yes. I totally have missed it.
Thanks, Mel.
--
Kind regards,
Minchan Kim
On Wed, Apr 14, 2010 at 12:06:36PM +0200, Andi Kleen wrote:
> Chris Mason <[email protected]> writes:
> >
> > Huh, 912 bytes...for select, really? From poll.h:
> >
> > /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
> > additional memory. */
> > #define MAX_STACK_ALLOC 832
> > #define FRONTEND_STACK_ALLOC 256
> > #define SELECT_STACK_ALLOC FRONTEND_STACK_ALLOC
> > #define POLL_STACK_ALLOC FRONTEND_STACK_ALLOC
> > #define WQUEUES_STACK_ALLOC (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> > #define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
> >
> > So, select is intentionally trying to use that much stack. It should be using
> > GFP_NOFS if it really wants to suck down that much stack...
>
> There are lots of other call chains which use multiple KB bytes by itself,
> so why not give select() that measly 832 bytes?
>
> You think only file systems are allowed to use stack? :)
Grin, most definitely.
>
> Basically if you cannot tolerate 1K (or more likely more) of stack
> used before your fs is called you're toast in lots of other situations
> anyways.
Well, on a 4K stack kernel, 832 bytes is a very large percentage for
just one function.
Direct reclaim is a problem because it splices parts of the kernel that
normally aren't connected together. The people that code in select see
832 bytes and say that's teeny, I should have taken 3832 bytes.
But they don't realize their function can dive down into ecryptfs then
the filesystem then maybe loop and then perhaps raid6 on top of a
network block device.
>
> > kernel had some sort of way to dynamically allocate ram, it could try
> > that too.
>
> It does this for large inputs, but the whole point of the stack fast
> path is to avoid it for common cases when a small number of fds is
> only needed.
>
> It's significantly slower to go to any external allocator.
Yeah, but since the call chain does eventually go into the allocator,
this function needs to be more stack friendly.
I do agree that we can't really solve this with noinline_for_stack pixie
dust, the long call chains are going to be a problem no matter what.
Reading through all the comments so far, I think the short summary is:
Cleaning pages in direct reclaim helps the VM because it is able to make
sure that lumpy reclaim finds adjacent pages. This isn't a fast
operation, it has to wait for IO (infinitely slow compared to the CPU).
Will it be good enough for the VM if we add a hint to the bdi writeback
threads to work on a general area of the file? The filesystem will get
writepages(), the VM will get the IO it needs started.
I know Mel mentioned before he wasn't interested in waiting for helper
threads, but I don't see how we can work without it.
-chris
Chris Mason <[email protected]> writes:
>>
>> Basically if you cannot tolerate 1K (or more likely more) of stack
>> used before your fs is called you're toast in lots of other situations
>> anyways.
>
> Well, on a 4K stack kernel, 832 bytes is a very large percentage for
> just one function.
To be honest I think 4K stack simply has to go. I tend to call
it "russian roulette" mode.
It was just a old workaround for a very old buggy VM that couldn't free 8K
pages and the VM is a lot better at that now. And the general trend is
to more complex code everywhere, so 4K stacks become more and more hazardous.
It was a bad idea back then and is still a bad idea, getting
worse and worse with each MLOC being added to the kernel each year.
We don't have any good ways to verify that obscure paths through
the more and more subsystems won't exceed it (in fact I'm pretty
sure there are plenty of problems in exotic configurations)
And even if you can make a specific load work there's basically
no safety net.
The only part of the 4K stack code that's good is the separate
interrupt stack, but that one should be just combined with a sane 8K
process stack.
But yes on a 4K kernel you probably don't want to do any direct reclaim.
Maybe for GFP_NOFS everywhere except user allocations when it's set?
Or simply drop it?
> But they don't realize their function can dive down into ecryptfs then
> the filesystem then maybe loop and then perhaps raid6 on top of a
> network block device.
Those stackings need to use separate threads anyways. A lot of them
do in fact. Block avoided this problem by iterating instead of
recursing. Those that still recurse on the same stack simply
need to be fixed.
> Yeah, but since the call chain does eventually go into the allocator,
> this function needs to be more stack friendly.
For common fast paths it doesn't go into the allocator.
-Andi
--
[email protected] -- Speaking for myself only.
> The only part of the 4K stack code that's good is the separate
> interrupt stack, but that one should be just combined with a sane 8K
> process stack.
The reality is that if you are blowing a 4K process stack you are
probably playing russian roulette on the current 8K x86-32 stack as well
because of the non IRQ split. So it needs fixing either way
On Wed, Apr 14, 2010 at 01:32:29PM +0100, Alan Cox wrote:
> > The only part of the 4K stack code that's good is the separate
> > interrupt stack, but that one should be just combined with a sane 8K
> > process stack.
>
> The reality is that if you are blowing a 4K process stack you are
> probably playing russian roulette on the current 8K x86-32 stack as well
> because of the non IRQ split. So it needs fixing either way
Yes I think the 8K stack on 32bit should be combined with a interrupt
stack too. There's no reason not to have an interrupt stack ever.
Again the problem with fixing it is that you won't have any safety net
for a slightly different stacking etc. path that you didn't cover.
That said extreme examples (like some of those Chris listed) definitely
need fixing by moving them to different threads. But even after that
you still want a safety net. 4K is just too near the edge.
Maybe it would work if we never used any indirect calls, but that's
clearly not the case.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Apr 14, 2010 at 07:20:15AM -0400, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 12:06:36PM +0200, Andi Kleen wrote:
> > Chris Mason <[email protected]> writes:
> > >
> > > Huh, 912 bytes...for select, really? From poll.h:
> > >
> > > /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
> > > additional memory. */
> > > #define MAX_STACK_ALLOC 832
> > > #define FRONTEND_STACK_ALLOC 256
> > > #define SELECT_STACK_ALLOC FRONTEND_STACK_ALLOC
> > > #define POLL_STACK_ALLOC FRONTEND_STACK_ALLOC
> > > #define WQUEUES_STACK_ALLOC (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> > > #define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
> > >
> > > So, select is intentionally trying to use that much stack. It should be using
> > > GFP_NOFS if it really wants to suck down that much stack...
> >
> > There are lots of other call chains which use multiple KB bytes by itself,
> > so why not give select() that measly 832 bytes?
> >
> > You think only file systems are allowed to use stack? :)
>
> Grin, most definitely.
>
> >
> > Basically if you cannot tolerate 1K (or more likely more) of stack
> > used before your fs is called you're toast in lots of other situations
> > anyways.
>
> Well, on a 4K stack kernel, 832 bytes is a very large percentage for
> just one function.
>
> Direct reclaim is a problem because it splices parts of the kernel that
> normally aren't connected together. The people that code in select see
> 832 bytes and say that's teeny, I should have taken 3832 bytes.
>
Even without direct reclaim, I doubt stack usage is often at the top of
peoples minds except for truly criminal large usages of it. Direct
reclaim splicing is somewhat of a problem but it's separate to stack
consumption overall.
> But they don't realize their function can dive down into ecryptfs then
> the filesystem then maybe loop and then perhaps raid6 on top of a
> network block device.
>
> >
> > > kernel had some sort of way to dynamically allocate ram, it could try
> > > that too.
> >
> > It does this for large inputs, but the whole point of the stack fast
> > path is to avoid it for common cases when a small number of fds is
> > only needed.
> >
> > It's significantly slower to go to any external allocator.
>
> Yeah, but since the call chain does eventually go into the allocator,
> this function needs to be more stack friendly.
>
> I do agree that we can't really solve this with noinline_for_stack pixie
> dust, the long call chains are going to be a problem no matter what.
>
> Reading through all the comments so far, I think the short summary is:
>
> Cleaning pages in direct reclaim helps the VM because it is able to make
> sure that lumpy reclaim finds adjacent pages. This isn't a fast
> operation, it has to wait for IO (infinitely slow compared to the CPU).
>
> Will it be good enough for the VM if we add a hint to the bdi writeback
> threads to work on a general area of the file? The filesystem will get
> writepages(), the VM will get the IO it needs started.
>
Bear in mind that the context of lumpy reclaim that the VM doesn't care
about where the data is on the file or filesystem. It's only concerned
about where the data is located in memory. There *may* be a correlation
between location-of-data-in-file and location-of-data-in-memory but only
if readahead was a factor and readahead happened to hit at a time the page
allocator broke up a contiguous block of memory.
> I know Mel mentioned before he wasn't interested in waiting for helper
> threads, but I don't see how we can work without it.
>
I'm not against the idea as such. It would have advantages in that the
thread could reorder the IO for better seeks for example and lumpy
reclaim is already potentially waiting a long time so another delay
won't hurt. I would worry that it's just hiding the stack usage by
moving it to another thread and that there would be communication cost
between a direct reclaimer and this writeback thread. The main gain
would be in hiding the "splicing" effect between subsystems that direct
reclaim can have.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, Apr 14, 2010 at 02:23:50PM +0100, Mel Gorman wrote:
> On Wed, Apr 14, 2010 at 07:20:15AM -0400, Chris Mason wrote:
[ nods ]
>
> Bear in mind that the context of lumpy reclaim that the VM doesn't care
> about where the data is on the file or filesystem. It's only concerned
> about where the data is located in memory. There *may* be a correlation
> between location-of-data-in-file and location-of-data-in-memory but only
> if readahead was a factor and readahead happened to hit at a time the page
> allocator broke up a contiguous block of memory.
>
> > I know Mel mentioned before he wasn't interested in waiting for helper
> > threads, but I don't see how we can work without it.
> >
>
> I'm not against the idea as such. It would have advantages in that the
> thread could reorder the IO for better seeks for example and lumpy
> reclaim is already potentially waiting a long time so another delay
> won't hurt. I would worry that it's just hiding the stack usage by
> moving it to another thread and that there would be communication cost
> between a direct reclaimer and this writeback thread. The main gain
> would be in hiding the "splicing" effect between subsystems that direct
> reclaim can have.
The big gain from the helper threads is that storage operates at a
roughly fixed iop rate. This is true for ssd as well, it's just a much
higher rate. So the threads can send down 4K ios and recover clean pages at
exactly the same rate it would sending down 64KB ios.
I know that for lumpy purposes it might not be the best 64KB, but the
other side of it is that we have to write those pages eventually anyway.
We might as well write them when it is more or less free.
The per-bdi writeback threads are a pretty good base for changing the
ordering for writeback, it seems like a good place to integrate requests
from the VM about which files (and which offsets in those files) to
write back first.
-chris
On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > profiles we are seeing here....
> > > > > > >
> > > > > >
> > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > doing sync IO, then waiting on those pages.
> > > > >
> > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > because seeks are evil and direct reclaim makes seeks. I'd really loev
> > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > of doing page by page spatters of IO to the drive.
> > >
> > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > making 4k io is not must for pageout. So, probably we can improve it.
> > >
> > >
> > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > helpers that filesystems use to do this, like:
> > > >
> > > > filemap_write_and_wait(page->mapping);
> > >
> > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> >
> > So use filemap_fdatawrite(page->mapping), or if it's better only
> > to start IO on a segment of the file, use
> > filemap_fdatawrite_range(page->mapping, start, end)....
>
> That does not help the stack usage issue, the caller ends up in
> ->writepages. From an IO perspective, it'll be better from a seek point of
> view but from a VM perspective, it may or may not be cleaning the right pages.
> So I think this is a red herring.
If you ask it to clean a bunch of pages around the one you want to
reclaim on the LRU, there is a good chance it will also be cleaning
pages that are near the end of the LRU or physically close by as
well. It's not a guarantee, but for the additional IO cost of about
10% wall time on that IO to clean the page you need, you also get
1-2 orders of magnitude other pages cleaned. That sounds like a
win any way you look at it...
I agree that it doesn't solve the stack problem (Chris' suggestion
that we enable the bdi flusher interface would fix this); what I'm
pointing out is that the arguments that it is too hard or there are
no interfaces available to issue larger IO from reclaim are not at
all valid.
> > the deepest call chain in queue_work() needs 700 bytes of stack
> > to complete, wait_for_completion() requires almost 2k of stack space
> > at it's deepest, the scheduler has some heavy stack users, etc,
> > and these are all functions that appear at the top of the stack.
> >
>
> The real issue here then is that stack usage has gone out of control.
That's definitely true, but it shouldn't cloud the fact that most
ppl want to kill writeback from direct reclaim, too, so killing two
birds with one stone seems like a good idea.
How about this? For now, we stop direct reclaim from doing writeback
only on order zero allocations, but allow it for higher order
allocations. That will prevent the majority of situations where
direct reclaim blows the stack and interferes with background
writeout, but won't cause lumpy reclaim to change behaviour.
This reduces the scope of impact and hence testing and validation
the needs to be done.
Then we can work towards allowing lumpy reclaim to use background
threads as Chris suggested for doing specific writeback operations
to solve the remaining problems being seen. Does this seem like a
reasonable compromise and approach to dealing with the problem?
> Disabling ->writepage in direct reclaim does not guarantee that stack
> usage will not be a problem again. From your traces, page reclaim itself
> seems to be a big dirty hog.
I couldn't agree more - the kernel still needs to be put on a stack
usage diet, but the above would give use some breathing space to attack the
problem before more people start to hit these problems.
> > Good start, but 512 bytes will only catch select and splice read,
> > and there are 300-400 byte functions in the above list that sit near
> > the top of the stack....
> >
>
> They will need to be tackled in turn then but obviously there should be
> a focus on the common paths. The reclaim paths do seem particularly
> heavy and it's down to a lot of temporary variables. I might not get the
> time today but what I'm going to try do some time this week is
>
> o Look at what temporary variables are copies of other pieces of information
> o See what variables live for the duration of reclaim but are not needed
> for all of it (i.e. uninline parts of it so variables do not persist)
> o See if it's possible to dynamically allocate scan_control
Welcome to my world ;)
> The last one is the trickiest. Basically, the idea would be to move as much
> into scan_control as possible. Then, instead of allocating it on the stack,
> allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> a semaphore. Limit the number of direct reclaimers that can be active at a
> time to the number of scan_control variables. kswapd could still allocate
> its on the stack or with kmalloc.
>
> If it works out, it would have two main benefits. Limits the number of
> processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> reclaim, there is too much going on. It would also shrink the stack usage
> particularly if some of the stack variables are moved into scan_control.
>
> Maybe someone will beat me to looking at the feasibility of this.
I like the idea - it really sounds like you want a fixed size,
preallocated mempool that can't be enlarged. In fact, I can probably
use something like this in XFS to save a couple of hundred bytes of
stack space in the worst hogs....
> > > > This is the sort of thing I'm pointing at when I say that stack
> > > > usage outside XFS has grown significantly significantly over the
> > > > past couple of years. Given XFS has remained pretty much the same or
> > > > even reduced slightly over the same time period, blaming XFS or
> > > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > > Regardless of the IO pattern performance issues, writeback via
> > > > direct reclaim just uses too much stack to be safe these days...
> > >
> > > Yeah, My answer is simple, All stack eater should be fixed.
> > > but XFS seems not innocence too. 3.5K is enough big although
> > > xfs have use such amount since very ago.
> >
> > XFS used to use much more than that - significant effort has been
> > put into reduce the stack footprint over many years. There's not
> > much left to trim without rewriting half the filesystem...
>
> I don't think he is levelling a complain at XFS in particular - just pointing
> out that it's heavy too. Still, we should be gratful that XFS is sort of
> a "Stack Canary". If it dies, everyone else could be in trouble soon :)
Yeah, true. Sorry Ñ–f in being a bit too defensive here - the scars
from previous discussions like this are showing through....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Apr 14, 2010 at 03:52:32PM +0900, KOSAKI Motohiro wrote:
> > On Wed, Apr 14, 2010 at 12:36:59AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > > > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > > > have a relatively simple test that can be run to determine what the
> > > > > impact is?
> > > >
> > > > So, can you please run two workloads concurrently?
> > > > - Normal IO workload (fio, iozone, etc..)
> > > > - echo $NUM > /proc/sys/vm/nr_hugepages
> > >
> > > What do I measure/observe/record that is meaningful?
> >
> > So, a rough as guts first pass - just run a large dd (8 times the
> > size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
> > the entire of memory in huge pages (500) every 5 seconds. The IO
> > rate is roughly 100MB/s, so it takes 75-85s to complete the dd.
.....
> > Basically, with my patch lumpy reclaim was *substantially* more
> > effective with only a slight increase in average allocation latency
> > with this test case.
....
> > I know this is a simple test case, but it shows much better results
> > than I think anyone (even me) is expecting...
>
> Ummm...
>
> Probably, I have to say I'm sorry. I guess my last mail give you
> a misunderstand.
> To be honest, I'm not interest this artificial non fragmentation case.
And to be brutally honest, I'm not interested in wasting my time
trying to come up with a test case that you are interested in.
Instead, can you please you provide me with your test cases
(scripts, preferably) that you use to measure the effectiveness of
reclaim changes and I'll run them.
> The above test-case does 1) discard all cache 2) fill pages by streaming
> io. then, it makes artificial "file offset neighbor == block neighbor == PFN neighbor"
> situation. then, file offset order writeout by flusher thread can make
> PFN contenious pages effectively.
Yes, that's true, but it does indicate that in that situation, it is
more effective than the current code. FWIW, in the case of HPC
applications (which often use huge pages and clear the cache before
starting anew job), large streaming IO is a pretty common IO
pattern, so I don't think this situation is as artificial as you are
indicating.
> Why I dont interest it? because lumpy reclaim is a technique for
> avoiding external fragmentation mess. IOW, it is for avoiding
> worst case. but your test case seems to mesure best one.
Then please provide test cases that you consider valid.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> They will need to be tackled in turn then but obviously there should be
> a focus on the common paths. The reclaim paths do seem particularly
> heavy and it's down to a lot of temporary variables. I might not get the
> time today but what I'm going to try do some time this week is
>
> o Look at what temporary variables are copies of other pieces of information
> o See what variables live for the duration of reclaim but are not needed
> for all of it (i.e. uninline parts of it so variables do not persist)
> o See if it's possible to dynamically allocate scan_control
>
> The last one is the trickiest. Basically, the idea would be to move as much
> into scan_control as possible. Then, instead of allocating it on the stack,
> allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> a semaphore. Limit the number of direct reclaimers that can be active at a
> time to the number of scan_control variables. kswapd could still allocate
> its on the stack or with kmalloc.
>
> If it works out, it would have two main benefits. Limits the number of
> processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> reclaim, there is too much going on. It would also shrink the stack usage
> particularly if some of the stack variables are moved into scan_control.
>
> Maybe someone will beat me to looking at the feasibility of this.
I already have some patches to remove trivial parts of struct scan_control,
namely may_unmap, may_swap, all_unreclaimable and isolate_pages. The rest
needs a deeper look.
A rather big offender in there is the combination of shrink_active_list (360
bytes here) and shrink_page_list (200 bytes). I am currently looking at
breaking out all the accounting stuff from shrink_active_list into a separate
leaf function so that the stack footprint does not add up.
Your idea of per-cpu allocated scan controls reminds me of an idea I have
had for some time now: moving reclaim into its own threads (per cpu?).
Not only would it separate the allocator's stack from the writeback stack,
we could also get rid of that too_many_isolated() workaround and coordinate
reclaim work better to prevent overreclaim.
But that is not a quick fix either...
Hi
> On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > They will need to be tackled in turn then but obviously there should be
> > a focus on the common paths. The reclaim paths do seem particularly
> > heavy and it's down to a lot of temporary variables. I might not get the
> > time today but what I'm going to try do some time this week is
> >
> > o Look at what temporary variables are copies of other pieces of information
> > o See what variables live for the duration of reclaim but are not needed
> > for all of it (i.e. uninline parts of it so variables do not persist)
> > o See if it's possible to dynamically allocate scan_control
> >
> > The last one is the trickiest. Basically, the idea would be to move as much
> > into scan_control as possible. Then, instead of allocating it on the stack,
> > allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> > a semaphore. Limit the number of direct reclaimers that can be active at a
> > time to the number of scan_control variables. kswapd could still allocate
> > its on the stack or with kmalloc.
> >
> > If it works out, it would have two main benefits. Limits the number of
> > processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> > reclaim, there is too much going on. It would also shrink the stack usage
> > particularly if some of the stack variables are moved into scan_control.
> >
> > Maybe someone will beat me to looking at the feasibility of this.
>
> I already have some patches to remove trivial parts of struct scan_control,
> namely may_unmap, may_swap, all_unreclaimable and isolate_pages. The rest
> needs a deeper look.
Seems interesting. but scan_control diet is not so effective. How much
bytes can we diet by it?
> A rather big offender in there is the combination of shrink_active_list (360
> bytes here) and shrink_page_list (200 bytes). I am currently looking at
> breaking out all the accounting stuff from shrink_active_list into a separate
> leaf function so that the stack footprint does not add up.
pagevec. it consume 128bytes per struct. I have removing patch.
> Your idea of per-cpu allocated scan controls reminds me of an idea I have
> had for some time now: moving reclaim into its own threads (per cpu?).
>
> Not only would it separate the allocator's stack from the writeback stack,
> we could also get rid of that too_many_isolated() workaround and coordinate
> reclaim work better to prevent overreclaim.
>
> But that is not a quick fix either...
So, I haven't think this way. probably seems good. but I like to do
simple diet at first.
Hi
> How about this? For now, we stop direct reclaim from doing writeback
> only on order zero allocations, but allow it for higher order
> allocations. That will prevent the majority of situations where
> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.
Tend to agree. but I would proposed slightly different algorithm for
avoind incorrect oom.
for high order allocation
allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
for low order allocation
- kswapd: always delegate io to flusher thread
- direct reclaim: delegate io to flusher thread only if vm pressure is low
This seems more safely. I mean Who want see incorrect oom regression?
I've made some pathes for this. I'll post it as another mail.
> Then we can work towards allowing lumpy reclaim to use background
> threads as Chris suggested for doing specific writeback operations
> to solve the remaining problems being seen. Does this seem like a
> reasonable compromise and approach to dealing with the problem?
Tend to agree. probably now we are discussing right approach. but
this is definitely needed deep thinking. then, I can't take exactly
answer yet.
Now, vmscan pageout() is one of IO throuput degression source.
Some IO workload makes very much order-0 allocation and reclaim
and pageout's 4K IOs are making annoying lots seeks.
At least, kswapd can avoid such pageout() because kswapd don't
need to consider OOM-Killer situation. that's no risk.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..d392a50 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
if (referenced_page)
return PAGEREF_RECLAIM_CLEAN;
+ /*
+ * Delegate pageout IO to flusher thread. They can make more
+ * effective IO pattern.
+ */
+ if (current_is_kswapd())
+ return PAGEREF_RECLAIM_CLEAN;
+
return PAGEREF_RECLAIM;
}
--
1.6.5.2
This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.
=============================================
Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.
Now I have to say that I'm sorry. 2 years ago, I thghout prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. thus I give up such approach.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/mmzone.h | 15 -------------
mm/page_alloc.c | 2 -
mm/vmscan.c | 54 ++---------------------------------------------
mm/vmstat.c | 2 -
4 files changed, 3 insertions(+), 70 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..ad76962 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -339,21 +339,6 @@ struct zone {
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
- * prev_priority holds the scanning priority for this zone. It is
- * defined as the scanning priority at which we achieved our reclaim
- * target at the previous try_to_free_pages() or balance_pgdat()
- * invocation.
- *
- * We use prev_priority as a measure of how much stress page reclaim is
- * under - it drives the swappiness decision: whether to unmap mapped
- * pages.
- *
- * Access to both this field is quite racy even on uniprocessor. But
- * it is expected to average out OK.
- */
- int prev_priority;
-
- /*
* The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
* this zone's LRU. Maintained by the pageout code.
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d03c946..88513c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3862,8 +3862,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;
- zone->prev_priority = DEF_PRIORITY;
-
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d392a50..dadb461 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1284,20 +1284,6 @@ done:
}
/*
- * We are about to scan this zone at a certain priority level. If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone. This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
- if (priority < zone->prev_priority)
- zone->prev_priority = priority;
-}
-
-/*
* This moves pages from the active list to the inactive list.
*
* We move them the other way if the page is referenced by one or more
@@ -1733,20 +1719,15 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
if (scanning_global_lru(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- note_zone_scanning_priority(zone, priority);
-
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
sc->all_unreclaimable = 0;
- } else {
+ } else
/*
* Ignore cpuset limitation here. We just want to reduce
* # of used pages by us regardless of memory shortage.
*/
sc->all_unreclaimable = 0;
- mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
- priority);
- }
shrink_zone(priority, zone, sc);
}
@@ -1852,17 +1833,11 @@ out:
if (priority < 0)
priority = 0;
- if (scanning_global_lru(sc)) {
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
+ if (scanning_global_lru(sc))
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- zone->prev_priority = priority;
- }
- } else
- mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
delayacct_freepages_end();
return ret;
@@ -2015,22 +1990,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
};
- /*
- * temp_priority is used to remember the scanning priority at which
- * this zone was successfully refilled to
- * free_pages == high_wmark_pages(zone).
- */
- int temp_priority[MAX_NR_ZONES];
-
loop_again:
total_scanned = 0;
sc.nr_reclaimed = 0;
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);
- for (i = 0; i < pgdat->nr_zones; i++)
- temp_priority[i] = DEF_PRIORITY;
-
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
@@ -2098,9 +2063,7 @@ loop_again:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
- temp_priority[i] = priority;
sc.nr_scanned = 0;
- note_zone_scanning_priority(zone, priority);
nid = pgdat->node_id;
zid = zone_idx(zone);
@@ -2173,16 +2136,6 @@ loop_again:
break;
}
out:
- /*
- * Note within each zone the priority level at which this zone was
- * brought into a happy state. So that the next thread which scans this
- * zone will start out at that priority level.
- */
- for (i = 0; i < pgdat->nr_zones; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- zone->prev_priority = temp_priority[i];
- }
if (!all_zones_ok) {
cond_resched();
@@ -2600,7 +2553,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
*/
priority = ZONE_RECLAIM_PRIORITY;
do {
- note_zone_scanning_priority(zone, priority);
shrink_zone(priority, zone, &sc);
priority--;
} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fa12ea3..2db0a0f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -761,11 +761,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
}
seq_printf(m,
"\n all_unreclaimable: %u"
- "\n prev_priority: %i"
"\n start_pfn: %lu"
"\n inactive_ratio: %u",
zone->all_unreclaimable,
- zone->prev_priority,
zone->zone_start_pfn,
zone->inactive_ratio);
seq_putc(m, '\n');
--
1.6.5.2
ditto
This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.
=========================================
Now very lots function in vmscan have `priority' argument. It consume
stack slightly. To move it on struct scan_control reduce stack.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 83 ++++++++++++++++++++++++++--------------------------------
1 files changed, 37 insertions(+), 46 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dadb461..8b78b49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,6 +77,8 @@ struct scan_control {
int order;
+ int priority;
+
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
@@ -1130,7 +1132,7 @@ static int too_many_isolated(struct zone *zone, int file,
*/
static unsigned long shrink_inactive_list(unsigned long max_scan,
struct zone *zone, struct scan_control *sc,
- int priority, int file)
+ int file)
{
LIST_HEAD(page_list);
struct pagevec pvec;
@@ -1156,7 +1158,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
lumpy_reclaim = 1;
- else if (sc->order && priority < DEF_PRIORITY - 2)
+ else if (sc->order && sc->priority < DEF_PRIORITY - 2)
lumpy_reclaim = 1;
pagevec_init(&pvec, 1);
@@ -1335,7 +1337,7 @@ static void move_active_pages_to_lru(struct zone *zone,
}
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
- struct scan_control *sc, int priority, int file)
+ struct scan_control *sc, int file)
{
unsigned long nr_taken;
unsigned long pgscanned;
@@ -1498,17 +1500,17 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
}
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
- struct zone *zone, struct scan_control *sc, int priority)
+ struct zone *zone, struct scan_control *sc)
{
int file = is_file_lru(lru);
if (is_active_lru(lru)) {
if (inactive_list_is_low(zone, sc, file))
- shrink_active_list(nr_to_scan, zone, sc, priority, file);
+ shrink_active_list(nr_to_scan, zone, sc, file);
return 0;
}
- return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
+ return shrink_inactive_list(nr_to_scan, zone, sc, file);
}
/*
@@ -1615,8 +1617,7 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
-static void shrink_zone(int priority, struct zone *zone,
- struct scan_control *sc)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
@@ -1640,8 +1641,8 @@ static void shrink_zone(int priority, struct zone *zone,
unsigned long scan;
scan = zone_nr_lru_pages(zone, sc, l);
- if (priority || noswap) {
- scan >>= priority;
+ if (sc->priority || noswap) {
+ scan >>= sc->priority;
scan = (scan * percent[file]) / 100;
}
nr[l] = nr_scan_try_batch(scan,
@@ -1657,7 +1658,7 @@ static void shrink_zone(int priority, struct zone *zone,
nr[l] -= nr_to_scan;
nr_reclaimed += shrink_list(l, nr_to_scan,
- zone, sc, priority);
+ zone, sc);
}
}
/*
@@ -1668,7 +1669,8 @@ static void shrink_zone(int priority, struct zone *zone,
* with multiple processes reclaiming pages, the total
* freeing target can get unreasonably large.
*/
- if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
+ if (nr_reclaimed >= nr_to_reclaim &&
+ sc->priority < DEF_PRIORITY)
break;
}
@@ -1679,7 +1681,7 @@ static void shrink_zone(int priority, struct zone *zone,
* rebalance the anon lru active/inactive ratio.
*/
if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
- shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
+ shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, 0);
throttle_vm_writeout(sc->gfp_mask);
}
@@ -1700,8 +1702,7 @@ static void shrink_zone(int priority, struct zone *zone,
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
*/
-static void shrink_zones(int priority, struct zonelist *zonelist,
- struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
struct zoneref *z;
@@ -1719,7 +1720,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
if (scanning_global_lru(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ if (zone->all_unreclaimable &&
+ sc->priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
sc->all_unreclaimable = 0;
} else
@@ -1729,7 +1731,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
*/
sc->all_unreclaimable = 0;
- shrink_zone(priority, zone, sc);
+ shrink_zone(zone, sc);
}
}
@@ -1752,7 +1754,6 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct scan_control *sc)
{
- int priority;
unsigned long ret = 0;
unsigned long total_scanned = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1779,11 +1780,11 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
}
}
- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (sc->priority = DEF_PRIORITY; sc->priority >= 0; sc->priority--) {
sc->nr_scanned = 0;
- if (!priority)
+ if (!sc->priority)
disable_swap_token();
- shrink_zones(priority, zonelist, sc);
+ shrink_zones(zonelist, sc);
/*
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
@@ -1816,23 +1817,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
/* Take a nap, wait for some writeback to complete */
if (!sc->hibernation_mode && sc->nr_scanned &&
- priority < DEF_PRIORITY - 2)
+ sc->priority < DEF_PRIORITY - 2)
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
ret = sc->nr_reclaimed;
-out:
- /*
- * Now that we've scanned all the zones at this priority level, note
- * that level within the zone so that the next thread which performs
- * scanning of this zone will immediately start out at this priority
- * level. This affects only the decision whether or not to bring
- * mapped pages onto the inactive list.
- */
- if (priority < 0)
- priority = 0;
+out:
if (scanning_global_lru(sc))
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
@@ -1892,7 +1884,8 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_zone(0, zone, &sc);
+ sc.priority = 0;
+ shrink_zone(zone, &sc);
return sc.nr_reclaimed;
}
@@ -1972,7 +1965,6 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
{
int all_zones_ok;
- int priority;
int i;
unsigned long total_scanned;
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1996,13 +1988,13 @@ loop_again:
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);
- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (sc.priority = DEF_PRIORITY; sc.priority >= 0; sc.priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
int has_under_min_watermark_zone = 0;
/* The swap token gets in the way of swapout... */
- if (!priority)
+ if (!sc.priority)
disable_swap_token();
all_zones_ok = 1;
@@ -2017,7 +2009,7 @@ loop_again:
if (!populated_zone(zone))
continue;
- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ if (zone->all_unreclaimable && sc.priority != DEF_PRIORITY)
continue;
/*
@@ -2026,7 +2018,7 @@ loop_again:
*/
if (inactive_anon_is_low(zone, &sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone,
- &sc, priority, 0);
+ &sc, 0);
if (!zone_watermark_ok(zone, order,
high_wmark_pages(zone), 0, 0)) {
@@ -2060,7 +2052,7 @@ loop_again:
if (!populated_zone(zone))
continue;
- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ if (zone->all_unreclaimable && sc.priority != DEF_PRIORITY)
continue;
sc.nr_scanned = 0;
@@ -2079,7 +2071,7 @@ loop_again:
*/
if (!zone_watermark_ok(zone, order,
8*high_wmark_pages(zone), end_zone, 0))
- shrink_zone(priority, zone, &sc);
+ shrink_zone(zone, &sc);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
lru_pages);
@@ -2119,7 +2111,7 @@ loop_again:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+ if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
if (has_under_min_watermark_zone)
count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
else
@@ -2520,7 +2512,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
- int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2551,11 +2542,11 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* Free memory by calling shrink zone with increasing
* priorities until we have enough memory freed.
*/
- priority = ZONE_RECLAIM_PRIORITY;
+ sc.priority = ZONE_RECLAIM_PRIORITY;
do {
- shrink_zone(priority, zone, &sc);
- priority--;
- } while (priority >= 0 && sc.nr_reclaimed < nr_pages);
+ shrink_zone(zone, &sc);
+ sc.priority--;
+ } while (sc.priority >= 0 && sc.nr_reclaimed < nr_pages);
}
slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
--
1.6.5.2
Even if pageout() is called from direct reclaim, we can delegate io to
flusher thread if vm pressure is low.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b78b49..eab6028 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -623,6 +623,13 @@ static enum page_references page_check_references(struct page *page,
if (current_is_kswapd())
return PAGEREF_RECLAIM_CLEAN;
+ /*
+ * Now VM pressure is not so high. then we can delegate
+ * page cleaning to flusher thread safely.
+ */
+ if (!sc->order && sc->priority > DEF_PRIORITY/2)
+ return PAGEREF_RECLAIM_CLEAN;
+
return PAGEREF_RECLAIM;
}
--
1.6.5.2
> Hi
>
> > How about this? For now, we stop direct reclaim from doing writeback
> > only on order zero allocations, but allow it for higher order
> > allocations. That will prevent the majority of situations where
> > direct reclaim blows the stack and interferes with background
> > writeout, but won't cause lumpy reclaim to change behaviour.
> > This reduces the scope of impact and hence testing and validation
> > the needs to be done.
>
> Tend to agree. but I would proposed slightly different algorithm for
> avoind incorrect oom.
>
> for high order allocation
> allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
>
> for low order allocation
> - kswapd: always delegate io to flusher thread
> - direct reclaim: delegate io to flusher thread only if vm pressure is low
>
> This seems more safely. I mean Who want see incorrect oom regression?
> I've made some pathes for this. I'll post it as another mail.
Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
Dave, can you please try to run your pageout annoying workload?
On Thu, Apr 15, 2010 at 01:09:01PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > How about this? For now, we stop direct reclaim from doing writeback
> > only on order zero allocations, but allow it for higher order
> > allocations. That will prevent the majority of situations where
> > direct reclaim blows the stack and interferes with background
> > writeout, but won't cause lumpy reclaim to change behaviour.
> > This reduces the scope of impact and hence testing and validation
> > the needs to be done.
>
> Tend to agree. but I would proposed slightly different algorithm for
> avoind incorrect oom.
>
> for high order allocation
> allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
SO same as current.
> for low order allocation
> - kswapd: always delegate io to flusher thread
> - direct reclaim: delegate io to flusher thread only if vm pressure is low
IMO, this really doesn't fix either of the problems - the bad IO
patterns nor the stack usage. All it will take is a bit more memory
pressure to trigger stack and IO problems, and the user reporting the
problems is generating an awful lot of memory pressure...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Apr 15, 2010 at 01:35:17PM +0900, KOSAKI Motohiro wrote:
> > Hi
> >
> > > How about this? For now, we stop direct reclaim from doing writeback
> > > only on order zero allocations, but allow it for higher order
> > > allocations. That will prevent the majority of situations where
> > > direct reclaim blows the stack and interferes with background
> > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > This reduces the scope of impact and hence testing and validation
> > > the needs to be done.
> >
> > Tend to agree. but I would proposed slightly different algorithm for
> > avoind incorrect oom.
> >
> > for high order allocation
> > allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> >
> > for low order allocation
> > - kswapd: always delegate io to flusher thread
> > - direct reclaim: delegate io to flusher thread only if vm pressure is low
> >
> > This seems more safely. I mean Who want see incorrect oom regression?
> > I've made some pathes for this. I'll post it as another mail.
>
> Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
> Dave, can you please try to run your pageout annoying workload?
It's just as easy for you to run and observe the effects. Start with a VM
with 1GB RAM and a 10GB scratch block device:
# mkfs.xfs -f /dev/<blah>
# mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch
in one shell:
# while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done
in another shell, if you have fs_mark installed, run:
# ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 &
otherwise run a couple of these in parallel on different directories:
# for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done
Cheers,
Dave.
--
Dave Chinner
[email protected]
> On Thu, Apr 15, 2010 at 01:09:01PM +0900, KOSAKI Motohiro wrote:
> > Hi
> >
> > > How about this? For now, we stop direct reclaim from doing writeback
> > > only on order zero allocations, but allow it for higher order
> > > allocations. That will prevent the majority of situations where
> > > direct reclaim blows the stack and interferes with background
> > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > This reduces the scope of impact and hence testing and validation
> > > the needs to be done.
> >
> > Tend to agree. but I would proposed slightly different algorithm for
> > avoind incorrect oom.
> >
> > for high order allocation
> > allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
>
> SO same as current.
Yes. as same as you propsed.
>
> > for low order allocation
> > - kswapd: always delegate io to flusher thread
> > - direct reclaim: delegate io to flusher thread only if vm pressure is low
>
> IMO, this really doesn't fix either of the problems - the bad IO
> patterns nor the stack usage. All it will take is a bit more memory
> pressure to trigger stack and IO problems, and the user reporting the
> problems is generating an awful lot of memory pressure...
This patch doesn't care stack usage. because
- again, I think all stack eater shold be diet.
- under allowing lumpy reclaim world, only deny low order reclaim
doesn't solve anything.
Please don't forget priority=0 recliam failure incvoke OOM-killer.
I don't imagine anyone want it.
And, Which IO workload trigger <6 priority vmscan?
> On Thu, Apr 15, 2010 at 01:35:17PM +0900, KOSAKI Motohiro wrote:
> > > Hi
> > >
> > > > How about this? For now, we stop direct reclaim from doing writeback
> > > > only on order zero allocations, but allow it for higher order
> > > > allocations. That will prevent the majority of situations where
> > > > direct reclaim blows the stack and interferes with background
> > > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > > This reduces the scope of impact and hence testing and validation
> > > > the needs to be done.
> > >
> > > Tend to agree. but I would proposed slightly different algorithm for
> > > avoind incorrect oom.
> > >
> > > for high order allocation
> > > allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> > >
> > > for low order allocation
> > > - kswapd: always delegate io to flusher thread
> > > - direct reclaim: delegate io to flusher thread only if vm pressure is low
> > >
> > > This seems more safely. I mean Who want see incorrect oom regression?
> > > I've made some pathes for this. I'll post it as another mail.
> >
> > Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
> > Dave, can you please try to run your pageout annoying workload?
>
> It's just as easy for you to run and observe the effects. Start with a VM
> with 1GB RAM and a 10GB scratch block device:
>
> # mkfs.xfs -f /dev/<blah>
> # mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch
>
> in one shell:
>
> # while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done
>
> in another shell, if you have fs_mark installed, run:
>
> # ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 &
>
> otherwise run a couple of these in parallel on different directories:
>
> # for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done
Thanks.
Unfortunately, I don't have unused disks. So, I'll try it at (probably)
next week.
On Thu, Apr 15, 2010 at 03:44:50PM +0900, KOSAKI Motohiro wrote:
> > > Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
> > > Dave, can you please try to run your pageout annoying workload?
> >
> > It's just as easy for you to run and observe the effects. Start with a VM
> > with 1GB RAM and a 10GB scratch block device:
> >
> > # mkfs.xfs -f /dev/<blah>
> > # mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch
> >
> > in one shell:
> >
> > # while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done
> >
> > in another shell, if you have fs_mark installed, run:
> >
> > # ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 &
> >
> > otherwise run a couple of these in parallel on different directories:
> >
> > # for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done
>
> Thanks.
>
> Unfortunately, I don't have unused disks. So, I'll try it at (probably)
> next week.
A filesystem on a loopback device will work just as well ;)
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> Now, vmscan pageout() is one of IO throuput degression source.
> Some IO workload makes very much order-0 allocation and reclaim
> and pageout's 4K IOs are making annoying lots seeks.
>
> At least, kswapd can avoid such pageout() because kswapd don't
> need to consider OOM-Killer situation. that's no risk.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
What's your opinion on trying to cluster the writes done by pageout,
instead of not doing any paging out in kswapd?
Something along these lines:
Cluster writes to disk due to memory pressure.
Write out logically adjacent pages to the one we're paging out
so that we may get better IOs in these situations:
These pages are likely to be contiguous on disk to the one we're
writing out, so they should get merged into a single disk IO.
Signed-off-by: Suleiman Souhlal <[email protected]>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..4e5a613 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,8 @@
#include "internal.h"
+#define PAGEOUT_CLUSTER_PAGES 16
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -350,6 +352,8 @@ typedef enum {
static pageout_t pageout(struct page *page, struct address_space
*mapping,
enum pageout_io sync_writeback)
{
+ int i;
+
/*
* If the page is dirty, only perform writeback if that write
* will be non-blocking. To prevent this allocation from being
@@ -408,6 +412,37 @@ static pageout_t pageout(struct page *page,
struct address_space *mapping,
}
/*
+ * Try to write out logically adjacent dirty pages too, if
+ * possible, to get better IOs, as the IO scheduler should
+ * merge them with the original one, if the file is not too
+ * fragmented.
+ */
+ for (i = 1; i < PAGEOUT_CLUSTER_PAGES; i++) {
+ struct page *p2;
+ int err;
+
+ p2 = find_get_page(mapping, page->index + i);
+ if (p2) {
+ if (trylock_page(p2) == 0) {
+ page_cache_release(p2);
+ break;
+ }
+ if (page_mapped(p2))
+ try_to_unmap(p2, 0);
+ if (PageDirty(p2)) {
+ err = write_one_page(p2, 0);
+ page_cache_release(p2);
+ if (err)
+ break;
+ } else {
+ unlock_page(p2);
+ page_cache_release(p2);
+ break;
+ }
+ }
+ }
+
+ /*
* Wait on writeback if requested to. This happens when
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
>
> On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>
> > Now, vmscan pageout() is one of IO throuput degression source.
> > Some IO workload makes very much order-0 allocation and reclaim
> > and pageout's 4K IOs are making annoying lots seeks.
> >
> > At least, kswapd can avoid such pageout() because kswapd don't
> > need to consider OOM-Killer situation. that's no risk.
> >
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
>
> What's your opinion on trying to cluster the writes done by pageout,
> instead of not doing any paging out in kswapd?
> Something along these lines:
Interesting.
So, I'd like to review your patch carefully. can you please give me one
day? :)
>
> Cluster writes to disk due to memory pressure.
>
> Write out logically adjacent pages to the one we're paging out
> so that we may get better IOs in these situations:
> These pages are likely to be contiguous on disk to the one we're
> writing out, so they should get merged into a single disk IO.
>
> Signed-off-by: Suleiman Souhlal <[email protected]>
> Now, vmscan pageout() is one of IO throuput degression source.
> Some IO workload makes very much order-0 allocation and reclaim
> and pageout's 4K IOs are making annoying lots seeks.
>
> At least, kswapd can avoid such pageout() because kswapd don't
> need to consider OOM-Killer situation. that's no risk.
I've found one bug in this patch myself. flusher thread don't
pageout anon pages. then, we need PageAnon() check ;)
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmscan.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3ff3311..d392a50 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
> if (referenced_page)
> return PAGEREF_RECLAIM_CLEAN;
>
> + /*
> + * Delegate pageout IO to flusher thread. They can make more
> + * effective IO pattern.
> + */
> + if (current_is_kswapd())
> + return PAGEREF_RECLAIM_CLEAN;
> +
> return PAGEREF_RECLAIM;
> }
>
> --
> 1.6.5.2
>
>
>
Cc to Johannes
> >
> > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> >
> > > Now, vmscan pageout() is one of IO throuput degression source.
> > > Some IO workload makes very much order-0 allocation and reclaim
> > > and pageout's 4K IOs are making annoying lots seeks.
> > >
> > > At least, kswapd can avoid such pageout() because kswapd don't
> > > need to consider OOM-Killer situation. that's no risk.
> > >
> > > Signed-off-by: KOSAKI Motohiro <[email protected]>
> >
> > What's your opinion on trying to cluster the writes done by pageout,
> > instead of not doing any paging out in kswapd?
> > Something along these lines:
>
> Interesting.
> So, I'd like to review your patch carefully. can you please give me one
> day? :)
Hannes, if my remember is correct, you tried similar swap-cluster IO
long time ago. now I can't remember why we didn't merged such patch.
Do you remember anything?
>
>
> >
> > Cluster writes to disk due to memory pressure.
> >
> > Write out logically adjacent pages to the one we're paging out
> > so that we may get better IOs in these situations:
> > These pages are likely to be contiguous on disk to the one we're
> > writing out, so they should get merged into a single disk IO.
> >
> > Signed-off-by: Suleiman Souhlal <[email protected]>
>
>
>
>
On Thu, Apr 15, 2010 at 03:35:14PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Apr 15, 2010 at 01:09:01PM +0900, KOSAKI Motohiro wrote:
> > > Hi
> > >
> > > > How about this? For now, we stop direct reclaim from doing writeback
> > > > only on order zero allocations, but allow it for higher order
> > > > allocations. That will prevent the majority of situations where
> > > > direct reclaim blows the stack and interferes with background
> > > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > > This reduces the scope of impact and hence testing and validation
> > > > the needs to be done.
> > >
> > > Tend to agree. but I would proposed slightly different algorithm for
> > > avoind incorrect oom.
> > >
> > > for high order allocation
> > > allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> >
> > SO same as current.
>
> Yes. as same as you propsed.
>
> >
> > > for low order allocation
> > > - kswapd: always delegate io to flusher thread
> > > - direct reclaim: delegate io to flusher thread only if vm pressure is low
> >
> > IMO, this really doesn't fix either of the problems - the bad IO
> > patterns nor the stack usage. All it will take is a bit more memory
> > pressure to trigger stack and IO problems, and the user reporting the
> > problems is generating an awful lot of memory pressure...
>
> This patch doesn't care stack usage. because
> - again, I think all stack eater shold be diet.
Agreed (again), but we've already come to the conclusion that a
stack diet is not enough.
> - under allowing lumpy reclaim world, only deny low order reclaim
> doesn't solve anything.
Yes, I suggested it *as a first step*, not as the end goal. Your
patches don't reach the first step which is fixing the reported
stack problem for order-0 allocations...
> Please don't forget priority=0 recliam failure incvoke OOM-killer.
> I don't imagine anyone want it.
Given that I haven't been able to trigger OOM without writeback from
direct reclaim so far (*) I'm not finding any evidence that it is a
problem or that there are regressions. I want to be able to say
that this change has no known regressions. I want to find the
regression and work to fix them, but without test cases there's no
way I can do this.
This is what I'm getting frustrated about - I want to fix this
problem once and for all, but I can't find out what I need to do to
robustly test such a change so we can have a high degree of
confidence that it doesn't introduce major regressions. Can anyone
help here?
(*) except in one case I've already described where it mananged to
allocate enough huge pages to starve the system of order zero pages,
which is what I asked it to do.
> And, Which IO workload trigger <6 priority vmscan?
You're asking me? I've been asking you for workloads that wind up
reclaim priority.... :/
All I can say is that the most common trigger I see for OOM is
copying a large file on a busy system that is running off a single
spindle. When that happens on my laptop I walk away and get a cup
of coffee when that happens and when I come back I pick up all the
broken bits the OOM killer left behind.....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
>
> On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>
> >Now, vmscan pageout() is one of IO throuput degression source.
> >Some IO workload makes very much order-0 allocation and reclaim
> >and pageout's 4K IOs are making annoying lots seeks.
> >
> >At least, kswapd can avoid such pageout() because kswapd don't
> >need to consider OOM-Killer situation. that's no risk.
> >
> >Signed-off-by: KOSAKI Motohiro <[email protected]>
>
> What's your opinion on trying to cluster the writes done by pageout,
> instead of not doing any paging out in kswapd?
XFS already does this in ->writepage to try to minimise the impact
of the way pageout issues IO. It helps, but it is still not as good
as having all the writeback come from the flusher threads because
it's still pretty much random IO.
And, FWIW, it doesn't solve the stack usage problems, either. In
fact, it will make them worse as write_one_page() puts another
struct writeback_control on the stack...
Cheers,
Dave.
--
Dave Chinner
[email protected]
> On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
> >
> > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> >
> > >Now, vmscan pageout() is one of IO throuput degression source.
> > >Some IO workload makes very much order-0 allocation and reclaim
> > >and pageout's 4K IOs are making annoying lots seeks.
> > >
> > >At least, kswapd can avoid such pageout() because kswapd don't
> > >need to consider OOM-Killer situation. that's no risk.
> > >
> > >Signed-off-by: KOSAKI Motohiro <[email protected]>
> >
> > What's your opinion on trying to cluster the writes done by pageout,
> > instead of not doing any paging out in kswapd?
>
> XFS already does this in ->writepage to try to minimise the impact
> of the way pageout issues IO. It helps, but it is still not as good
> as having all the writeback come from the flusher threads because
> it's still pretty much random IO.
I havent review such patch yet. then, I'm talking about generic thing.
pageout() doesn't only writeout file backed page, but also write
swap backed page. so, filesystem optimization nor flusher thread
doesn't erase pageout clusterring worth.
> And, FWIW, it doesn't solve the stack usage problems, either. In
> fact, it will make them worse as write_one_page() puts another
> struct writeback_control on the stack...
Correct. we need to avoid double writeback_control on stack.
probably, we need to divide pageout() some piece.
> On Thu, Apr 15, 2010 at 03:35:14PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, Apr 15, 2010 at 01:09:01PM +0900, KOSAKI Motohiro wrote:
> > > > Hi
> > > >
> > > > > How about this? For now, we stop direct reclaim from doing writeback
> > > > > only on order zero allocations, but allow it for higher order
> > > > > allocations. That will prevent the majority of situations where
> > > > > direct reclaim blows the stack and interferes with background
> > > > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > > > This reduces the scope of impact and hence testing and validation
> > > > > the needs to be done.
> > > >
> > > > Tend to agree. but I would proposed slightly different algorithm for
> > > > avoind incorrect oom.
> > > >
> > > > for high order allocation
> > > > allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> > >
> > > SO same as current.
> >
> > Yes. as same as you propsed.
> >
> > >
> > > > for low order allocation
> > > > - kswapd: always delegate io to flusher thread
> > > > - direct reclaim: delegate io to flusher thread only if vm pressure is low
> > >
> > > IMO, this really doesn't fix either of the problems - the bad IO
> > > patterns nor the stack usage. All it will take is a bit more memory
> > > pressure to trigger stack and IO problems, and the user reporting the
> > > problems is generating an awful lot of memory pressure...
> >
> > This patch doesn't care stack usage. because
> > - again, I think all stack eater shold be diet.
>
> Agreed (again), but we've already come to the conclusion that a
> stack diet is not enough.
ok.
> > - under allowing lumpy reclaim world, only deny low order reclaim
> > doesn't solve anything.
>
> Yes, I suggested it *as a first step*, not as the end goal. Your
> patches don't reach the first step which is fixing the reported
> stack problem for order-0 allocations...
I have some diet patch as another patches. I'll post todays diet patch
by another mail. I didn't hope mixing perfectly unrelated patches.
> > Please don't forget priority=0 recliam failure incvoke OOM-killer.
> > I don't imagine anyone want it.
>
> Given that I haven't been able to trigger OOM without writeback from
> direct reclaim so far (*) I'm not finding any evidence that it is a
> problem or that there are regressions. I want to be able to say
> that this change has no known regressions. I want to find the
> regression and work to fix them, but without test cases there's no
> way I can do this.
>
> This is what I'm getting frustrated about - I want to fix this
> problem once and for all, but I can't find out what I need to do to
> robustly test such a change so we can have a high degree of
> confidence that it doesn't introduce major regressions. Can anyone
> help here?
>
> (*) except in one case I've already described where it mananged to
> allocate enough huge pages to starve the system of order zero pages,
> which is what I asked it to do.
Agreed. I'm sorry that thing. Probably nobody in the world have
enough VM test case even though include no linux people. Modern general
purpose OS are used really really various purpose and various machine.
So, I haven't seen perfectly zero regression VM change. I'm getting
the same frustration anytime.
Because, Many VM mess is for avoiding extream starvation case. but If
it can be reproduced easily, it's VM bug ;)
> > And, Which IO workload trigger <6 priority vmscan?
>
> You're asking me? I've been asking you for workloads that wind up
> reclaim priority.... :/
??? Do I misunderstand your last mail?
You wrote
> IMO, this really doesn't fix either of the problems - the bad IO
> patterns nor the stack usage. All it will take is a bit more memory
> pressure to trigger stack and IO problems, and the user reporting the
> problems is generating an awful lot of memory pressure...
and, I ask which is "the bad IO patterns". if it's not your intention,
What do you talked about io pattern?
If my understand is correct, you asked me about vmscan hurt case,
and I asked you your the bad IO pattern.
now guessing, your intention was "bad IO patterns", not "the IO patterns"??
> All I can say is that the most common trigger I see for OOM is
> copying a large file on a busy system that is running off a single
> spindle. When that happens on my laptop I walk away and get a cup
> of coffee when that happens and when I come back I pick up all the
> broken bits the OOM killer left behind.....
As far as I understand, you are talking about no specific general thing.
then, I also talking general one. In general, I think slow down is
better than OOM-killer. So, even though we need more and more improvement,
we always care about avoiding incorrect oom. iow, I'd prefer step by
step development.
Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.
detail
- remove "while (nr_scanned < max_scan)" loop
- remove nr_freed (now, we use nr_reclaimed directly)
- remove nr_scan (now, we use nr_scanned directly)
- rename max_scan to nr_to_scan
- pass nr_to_scan into isolate_pages() directly instead
using SWAP_CLUSTER_MAX
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 190 ++++++++++++++++++++++++++++-------------------------------
1 files changed, 89 insertions(+), 101 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eab6028..4de4029 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1137,16 +1137,22 @@ static int too_many_isolated(struct zone *zone, int file,
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc,
int file)
{
LIST_HEAD(page_list);
struct pagevec pvec;
- unsigned long nr_scanned = 0;
+ unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int lumpy_reclaim = 0;
+ struct page *page;
+ unsigned long nr_taken;
+ unsigned long nr_active;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+ unsigned long nr_anon;
+ unsigned long nr_file;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1172,119 +1178,101 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- do {
- struct page *page;
- unsigned long nr_taken;
- unsigned long nr_scan;
- unsigned long nr_freed;
- unsigned long nr_active;
- unsigned int count[NR_LRU_LISTS] = { 0, };
- int mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
- unsigned long nr_anon;
- unsigned long nr_file;
-
- nr_taken = sc->isolate_pages(SWAP_CLUSTER_MAX,
- &page_list, &nr_scan, sc->order, mode,
- zone, sc->mem_cgroup, 0, file);
+ nr_taken = sc->isolate_pages(nr_to_scan,
+ &page_list, &nr_scanned, sc->order,
+ lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+ zone, sc->mem_cgroup, 0, file);
- if (scanning_global_lru(sc)) {
- zone->pages_scanned += nr_scan;
- if (current_is_kswapd())
- __count_zone_vm_events(PGSCAN_KSWAPD, zone,
- nr_scan);
- else
- __count_zone_vm_events(PGSCAN_DIRECT, zone,
- nr_scan);
- }
+ if (scanning_global_lru(sc)) {
+ zone->pages_scanned += nr_scanned;
+ if (current_is_kswapd())
+ __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
+ else
+ __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
+ }
- if (nr_taken == 0)
- goto done;
+ if (nr_taken == 0)
+ goto done;
- nr_active = clear_active_flags(&page_list, count);
- __count_vm_events(PGDEACTIVATE, nr_active);
+ nr_active = clear_active_flags(&page_list, count);
+ __count_vm_events(PGDEACTIVATE, nr_active);
- __mod_zone_page_state(zone, NR_ACTIVE_FILE,
- -count[LRU_ACTIVE_FILE]);
- __mod_zone_page_state(zone, NR_INACTIVE_FILE,
- -count[LRU_INACTIVE_FILE]);
- __mod_zone_page_state(zone, NR_ACTIVE_ANON,
- -count[LRU_ACTIVE_ANON]);
- __mod_zone_page_state(zone, NR_INACTIVE_ANON,
- -count[LRU_INACTIVE_ANON]);
+ __mod_zone_page_state(zone, NR_ACTIVE_FILE,
+ -count[LRU_ACTIVE_FILE]);
+ __mod_zone_page_state(zone, NR_INACTIVE_FILE,
+ -count[LRU_INACTIVE_FILE]);
+ __mod_zone_page_state(zone, NR_ACTIVE_ANON,
+ -count[LRU_ACTIVE_ANON]);
+ __mod_zone_page_state(zone, NR_INACTIVE_ANON,
+ -count[LRU_INACTIVE_ANON]);
- nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
- nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+ nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+ nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
- reclaim_stat->recent_scanned[0] += nr_anon;
- reclaim_stat->recent_scanned[1] += nr_file;
+ reclaim_stat->recent_scanned[0] += nr_anon;
+ reclaim_stat->recent_scanned[1] += nr_file;
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(&zone->lru_lock);
- nr_scanned += nr_scan;
- nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+ nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+ /*
+ * If we are direct reclaiming for contiguous pages and we do
+ * not reclaim everything in the list, try again and wait
+ * for IO to complete. This will stall high-order allocations
+ * but that should be acceptable to the caller
+ */
+ if (nr_reclaimed < nr_taken && !current_is_kswapd() && lumpy_reclaim) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
/*
- * If we are direct reclaiming for contiguous pages and we do
- * not reclaim everything in the list, try again and wait
- * for IO to complete. This will stall high-order allocations
- * but that should be acceptable to the caller
+ * The attempt at page out may have made some
+ * of the pages active, mark them inactive again.
*/
- if (nr_freed < nr_taken && !current_is_kswapd() &&
- lumpy_reclaim) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
-
- /*
- * The attempt at page out may have made some
- * of the pages active, mark them inactive again.
- */
- nr_active = clear_active_flags(&page_list, count);
- count_vm_events(PGDEACTIVATE, nr_active);
-
- nr_freed += shrink_page_list(&page_list, sc,
- PAGEOUT_IO_SYNC);
- }
+ nr_active = clear_active_flags(&page_list, count);
+ count_vm_events(PGDEACTIVATE, nr_active);
- nr_reclaimed += nr_freed;
+ nr_reclaimed += shrink_page_list(&page_list, sc,
+ PAGEOUT_IO_SYNC);
+ }
- local_irq_disable();
- if (current_is_kswapd())
- __count_vm_events(KSWAPD_STEAL, nr_freed);
- __count_zone_vm_events(PGSTEAL, zone, nr_freed);
+ local_irq_disable();
+ if (current_is_kswapd())
+ __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+ __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
- spin_lock(&zone->lru_lock);
- /*
- * Put back any unfreeable pages.
- */
- while (!list_empty(&page_list)) {
- int lru;
- page = lru_to_page(&page_list);
- VM_BUG_ON(PageLRU(page));
- list_del(&page->lru);
- if (unlikely(!page_evictable(page, NULL))) {
- spin_unlock_irq(&zone->lru_lock);
- putback_lru_page(page);
- spin_lock_irq(&zone->lru_lock);
- continue;
- }
- SetPageLRU(page);
- lru = page_lru(page);
- add_page_to_lru_list(zone, page, lru);
- if (is_active_lru(lru)) {
- int file = is_file_lru(lru);
- reclaim_stat->recent_rotated[file]++;
- }
- if (!pagevec_add(&pvec, page)) {
- spin_unlock_irq(&zone->lru_lock);
- __pagevec_release(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
+ spin_lock(&zone->lru_lock);
+ /*
+ * Put back any unfreeable pages.
+ */
+ while (!list_empty(&page_list)) {
+ int lru;
+ page = lru_to_page(&page_list);
+ VM_BUG_ON(PageLRU(page));
+ list_del(&page->lru);
+ if (unlikely(!page_evictable(page, NULL))) {
+ spin_unlock_irq(&zone->lru_lock);
+ putback_lru_page(page);
+ spin_lock_irq(&zone->lru_lock);
+ continue;
}
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
- } while (nr_scanned < max_scan);
+ SetPageLRU(page);
+ lru = page_lru(page);
+ add_page_to_lru_list(zone, page, lru);
+ if (is_active_lru(lru)) {
+ int file = is_file_lru(lru);
+ reclaim_stat->recent_rotated[file]++;
+ }
+ if (!pagevec_add(&pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
done:
spin_unlock_irq(&zone->lru_lock);
--
1.6.5.2
This patch is used from [3/4]
===================================
Free_hot_cold_page() and __free_pages_ok() have very similar
freeing preparation. This patch make consolicate it.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 40 +++++++++++++++++++++-------------------
1 files changed, 21 insertions(+), 19 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88513c0..ba9aea7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,20 +599,23 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}
-static void __free_pages_ok(struct page *page, unsigned int order)
+static int free_pages_prepare(struct page *page, unsigned int order)
{
- unsigned long flags;
int i;
int bad = 0;
- int wasMlocked = __TestClearPageMlocked(page);
trace_mm_page_free_direct(page, order);
kmemcheck_free_shadow(page, order);
- for (i = 0 ; i < (1 << order) ; ++i)
- bad += free_pages_check(page + i);
+ for (i = 0 ; i < (1 << order) ; ++i) {
+ struct page *pg = page + i;
+
+ if (PageAnon(pg))
+ pg->mapping = NULL;
+ bad += free_pages_check(pg);
+ }
if (bad)
- return;
+ return -EINVAL;
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
@@ -622,6 +625,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
arch_free_page(page, order);
kernel_map_pages(page, 1 << order, 0);
+ return 0;
+}
+
+static void __free_pages_ok(struct page *page, unsigned int order)
+{
+ unsigned long flags;
+ int wasMlocked = __TestClearPageMlocked(page);
+
+ if (free_pages_prepare(page, order))
+ return;
+
local_irq_save(flags);
if (unlikely(wasMlocked))
free_page_mlock(page);
@@ -1107,21 +1121,9 @@ void free_hot_cold_page(struct page *page, int cold)
int migratetype;
int wasMlocked = __TestClearPageMlocked(page);
- trace_mm_page_free_direct(page, 0);
- kmemcheck_free_shadow(page, 0);
-
- if (PageAnon(page))
- page->mapping = NULL;
- if (free_pages_check(page))
+ if (free_pages_prepare(page, 0))
return;
- if (!PageHighMem(page)) {
- debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
- debug_check_no_obj_freed(page_address(page), PAGE_SIZE);
- }
- arch_free_page(page, 0);
- kernel_map_pages(page, 1, 0);
-
migratetype = get_pageblock_migratetype(page);
set_page_private(page, migratetype);
local_irq_save(flags);
--
1.6.5.2
Now, vmscan is using __pagevec_free() for batch freeing. but
pagevec consume slightly lots stack (sizeof(long)*8), and x86_64
stack is very strictly limited.
Then, now we are planning to use page->lru list instead pagevec
for reducing stack. and introduce new helper function.
This is similar to __pagevec_free(), but receive list instead
pagevec. and this don't use pcp cache. it is good characteristics
for vmscan.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/gfp.h | 1 +
mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 0 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4c6d413..dbcac56 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -332,6 +332,7 @@ extern void free_hot_cold_page(struct page *page, int cold);
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr),0)
+void free_pages_bulk(struct zone *zone, struct list_head *list);
void page_alloc_init(void);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ba9aea7..1f68832 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2049,6 +2049,50 @@ void free_pages(unsigned long addr, unsigned int order)
EXPORT_SYMBOL(free_pages);
+/*
+ * Frees a number of pages from the list
+ * Assumes all pages on list are in same zone and order==0.
+ *
+ * This is similar to __pagevec_free(), but receive list instead pagevec.
+ * and this don't use pcp cache. it is good characteristics for vmscan.
+ */
+void free_pages_bulk(struct zone *zone, struct list_head *list)
+{
+ unsigned long flags;
+ struct page *page;
+ struct page *page2;
+ int nr_pages = 0;
+
+ list_for_each_entry_safe(page, page2, list, lru) {
+ int wasMlocked = __TestClearPageMlocked(page);
+
+ if (free_pages_prepare(page, 0)) {
+ /* Make orphan the corrupted page. */
+ list_del(&page->lru);
+ continue;
+ }
+ if (unlikely(wasMlocked)) {
+ local_irq_save(flags);
+ free_page_mlock(page);
+ local_irq_restore(flags);
+ }
+ nr_pages++;
+ }
+
+ spin_lock_irqsave(&zone->lock, flags);
+ __count_vm_events(PGFREE, nr_pages);
+ zone->all_unreclaimable = 0;
+ zone->pages_scanned = 0;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
+
+ list_for_each_entry_safe(page, page2, list, lru) {
+ /* have to delete it as __free_one_page list manipulates */
+ list_del(&page->lru);
+ __free_one_page(page, zone, 0, page_private(page));
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
/**
* alloc_pages_exact - allocate an exact number physically-contiguous pages.
* @size: the number of bytes to allocate
--
1.6.5.2
On x86_64, sizeof(struct pagevec) is 8*16=128, but
sizeof(struct list_head) is 8*2=16. So, to replace pagevec with list
makes to reduce 112 bytes stack.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 22 ++++++++++++++--------
1 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4de4029..fbc26d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -93,6 +93,8 @@ struct scan_control {
unsigned long *scanned, int order, int mode,
struct zone *z, struct mem_cgroup *mem_cont,
int active, int file);
+
+ struct list_head free_batch_list;
};
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -641,13 +643,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
enum pageout_io sync_writeback)
{
LIST_HEAD(ret_pages);
- struct pagevec freed_pvec;
int pgactivate = 0;
unsigned long nr_reclaimed = 0;
cond_resched();
- pagevec_init(&freed_pvec, 1);
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -822,10 +822,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
__clear_page_locked(page);
free_it:
nr_reclaimed++;
- if (!pagevec_add(&freed_pvec, page)) {
- __pagevec_free(&freed_pvec);
- pagevec_reinit(&freed_pvec);
- }
+ list_add(&page->lru, &sc->free_batch_list);
continue;
cull_mlocked:
@@ -849,8 +846,6 @@ keep:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
list_splice(&ret_pages, page_list);
- if (pagevec_count(&freed_pvec))
- __pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1238,6 +1233,11 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
PAGEOUT_IO_SYNC);
}
+ /*
+ * Free unused pages.
+ */
+ free_pages_bulk(zone, &sc->free_batch_list);
+
local_irq_disable();
if (current_is_kswapd())
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
@@ -1844,6 +1844,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
.nodemask = nodemask,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
return do_try_to_free_pages(zonelist, &sc);
@@ -1864,6 +1865,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
.order = 0,
.mem_cgroup = mem,
.isolate_pages = mem_cgroup_isolate_pages,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
nodemask_t nm = nodemask_of_node(nid);
@@ -1900,6 +1902,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
.mem_cgroup = mem_cont,
.isolate_pages = mem_cgroup_isolate_pages,
.nodemask = NULL, /* we don't care the placement */
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -1976,6 +1979,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.order = order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
loop_again:
total_scanned = 0;
@@ -2333,6 +2337,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
.swappiness = vm_swappiness,
.order = 0,
.isolate_pages = isolate_pages_global,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
struct zonelist * zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
struct task_struct *p = current;
@@ -2517,6 +2522,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.swappiness = vm_swappiness,
.order = order,
.isolate_pages = isolate_pages_global,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
unsigned long slab_reclaimable;
--
1.6.5.2
On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > > profiles we are seeing here....
> > > > > > > >
> > > > > > >
> > > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > > doing sync IO, then waiting on those pages.
> > > > > >
> > > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > > because seeks are evil and direct reclaim makes seeks. I'd really loev
> > > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > > of doing page by page spatters of IO to the drive.
> > > >
> > > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > > making 4k io is not must for pageout. So, probably we can improve it.
> > > >
> > > >
> > > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > > helpers that filesystems use to do this, like:
> > > > >
> > > > > filemap_write_and_wait(page->mapping);
> > > >
> > > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> > >
> > > So use filemap_fdatawrite(page->mapping), or if it's better only
> > > to start IO on a segment of the file, use
> > > filemap_fdatawrite_range(page->mapping, start, end)....
> >
> > That does not help the stack usage issue, the caller ends up in
> > ->writepages. From an IO perspective, it'll be better from a seek point of
> > view but from a VM perspective, it may or may not be cleaning the right pages.
> > So I think this is a red herring.
>
> If you ask it to clean a bunch of pages around the one you want to
> reclaim on the LRU, there is a good chance it will also be cleaning
> pages that are near the end of the LRU or physically close by as
> well. It's not a guarantee, but for the additional IO cost of about
> 10% wall time on that IO to clean the page you need, you also get
> 1-2 orders of magnitude other pages cleaned. That sounds like a
> win any way you look at it...
>
At worst, it'll distort the LRU ordering slightly. Lets say the the
file-adjacent-page you clean was near the end of the LRU. Before such a
patch, it may have gotten cleaned and done another lap of the LRU.
After, it would be reclaimed sooner. I don't know if we depend on such
behaviour (very doubtful) but it's a subtle enough change. I can't
predict what it'll do for IO congestion. Simplistically, there is more
IO so it's bad but if the write pattern is less seeky and we needed to
write the pages anyway, it might be improved.
> I agree that it doesn't solve the stack problem (Chris' suggestion
> that we enable the bdi flusher interface would fix this);
I'm afraid I'm not familiar with this interface. Can you point me at
some previous discussion so that I am sure I am looking at the right
thing?
> what I'm
> pointing out is that the arguments that it is too hard or there are
> no interfaces available to issue larger IO from reclaim are not at
> all valid.
>
Sure, I'm not resisting fixing this, just your first patch :) There are four
goals here
1. Reduce stack usage
2. Avoid the splicing of subsystem stack usage with direct reclaim
3. Preserve lumpy reclaims cleaning of contiguous pages
4. Try and not drastically alter LRU aging
1 and 2 are important for you, 3 is important for me and 4 will have to
be dealt with on a case-by-case basis.
Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
guess dirty pages can cycle around more so it'd need to be cared for.
> > > the deepest call chain in queue_work() needs 700 bytes of stack
> > > to complete, wait_for_completion() requires almost 2k of stack space
> > > at it's deepest, the scheduler has some heavy stack users, etc,
> > > and these are all functions that appear at the top of the stack.
> > >
> >
> > The real issue here then is that stack usage has gone out of control.
>
> That's definitely true, but it shouldn't cloud the fact that most
> ppl want to kill writeback from direct reclaim, too, so killing two
> birds with one stone seems like a good idea.
>
Ah yes, but I at least will resist killing of writeback from direct
reclaim because of lumpy reclaim. Again, I recognise the seek pattern
sucks but sometimes there are specific pages we need cleaned.
> How about this? For now, we stop direct reclaim from doing writeback
> only on order zero allocations, but allow it for higher order
> allocations. That will prevent the majority of situations where
> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.
>
> Then we can work towards allowing lumpy reclaim to use background
> threads as Chris suggested for doing specific writeback operations
> to solve the remaining problems being seen. Does this seem like a
> reasonable compromise and approach to dealing with the problem?
>
I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
enough or come up with an alternative fix. From the goals above it mitigates
1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
the LRU with 4 until the background cleaner or kswapd comes along.
One reason why I am edgy about this is that lumpy reclaim can kick in
for low-enough orders too like order-1 pages for stacks in some cases or
order-2 pages for network cards using jumbo frames or some wireless
cards. The network cards in particular could still cause the stack
overflow but be much harder to reproduce and detect.
> > Disabling ->writepage in direct reclaim does not guarantee that stack
> > usage will not be a problem again. From your traces, page reclaim itself
> > seems to be a big dirty hog.
>
> I couldn't agree more - the kernel still needs to be put on a stack
> usage diet, but the above would give use some breathing space to attack the
> problem before more people start to hit these problems.
>
I'd like stack reduction to be plan a because it buys time without
making the problem exclusively lumpy reclaims where it can still hit,
but is harder to reproduce.
> > > Good start, but 512 bytes will only catch select and splice read,
> > > and there are 300-400 byte functions in the above list that sit near
> > > the top of the stack....
> > >
> >
> > They will need to be tackled in turn then but obviously there should be
> > a focus on the common paths. The reclaim paths do seem particularly
> > heavy and it's down to a lot of temporary variables. I might not get the
> > time today but what I'm going to try do some time this week is
> >
> > o Look at what temporary variables are copies of other pieces of information
> > o See what variables live for the duration of reclaim but are not needed
> > for all of it (i.e. uninline parts of it so variables do not persist)
> > o See if it's possible to dynamically allocate scan_control
>
> Welcome to my world ;)
>
It's not like the brochure at all :)
> > The last one is the trickiest. Basically, the idea would be to move as much
> > into scan_control as possible. Then, instead of allocating it on the stack,
> > allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> > a semaphore. Limit the number of direct reclaimers that can be active at a
> > time to the number of scan_control variables. kswapd could still allocate
> > its on the stack or with kmalloc.
> >
> > If it works out, it would have two main benefits. Limits the number of
> > processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> > reclaim, there is too much going on. It would also shrink the stack usage
> > particularly if some of the stack variables are moved into scan_control.
> >
> > Maybe someone will beat me to looking at the feasibility of this.
>
> I like the idea - it really sounds like you want a fixed size,
> preallocated mempool that can't be enlarged.
Yep. It would cut down around 1K of stack usage when direct reclaim gets
involved. The "downside" would be a limitation of the number of direct
reclaimers that exist at any given time but that could be a positive in
some cases.
> In fact, I can probably
> use something like this in XFS to save a couple of hundred bytes of
> stack space in the worst hogs....
>
> > > > > This is the sort of thing I'm pointing at when I say that stack
> > > > > usage outside XFS has grown significantly significantly over the
> > > > > past couple of years. Given XFS has remained pretty much the same or
> > > > > even reduced slightly over the same time period, blaming XFS or
> > > > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > > > Regardless of the IO pattern performance issues, writeback via
> > > > > direct reclaim just uses too much stack to be safe these days...
> > > >
> > > > Yeah, My answer is simple, All stack eater should be fixed.
> > > > but XFS seems not innocence too. 3.5K is enough big although
> > > > xfs have use such amount since very ago.
> > >
> > > XFS used to use much more than that - significant effort has been
> > > put into reduce the stack footprint over many years. There's not
> > > much left to trim without rewriting half the filesystem...
> >
> > I don't think he is levelling a complain at XFS in particular - just pointing
> > out that it's heavy too. Still, we should be gratful that XFS is sort of
> > a "Stack Canary". If it dies, everyone else could be in trouble soon :)
>
> Yeah, true. Sorry ??f in being a bit too defensive here - the scars
> from previous discussions like this are showing through....
>
I guessed :)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Apr 15, 2010 at 05:26:27PM +0900, KOSAKI Motohiro wrote:
> Cc to Johannes
>
> > >
> > > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> > >
> > > > Now, vmscan pageout() is one of IO throuput degression source.
> > > > Some IO workload makes very much order-0 allocation and reclaim
> > > > and pageout's 4K IOs are making annoying lots seeks.
> > > >
> > > > At least, kswapd can avoid such pageout() because kswapd don't
> > > > need to consider OOM-Killer situation. that's no risk.
> > > >
> > > > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > >
> > > What's your opinion on trying to cluster the writes done by pageout,
> > > instead of not doing any paging out in kswapd?
> > > Something along these lines:
> >
> > Interesting.
> > So, I'd like to review your patch carefully. can you please give me one
> > day? :)
>
> Hannes, if my remember is correct, you tried similar swap-cluster IO
> long time ago. now I can't remember why we didn't merged such patch.
> Do you remember anything?
Oh, quite vividly in fact :) For a lot of swap loads the LRU order
diverged heavily from swap slot order and readaround was a waste of
time.
Of course, the patch looked good, too, but it did not match reality
that well.
I guess 'how about this patch?' won't get us as far as 'how about
those numbers/graphs of several real-life workloads? oh and here
is the patch...'.
> > > Cluster writes to disk due to memory pressure.
> > >
> > > Write out logically adjacent pages to the one we're paging out
> > > so that we may get better IOs in these situations:
> > > These pages are likely to be contiguous on disk to the one we're
> > > writing out, so they should get merged into a single disk IO.
> > >
> > > Signed-off-by: Suleiman Souhlal <[email protected]>
For random IO, LRU order will have nothing to do with mapping/disk order.
On Thu, Apr 15, 2010 at 01:11:37PM +0900, KOSAKI Motohiro wrote:
> Now, vmscan pageout() is one of IO throuput degression source.
> Some IO workload makes very much order-0 allocation and reclaim
> and pageout's 4K IOs are making annoying lots seeks.
>
> At least, kswapd can avoid such pageout() because kswapd don't
> need to consider OOM-Killer situation. that's no risk.
>
Well, there is some risk here. Direct reclaimers may not be cleaning
more pages than it had to previously except it splices subsystems
together increasing stack usage and causing further problems.
It might not cause OOM-killer issues but it could increase the time
dirty pages spend on the LRU.
Am I missing something?
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmscan.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3ff3311..d392a50 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
> if (referenced_page)
> return PAGEREF_RECLAIM_CLEAN;
>
> + /*
> + * Delegate pageout IO to flusher thread. They can make more
> + * effective IO pattern.
> + */
> + if (current_is_kswapd())
> + return PAGEREF_RECLAIM_CLEAN;
> +
> return PAGEREF_RECLAIM;
> }
>
> --
> 1.6.5.2
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
> On Thu, Apr 15, 2010 at 01:11:37PM +0900, KOSAKI Motohiro wrote:
> > Now, vmscan pageout() is one of IO throuput degression source.
> > Some IO workload makes very much order-0 allocation and reclaim
> > and pageout's 4K IOs are making annoying lots seeks.
> >
> > At least, kswapd can avoid such pageout() because kswapd don't
> > need to consider OOM-Killer situation. that's no risk.
> >
>
> Well, there is some risk here. Direct reclaimers may not be cleaning
> more pages than it had to previously except it splices subsystems
> together increasing stack usage and causing further problems.
>
> It might not cause OOM-killer issues but it could increase the time
> dirty pages spend on the LRU.
>
> Am I missing something?
No. you are right. I fully agree your previous mail. so, I need to cool down a bit ;)
On Thu, Apr 15, 2010 at 07:23:04PM +0900, KOSAKI Motohiro wrote:
> Now, max_scan of shrink_inactive_list() is always passed less than
> SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
> This patch also help stack diet.
>
Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
stack-o-meter) and got the following. The prereq patches are from
earlier in the thread with the subjects
vmscan: kill prev_priority completely
vmscan: move priority variable into scan_control
It gets
$ stack-o-meter vmlinux-vanilla vmlinux-1-2patchprereq
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-72 (-72)
function old new delta
kswapd 748 676 -72
and with this patch on top
$ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144)
function old new delta
shrink_zone 1232 1160 -72
kswapd 748 676 -72
X86-32 based config.
> detail
> - remove "while (nr_scanned < max_scan)" loop
> - remove nr_freed (now, we use nr_reclaimed directly)
> - remove nr_scan (now, we use nr_scanned directly)
> - rename max_scan to nr_to_scan
> - pass nr_to_scan into isolate_pages() directly instead
> using SWAP_CLUSTER_MAX
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
I couldn't spot any problems. I'd consider throwing a
WARN_ON(nr_to_scan > SWAP_CLUSTER_MAX) in case some future change breaks
the assumptions but otherwise.
Acked-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 190 ++++++++++++++++++++++++++++-------------------------------
> 1 files changed, 89 insertions(+), 101 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eab6028..4de4029 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1137,16 +1137,22 @@ static int too_many_isolated(struct zone *zone, int file,
> * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
> * of reclaimed pages
> */
> -static unsigned long shrink_inactive_list(unsigned long max_scan,
> +static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone, struct scan_control *sc,
> int file)
> {
> LIST_HEAD(page_list);
> struct pagevec pvec;
> - unsigned long nr_scanned = 0;
> + unsigned long nr_scanned;
> unsigned long nr_reclaimed = 0;
> struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> int lumpy_reclaim = 0;
> + struct page *page;
> + unsigned long nr_taken;
> + unsigned long nr_active;
> + unsigned int count[NR_LRU_LISTS] = { 0, };
> + unsigned long nr_anon;
> + unsigned long nr_file;
>
> while (unlikely(too_many_isolated(zone, file, sc))) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1172,119 +1178,101 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
>
> lru_add_drain();
> spin_lock_irq(&zone->lru_lock);
> - do {
> - struct page *page;
> - unsigned long nr_taken;
> - unsigned long nr_scan;
> - unsigned long nr_freed;
> - unsigned long nr_active;
> - unsigned int count[NR_LRU_LISTS] = { 0, };
> - int mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
> - unsigned long nr_anon;
> - unsigned long nr_file;
> -
> - nr_taken = sc->isolate_pages(SWAP_CLUSTER_MAX,
> - &page_list, &nr_scan, sc->order, mode,
> - zone, sc->mem_cgroup, 0, file);
> + nr_taken = sc->isolate_pages(nr_to_scan,
> + &page_list, &nr_scanned, sc->order,
> + lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE,
> + zone, sc->mem_cgroup, 0, file);
>
> - if (scanning_global_lru(sc)) {
> - zone->pages_scanned += nr_scan;
> - if (current_is_kswapd())
> - __count_zone_vm_events(PGSCAN_KSWAPD, zone,
> - nr_scan);
> - else
> - __count_zone_vm_events(PGSCAN_DIRECT, zone,
> - nr_scan);
> - }
> + if (scanning_global_lru(sc)) {
> + zone->pages_scanned += nr_scanned;
> + if (current_is_kswapd())
> + __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
> + else
> + __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
> + }
>
> - if (nr_taken == 0)
> - goto done;
> + if (nr_taken == 0)
> + goto done;
>
> - nr_active = clear_active_flags(&page_list, count);
> - __count_vm_events(PGDEACTIVATE, nr_active);
> + nr_active = clear_active_flags(&page_list, count);
> + __count_vm_events(PGDEACTIVATE, nr_active);
>
> - __mod_zone_page_state(zone, NR_ACTIVE_FILE,
> - -count[LRU_ACTIVE_FILE]);
> - __mod_zone_page_state(zone, NR_INACTIVE_FILE,
> - -count[LRU_INACTIVE_FILE]);
> - __mod_zone_page_state(zone, NR_ACTIVE_ANON,
> - -count[LRU_ACTIVE_ANON]);
> - __mod_zone_page_state(zone, NR_INACTIVE_ANON,
> - -count[LRU_INACTIVE_ANON]);
> + __mod_zone_page_state(zone, NR_ACTIVE_FILE,
> + -count[LRU_ACTIVE_FILE]);
> + __mod_zone_page_state(zone, NR_INACTIVE_FILE,
> + -count[LRU_INACTIVE_FILE]);
> + __mod_zone_page_state(zone, NR_ACTIVE_ANON,
> + -count[LRU_ACTIVE_ANON]);
> + __mod_zone_page_state(zone, NR_INACTIVE_ANON,
> + -count[LRU_INACTIVE_ANON]);
>
> - nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> - nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> - __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> - __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
> + nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> + nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> + __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> + __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
>
> - reclaim_stat->recent_scanned[0] += nr_anon;
> - reclaim_stat->recent_scanned[1] += nr_file;
> + reclaim_stat->recent_scanned[0] += nr_anon;
> + reclaim_stat->recent_scanned[1] += nr_file;
>
> - spin_unlock_irq(&zone->lru_lock);
> + spin_unlock_irq(&zone->lru_lock);
>
> - nr_scanned += nr_scan;
> - nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +
> + /*
> + * If we are direct reclaiming for contiguous pages and we do
> + * not reclaim everything in the list, try again and wait
> + * for IO to complete. This will stall high-order allocations
> + * but that should be acceptable to the caller
> + */
> + if (nr_reclaimed < nr_taken && !current_is_kswapd() && lumpy_reclaim) {
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
> - * If we are direct reclaiming for contiguous pages and we do
> - * not reclaim everything in the list, try again and wait
> - * for IO to complete. This will stall high-order allocations
> - * but that should be acceptable to the caller
> + * The attempt at page out may have made some
> + * of the pages active, mark them inactive again.
> */
> - if (nr_freed < nr_taken && !current_is_kswapd() &&
> - lumpy_reclaim) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> - /*
> - * The attempt at page out may have made some
> - * of the pages active, mark them inactive again.
> - */
> - nr_active = clear_active_flags(&page_list, count);
> - count_vm_events(PGDEACTIVATE, nr_active);
> -
> - nr_freed += shrink_page_list(&page_list, sc,
> - PAGEOUT_IO_SYNC);
> - }
> + nr_active = clear_active_flags(&page_list, count);
> + count_vm_events(PGDEACTIVATE, nr_active);
>
> - nr_reclaimed += nr_freed;
> + nr_reclaimed += shrink_page_list(&page_list, sc,
> + PAGEOUT_IO_SYNC);
> + }
>
> - local_irq_disable();
> - if (current_is_kswapd())
> - __count_vm_events(KSWAPD_STEAL, nr_freed);
> - __count_zone_vm_events(PGSTEAL, zone, nr_freed);
> + local_irq_disable();
> + if (current_is_kswapd())
> + __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> + __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
>
> - spin_lock(&zone->lru_lock);
> - /*
> - * Put back any unfreeable pages.
> - */
> - while (!list_empty(&page_list)) {
> - int lru;
> - page = lru_to_page(&page_list);
> - VM_BUG_ON(PageLRU(page));
> - list_del(&page->lru);
> - if (unlikely(!page_evictable(page, NULL))) {
> - spin_unlock_irq(&zone->lru_lock);
> - putback_lru_page(page);
> - spin_lock_irq(&zone->lru_lock);
> - continue;
> - }
> - SetPageLRU(page);
> - lru = page_lru(page);
> - add_page_to_lru_list(zone, page, lru);
> - if (is_active_lru(lru)) {
> - int file = is_file_lru(lru);
> - reclaim_stat->recent_rotated[file]++;
> - }
> - if (!pagevec_add(&pvec, page)) {
> - spin_unlock_irq(&zone->lru_lock);
> - __pagevec_release(&pvec);
> - spin_lock_irq(&zone->lru_lock);
> - }
> + spin_lock(&zone->lru_lock);
> + /*
> + * Put back any unfreeable pages.
> + */
> + while (!list_empty(&page_list)) {
> + int lru;
> + page = lru_to_page(&page_list);
> + VM_BUG_ON(PageLRU(page));
> + list_del(&page->lru);
> + if (unlikely(!page_evictable(page, NULL))) {
> + spin_unlock_irq(&zone->lru_lock);
> + putback_lru_page(page);
> + spin_lock_irq(&zone->lru_lock);
> + continue;
> }
> - __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
> - __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
> -
> - } while (nr_scanned < max_scan);
> + SetPageLRU(page);
> + lru = page_lru(page);
> + add_page_to_lru_list(zone, page, lru);
> + if (is_active_lru(lru)) {
> + int file = is_file_lru(lru);
> + reclaim_stat->recent_rotated[file]++;
> + }
> + if (!pagevec_add(&pvec, page)) {
> + spin_unlock_irq(&zone->lru_lock);
> + __pagevec_release(&pvec);
> + spin_lock_irq(&zone->lru_lock);
> + }
> + }
> + __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
> + __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
>
> done:
> spin_unlock_irq(&zone->lru_lock);
> --
> 1.6.5.2
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Apr 15, 2010 at 07:24:05PM +0900, KOSAKI Motohiro wrote:
> This patch is used from [3/4]
>
> ===================================
> Free_hot_cold_page() and __free_pages_ok() have very similar
> freeing preparation. This patch make consolicate it.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/page_alloc.c | 40 +++++++++++++++++++++-------------------
> 1 files changed, 21 insertions(+), 19 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 88513c0..ba9aea7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -599,20 +599,23 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
> spin_unlock(&zone->lock);
> }
>
> -static void __free_pages_ok(struct page *page, unsigned int order)
> +static int free_pages_prepare(struct page *page, unsigned int order)
> {
You don't appear to do anything with the return value. bool? Otherwise I
see no problems
Acked-by: Mel Gorman <[email protected]>
> - unsigned long flags;
> int i;
> int bad = 0;
> - int wasMlocked = __TestClearPageMlocked(page);
>
> trace_mm_page_free_direct(page, order);
> kmemcheck_free_shadow(page, order);
>
> - for (i = 0 ; i < (1 << order) ; ++i)
> - bad += free_pages_check(page + i);
> + for (i = 0 ; i < (1 << order) ; ++i) {
> + struct page *pg = page + i;
> +
> + if (PageAnon(pg))
> + pg->mapping = NULL;
> + bad += free_pages_check(pg);
> + }
> if (bad)
> - return;
> + return -EINVAL;
>
> if (!PageHighMem(page)) {
> debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
> @@ -622,6 +625,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> arch_free_page(page, order);
> kernel_map_pages(page, 1 << order, 0);
>
> + return 0;
> +}
> +
> +static void __free_pages_ok(struct page *page, unsigned int order)
> +{
> + unsigned long flags;
> + int wasMlocked = __TestClearPageMlocked(page);
> +
> + if (free_pages_prepare(page, order))
> + return;
> +
> local_irq_save(flags);
> if (unlikely(wasMlocked))
> free_page_mlock(page);
> @@ -1107,21 +1121,9 @@ void free_hot_cold_page(struct page *page, int cold)
> int migratetype;
> int wasMlocked = __TestClearPageMlocked(page);
>
> - trace_mm_page_free_direct(page, 0);
> - kmemcheck_free_shadow(page, 0);
> -
> - if (PageAnon(page))
> - page->mapping = NULL;
> - if (free_pages_check(page))
> + if (free_pages_prepare(page, 0))
> return;
>
> - if (!PageHighMem(page)) {
> - debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
> - debug_check_no_obj_freed(page_address(page), PAGE_SIZE);
> - }
> - arch_free_page(page, 0);
> - kernel_map_pages(page, 1, 0);
> -
> migratetype = get_pageblock_migratetype(page);
> set_page_private(page, migratetype);
> local_irq_save(flags);
> --
> 1.6.5.2
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > > On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > > > profiles we are seeing here....
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > > > doing sync IO, then waiting on those pages.
> > > > > > >
> > > > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > > > because seeks are evil and direct reclaim makes seeks. I'd really loev
> > > > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > > > of doing page by page spatters of IO to the drive.
> > > > >
> > > > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > > > making 4k io is not must for pageout. So, probably we can improve it.
> > > > >
> > > > >
> > > > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > > > helpers that filesystems use to do this, like:
> > > > > >
> > > > > > filemap_write_and_wait(page->mapping);
> > > > >
> > > > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> > > >
> > > > So use filemap_fdatawrite(page->mapping), or if it's better only
> > > > to start IO on a segment of the file, use
> > > > filemap_fdatawrite_range(page->mapping, start, end)....
> > >
> > > That does not help the stack usage issue, the caller ends up in
> > > ->writepages. From an IO perspective, it'll be better from a seek point of
> > > view but from a VM perspective, it may or may not be cleaning the right pages.
> > > So I think this is a red herring.
> >
> > If you ask it to clean a bunch of pages around the one you want to
> > reclaim on the LRU, there is a good chance it will also be cleaning
> > pages that are near the end of the LRU or physically close by as
> > well. It's not a guarantee, but for the additional IO cost of about
> > 10% wall time on that IO to clean the page you need, you also get
> > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > win any way you look at it...
> >
>
> At worst, it'll distort the LRU ordering slightly. Lets say the the
> file-adjacent-page you clean was near the end of the LRU. Before such a
> patch, it may have gotten cleaned and done another lap of the LRU.
> After, it would be reclaimed sooner. I don't know if we depend on such
> behaviour (very doubtful) but it's a subtle enough change. I can't
> predict what it'll do for IO congestion. Simplistically, there is more
> IO so it's bad but if the write pattern is less seeky and we needed to
> write the pages anyway, it might be improved.
>
> > I agree that it doesn't solve the stack problem (Chris' suggestion
> > that we enable the bdi flusher interface would fix this);
>
> I'm afraid I'm not familiar with this interface. Can you point me at
> some previous discussion so that I am sure I am looking at the right
> thing?
vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
real code needs to go....just look for the ~ marks.
I mostly meant that the bdi helper threads were the best place to add
knowledge about which pages we want to write for reclaim. We might need
to add a thread dedicated to just doing the VM's dirty work, but that's
where I would start discussing fancy new interfaces.
>
> > what I'm
> > pointing out is that the arguments that it is too hard or there are
> > no interfaces available to issue larger IO from reclaim are not at
> > all valid.
> >
>
> Sure, I'm not resisting fixing this, just your first patch :) There are four
> goals here
>
> 1. Reduce stack usage
> 2. Avoid the splicing of subsystem stack usage with direct reclaim
> 3. Preserve lumpy reclaims cleaning of contiguous pages
> 4. Try and not drastically alter LRU aging
>
> 1 and 2 are important for you, 3 is important for me and 4 will have to
> be dealt with on a case-by-case basis.
>
> Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> guess dirty pages can cycle around more so it'd need to be cared for.
I'd like to add one more:
5. Don't dive into filesystem locks during reclaim.
This is different from splicing code paths together, but
the filesystem writepage code has become the center of our attempts at
doing big fat contiguous writes on disk. We push off work as late as we
can until just before the pages go down to disk.
I'll pick on ext4 and btrfs for a minute, just to broaden the scope
outside of XFS. Writepage comes along and the filesystem needs to
actually find blocks on disk for all the dirty pages it has promised to
write.
So, we start a transaction, we take various allocator locks, modify
different metadata, log changed blocks, take a break (logging is hard
work you know, need_resched() triggered a by now), stuff it
all into the file's metadata, log that, and finally return.
Each of the steps above can block for a long time. Ext4 solves
this by not doing them. ext4_writepage only writes pages that
are already fully allocated on disk.
Btrfs is much more efficient at not doing them, it just returns right
away for PF_MEMALLOC.
This is a long way of saying the filesystem writepage code is the
opposite of what direct reclaim wants. Direct reclaim wants to
find free ram now, and if it does end up in the mess describe above,
it'll just get stuck for a long time on work entirely unrelated to
finding free pages.
-chris
On Thu, Apr 15, 2010 at 07:24:53PM +0900, KOSAKI Motohiro wrote:
> Now, vmscan is using __pagevec_free() for batch freeing. but
> pagevec consume slightly lots stack (sizeof(long)*8), and x86_64
> stack is very strictly limited.
>
> Then, now we are planning to use page->lru list instead pagevec
> for reducing stack. and introduce new helper function.
>
> This is similar to __pagevec_free(), but receive list instead
> pagevec. and this don't use pcp cache. it is good characteristics
> for vmscan.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> include/linux/gfp.h | 1 +
> mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 45 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4c6d413..dbcac56 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -332,6 +332,7 @@ extern void free_hot_cold_page(struct page *page, int cold);
> #define __free_page(page) __free_pages((page), 0)
> #define free_page(addr) free_pages((addr),0)
>
> +void free_pages_bulk(struct zone *zone, struct list_head *list);
> void page_alloc_init(void);
> void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
> void drain_all_pages(void);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ba9aea7..1f68832 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2049,6 +2049,50 @@ void free_pages(unsigned long addr, unsigned int order)
>
> EXPORT_SYMBOL(free_pages);
>
> +/*
> + * Frees a number of pages from the list
> + * Assumes all pages on list are in same zone and order==0.
> + *
> + * This is similar to __pagevec_free(), but receive list instead pagevec.
> + * and this don't use pcp cache. it is good characteristics for vmscan.
> + */
> +void free_pages_bulk(struct zone *zone, struct list_head *list)
> +{
> + unsigned long flags;
> + struct page *page;
> + struct page *page2;
> + int nr_pages = 0;
> +
> + list_for_each_entry_safe(page, page2, list, lru) {
> + int wasMlocked = __TestClearPageMlocked(page);
> +
> + if (free_pages_prepare(page, 0)) {
> + /* Make orphan the corrupted page. */
> + list_del(&page->lru);
> + continue;
> + }
> + if (unlikely(wasMlocked)) {
> + local_irq_save(flags);
> + free_page_mlock(page);
> + local_irq_restore(flags);
> + }
You could clear this under the zone->lock below before calling
__free_one_page. It'd avoid a large number of IRQ enables and disables which
are a problem on some CPUs (P4 and Itanium both blow in this regard according
to PeterZ).
> + nr_pages++;
> + }
> +
> + spin_lock_irqsave(&zone->lock, flags);
> + __count_vm_events(PGFREE, nr_pages);
> + zone->all_unreclaimable = 0;
> + zone->pages_scanned = 0;
> + __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
> +
> + list_for_each_entry_safe(page, page2, list, lru) {
> + /* have to delete it as __free_one_page list manipulates */
> + list_del(&page->lru);
> + __free_one_page(page, zone, 0, page_private(page));
> + }
This has the effect of bypassing the per-cpu lists as well as making the
zone lock hotter. The cache hotness of the data within the page is
probably not a factor but the cache hotness of the stuct page is.
The zone lock getting hotter is a greater problem. Large amounts of page
reclaim or dumping of page cache will now contend on the zone lock where
as previously it would have dumped into the per-cpu lists (potentially
but not necessarily avoiding the zone lock).
While there might be a stack saving in the next patch, there would appear
to be definite performance implications in taking this patch.
Functionally, I see no problem but I'd put this sort of patch on the
very long finger until the performance aspects of it could be examined.
> + spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> /**
> * alloc_pages_exact - allocate an exact number physically-contiguous pages.
> * @size: the number of bytes to allocate
> --
> 1.6.5.2
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Dave Chinner <[email protected]> writes:
>
> How about this? For now, we stop direct reclaim from doing writeback
> only on order zero allocations, but allow it for higher order
> allocations. That will prevent the majority of situations where
And also stop it always with 4K stacks.
> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.
-Andi
--
[email protected] -- Speaking for myself only.
Mel Gorman <[email protected]> writes:
>
> $ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144)
> function old new delta
> shrink_zone 1232 1160 -72
> kswapd 748 676 -72
And the next time someone adds a new feature to these code paths or
the compiler inlines differently these 72 bytes are easily there
again. It's not really a long term solution. Code is tending to get
more complicated all the time. I consider it unlikely this trend will
stop any time soon.
So just doing some stack micro optimizations doesn't really help
all that much.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Apr 15, 2010 at 05:01:36PM +0200, Andi Kleen wrote:
> Mel Gorman <[email protected]> writes:
> >
> > $ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink
> > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144)
> > function old new delta
> > shrink_zone 1232 1160 -72
> > kswapd 748 676 -72
>
> And the next time someone adds a new feature to these code paths or
> the compiler inlines differently these 72 bytes are easily there
> again. It's not really a long term solution. Code is tending to get
> more complicated all the time. I consider it unlikely this trend will
> stop any time soon.
>
The same logic applies when/if page writeback is split so that it is
handled by a separate thread.
> So just doing some stack micro optimizations doesn't really help
> all that much.
>
It's a buying-time venture, I'll agree but as both approaches are only
about reducing stack stack they wouldn't be long-term solutions by your
criteria. What do you suggest?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
> It's a buying-time venture, I'll agree but as both approaches are only
> about reducing stack stack they wouldn't be long-term solutions by your
> criteria. What do you suggest?
(from easy to more complicated):
- Disable direct reclaim with 4K stacks
- Do direct reclaim only on separate stacks
- Add interrupt stacks to any 8K stack architectures.
- Get rid of 4K stacks completely
- Think about any other stackings that could give large scale recursion
and find ways to run them on separate stacks too.
- Long term: maybe we need 16K stacks at some point, depending on how
good the VM gets. Alternative would be to stop making Linux more complicated,
but that's unlikely to happen.
-Andi
--
[email protected] -- Speaking for myself only.
On Apr 15, 2010, at 3:30 AM, Johannes Weiner wrote:
> On Thu, Apr 15, 2010 at 05:26:27PM +0900, KOSAKI Motohiro wrote:
>>
>> Hannes, if my remember is correct, you tried similar swap-cluster IO
>> long time ago. now I can't remember why we didn't merged such patch.
>> Do you remember anything?
>
> Oh, quite vividly in fact :) For a lot of swap loads the LRU order
> diverged heavily from swap slot order and readaround was a waste of
> time.
>
> Of course, the patch looked good, too, but it did not match reality
> that well.
>
> I guess 'how about this patch?' won't get us as far as 'how about
> those numbers/graphs of several real-life workloads? oh and here
> is the patch...'.
>
>>>> Cluster writes to disk due to memory pressure.
>>>>
>>>> Write out logically adjacent pages to the one we're paging out
>>>> so that we may get better IOs in these situations:
>>>> These pages are likely to be contiguous on disk to the one
>>>> we're
>>>> writing out, so they should get merged into a single disk IO.
>>>>
>>>> Signed-off-by: Suleiman Souhlal <[email protected]>
>
> For random IO, LRU order will have nothing to do with mapping/disk
> order.
Right, that's why the patch writes out contiguous pages in mapping
order.
If they are contiguous on disk with the original page, then writing
them out
as well should be essentially free (when it comes to disk time). There
is
almost no waste of memory regardless of the access patterns, as far as I
can tell.
This patch is just a proof of concept and could be improved by getting
help
from the filesystem/swap code to ensure that the additional pages we're
writing out really are contiguous with the original one.
-- Suleiman
On Apr 15, 2010, at 2:32 AM, Dave Chinner wrote:
> On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
>>
>> On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>>
>>> Now, vmscan pageout() is one of IO throuput degression source.
>>> Some IO workload makes very much order-0 allocation and reclaim
>>> and pageout's 4K IOs are making annoying lots seeks.
>>>
>>> At least, kswapd can avoid such pageout() because kswapd don't
>>> need to consider OOM-Killer situation. that's no risk.
>>>
>>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>>
>> What's your opinion on trying to cluster the writes done by pageout,
>> instead of not doing any paging out in kswapd?
>
> XFS already does this in ->writepage to try to minimise the impact
> of the way pageout issues IO. It helps, but it is still not as good
> as having all the writeback come from the flusher threads because
> it's still pretty much random IO.
Doesn't the randomness become irrelevant if you can cluster enough
pages?
> And, FWIW, it doesn't solve the stack usage problems, either. In
> fact, it will make them worse as write_one_page() puts another
> struct writeback_control on the stack...
Sorry, this patch was not meant to solve the stack usage problems.
-- Suleiman
On Thu, Apr 15, 2010 at 09:42:17AM -0400, Chris Mason wrote:
> I'd like to add one more:
>
> 5. Don't dive into filesystem locks during reclaim.
>
> This is different from splicing code paths together, but
> the filesystem writepage code has become the center of our attempts at
> doing big fat contiguous writes on disk. We push off work as late as we
> can until just before the pages go down to disk.
>
> I'll pick on ext4 and btrfs for a minute, just to broaden the scope
> outside of XFS. Writepage comes along and the filesystem needs to
> actually find blocks on disk for all the dirty pages it has promised to
> write.
>
> So, we start a transaction, we take various allocator locks, modify
> different metadata, log changed blocks, take a break (logging is hard
> work you know, need_resched() triggered a by now), stuff it
> all into the file's metadata, log that, and finally return.
>
> Each of the steps above can block for a long time. Ext4 solves
> this by not doing them. ext4_writepage only writes pages that
> are already fully allocated on disk.
>
> Btrfs is much more efficient at not doing them, it just returns right
> away for PF_MEMALLOC.
This is a real problem, BTW. One of the problems we've been fighting
inside Google is because ext4_writepage() refuses to write pages that
are subject to delayed allocation, it can cause the OOM killer to get
invoked.
I had thought this was because of some evil games we're playing for
container support that makes zones small, but just last night at the
LF Collaboration Summit reception, I ran into a technologist from a
major financial industry customer reported to me that when they tried
using ext4, they ran into the exact same problem because they were
running Oracle which was pinning down 3 gigs of memory, and then when
they tried writing a very big file using ext4, they had the same
problem of writepage() not being able to reclaim enough pages, so the
kernel fell back to invoking the OOM killer, and things got ugly in a
hurry...
One of the things I was proposing internally to try as a long-term
we-gotta-fix writeback is that we need some kind of signal so that we
can do the lumpy reclaim (a) in a separate process, to avoid a lock
inversion problem and the gee-its-going-to-take-a-long-time problem
which Chris Mentioned, and (b) to try to cluster I/O so that we're not
dribbling out writes to the disk in small, seeky, 4k writes, which is
really a disaster from a performance standpoint. Maybe the VM guys
don't care about this, but this sort of things tends to get us
filesystem guys all up in a lather not just because of the really
sucky performance, but also because it tends to mean that the system
can thrash itself to death in low memory situations.
- Ted
On Thu, 15 Apr 2010 14:15:33 BST, Mel Gorman said:
> Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
> stack-o-meter) and got the following. The prereq patches are from
> earlier in the thread with the subjects
Think that's a script worth having in-tree?
On Thu, Apr 15, 2010 at 10:27:09AM -0700, Suleiman Souhlal wrote:
>
> On Apr 15, 2010, at 2:32 AM, Dave Chinner wrote:
>
> >On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
> >>
> >>On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> >>
> >>>Now, vmscan pageout() is one of IO throuput degression source.
> >>>Some IO workload makes very much order-0 allocation and reclaim
> >>>and pageout's 4K IOs are making annoying lots seeks.
> >>>
> >>>At least, kswapd can avoid such pageout() because kswapd don't
> >>>need to consider OOM-Killer situation. that's no risk.
> >>>
> >>>Signed-off-by: KOSAKI Motohiro <[email protected]>
> >>
> >>What's your opinion on trying to cluster the writes done by pageout,
> >>instead of not doing any paging out in kswapd?
> >
> >XFS already does this in ->writepage to try to minimise the impact
> >of the way pageout issues IO. It helps, but it is still not as good
> >as having all the writeback come from the flusher threads because
> >it's still pretty much random IO.
>
> Doesn't the randomness become irrelevant if you can cluster enough
> pages?
No. If you are doing full disk seeks between random chunks, then you
still lose a large amount of throughput. e.g. if the seek time is
10ms and your IO time is 10ms for each 4k page, then increasing the
size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
throughput but we are still limited to 100 IOs per second. We've
gone from 400kB/s to 6MB/s, but that's still an order of magnitude
short of the 100MB/s full size IOs with little in way of seeks
between them will acheive on the same spindle...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > It's a buying-time venture, I'll agree but as both approaches are only
> > about reducing stack stack they wouldn't be long-term solutions by your
> > criteria. What do you suggest?
>
> (from easy to more complicated):
>
> - Disable direct reclaim with 4K stacks
Just to re-iterate: we're blowing the stack with direct reclaim on
x86_64 w/ 8k stacks. The old i386/4k stack problem is a red
herring.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Apr 15, 2010 at 4:33 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Apr 15, 2010 at 10:27:09AM -0700, Suleiman Souhlal wrote:
>>
>> On Apr 15, 2010, at 2:32 AM, Dave Chinner wrote:
>>
>> >On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
>> >>
>> >>On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>> >>
>> >>>Now, vmscan pageout() is one of IO throuput degression source.
>> >>>Some IO workload makes very much order-0 allocation and reclaim
>> >>>and pageout's 4K IOs are making annoying lots seeks.
>> >>>
>> >>>At least, kswapd can avoid such pageout() because kswapd don't
>> >>>need to consider OOM-Killer situation. that's no risk.
>> >>>
>> >>>Signed-off-by: KOSAKI Motohiro <[email protected]>
>> >>
>> >>What's your opinion on trying to cluster the writes done by pageout,
>> >>instead of not doing any paging out in kswapd?
>> >
>> >XFS already does this in ->writepage to try to minimise the impact
>> >of the way pageout issues IO. It helps, but it is still not as good
>> >as having all the writeback come from the flusher threads because
>> >it's still pretty much random IO.
>>
>> Doesn't the randomness become irrelevant if you can cluster enough
>> pages?
>
> No. If you are doing full disk seeks between random chunks, then you
> still lose a large amount of throughput. e.g. if the seek time is
> 10ms and your IO time is 10ms for each 4k page, then increasing the
> size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
> throughput but we are still limited to 100 IOs per second. We've
> gone from 400kB/s to 6MB/s, but that's still an order of magnitude
> short of the 100MB/s full size IOs with little in way of seeks
> between them will acheive on the same spindle...
What I meant was that, theoretically speaking, you could increase the
maximum amount of pages that get clustered so that you could get
100MB/s, although it most likely wouldn't be a good idea with the
current patch.
-- Suleiman
On Tue, 13 Apr 2010 10:17:58 +1000
Dave Chinner <[email protected]> wrote:
> From: Dave Chinner <[email protected]>
>
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
>
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
>
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
>
> Reported-by: John Berthels <[email protected]>
> Signed-off-by: Dave Chinner <[email protected]>
Hmm. Then, if memoy cgroup is filled by dirty pages, it can't kick writeback
and has to wait for someone else's writeback ?
How long this will take ?
# mount -t cgroup none /cgroup -o memory
# mkdir /cgroup/A
# echo 20M > /cgroup/A/memory.limit_in_bytes
# echo $$ > /cgroup/A/tasks
# dd if=/dev/zero of=./tmpfile bs=4096 count=1000000
Can memcg ask writeback thread to "Wake Up Now! and Write this out!" effectively ?
Thanks,
-Kame
> ---
> mm/vmscan.c | 13 ++++++-------
> 1 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e0e5f15..5321ac4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> * writeout. So in laptop mode, write out the whole world.
> */
> writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> - if (total_scanned > writeback_threshold) {
> + if (total_scanned > writeback_threshold)
> wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> - sc->may_writepage = 1;
> - }
>
> /* Take a nap, wait for some writeback to complete */
> if (!sc->hibernation_mode && sc->nr_scanned &&
> @@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> {
> struct scan_control sc = {
> .gfp_mask = gfp_mask,
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .may_unmap = 1,
> .may_swap = 1,
> @@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> struct zone *zone, int nid)
> {
> struct scan_control sc = {
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .may_unmap = 1,
> .may_swap = !noswap,
> .swappiness = swappiness,
> @@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> {
> struct zonelist *zonelist;
> struct scan_control sc = {
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .may_unmap = 1,
> .may_swap = !noswap,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> @@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> struct reclaim_state reclaim_state;
> int priority;
> struct scan_control sc = {
> - .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> + .may_writepage = (current_is_kswapd() &&
> + (zone_reclaim_mode & RECLAIM_WRITE)),
> .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> .may_swap = 1,
> .nr_to_reclaim = max_t(unsigned long, nr_pages,
> --
> 1.6.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > If you ask it to clean a bunch of pages around the one you want to
> > reclaim on the LRU, there is a good chance it will also be cleaning
> > pages that are near the end of the LRU or physically close by as
> > well. It's not a guarantee, but for the additional IO cost of about
> > 10% wall time on that IO to clean the page you need, you also get
> > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > win any way you look at it...
>
> At worst, it'll distort the LRU ordering slightly. Lets say the the
> file-adjacent-page you clean was near the end of the LRU. Before such a
> patch, it may have gotten cleaned and done another lap of the LRU.
> After, it would be reclaimed sooner. I don't know if we depend on such
> behaviour (very doubtful) but it's a subtle enough change. I can't
> predict what it'll do for IO congestion. Simplistically, there is more
> IO so it's bad but if the write pattern is less seeky and we needed to
> write the pages anyway, it might be improved.
Fundamentally, we have so many pages on the LRU, getting a few out
of order at the back end of it is going to be in the noise. If we
trade off "perfect" LRU behaviour for cleaning pages an order of
magnitude faster, reclaim will find candidate pages for a whole lot
faster. And if we have more clean pages available, faster, overall
system throughput is going to improve and be much less likely to
fall into deep, dark holes where the OOM-killer is the light at the
end.....
[ snip questions Chris answered ]
> > what I'm
> > pointing out is that the arguments that it is too hard or there are
> > no interfaces available to issue larger IO from reclaim are not at
> > all valid.
> >
>
> Sure, I'm not resisting fixing this, just your first patch :) There are four
> goals here
>
> 1. Reduce stack usage
> 2. Avoid the splicing of subsystem stack usage with direct reclaim
> 3. Preserve lumpy reclaims cleaning of contiguous pages
> 4. Try and not drastically alter LRU aging
>
> 1 and 2 are important for you, 3 is important for me and 4 will have to
> be dealt with on a case-by-case basis.
#4 is important to me, too, because that has direct impact on large
file IO workloads. however, it is gross changes in behaviour that
concern me, not subtle, probably-in-the-noise changes that you're
concerned about. :)
> Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> guess dirty pages can cycle around more so it'd need to be cared for.
Well, you keep saying that they break #3, but I haven't seen any
test cases or results showing that. I've been unable to confirm that
lumpy reclaim is broken by disallowing writeback in my testing, so
I'm interested to know what tests you are running that show it is
broken...
> > How about this? For now, we stop direct reclaim from doing writeback
> > only on order zero allocations, but allow it for higher order
> > allocations. That will prevent the majority of situations where
> > direct reclaim blows the stack and interferes with background
> > writeout, but won't cause lumpy reclaim to change behaviour.
> > This reduces the scope of impact and hence testing and validation
> > the needs to be done.
> >
> > Then we can work towards allowing lumpy reclaim to use background
> > threads as Chris suggested for doing specific writeback operations
> > to solve the remaining problems being seen. Does this seem like a
> > reasonable compromise and approach to dealing with the problem?
> >
>
> I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
> enough or come up with an alternative fix. From the goals above it mitigates
> 1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
> the LRU with 4 until the background cleaner or kswapd comes along.
We've been through this already, but I'll repeat it again in the
hope it sinks in: reducing stack usage is not sufficient to stay
within an 8k stack if we can enter writeback with an arbitrary
amount of stack already consumed.
We've already got a report of 9k of stack usage (7200 bytes left on
a order-2 stack) and this is without a complex storage stack - it's
just a partition on a SATA drive. We can easily add another 1k,
possibly 2k to that stack depth with a complex storage subsystem.
Trimming this much (3-4k) is simply not feasible in a callchain that
is 50-70 functions deep...
> One reason why I am edgy about this is that lumpy reclaim can kick in
> for low-enough orders too like order-1 pages for stacks in some cases or
> order-2 pages for network cards using jumbo frames or some wireless
> cards. The network cards in particular could still cause the stack
> overflow but be much harder to reproduce and detect.
So push lumpy reclaim into a separate thread. It already blocks, so
waiting for some other thread to do the work won't change anything.
Separating high-order reclaim from LRU reclaim is probably a good
idea, anyway - they use different algorithms and while the two are
intertwined it's hard to optimise/improve either....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, 16 Apr 2010 10:13:39 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> Hmm. Then, if memoy cgroup is filled by dirty pages, it can't kick writeback
> and has to wait for someone else's writeback ?
>
> How long this will take ?
> # mount -t cgroup none /cgroup -o memory
> # mkdir /cgroup/A
> # echo 20M > /cgroup/A/memory.limit_in_bytes
> # echo $$ > /cgroup/A/tasks
> # dd if=/dev/zero of=./tmpfile bs=4096 count=1000000
>
> Can memcg ask writeback thread to "Wake Up Now! and Write this out!" effectively ?
>
Hmm.. I saw an oom-kill while testing several cases but performance itself
seems not to be far different with or without patch.
But I'm unhappy with oom-kill, so some tweak for memcg will be necessary
if we'll go with this.
Thanks,
-Kame
On Fri, Apr 16, 2010 at 09:40:13AM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > > It's a buying-time venture, I'll agree but as both approaches are only
> > > about reducing stack stack they wouldn't be long-term solutions by your
> > > criteria. What do you suggest?
> >
> > (from easy to more complicated):
> >
> > - Disable direct reclaim with 4K stacks
>
> Just to re-iterate: we're blowing the stack with direct reclaim on
> x86_64 w/ 8k stacks. The old i386/4k stack problem is a red
> herring.
Yes that's known, but on 4K it will definitely not work at all.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Apr 15, 2010 at 02:22:01PM -0400, [email protected] wrote:
> On Thu, 15 Apr 2010 14:15:33 BST, Mel Gorman said:
>
> > Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
> > stack-o-meter) and got the following. The prereq patches are from
> > earlier in the thread with the subjects
>
> Think that's a script worth having in-tree?
Ahh, it's a hatchet-job at the moment. I copied bloat-o-meter and
altered one function. I made a TODO note to extend bloat-o-meter
properly and that would be worth merging.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
> No. If you are doing full disk seeks between random chunks, then you
> still lose a large amount of throughput. e.g. if the seek time is
> 10ms and your IO time is 10ms for each 4k page, then increasing the
> size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
> throughput but we are still limited to 100 IOs per second. We've
> gone from 400kB/s to 6MB/s, but that's still an order of magnitude
> short of the 100MB/s full size IOs with little in way of seeks
> between them will acheive on the same spindle...
The usual armwaving numbers for ops/sec for an ATA disk are in the 200
ops/sec range so that seems horribly credible.
But then I've never quite understood why our anonymous paging isn't
sorting stuff as best it can and then using the drive as a log structure
with in memory metadata so it can stream the pages onto disk. Read
performance is goig to be similar (maybe better if you have a log tidy
when idle), write ought to be far better.
Alan
On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > It's a buying-time venture, I'll agree but as both approaches are only
> > about reducing stack stack they wouldn't be long-term solutions by your
> > criteria. What do you suggest?
>
> (from easy to more complicated):
>
> - Disable direct reclaim with 4K stacks
Do not like. While I can see why 4K stacks are a serious problem, I'd
sooner see 4K stacks disabled than have the kernel behave so differently
for direct reclaim. It's be tricky to spot regressions in reclaim that
were due to this .config option
> - Do direct reclaim only on separate stacks
This is looking more and more attractive.
> - Add interrupt stacks to any 8K stack architectures.
This is a similar but separate problem. It's similar in that interrupt
stacks can splice subsystems together in terms of stack usage.
> - Get rid of 4K stacks completely
Why would we *not* do this? I can't remember the original reasoning
behind 4K stacks but am guessing it helped fork-orientated workloads in
startup times in the days before lumpy reclaim and better fragmentation
control.
Who typically enables this option?
> - Think about any other stackings that could give large scale recursion
> and find ways to run them on separate stacks too.
The patch series I threw up about reducing stack was a cut-down
approach. Instead of using separate stacks, keep the stack usage out of
the main caller path where possible.
> - Long term: maybe we need 16K stacks at some point, depending on how
> good the VM gets. Alternative would be to stop making Linux more complicated,
> but that's unlikely to happen.
>
Make this Plan D if nothing else works out and we still hit a wall?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Fri, Apr 16, 2010 at 09:40:13AM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > > It's a buying-time venture, I'll agree but as both approaches are only
> > > about reducing stack stack they wouldn't be long-term solutions by your
> > > criteria. What do you suggest?
> >
> > (from easy to more complicated):
> >
> > - Disable direct reclaim with 4K stacks
>
> Just to re-iterate: we're blowing the stack with direct reclaim on
> x86_64 w/ 8k stacks.
Yep, that is not being disputed. By the way, what did you use to
generate your report? Was it CONFIG_DEBUG_STACK_USAGE or something else?
I used a modified bloat-o-meter to gather my data but it'd be nice to
be sure I'm seeing the same things as you (minus XFS unless I
specifically set it up).
> The old i386/4k stack problem is a red
> herring.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Apr 15, 2010 at 09:42:17AM -0400, Chris Mason wrote:
> On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> > On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > > > On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > > > > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > > > > profiles we are seeing here....
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > > > > doing sync IO, then waiting on those pages.
> > > > > > > >
> > > > > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > > > > because seeks are evil and direct reclaim makes seeks. I'd really loev
> > > > > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > > > > of doing page by page spatters of IO to the drive.
> > > > > >
> > > > > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > > > > making 4k io is not must for pageout. So, probably we can improve it.
> > > > > >
> > > > > >
> > > > > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > > > > helpers that filesystems use to do this, like:
> > > > > > >
> > > > > > > filemap_write_and_wait(page->mapping);
> > > > > >
> > > > > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > > > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> > > > >
> > > > > So use filemap_fdatawrite(page->mapping), or if it's better only
> > > > > to start IO on a segment of the file, use
> > > > > filemap_fdatawrite_range(page->mapping, start, end)....
> > > >
> > > > That does not help the stack usage issue, the caller ends up in
> > > > ->writepages. From an IO perspective, it'll be better from a seek point of
> > > > view but from a VM perspective, it may or may not be cleaning the right pages.
> > > > So I think this is a red herring.
> > >
> > > If you ask it to clean a bunch of pages around the one you want to
> > > reclaim on the LRU, there is a good chance it will also be cleaning
> > > pages that are near the end of the LRU or physically close by as
> > > well. It's not a guarantee, but for the additional IO cost of about
> > > 10% wall time on that IO to clean the page you need, you also get
> > > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > > win any way you look at it...
> > >
> >
> > At worst, it'll distort the LRU ordering slightly. Lets say the the
> > file-adjacent-page you clean was near the end of the LRU. Before such a
> > patch, it may have gotten cleaned and done another lap of the LRU.
> > After, it would be reclaimed sooner. I don't know if we depend on such
> > behaviour (very doubtful) but it's a subtle enough change. I can't
> > predict what it'll do for IO congestion. Simplistically, there is more
> > IO so it's bad but if the write pattern is less seeky and we needed to
> > write the pages anyway, it might be improved.
> >
> > > I agree that it doesn't solve the stack problem (Chris' suggestion
> > > that we enable the bdi flusher interface would fix this);
> >
> > I'm afraid I'm not familiar with this interface. Can you point me at
> > some previous discussion so that I am sure I am looking at the right
> > thing?
>
> vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
> real code needs to go....just look for the ~ marks.
>
I must be blind. What tree is this in? I can't see it v2.6.34-rc4,
mmotm or google.
> I mostly meant that the bdi helper threads were the best place to add
> knowledge about which pages we want to write for reclaim. We might need
> to add a thread dedicated to just doing the VM's dirty work, but that's
> where I would start discussing fancy new interfaces.
>
> >
> > > what I'm
> > > pointing out is that the arguments that it is too hard or there are
> > > no interfaces available to issue larger IO from reclaim are not at
> > > all valid.
> > >
> >
> > Sure, I'm not resisting fixing this, just your first patch :) There are four
> > goals here
> >
> > 1. Reduce stack usage
> > 2. Avoid the splicing of subsystem stack usage with direct reclaim
> > 3. Preserve lumpy reclaims cleaning of contiguous pages
> > 4. Try and not drastically alter LRU aging
> >
> > 1 and 2 are important for you, 3 is important for me and 4 will have to
> > be dealt with on a case-by-case basis.
> >
> > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > guess dirty pages can cycle around more so it'd need to be cared for.
>
> I'd like to add one more:
>
> 5. Don't dive into filesystem locks during reclaim.
>
Good add. It's not a new problem either. This came up at least two years
ago at around the first VM/FS summit and the response was a long the lines
of shuffling uncomfortably :/
> This is different from splicing code paths together, but
> the filesystem writepage code has become the center of our attempts at
> doing big fat contiguous writes on disk. We push off work as late as we
> can until just before the pages go down to disk.
>
> I'll pick on ext4 and btrfs for a minute, just to broaden the scope
> outside of XFS. Writepage comes along and the filesystem needs to
> actually find blocks on disk for all the dirty pages it has promised to
> write.
>
> So, we start a transaction, we take various allocator locks, modify
> different metadata, log changed blocks, take a break (logging is hard
> work you know, need_resched() triggered a by now), stuff it
> all into the file's metadata, log that, and finally return.
>
> Each of the steps above can block for a long time. Ext4 solves
> this by not doing them. ext4_writepage only writes pages that
> are already fully allocated on disk.
>
> Btrfs is much more efficient at not doing them, it just returns right
> away for PF_MEMALLOC.
>
> This is a long way of saying the filesystem writepage code is the
> opposite of what direct reclaim wants. Direct reclaim wants to
> find free ram now, and if it does end up in the mess describe above,
> it'll just get stuck for a long time on work entirely unrelated to
> finding free pages.
>
Ok, good summary, thanks. I was only partially aware of some of these.
i.e. I knew it was a problem but was not sensitive to how bad it was.
Your last point is interesting because lumpy reclaim for large orders under
heavy pressure can make the system stutter badly (e.g. during a huge
page pool resize). I had blamed just plain IO but messing around with
locks and tranactions could have been a large factor and I didn't go
looking for it.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Fri, Apr 16, 2010 at 02:14:12PM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> > On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > > If you ask it to clean a bunch of pages around the one you want to
> > > reclaim on the LRU, there is a good chance it will also be cleaning
> > > pages that are near the end of the LRU or physically close by as
> > > well. It's not a guarantee, but for the additional IO cost of about
> > > 10% wall time on that IO to clean the page you need, you also get
> > > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > > win any way you look at it...
> >
> > At worst, it'll distort the LRU ordering slightly. Lets say the the
> > file-adjacent-page you clean was near the end of the LRU. Before such a
> > patch, it may have gotten cleaned and done another lap of the LRU.
> > After, it would be reclaimed sooner. I don't know if we depend on such
> > behaviour (very doubtful) but it's a subtle enough change. I can't
> > predict what it'll do for IO congestion. Simplistically, there is more
> > IO so it's bad but if the write pattern is less seeky and we needed to
> > write the pages anyway, it might be improved.
>
> Fundamentally, we have so many pages on the LRU, getting a few out
> of order at the back end of it is going to be in the noise. If we
> trade off "perfect" LRU behaviour for cleaning pages an order of
haha, I don't think anyone pretends the LRU behaviour is perfect.
Altering its existing behaviour tends to be done with great care but
from what I gather that is often a case of "better the devil you know".
> magnitude faster, reclaim will find candidate pages for a whole lot
> faster. And if we have more clean pages available, faster, overall
> system throughput is going to improve and be much less likely to
> fall into deep, dark holes where the OOM-killer is the light at the
> end.....
>
> [ snip questions Chris answered ]
>
> > > what I'm
> > > pointing out is that the arguments that it is too hard or there are
> > > no interfaces available to issue larger IO from reclaim are not at
> > > all valid.
> > >
> >
> > Sure, I'm not resisting fixing this, just your first patch :) There are four
> > goals here
> >
> > 1. Reduce stack usage
> > 2. Avoid the splicing of subsystem stack usage with direct reclaim
> > 3. Preserve lumpy reclaims cleaning of contiguous pages
> > 4. Try and not drastically alter LRU aging
> >
> > 1 and 2 are important for you, 3 is important for me and 4 will have to
> > be dealt with on a case-by-case basis.
>
> #4 is important to me, too, because that has direct impact on large
> file IO workloads. however, it is gross changes in behaviour that
> concern me, not subtle, probably-in-the-noise changes that you're
> concerned about. :)
>
I'm also less concerned with this aspect. I brought it up because it was
a factor. I don't think it'll cause us problems but if problems do
arise, it's nice to have a few potential candidates to examine in
advance.
> > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > guess dirty pages can cycle around more so it'd need to be cared for.
>
> Well, you keep saying that they break #3, but I haven't seen any
> test cases or results showing that. I've been unable to confirm that
> lumpy reclaim is broken by disallowing writeback in my testing, so
> I'm interested to know what tests you are running that show it is
> broken...
>
Ok, I haven't actually tested this. The machines I use are tied up
retesting the compaction patches at the moment. The reason why I reckon
it'll be a problem is that when these sync-writeback changes were
introduced, it significantly helped lumpy reclaim for huge pages. I am
making an assumption that backing out those changes will hurt it.
I'll test for real on Monday and see what falls out.
> > > How about this? For now, we stop direct reclaim from doing writeback
> > > only on order zero allocations, but allow it for higher order
> > > allocations. That will prevent the majority of situations where
> > > direct reclaim blows the stack and interferes with background
> > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > This reduces the scope of impact and hence testing and validation
> > > the needs to be done.
> > >
> > > Then we can work towards allowing lumpy reclaim to use background
> > > threads as Chris suggested for doing specific writeback operations
> > > to solve the remaining problems being seen. Does this seem like a
> > > reasonable compromise and approach to dealing with the problem?
> > >
> >
> > I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
> > enough or come up with an alternative fix. From the goals above it mitigates
> > 1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
> > the LRU with 4 until the background cleaner or kswapd comes along.
>
> We've been through this already, but I'll repeat it again in the
> hope it sinks in: reducing stack usage is not sufficient to stay
> within an 8k stack if we can enter writeback with an arbitrary
> amount of stack already consumed.
>
> We've already got a report of 9k of stack usage (7200 bytes left on
> a order-2 stack) and this is without a complex storage stack - it's
> just a partition on a SATA drive. We can easily add another 1k,
> possibly 2k to that stack depth with a complex storage subsystem.
> Trimming this much (3-4k) is simply not feasible in a callchain that
> is 50-70 functions deep...
>
Ok, based on this, I'll stop working on the stack-reduction patches.
I'll test what I have and push it but I won't bring it further for the
moment and instead look at putting writeback into its own thread. If
someone else works on it in the meantime, I'll review and test from the
perspective of lumpy reclaim.
> > One reason why I am edgy about this is that lumpy reclaim can kick in
> > for low-enough orders too like order-1 pages for stacks in some cases or
> > order-2 pages for network cards using jumbo frames or some wireless
> > cards. The network cards in particular could still cause the stack
> > overflow but be much harder to reproduce and detect.
>
> So push lumpy reclaim into a separate thread. It already blocks, so
> waiting for some other thread to do the work won't change anything.
No, it wouldn't. As long as it can wait on the right pages, it doesn't
really matter who does the work.
> Separating high-order reclaim from LRU reclaim is probably a good
> idea, anyway - they use different algorithms and while the two are
> intertwined it's hard to optimise/improve either....
>
They are not a million miles apart either. Lumpy reclaim uses the LRU to
select a cursor page and then reclaims around it. Improvements on LRU tend
to help lumpy reclaim as well. It's why during the tests I run I can often
allocate 80-95% of memory as huge pages on x86-64 as opposed to when anti-frag
was being developed first where getting 30% was a cause for celebration :)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Apr 15, 2010 at 11:43:48AM +0900, KOSAKI Motohiro wrote:
> > I already have some patches to remove trivial parts of struct scan_control,
> > namely may_unmap, may_swap, all_unreclaimable and isolate_pages. The rest
> > needs a deeper look.
>
> Seems interesting. but scan_control diet is not so effective. How much
> bytes can we diet by it?
Not much, it cuts 16 bytes on x86 32 bit. The bigger gain is the code
clarification it comes with. There is too much state to keep track of
in reclaim.
On Fri, Apr 16, 2010 at 03:57:07PM +0100, Mel Gorman wrote:
> On Fri, Apr 16, 2010 at 09:40:13AM +1000, Dave Chinner wrote:
> > On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > > > It's a buying-time venture, I'll agree but as both approaches are only
> > > > about reducing stack stack they wouldn't be long-term solutions by your
> > > > criteria. What do you suggest?
> > >
> > > (from easy to more complicated):
> > >
> > > - Disable direct reclaim with 4K stacks
> >
> > Just to re-iterate: we're blowing the stack with direct reclaim on
> > x86_64 w/ 8k stacks.
>
> Yep, that is not being disputed. By the way, what did you use to
> generate your report? Was it CONFIG_DEBUG_STACK_USAGE or something else?
> I used a modified bloat-o-meter to gather my data but it'd be nice to
> be sure I'm seeing the same things as you (minus XFS unless I
> specifically set it up).
I'm using the tracing subsystem to get them. Doesn't everyone use
that now? ;)
$ grep STACK .config
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_STACK_TRACER=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
Then:
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
<run workloads>
Monitor the worst recorded stack usage as it changes via:
# cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (44 entries)
----- ---- --------
0) 5584 288 get_page_from_freelist+0x5c0/0x830
1) 5296 272 __alloc_pages_nodemask+0x102/0x730
2) 5024 48 kmem_getpages+0x62/0x160
3) 4976 96 cache_grow+0x308/0x330
4) 4880 96 cache_alloc_refill+0x27f/0x2c0
5) 4784 96 __kmalloc+0x241/0x250
6) 4688 112 vring_add_buf+0x233/0x420
......
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Apr 16, 2010 at 10:50:02AM +0100, Alan Cox wrote:
> > No. If you are doing full disk seeks between random chunks, then you
> > still lose a large amount of throughput. e.g. if the seek time is
> > 10ms and your IO time is 10ms for each 4k page, then increasing the
> > size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
> > throughput but we are still limited to 100 IOs per second. We've
> > gone from 400kB/s to 6MB/s, but that's still an order of magnitude
> > short of the 100MB/s full size IOs with little in way of seeks
> > between them will acheive on the same spindle...
>
> The usual armwaving numbers for ops/sec for an ATA disk are in the 200
> ops/sec range so that seems horribly credible.
Yeah, in my experience 7200rpm SATA will get you 200 ops/s when you
are doing really small seeks as the typical minimum seek time is
around 4-5ms. Average seek time, however, is usually in the range of
10ms, because full head sweep + spindle rotation seeks take in the
order of 15ms.
Hence small random IO tends to result in seek times nearer the
average seek time than the minimum, so that's what i tend to use for
determining the number of ops/s a disk will sustain.
> But then I've never quite understood why our anonymous paging isn't
> sorting stuff as best it can and then using the drive as a log structure
> with in memory metadata so it can stream the pages onto disk. Read
> performance is goig to be similar (maybe better if you have a log tidy
> when idle), write ought to be far better.
Sounds like a worthy project for someone to sink their teeth into.
Lots of people would like to have a system that can page out at
hundreds of megabytes a second....
Cheers,
Dave.
--
Dave Chinner
[email protected]
There are two issues here: stack utilisation and poor IO patterns in
direct reclaim. They are different.
The poor IO patterns thing is a regression. Some time several years
ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
dirty-page writeback than it used to. AFAIK nobody attempted to work
out why, nor attempted to try to fix it.
Doing writearound in pageout() might help. The kernel was in fact was
doing that around 2.5.10, but I took it out again because it wasn't
obviously beneficial.
Writearound is hard to do, because direct-reclaim doesn't have an easy
way of pinning the address_space: it can disappear and get freed under
your feet. I was able to make this happen under intense MM loads. The
current page-at-a-time pageout code pins the address_space by taking a
lock on one of its pages. Once that lock is released, we cannot touch
*mapping.
And lo, the pageout() code is presently buggy:
res = mapping->a_ops->writepage(page, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
The ->writepage can/will unlock the page, and we're passing a hand
grenade into handle_write_error().
Any attempt to implement writearound in pageout will need to find a way
to safely pin that address_space. One way is to take a temporary ref
on mapping->host, but IIRC that introduced nasties with inode_lock.
Certainly it'll put more load on that worrisomely-singleton lock.
Regarding simply not doing any writeout in direct reclaim (Dave's
initial proposal): the problem is that pageout() will clean a page in
the target zone. Normal writeout won't do that, so we could get into a
situation where vast amounts of writeout is happening, but none of it
is cleaning pages in the zone which we're trying to allocate from.
It's quite possibly livelockable, too.
Doing writearound (if we can get it going) will solve that adequately
(assuming that the target page gets reliably written), but it won't
help the stack usage problem.
To solve the IO-pattern thing I really do think we should first work
out ytf we started doing much more IO off the LRU. What caused it? Is
it really unavoidable?
To solve the stack-usage thing: dunno, really. One could envisage code
which skips pageout() if we're using more than X amount of stack, but
that sucks. Another possibility might be to hand the target page over
to another thread (I suppose kswapd will do) and then synchronise with
that thread - get_page()+wait_on_page_locked() is one way. The helper
thread could of course do writearound.
On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> The poor IO patterns thing is a regression. Some time several years
> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> dirty-page writeback than it used to. AFAIK nobody attempted to work
> out why, nor attempted to try to fix it.
I just know that we XFS guys have been complaining about it a lot..
But that was mostly a tuning issue - before writeout mostly happened
from pdflush. If we got into kswapd or direct reclaim we already
did get horrible I/O patterns - it just happened far less often.
> Regarding simply not doing any writeout in direct reclaim (Dave's
> initial proposal): the problem is that pageout() will clean a page in
> the target zone. Normal writeout won't do that, so we could get into a
> situation where vast amounts of writeout is happening, but none of it
> is cleaning pages in the zone which we're trying to allocate from.
> It's quite possibly livelockable, too.
As Chris mentioned currently btrfs and ext4 do not actually do delalloc
conversions from this path, so for typical workloads the amount of
writeout that can happen from this path is extremly limited. And unless
we get things fixed we will have to do the same for XFS. I'd be much
more happy if we could just sort it out at the VM level, because this
means we have one sane place for this kind of policy instead of three
or more hacks down inside the filesystems. It's rather interesting
that all people on the modern fs side completely agree here what the
problem is, but it seems rather hard to convince the VM side to do
anything about it.
> To solve the stack-usage thing: dunno, really. One could envisage code
> which skips pageout() if we're using more than X amount of stack, but
> that sucks.
And it doesn't solve other issues, like the whole lock taking problem.
> Another possibility might be to hand the target page over
> to another thread (I suppose kswapd will do) and then synchronise with
> that thread - get_page()+wait_on_page_locked() is one way. The helper
> thread could of course do writearound.
Allowing the flusher threads to do targeted writeout would be the
best from the FS POV. We'll still have one source of the I/O, just
with another know on how to select the exact region to write out.
We can still synchronously wait for the I/O for lumpy reclaim if really
nessecary.
On Sat, 17 Apr 2010 20:32:39 -0400, Andrew Morton
<[email protected]> wrote:
>
> There are two issues here: stack utilisation and poor IO patterns in
> direct reclaim. They are different.
>
> The poor IO patterns thing is a regression. Some time several years
> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> dirty-page writeback than it used to. AFAIK nobody attempted to work
> out why, nor attempted to try to fix it.
I for one am looking very seriously at this problem together with Bruce.
We plan to have a discussion on this topic at the next LSF meeting
in Boston.
>
>
> Doing writearound in pageout() might help. The kernel was in fact was
> doing that around 2.5.10, but I took it out again because it wasn't
> obviously beneficial.
>
> Writearound is hard to do, because direct-reclaim doesn't have an easy
> way of pinning the address_space: it can disappear and get freed under
> your feet. I was able to make this happen under intense MM loads. The
> current page-at-a-time pageout code pins the address_space by taking a
> lock on one of its pages. Once that lock is released, we cannot touch
> *mapping.
>
> And lo, the pageout() code is presently buggy:
>
> res = mapping->a_ops->writepage(page, &wbc);
> if (res < 0)
> handle_write_error(mapping, page, res);
>
> The ->writepage can/will unlock the page, and we're passing a hand
> grenade into handle_write_error().
>
> Any attempt to implement writearound in pageout will need to find a way
> to safely pin that address_space. One way is to take a temporary ref
> on mapping->host, but IIRC that introduced nasties with inode_lock.
> Certainly it'll put more load on that worrisomely-singleton lock.
>
>
> Regarding simply not doing any writeout in direct reclaim (Dave's
> initial proposal): the problem is that pageout() will clean a page in
> the target zone. Normal writeout won't do that, so we could get into a
> situation where vast amounts of writeout is happening, but none of it
> is cleaning pages in the zone which we're trying to allocate from.
> It's quite possibly livelockable, too.
>
> Doing writearound (if we can get it going) will solve that adequately
> (assuming that the target page gets reliably written), but it won't
> help the stack usage problem.
>
>
> To solve the IO-pattern thing I really do think we should first work
> out ytf we started doing much more IO off the LRU. What caused it? Is
> it really unavoidable?
>
>
> To solve the stack-usage thing: dunno, really. One could envisage code
> which skips pageout() if we're using more than X amount of stack, but
> that sucks. Another possibility might be to hand the target page over
> to another thread (I suppose kswapd will do) and then synchronise with
> that thread - get_page()+wait_on_page_locked() is one way. The helper
> thread could of course do writearound.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC?
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : [email protected]
On Sun, 18 Apr 2010 15:05:26 -0400, Christoph Hellwig <[email protected]>
wrote:
> On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
>> The poor IO patterns thing is a regression. Some time several years
>> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
>> dirty-page writeback than it used to. AFAIK nobody attempted to work
>> out why, nor attempted to try to fix it.
>
> I just know that we XFS guys have been complaining about it a lot..
I know also that the ext3 and reisefs guys complained about this issue
as well.
>
> But that was mostly a tuning issue - before writeout mostly happened
> from pdflush. If we got into kswapd or direct reclaim we already
> did get horrible I/O patterns - it just happened far less often.
>
>> Regarding simply not doing any writeout in direct reclaim (Dave's
>> initial proposal): the problem is that pageout() will clean a page in
>> the target zone. Normal writeout won't do that, so we could get into a
>> situation where vast amounts of writeout is happening, but none of it
>> is cleaning pages in the zone which we're trying to allocate from.
>> It's quite possibly livelockable, too.
>
> As Chris mentioned currently btrfs and ext4 do not actually do delalloc
> conversions from this path, so for typical workloads the amount of
> writeout that can happen from this path is extremly limited. And unless
> we get things fixed we will have to do the same for XFS. I'd be much
> more happy if we could just sort it out at the VM level, because this
> means we have one sane place for this kind of policy instead of three
> or more hacks down inside the filesystems. It's rather interesting
> that all people on the modern fs side completely agree here what the
> problem is, but it seems rather hard to convince the VM side to do
> anything about it.
>
>> To solve the stack-usage thing: dunno, really. One could envisage code
>> which skips pageout() if we're using more than X amount of stack, but
>> that sucks.
>
> And it doesn't solve other issues, like the whole lock taking problem.
>
>> Another possibility might be to hand the target page over
>> to another thread (I suppose kswapd will do) and then synchronise with
>> that thread - get_page()+wait_on_page_locked() is one way. The helper
>> thread could of course do writearound.
>
> Allowing the flusher threads to do targeted writeout would be the
> best from the FS POV. We'll still have one source of the I/O, just
> with another know on how to select the exact region to write out.
> We can still synchronously wait for the I/O for lumpy reclaim if really
> nessecary.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC?
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : [email protected]
On Sun, 18 Apr 2010 15:05:26 -0400 Christoph Hellwig <[email protected]> wrote:
> On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> > The poor IO patterns thing is a regression. Some time several years
> > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> > dirty-page writeback than it used to. AFAIK nobody attempted to work
> > out why, nor attempted to try to fix it.
>
> I just know that we XFS guys have been complaining about it a lot..
>
> But that was mostly a tuning issue - before writeout mostly happened
> from pdflush. If we got into kswapd or direct reclaim we already
> did get horrible I/O patterns - it just happened far less often.
Right. It's intended that the great majority of writeout be performed
by the fs flusher threads and by the write()r in balance_dirty_pages().
Writeout off the LRU is supposed to be a rare emergency case.
This got broken.
> > Regarding simply not doing any writeout in direct reclaim (Dave's
> > initial proposal): the problem is that pageout() will clean a page in
> > the target zone. Normal writeout won't do that, so we could get into a
> > situation where vast amounts of writeout is happening, but none of it
> > is cleaning pages in the zone which we're trying to allocate from.
> > It's quite possibly livelockable, too.
>
> As Chris mentioned currently btrfs and ext4 do not actually do delalloc
> conversions from this path, so for typical workloads the amount of
> writeout that can happen from this path is extremly limited. And unless
> we get things fixed we will have to do the same for XFS. I'd be much
> more happy if we could just sort it out at the VM level, because this
> means we have one sane place for this kind of policy instead of three
> or more hacks down inside the filesystems. It's rather interesting
> that all people on the modern fs side completely agree here what the
> problem is, but it seems rather hard to convince the VM side to do
> anything about it.
>
> > To solve the stack-usage thing: dunno, really. One could envisage code
> > which skips pageout() if we're using more than X amount of stack, but
> > that sucks.
>
> And it doesn't solve other issues, like the whole lock taking problem.
>
> > Another possibility might be to hand the target page over
> > to another thread (I suppose kswapd will do) and then synchronise with
> > that thread - get_page()+wait_on_page_locked() is one way. The helper
> > thread could of course do writearound.
>
> Allowing the flusher threads to do targeted writeout would be the
> best from the FS POV. We'll still have one source of the I/O, just
> with another know on how to select the exact region to write out.
> We can still synchronously wait for the I/O for lumpy reclaim if really
> nessecary.
Yeah, but it's all bandaids. The first thing we should do is work out
why writeout-off-the-LRU increased so much and fix that.
Handing writeout off to separate threads might be used to solve the
stack consumption problem but we shouldn't use it to "solve" the
excess-writeout-from-page-reclaim problem.
On Sun, Apr 18, 2010 at 12:31:09PM -0400, Andrew Morton wrote:
> Yeah, but it's all bandaids. The first thing we should do is work out
> why writeout-off-the-LRU increased so much and fix that.
>
> Handing writeout off to separate threads might be used to solve the
> stack consumption problem but we shouldn't use it to "solve" the
> excess-writeout-from-page-reclaim problem.
I think both of them are really serious issue. Exposing the whole
stack and lock problems with direct reclaim are a bit of a positive
side-effect os the writeout tuning messup. Without it the problems
would still be just as harmfull, just happenening even less often and
thus getting even less attention.
On Sun, 2010-04-18 at 15:10 -0400, Sorin Faibish wrote:
> On Sat, 17 Apr 2010 20:32:39 -0400, Andrew Morton
> <[email protected]> wrote:
>
> >
> > There are two issues here: stack utilisation and poor IO patterns in
> > direct reclaim. They are different.
> >
> > The poor IO patterns thing is a regression. Some time several years
> > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> > dirty-page writeback than it used to. AFAIK nobody attempted to work
> > out why, nor attempted to try to fix it.
> I for one am looking very seriously at this problem together with Bruce.
> We plan to have a discussion on this topic at the next LSF meeting
> in Boston.
As luck would have it, the Memory Management summit is co-located with
the Storage and Filesystem workshop ... how about just planning to lock
all the protagonists in a room if it's not solved by August. The less
extreme might even like to propose topics for the plenary sessions ...
James
On Sun, 18 Apr 2010 17:30:36 -0400, James Bottomley
<[email protected]> wrote:
> On Sun, 2010-04-18 at 15:10 -0400, Sorin Faibish wrote:
>> On Sat, 17 Apr 2010 20:32:39 -0400, Andrew Morton
>> <[email protected]> wrote:
>>
>> >
>> > There are two issues here: stack utilisation and poor IO patterns in
>> > direct reclaim. They are different.
>> >
>> > The poor IO patterns thing is a regression. Some time several years
>> > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
>> > dirty-page writeback than it used to. AFAIK nobody attempted to work
>> > out why, nor attempted to try to fix it.
>
>> I for one am looking very seriously at this problem together with Bruce.
>> We plan to have a discussion on this topic at the next LSF meeting
>> in Boston.
>
> As luck would have it, the Memory Management summit is co-located with
> the Storage and Filesystem workshop ... how about just planning to lock
> all the protagonists in a room if it's not solved by August. The less
> extreme might even like to propose topics for the plenary sessions ...
Let's work together to get this done. This is a very good idea. I will try
to bring some facts about the current state by instrumenting the kernel
to sample with higher time granularity the dirty pages dynamics. This will
allow us expose better the problem or lack of. :)
/Sorin
>
> James
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC?
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : [email protected]
On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
>
> There are two issues here: stack utilisation and poor IO patterns in
> direct reclaim. They are different.
>
> The poor IO patterns thing is a regression. Some time several years
> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> dirty-page writeback than it used to. AFAIK nobody attempted to work
> out why, nor attempted to try to fix it.
I think that part of the problem is that at roughly the same time
writeback started on a long down hill slide as well, and we've
really only fixed that in the last couple of kernel releases. Also,
it tends to take more that just writing a few large files to invoke
the LRU-based writeback code is it is generally not invoked in
filesystem "performance" testing. Hence my bet is on the fact that
the effects of LRU-based writeback are rarely noticed in common
testing.
IOWs, low memory testing is not something a lot of people do. Add to
that the fact that most fs people, including me, have been treating
the VM as a black box that a bunch of other people have been taking
care of and hence really just been hoping it does the right thing,
and we've got a recipe for an unnoticed descent into a Bad Place.
[snip]
> Any attempt to implement writearound in pageout will need to find a way
> to safely pin that address_space. One way is to take a temporary ref
> on mapping->host, but IIRC that introduced nasties with inode_lock.
> Certainly it'll put more load on that worrisomely-singleton lock.
A problem already solved in the background flusher threads....
> Regarding simply not doing any writeout in direct reclaim (Dave's
> initial proposal): the problem is that pageout() will clean a page in
> the target zone. Normal writeout won't do that, so we could get into a
> situation where vast amounts of writeout is happening, but none of it
> is cleaning pages in the zone which we're trying to allocate from.
> It's quite possibly livelockable, too.
That's true, but seeing as we can't safely do writeback from
reclaim, we need some method of telling the background threads to
write a certain region of an inode. Perhaps some extension of a
struct writeback_control?
> Doing writearound (if we can get it going) will solve that adequately
> (assuming that the target page gets reliably written), but it won't
> help the stack usage problem.
>
>
> To solve the IO-pattern thing I really do think we should first work
> out ytf we started doing much more IO off the LRU. What caused it? Is
> it really unavoidable?
/me wonders who has the time and expertise to do that archeology
> To solve the stack-usage thing: dunno, really. One could envisage code
> which skips pageout() if we're using more than X amount of stack, but
Which, if we have to set it as low as 1.5k of stack used, may as
well just skip pageout()....
> that sucks. Another possibility might be to hand the target page over
> to another thread (I suppose kswapd will do) and then synchronise with
> that thread - get_page()+wait_on_page_locked() is one way. The helper
> thread could of course do writearound.
I'm fundamentally opposed to pushing IO to another place in the VM
when it could be just as easily handed to the flusher threads.
Also, consider that there's only one kswapd thread in a given
context (e.g. per CPU), but we can scale the number of flusher
threads as need be....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, 19 Apr 2010 10:35:56 +1000
Dave Chinner <[email protected]> wrote:
> On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> >
> > There are two issues here: stack utilisation and poor IO patterns in
> > direct reclaim. They are different.
> >
> > The poor IO patterns thing is a regression. Some time several years
> > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> > dirty-page writeback than it used to. AFAIK nobody attempted to
> > work out why, nor attempted to try to fix it.
>
> I think that part of the problem is that at roughly the same time
> writeback started on a long down hill slide as well, and we've
> really only fixed that in the last couple of kernel releases. Also,
> it tends to take more that just writing a few large files to invoke
> the LRU-based writeback code is it is generally not invoked in
> filesystem "performance" testing. Hence my bet is on the fact that
> the effects of LRU-based writeback are rarely noticed in common
> testing.
>
Would this also be the time where we started real dirty accounting, and
started playing with the dirty page thresholds?
Background writeback is that interesting tradeoff between writing out
to make the VM easier (and the data safe) and the chance of someone
either rewriting the same data (as benchmarks do regularly... not sure
about real workloads) or deleting the temporary file.
Maybe we need to do the background dirty writes a bit more aggressive...
or play with heuristics where we get an adaptive timeout (say, if the
file got closed by the last opener, then do a shorter timeout)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, Apr 18, 2010 at 05:49:44PM -0700, Arjan van de Ven wrote:
> On Mon, 19 Apr 2010 10:35:56 +1000
> Dave Chinner <[email protected]> wrote:
>
> > On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> > >
> > > There are two issues here: stack utilisation and poor IO patterns in
> > > direct reclaim. They are different.
> > >
> > > The poor IO patterns thing is a regression. Some time several years
> > > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> > > dirty-page writeback than it used to. AFAIK nobody attempted to
> > > work out why, nor attempted to try to fix it.
> >
> > I think that part of the problem is that at roughly the same time
> > writeback started on a long down hill slide as well, and we've
> > really only fixed that in the last couple of kernel releases. Also,
> > it tends to take more that just writing a few large files to invoke
> > the LRU-based writeback code is it is generally not invoked in
> > filesystem "performance" testing. Hence my bet is on the fact that
> > the effects of LRU-based writeback are rarely noticed in common
> > testing.
>
> Would this also be the time where we started real dirty accounting, and
> started playing with the dirty page thresholds?
Yes, I think that was introduced in 2.6.16/17, so it's definitely in
the ballpark.
> Background writeback is that interesting tradeoff between writing out
> to make the VM easier (and the data safe) and the chance of someone
> either rewriting the same data (as benchmarks do regularly... not sure
> about real workloads) or deleting the temporary file.
>
> Maybe we need to do the background dirty writes a bit more aggressive...
> or play with heuristics where we get an adaptive timeout (say, if the
> file got closed by the last opener, then do a shorter timeout)
Realistically, I'm concerned about preventing the worst case
behaviour from occurring - making the background writes more
agressive without preventing writeback in LRU order simply means it
will be harder to test the VM corner case that triggers these
writeout patterns...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sun, Apr 18, 2010 at 04:30:36PM -0500, James Bottomley wrote:
> > I for one am looking very seriously at this problem together with Bruce.
> > We plan to have a discussion on this topic at the next LSF meeting
> > in Boston.
>
> As luck would have it, the Memory Management summit is co-located with
> the Storage and Filesystem workshop ... how about just planning to lock
> all the protagonists in a room if it's not solved by August. The less
> extreme might even like to propose topics for the plenary sessions ...
I'd personally hope that this is solved long before the LSF/VM
workshops.... but if not, yes, we should definitely tackle it then.
- Ted
On Mon, 19 Apr 2010 11:08:05 +1000
Dave Chinner <[email protected]> wrote:
> > Maybe we need to do the background dirty writes a bit more
> > aggressive... or play with heuristics where we get an adaptive
> > timeout (say, if the file got closed by the last opener, then do a
> > shorter timeout)
>
> Realistically, I'm concerned about preventing the worst case
> behaviour from occurring - making the background writes more
> agressive without preventing writeback in LRU order simply means it
> will be harder to test the VM corner case that triggers these
> writeout patterns...
while I appreciate that the worst case should not be uber horrific...
I care a LOT about getting the normal case right... and am willing to
sacrifice the worst case for that.. (obviously not to infinity, it
needs to be bounded)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Fri, Apr 16, 2010 at 04:05:10PM +0100, Mel Gorman wrote:
> > vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
> > real code needs to go....just look for the ~ marks.
> >
>
> I must be blind. What tree is this in? I can't see it v2.6.34-rc4,
> mmotm or google.
>
Bah, Johannes corrected my literal mind. har de har har :)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Fri, Apr 16, 2010 at 04:14:03PM +0100, Mel Gorman wrote:
> > > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > > guess dirty pages can cycle around more so it'd need to be cared for.
> >
> > Well, you keep saying that they break #3, but I haven't seen any
> > test cases or results showing that. I've been unable to confirm that
> > lumpy reclaim is broken by disallowing writeback in my testing, so
> > I'm interested to know what tests you are running that show it is
> > broken...
> >
>
> Ok, I haven't actually tested this. The machines I use are tied up
> retesting the compaction patches at the moment. The reason why I reckon
> it'll be a problem is that when these sync-writeback changes were
> introduced, it significantly helped lumpy reclaim for huge pages. I am
> making an assumption that backing out those changes will hurt it.
>
> I'll test for real on Monday and see what falls out.
>
One machine has completed the test and the results are as expected. When
allocating huge pages under stress, your patch drops the success rates
significantly. On X86-64, it showed
STRESS-HIGHALLOC
stress-highalloc stress-highalloc
enable-directreclaim disable-directreclaim
Under Load 1 89.00 ( 0.00) 73.00 (-16.00)
Under Load 2 90.00 ( 0.00) 85.00 (-5.00)
At Rest 90.00 ( 0.00) 90.00 ( 0.00)
So with direct reclaim, it gets 89% of memory as huge pages at the first
attempt but 73% with your patch applied. The "Under Load 2" test happens
immediately after. With the start kernel, the first and second attempts
are usually the same or very close together. With your patch applied,
there are big differences as it was no longer trying to clean pages.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Apr 15, 2010 at 3:30 AM, Johannes Weiner <[email protected]> wrote:
> On Thu, Apr 15, 2010 at 05:26:27PM +0900, KOSAKI Motohiro wrote:
>> Cc to Johannes
>>
>> > >
>> > > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>> > >
>> > > > Now, vmscan pageout() is one of IO throuput degression source.
>> > > > Some IO workload makes very much order-0 allocation and reclaim
>> > > > and pageout's 4K IOs are making annoying lots seeks.
>> > > >
>> > > > At least, kswapd can avoid such pageout() because kswapd don't
>> > > > need to consider OOM-Killer situation. that's no risk.
>> > > >
>> > > > Signed-off-by: KOSAKI Motohiro <[email protected]>
>> > >
>> > > What's your opinion on trying to cluster the writes done by pageout,
>> > > instead of not doing any paging out in kswapd?
>> > > Something along these lines:
>> >
>> > Interesting.
>> > So, I'd like to review your patch carefully. can you please give me one
>> > day? :)
>>
>> Hannes, if my remember is correct, you tried similar swap-cluster IO
>> long time ago. now I can't remember why we didn't merged such patch.
>> Do you remember anything?
>
> Oh, quite vividly in fact :) ?For a lot of swap loads the LRU order
> diverged heavily from swap slot order and readaround was a waste of
> time.
>
> Of course, the patch looked good, too, but it did not match reality
> that well.
>
> I guess 'how about this patch?' won't get us as far as 'how about
> those numbers/graphs of several real-life workloads? ?oh and here
> is the patch...'.
Hannes,
We recently ran into this problem while running some experiments on
ext4 filesystem. We experienced the scenario where we are writing a
large file or just opening a large file with limited memory allocation
(using containers), and the process got OOMed. The memory assigned to
the container is reasonably large, and the OOM can not be reproduced
on ext2 with the same configurations.
Later we figured this might be due to the delayed block allocation
from ext4. Vmscan sends a single page to ext4->writepage(), then ext4
punts if the block is DA'ed and re-dirties the page. On the other
hand, the flusher thread use ext4->writepages() which does include the
block allocation.
We looked at the OOM log under ext4, all pages within the container
were in inactive list and either Dirty or WriteBack. Also, the zones
are all marked as "all_unreclaimable" which indicates the reclaim path
has scanned the LRU quite lot times without making progress. If the
delayed block allocation is the cause for pageout() not being able to
flush dirty pages and then triggers OOMs, should we signal the fs to
force write out dirty pages under memory pressure?
--Ying
>
>> > > ? ? ?Cluster writes to disk due to memory pressure.
>> > >
>> > > ? ? ?Write out logically adjacent pages to the one we're paging out
>> > > ? ? ?so that we may get better IOs in these situations:
>> > > ? ? ?These pages are likely to be contiguous on disk to the one we're
>> > > ? ? ?writing out, so they should get merged into a single disk IO.
>> > >
>> > > ? ? ?Signed-off-by: Suleiman Souhlal <[email protected]>
>
> For random IO, LRU order will have nothing to do with mapping/disk order.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Mon, Apr 19, 2010 at 04:20:34PM +0100, Mel Gorman wrote:
> On Fri, Apr 16, 2010 at 04:14:03PM +0100, Mel Gorman wrote:
> > > > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > > > guess dirty pages can cycle around more so it'd need to be cared for.
> > >
> > > Well, you keep saying that they break #3, but I haven't seen any
> > > test cases or results showing that. I've been unable to confirm that
> > > lumpy reclaim is broken by disallowing writeback in my testing, so
> > > I'm interested to know what tests you are running that show it is
> > > broken...
> > >
> >
> > Ok, I haven't actually tested this. The machines I use are tied up
> > retesting the compaction patches at the moment. The reason why I reckon
> > it'll be a problem is that when these sync-writeback changes were
> > introduced, it significantly helped lumpy reclaim for huge pages. I am
> > making an assumption that backing out those changes will hurt it.
> >
> > I'll test for real on Monday and see what falls out.
> >
>
> One machine has completed the test and the results are as expected. When
> allocating huge pages under stress, your patch drops the success rates
> significantly. On X86-64, it showed
>
> STRESS-HIGHALLOC
> stress-highalloc stress-highalloc
> enable-directreclaim disable-directreclaim
> Under Load 1 89.00 ( 0.00) 73.00 (-16.00)
> Under Load 2 90.00 ( 0.00) 85.00 (-5.00)
> At Rest 90.00 ( 0.00) 90.00 ( 0.00)
>
> So with direct reclaim, it gets 89% of memory as huge pages at the first
> attempt but 73% with your patch applied. The "Under Load 2" test happens
> immediately after. With the start kernel, the first and second attempts
> are usually the same or very close together. With your patch applied,
> there are big differences as it was no longer trying to clean pages.
What was the machine config you were testing on (RAM, CPUs, etc)?
And what are these loads? Do you have a script that generates
them? If so, can you share them, please?
OOC, what was the effect on the background load - did it go faster
or slower when writeback was disabled? i.e. did we trade of more
large pages for better overall throughput?
Also, I'm curious as to the repeatability of the tests you are
doing. I found that from run to run I could see a *massive*
variance in the results. e.g. one run might only get ~80 huge
pages at the first attempt, the test run from the same initial
conditions next might get 440 huge pages at the first attempt. I saw
the same variance with or without writeback from direct reclaim
enabled. Hence only after averaging over tens of runs could I see
any sort of trend emerge, and it makes me wonder if your testing is
also seeing this sort of variance....
FWIW, if we look results of the test I did, it showed a 20%
improvement in large page allocation with a 15% increase in load
throughput, while you're showing a 16% degradation in large page
allocation. Effectively we've got two workloads that show results
at either end of the spectrum (perhaps they are best case vs worst
case) but there's no real in-between. What other tests can we run to
get a better picture of the effect?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Apr 23, 2010 at 11:06:32AM +1000, Dave Chinner wrote:
> On Mon, Apr 19, 2010 at 04:20:34PM +0100, Mel Gorman wrote:
> > On Fri, Apr 16, 2010 at 04:14:03PM +0100, Mel Gorman wrote:
> > > > > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > > > > guess dirty pages can cycle around more so it'd need to be cared for.
> > > >
> > > > Well, you keep saying that they break #3, but I haven't seen any
> > > > test cases or results showing that. I've been unable to confirm that
> > > > lumpy reclaim is broken by disallowing writeback in my testing, so
> > > > I'm interested to know what tests you are running that show it is
> > > > broken...
> > > >
> > >
> > > Ok, I haven't actually tested this. The machines I use are tied up
> > > retesting the compaction patches at the moment. The reason why I reckon
> > > it'll be a problem is that when these sync-writeback changes were
> > > introduced, it significantly helped lumpy reclaim for huge pages. I am
> > > making an assumption that backing out those changes will hurt it.
> > >
> > > I'll test for real on Monday and see what falls out.
> > >
> >
> > One machine has completed the test and the results are as expected. When
> > allocating huge pages under stress, your patch drops the success rates
> > significantly. On X86-64, it showed
> >
> > STRESS-HIGHALLOC
> > stress-highalloc stress-highalloc
> > enable-directreclaim disable-directreclaim
> > Under Load 1 89.00 ( 0.00) 73.00 (-16.00)
> > Under Load 2 90.00 ( 0.00) 85.00 (-5.00)
> > At Rest 90.00 ( 0.00) 90.00 ( 0.00)
> >
> > So with direct reclaim, it gets 89% of memory as huge pages at the first
> > attempt but 73% with your patch applied. The "Under Load 2" test happens
> > immediately after. With the start kernel, the first and second attempts
> > are usually the same or very close together. With your patch applied,
> > there are big differences as it was no longer trying to clean pages.
>
> What was the machine config you were testing on (RAM, CPUs, etc)?
2G RAM, AMD Phenom with 4 cores.
> And what are these loads?
Compile-based loads that fill up memory and put it under heavy memory
pressure that also dirties memory. While they are running, a kernel module
is loaded that starts allocating huge pages one at a time so that accurate
timing and the state of the system can be gathered at allocation time. The
number of allocation attempts is 90% of the number of huge pages that exist
in the system.
> Do you have a script that generates
> them? If so, can you share them, please?
>
Yes, but unfortunately they are not in a publishable state. Parts of
them depend on an automation harness that I don't hold the copyright to.
> OOC, what was the effect on the background load - did it go faster
> or slower when writeback was disabled?
Unfortunately, I don't know what the effect on the underlying load is
as it takes longer than the huge page allocation attempts do. The tests
objective is to check how well lumpy reclaim works undedmemory pressure.
However, the time it takes to allocate a huge page increases with direct
reclaim disabled (i.e. your patch) early in the test up until about 40%
of memory was allocated as huge pages. After that, the latencies with
disable-directreclaim are lower until the gives up while the latencies with
enable-directreclaim increase.
In other words, with direct reclaim writing back pages, lumpy reclaim is a
lot more determined to get the pages cleaned and wait on them if necessary. A
compromise patch might be to have a wait_on_page_dirty to be cleared instead
of queueing the IO and wait_on_page_writeback? How long it stalled would
depend heavily on what rate pages were getting cleaned in the background.
> i.e. did we trade of more
> large pages for better overall throughput?
>
> Also, I'm curious as to the repeatability of the tests you are
> doing. I found that from run to run I could see a *massive*
> variance in the results. e.g. one run might only get ~80 huge
> pages at the first attempt, the test run from the same initial
> conditions next might get 440 huge pages at the first attempt.
You are using the nr_hugepages interface and writing a large number to it
so you are also triggering the hugetlbfs retry-logic and have little control
over how many times the allocator gets called on each attempt. How many huge
pages it allocates depends on how much progress it is able to make during
lumpy reclaim.
It's why the tests I run allocate huge pages one at a time and measure
the latencies as it goes. The results tend to be quite reproducible.
Success figures would be the same between runs and the rate of
allocation success would generally be comparable as well.
Your test could do something similar by only ever requesting one additional
page. It will be good enough to measure allocation latency. The gathering
of other system state at the time of failure is not very important here
(where as it was important during anti-frag development hence the use of a
kernel module).
> I saw
> the same variance with or without writeback from direct reclaim
> enabled. Hence only after averaging over tens of runs could I see
> any sort of trend emerge, and it makes me wonder if your testing is
> also seeing this sort of variance....
>
Typically, there is not much variance between tests. Maybe 1-2% in allocation
success rates.
> FWIW, if we look results of the test I did, it showed a 20%
> improvement in large page allocation with a 15% increase in load
> throughput, while you're showing a 16% degradation in large page
> allocation.
With writeback, lumpy reclaim takes a range of pages, cleans them, waits for
the IO before moving on. This causes a seeky IO pattern and takes time. Also
causes a nice amount of trashing.
With your patch, lumpy reclaim would just skip over ranges with dirty pages
until it found clean pages in a suitable range. When there is plenty of
usable memore early in the test, it probably scans more but causes less
IO so would appear faster. Later in the test, it scans more but eventually
encounters too many dirty pages and gives up. Hence, its success rates will
be more random because it depends on where exactly the dirty pages were.
If this is accurate, it will always be the case that your patch causes less
disruption in the system and will appear faster due to the lack of IO but
will be less predictable and give up easier so will have lower success
rates when there are dirty pages in the system.
> Effectively we've got two workloads that show results
> at either end of the spectrum (perhaps they are best case vs worst
> case) but there's no real in-between. What other tests can we run to
> get a better picture of the effect?
>
The underlying workload is only important in how many pages it is
dirtying at any given time. Heck, at one point my test workload was a
single process that created a mapping the size of physical memory and in
test a) would constantly read it and in test b) would constantly write
it. Lumpy reclaim with dirty-page-writeback was always more predictable
and had higher success rates.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab