2010-06-08 09:02:34

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

I finally got a chance last week to visit the topic of direct reclaim
avoiding the writing out pages. As it came up during discussions the last
time, I also had a stab at making the VM writing ranges of pages instead
of individual pages. I am not proposing for merging yet until I want to see
what people think of this general direction and if we can agree on if this
is the right one or not.

To summarise, there are two big problems with page reclaim right now. The
first is that page reclaim uses a_op->writepage to write a back back
under the page lock which is inefficient from an IO perspective due to
seeky patterns. The second is that direct reclaim calling the filesystem
splices two potentially deep call paths together and potentially overflows
the stack on complex storage or filesystems. This series is an early draft
at tackling both of these problems and is in three stages.

The first 4 patches are a forward-port of trace points that are partly
based on trace points defined by Larry Woodman but never merged. They trace
parts of kswapd, direct reclaim, LRU page isolation and page writeback. The
tracepoints can be used to evaluate what is happening within reclaim and
whether things are getting better or worse. They do not have to be part of
the final series but might be useful during discussion.

Patch 5 writes out contiguous ranges of pages where possible using
a_ops->writepages. When writing a range, the inode is pinned and the page
lock released before submitting to writepages(). This potentially generates
a better IO pattern and it should avoid a lock inversion problem within the
filesystem that wants the same page lock held by the VM. The downside with
writing ranges is that the VM may not be generating more IO than necessary.

Patch 6 prevents direct reclaim writing out pages at all and instead dirty
pages are put back on the LRU. For lumpy reclaim, the caller will briefly
wait on dirty pages to be written out before trying to reclaim the dirty
pages a second time.

The last patch increases the responsibility of kswapd somewhat because
it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
a better fit than background flushers to clean pages as it knows where the
pages needing cleaning are. As it's async IO, it should not cause kswapd to
stall (at least until the queue is congested) but the order that pages are
reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
by direct reclaimers are getting another lap on the LRU. The dirty pages
could have been put on a dedicated list but this increased counter overhead
and the number of lists and it is unclear if it is necessary.

The series has survived performance and stress testing, particularly around
high-order allocations on X86, X86-64 and PPC64. The results of the tests
showed that while lumpy reclaim has a slightly lower success rate when
allocating huge pages but it was still very acceptable rates, reclaim was
a lot less disruptive and allocation latency was lower.

Comments?

.../trace/postprocess/trace-vmscan-postprocess.pl | 623 ++++++++++++++++++++
include/trace/events/gfpflags.h | 37 ++
include/trace/events/kmem.h | 38 +--
include/trace/events/vmscan.h | 184 ++++++
mm/vmscan.c | 299 ++++++++--
5 files changed, 1092 insertions(+), 89 deletions(-)
create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
create mode 100644 include/trace/events/gfpflags.h
create mode 100644 include/trace/events/vmscan.h


2010-06-08 09:02:32

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/6] tracing, vmscan: Add trace events for LRU page isolation

This patch adds an event for when pages are isolated en-masse from the
LRU lists. This event augments the information available on LRU traffic
and can be used to evaluate lumpy reclaim.

Signed-off-by: Mel Gorman <[email protected]>
---
include/trace/events/vmscan.h | 46 +++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 14 ++++++++++++
2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f76521f..a331454 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -109,6 +109,52 @@ TRACE_EVENT(mm_vmscan_direct_reclaim_end,
TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
);

+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+ TP_PROTO(int order,
+ unsigned long nr_requested,
+ unsigned long nr_scanned,
+ unsigned long nr_taken,
+ unsigned long nr_lumpy_taken,
+ unsigned long nr_lumpy_dirty,
+ unsigned long nr_lumpy_failed,
+ int isolate_mode),
+
+ TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+ TP_STRUCT__entry(
+ __field(int, order)
+ __field(unsigned long, nr_requested)
+ __field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_taken)
+ __field(unsigned long, nr_lumpy_taken)
+ __field(unsigned long, nr_lumpy_dirty)
+ __field(unsigned long, nr_lumpy_failed)
+ __field(int, isolate_mode)
+ ),
+
+ TP_fast_assign(
+ __entry->order = order;
+ __entry->nr_requested = nr_requested;
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_taken = nr_taken;
+ __entry->nr_lumpy_taken = nr_lumpy_taken;
+ __entry->nr_lumpy_dirty = nr_lumpy_dirty;
+ __entry->nr_lumpy_failed = nr_lumpy_failed;
+ __entry->isolate_mode = isolate_mode;
+ ),
+
+ TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+ __entry->isolate_mode,
+ __entry->order,
+ __entry->nr_requested,
+ __entry->nr_scanned,
+ __entry->nr_taken,
+ __entry->nr_lumpy_taken,
+ __entry->nr_lumpy_dirty,
+ __entry->nr_lumpy_failed)
+);
+
#endif /* _TRACE_VMSCAN_H */

/* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6bfb579..25bf05a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -917,6 +917,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
unsigned long *scanned, int order, int mode, int file)
{
unsigned long nr_taken = 0;
+ unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0;
unsigned long scan;

for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -994,12 +995,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
list_move(&cursor_page->lru, dst);
mem_cgroup_del_lru(cursor_page);
nr_taken++;
+ nr_lumpy_taken++;
+ if (PageDirty(cursor_page))
+ nr_lumpy_dirty++;
scan++;
+ } else {
+ if (mode == ISOLATE_BOTH &&
+ page_count(cursor_page))
+ nr_lumpy_failed++;
}
}
}

*scanned = scan;
+
+ trace_mm_vmscan_lru_isolate(order,
+ nr_to_scan, scan,
+ nr_taken,
+ nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+ mode);
return nr_taken;
}

--
1.7.1

2010-06-08 09:02:51

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/6] vmscan: Do not writeback pages in direct reclaim

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back pages by not setting
may_writepage in scan_control. Instead, dirty pages are placed back on the
LRU lists for either background writing by the BDI threads or kswapd. If
in direct lumpy reclaim and dirty pages are encountered, the process will
kick the background flushter threads before trying again.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 69 ++++++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 50 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2eb2a6..3565610 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -725,6 +725,9 @@ writeout:
list_splice(&ret_pages, page_list);
}

+/* Direct lumpy reclaim waits up to a second for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 10
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -734,10 +737,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
{
LIST_HEAD(putback_pages);
LIST_HEAD(dirty_pages);
- struct list_head *ret_list = page_list;
struct pagevec freed_pvec;
int pgactivate;
- bool cleaned = false;
+ int cleaned = 0;
+ unsigned long nr_dirty;
unsigned long nr_reclaimed = 0;

pgactivate = 0;
@@ -746,6 +749,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
pagevec_init(&freed_pvec, 1);

restart_dirty:
+ nr_dirty = 0;
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -837,12 +841,17 @@ restart_dirty:
if (PageDirty(page)) {
/*
* On the first pass, dirty pages are put on a separate
- * list. IO is then queued based on ranges of pages for
- * each unique mapping in the list
+ * list. If kswapd, IO is then queued based on ranges of
+ * pages for each unique mapping in the list. Direct
+ * reclaimers put the dirty pages back on the list for
+ * cleaning by kswapd
*/
- if (!cleaned) {
- /* Keep locked for clean_page_list */
+ if (cleaned < MAX_SWAP_CLEAN_WAIT) {
+ /* Keep locked for kswapd to call clean_page_list */
+ if (!current_is_kswapd())
+ unlock_page(page);
list_add(&page->lru, &dirty_pages);
+ nr_dirty++;
goto keep_dirty;
}

@@ -959,15 +968,38 @@ keep_dirty:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}

- if (!cleaned && !list_empty(&dirty_pages)) {
- clean_page_list(&dirty_pages, sc);
- page_list = &dirty_pages;
- cleaned = true;
- goto restart_dirty;
+ if (cleaned < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
+ /*
+ * Only kswapd cleans pages. Direct reclaimers entering the filesystem
+ * potentially splices two expensive call-chains and busts the stack
+ * so instead they go to sleep to give background cleaning a chance
+ */
+ list_splice(&dirty_pages, page_list);
+ INIT_LIST_HEAD(&dirty_pages);
+ if (current_is_kswapd()) {
+ cleaned = MAX_SWAP_CLEAN_WAIT;
+ clean_page_list(page_list, sc);
+ goto restart_dirty;
+ } else {
+ cleaned++;
+ /*
+ * If lumpy reclaiming, kick the background flusher and wait
+ * for the pages to be cleaned
+ *
+ * XXX: kswapd won't find these isolated pages but the
+ * background flusher does not prioritise pages. It'd
+ * be nice to prioritise a list of pages somehow
+ */
+ if (sync_writeback == PAGEOUT_IO_SYNC) {
+ wakeup_flusher_threads(nr_dirty);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ goto restart_dirty;
+ }
+ }
}
- BUG_ON(!list_empty(&dirty_pages));

- list_splice(&putback_pages, ret_list);
+ list_splice(&dirty_pages, page_list);
+ list_splice(&putback_pages, page_list);
if (pagevec_count(&freed_pvec))
__pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
@@ -1988,10 +2020,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* writeout. So in laptop mode, write out the whole world.
*/
writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
- if (total_scanned > writeback_threshold) {
+ if (total_scanned > writeback_threshold)
wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
- sc->may_writepage = 1;
- }

/* Take a nap, wait for some writeback to complete */
if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -2040,7 +2070,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
unsigned long nr_reclaimed;
struct scan_control sc = {
.gfp_mask = gfp_mask,
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.may_unmap = 1,
.may_swap = 1,
@@ -2069,7 +2099,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
struct zone *zone, int nid)
{
struct scan_control sc = {
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.may_unmap = 1,
.may_swap = !noswap,
.swappiness = swappiness,
@@ -2743,7 +2773,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
struct reclaim_state reclaim_state;
int priority;
struct scan_control sc = {
- .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+ .may_writepage = (current_is_kswapd() &&
+ (zone_reclaim_mode & RECLAIM_WRITE)),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
.nr_to_reclaim = max_t(unsigned long, nr_pages,
--
1.7.1

2010-06-08 09:02:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible

Page reclaim cleans individual pages using a_ops->writepage() because from
the VM perspective, it is known that pages in a particular zone must be freed
soon, it considers the target page to be the oldest and it does not want
to wait while background flushers cleans other pages. From a filesystem
perspective this is extremely inefficient as it generates a very seeky
IO pattern leading to the perverse situation where it can take longer to
clean all dirty pages than it would have otherwise.

This patch recognises that there are cases where a number of pages
belonging to the same inode are being written out. When this happens and
writepages() is implemented, the range of pages will be written out with
a_ops->writepages. The inode is pinned and the page lock released before
submitting the range to the filesystem. While this potentially means that
more pages are cleaned than strictly necessary, the expectation is that the
filesystem will be able to writeout the pages more efficiently and improve
overall performance.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 220 +++++++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 176 insertions(+), 44 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 58527c4..b2eb2a6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,55 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;

+int write_reclaim_page(struct page *page, struct address_space *mapping,
+ enum pageout_io sync_writeback)
+{
+ int res;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .nr_to_write = SWAP_CLUSTER_MAX,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ .nonblocking = 1,
+ .for_reclaim = 1,
+ };
+
+ if (!clear_page_dirty_for_io(page))
+ return PAGE_CLEAN;
+
+ SetPageReclaim(page);
+ res = mapping->a_ops->writepage(page, &wbc);
+ /*
+ * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
+ * the page lock has been dropped by ->writepage, that mapping could
+ * be anything
+ */
+ if (res < 0)
+ handle_write_error(mapping, page, res);
+ if (res == AOP_WRITEPAGE_ACTIVATE) {
+ ClearPageReclaim(page);
+ return PAGE_ACTIVATE;
+ }
+
+ /*
+ * Wait on writeback if requested to. This happens when
+ * direct reclaiming a large contiguous area and the
+ * first attempt to free a range of pages fails.
+ */
+ if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+ wait_on_page_writeback(page);
+
+ if (!PageWriteback(page)) {
+ /* synchronous write or broken a_ops? */
+ ClearPageReclaim(page);
+ }
+ trace_mm_vmscan_writepage(page,
+ sync_writeback == PAGEOUT_IO_SYNC);
+ inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+ return PAGE_SUCCESS;
+}
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
@@ -367,45 +416,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
if (!may_write_to_queue(mapping->backing_dev_info))
return PAGE_KEEP;

- if (clear_page_dirty_for_io(page)) {
- int res;
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_NONE,
- .nr_to_write = SWAP_CLUSTER_MAX,
- .range_start = 0,
- .range_end = LLONG_MAX,
- .nonblocking = 1,
- .for_reclaim = 1,
- };
-
- SetPageReclaim(page);
- res = mapping->a_ops->writepage(page, &wbc);
- if (res < 0)
- handle_write_error(mapping, page, res);
- if (res == AOP_WRITEPAGE_ACTIVATE) {
- ClearPageReclaim(page);
- return PAGE_ACTIVATE;
- }
-
- /*
- * Wait on writeback if requested to. This happens when
- * direct reclaiming a large contiguous area and the
- * first attempt to free a range of pages fails.
- */
- if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
- wait_on_page_writeback(page);
-
- if (!PageWriteback(page)) {
- /* synchronous write or broken a_ops? */
- ClearPageReclaim(page);
- }
- trace_mm_vmscan_writepage(page,
- sync_writeback == PAGEOUT_IO_SYNC);
- inc_zone_page_state(page, NR_VMSCAN_WRITE);
- return PAGE_SUCCESS;
- }
-
- return PAGE_CLEAN;
+ return write_reclaim_page(page, mapping, sync_writeback);
}

/*
@@ -621,20 +632,120 @@ static enum page_references page_check_references(struct page *page,
}

/*
+ * Clean a list of pages in contiguous ranges where possible. It is expected
+ * that all the pages on page_list have been locked as part of isolation from
+ * the LRU
+ *
+ * XXX: Is there a problem with holding multiple page locks like this?
+ */
+static noinline_for_stack void clean_page_list(struct list_head *page_list,
+ struct scan_control *sc)
+{
+ LIST_HEAD(ret_pages);
+ struct page *cursor, *page, *tmp;
+
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ };
+
+ if (!sc->may_writepage)
+ return;
+
+ /* Write the pages out to disk in ranges where possible */
+ while (!list_empty(page_list)) {
+ struct address_space *mapping;
+ bool may_enter_fs;
+
+ cursor = lru_to_page(page_list);
+ list_del(&cursor->lru);
+ list_add(&cursor->lru, &ret_pages);
+ mapping = page_mapping(cursor);
+ if (!mapping || !may_write_to_queue(mapping->backing_dev_info)) {
+ unlock_page(cursor);
+ continue;
+ }
+
+ may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
+ (PageSwapCache(cursor) && (sc->gfp_mask & __GFP_IO));
+ if (!may_enter_fs) {
+ unlock_page(cursor);
+ continue;
+ }
+
+ wbc.nr_to_write = LONG_MAX;
+ wbc.range_start = page_offset(cursor);
+ wbc.range_end = page_offset(cursor) + PAGE_CACHE_SIZE - 1;
+
+ /* Only search if there is an inode to pin the address_space with */
+ if (!mapping->host)
+ goto writeout;
+
+ /* Only search if the address_space is smart about ranges */
+ if (!mapping->a_ops->writepages)
+ goto writeout;
+
+ /* Find a range of pages to clean within this list */
+ list_for_each_entry_safe(page, tmp, page_list, lru) {
+ if (!PageDirty(page) || PageWriteback(page))
+ continue;
+ if (page_mapping(page) != mapping)
+ continue;
+
+ list_del(&page->lru);
+ unlock_page(page);
+ list_add(&page->lru, &ret_pages);
+
+ wbc.range_start = min(wbc.range_start, page_offset(page));
+ wbc.range_end = max(wbc.range_end,
+ (page_offset(page) + PAGE_CACHE_SIZE - 1));
+ }
+
+writeout:
+ if (wbc.range_start == wbc.range_end - PAGE_CACHE_SIZE + 1) {
+ /* Write single page */
+ switch (write_reclaim_page(cursor, mapping, PAGEOUT_IO_ASYNC)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ case PAGE_CLEAN:
+ unlock_page(cursor);
+ break;
+ case PAGE_SUCCESS:
+ break;
+ }
+ } else {
+ /* Grab inode under page lock before writing range */
+ struct inode *inode = igrab(mapping->host);
+ unlock_page(cursor);
+ if (inode) {
+ do_writepages(mapping, &wbc);
+ iput(inode);
+ }
+ }
+ }
+ list_splice(&ret_pages, page_list);
+}
+
+/*
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
enum pageout_io sync_writeback)
{
- LIST_HEAD(ret_pages);
+ LIST_HEAD(putback_pages);
+ LIST_HEAD(dirty_pages);
+ struct list_head *ret_list = page_list;
struct pagevec freed_pvec;
- int pgactivate = 0;
+ int pgactivate;
+ bool cleaned = false;
unsigned long nr_reclaimed = 0;

+ pgactivate = 0;
cond_resched();

pagevec_init(&freed_pvec, 1);
+
+restart_dirty:
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -723,7 +834,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}

- if (PageDirty(page)) {
+ if (PageDirty(page)) {
+ /*
+ * On the first pass, dirty pages are put on a separate
+ * list. IO is then queued based on ranges of pages for
+ * each unique mapping in the list
+ */
+ if (!cleaned) {
+ /* Keep locked for clean_page_list */
+ list_add(&page->lru, &dirty_pages);
+ goto keep_dirty;
+ }
+
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
@@ -832,10 +954,20 @@ activate_locked:
keep_locked:
unlock_page(page);
keep:
- list_add(&page->lru, &ret_pages);
+ list_add(&page->lru, &putback_pages);
+keep_dirty:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
- list_splice(&ret_pages, page_list);
+
+ if (!cleaned && !list_empty(&dirty_pages)) {
+ clean_page_list(&dirty_pages, sc);
+ page_list = &dirty_pages;
+ cleaned = true;
+ goto restart_dirty;
+ }
+ BUG_ON(!list_empty(&dirty_pages));
+
+ list_splice(&putback_pages, ret_list);
if (pagevec_count(&freed_pvec))
__pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
--
1.7.1

2010-06-08 09:02:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/6] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events

This patch adds a simple post-processing script for the reclaim-related
trace events. It can be used to give an indication of how much traffic
there is on the LRU lists and how severe latencies due to reclaim are.
Example output looks like the following

Reclaim latencies expressed as order-latency_in_ms
uname-3942 9-200.179000000004 9-98.7900000000373 9-99.8330000001006
kswapd0-311 0-662.097999999998 0-2.79700000002049 \
0-149.100000000035 0-3295.73600000003 0-9806.31799999997 0-35528.833 \
0-10043.197 0-129740.979 0-3.50500000000466 0-3.54899999999907 \
0-9297.78999999992 0-3.48499999998603 0-3596.97999999998 0-3.92799999995623 \
0-3.35000000009313 0-16729.017 0-3.57799999997951 0-47435.0630000001 \
0-3.7819999998901 0-5864.06999999995 0-18635.334 0-10541.289 9-186011.565 \
9-3680.86300000001 9-1379.06499999994 9-958571.115 9-66215.474 \
9-6721.14699999988 9-1962.15299999993 9-1094806.125 9-2267.83199999994 \
9-47120.9029999999 9-427653.886 9-2.6359999999404 9-632.148999999976 \
9-476.753000000026 9-495.577000000048 9-8.45900000003166 9-6.6820000000298 \
9-1.30500000016764 9-251.746000000043 9-383.905000000028 9-80.1419999999925 \
9-281.160000000149 9-14.8780000000261 9-381.45299999998 9-512.07799999998 \
9-49.5519999999087 9-167.439000000013 9-183.820999999996 9-239.527999999933 \
9-19.9479999998584 9-148.747999999905 9-164.583000000101 9-16.9480000000913 \
9-192.376000000164 9-64.1010000000242 9-1.40800000005402 9-3.60800000000745 \
9-17.1359999999404 9-4.69500000006519 9-2.06400000001304 9-1582488.554 \
9-6244.19499999983 9-348153.812 9-2.0999999998603 9-0.987999999895692 \
0-32218.473 0-1.6140000000596 0-1.28100000019185 0-1.41300000017509 \
0-1.32299999985844 0-602.584000000032 0-1.34400000004098 0-1.6929999999702 \
1-22101.8190000001 9-174876.724 9-16.2420000000857 9-175.165999999736 \
9-15.8589999997057 9-0.604999999981374 9-3061.09000000032 9-479.277000000235 \
9-1.54499999992549 9-771.985000000335 9-4.88700000010431 9-15.0649999999441 \
9-0.879999999888241 9-252.01500000013 9-1381.03600000031 9-545.689999999944 \
9-3438.0129999998 9-3343.70099999988
bench-stresshig-3942 9-7063.33900000004 9-129960.482 9-2062.27500000002 \
9-3845.59399999992 9-171.82799999998 9-16493.821 9-7615.23900000006 \
9-10217.848 9-983.138000000035 9-2698.39999999991 9-4016.1540000001 \
9-5522.37700000009 9-21630.429 \
9-15061.048 9-10327.953 9-542.69700000016 9-317.652000000002 \
9-8554.71699999995 9-1786.61599999992 9-1899.31499999994 9-2093.41899999999 \
9-4992.62400000007 9-942.648999999976 9-1923.98300000001 9-3.7980000001844 \
9-5.99899999983609 9-0.912000000011176 9-1603.67700000014 9-1.98300000000745 \
9-3.96500000008382 9-0.902999999932945 9-2802.72199999983 9-1078.24799999991 \
9-2155.82900000014 9-10.058999999892 9-1984.723 9-1687.97999999998 \
9-1136.05300000007 9-3183.61699999985 9-458.731000000145 9-6.48600000003353 \
9-1013.25200000009 9-8415.22799999989 9-10065.584 9-2076.79600000009 \
9-3792.65699999989 9-71.2010000001173 9-2560.96999999997 9-2260.68400000012 \
9-2862.65799999982 9-1255.81500000018 9-15.7440000001807 9-4.33499999996275 \
9-1446.63800000004 9-238.635000000009 9-60.1790000000037 9-4.38800000003539 \
9-639.567000000039 9-306.698000000091 9-31.4070000001229 9-74.997999999905 \
9-632.725999999791 9-1625.93200000003 9-931.266000000061 9-98.7749999999069 \
9-984.606999999844 9-225.638999999966 9-421.316000000108 9-653.744999999879 \
9-572.804000000004 9-769.158999999985 9-603.918000000063 9-4.28499999991618 \
9-626.21399999992 9-1721.25 9-0.854999999981374 9-572.39599999995 \
9-681.881999999983 9-1345.12599999993 9-363.666999999899 9-3823.31099999999 \
9-2991.28200000012 9-4.27099999994971 9-309.76500000013 9-3068.35700000008 \
9-788.25 9-3515.73999999999 9-2065.96100000013 9-286.719999999972 \
9-316.076000000117 9-344.151000000071 9-2.51000000000931 9-306.688000000082 \
9-1515.00099999993 9-336.528999999864 9-793.491999999853 9-457.348999999929 \
9-13620.155 9-119.933999999892 9-35.0670000000391 9-918.266999999993 \
9-828.569000000134 9-4863.81099999999 9-105.222000000067 9-894.23900000006 \
9-110.964999999851 9-0.662999999942258 9-12753.3150000002 9-12.6129999998957 \
9-13368.0899999999 9-12.4199999999255 9-1.00300000002608 9-1.41100000008009 \
9-10300.5290000001 9-16.502000000095 9-30.7949999999255 9-6283.0140000002 \
9-4320.53799999994 9-6826.27300000004 9-3.07299999985844 9-1497.26799999992 \
9-13.4040000000969 9-3.12999999988824 9-3.86100000003353 9-11.3539999998175 \
9-0.10799999977462 9-21.780999999959 9-209.695999999996 9-299.647000000114 \
9-6.01699999999255 9-20.8349999999627 9-22.5470000000205 9-5470.16800000006 \
9-7.60499999998137 9-0.821000000229105 9-1.56600000010803 9-14.1669999998994 \
9-0.209000000031665 9-1.82300000009127 9-1.70000000018626 9-19.9429999999702 \
9-124.266999999993 9-0.0389999998733401 9-6.71400000015274 9-16.7710000001825 \
9-31.0409999999683 9-0.516999999992549 9-115.888000000035 9-5.19900000002235 \
9-222.389999999898 9-11.2739999999758 9-80.9050000000279 9-8.14500000001863 \
9-4.44599999999627 9-0.218999999808148 9-0.715000000083819 9-0.233000000007451
\
9-48.2630000000354 9-248.560999999987 9-374.96800000011 9-644.179000000004 \
9-0.835999999893829 9-79.0060000000522 9-128.447999999858 9-0.692000000039116 \
9-5.26500000013039 9-128.449000000022 9-2.04799999995157 9-12.0990000001621 \
9-8.39899999997579 9-10.3860000001732 9-11.9310000000987 9-53.4450000000652 \
9-0.46999999997206 9-2.96299999998882 9-17.9699999999721 9-0.776000000070781 \
9-25.2919999998994 9-33.1110000000335 9-0.434000000124797 9-0.641000000061467 \
9-0.505000000121072 9-1.12800000002608 9-149.222000000067 9-1.17599999997765 \
9-3247.33100000001 9-10.7439999999478 9-153.523000000045 9-1.38300000014715 \
9-794.762000000104 9-3.36199999996461 9-128.765999999829 9-181.543999999994 \
9-78149.8229999999 9-176.496999999974 9-89.9940000001807 9-9.12700000009499 \
9-250.827000000048 9-0.224999999860302 9-0.388999999966472 9-1.16700000036508 \
9-32.1740000001155 9-12.6800000001676 9-0.0720000001601875 9-0.274999999906868
\
9-0.724000000394881 9-266.866000000387 9-45.5709999999963 9-4.54399999976158 \
9-8.27199999988079 9-4.38099999958649 9-0.512000000104308 9-0.0640000002458692
\
9-5.20000000018626 9-0.0839999997988343 9-12.816000000108 9-0.503000000026077 \
9-0.507999999914318 9-6.23999999975786 9-3.35100000025705 9-18.8530000001192 \
9-25.2220000000671 9-68.2309999996796 9-98.9939999999478 9-0.441000000108033 \
9-4.24599999981001 9-261.702000000048 9-3.01599999982864 9-0.0749999997206032 \
9-0.0370000000111759 9-4.375 9-3.21800000034273 9-11.3960000001825 \
9-0.0540000000037253 9-0.286000000312924 9-0.865999999921769 \
9-0.294999999925494 9-6.45999999996275 9-4.31099999975413 9-128.248999999836 \
9-0.282999999821186 9-102.155000000261 9-0.0860000001266599 \
9-0.0540000000037253 9-0.935000000055879 9-0.0670000002719462 \
9-5.8640000000596 9-19.9860000000335 9-4.18699999991804 9-0.566000000108033 \
9-2.55099999997765 9-0.702000000048429 9-131.653999999631 9-0.638999999966472 \
9-14.3229999998584 9-183.398000000045 9-178.095999999903 9-3.22899999981746 \
9-7.31399999978021 9-22.2400000002235 9-11.7979999999516 9-108.10599999968 \
9-99.0159999998286 9-102.640999999829 9-38.414000000339
Process Direct Wokeup Pages Pages Pages
details Rclms Kswapd Scanned Sync-IO ASync-IO
cc1-30800 0 1 0 0 0 wakeup-0=1
cc1-24260 0 1 0 0 0 wakeup-0=1
cc1-24152 0 12 0 0 0 wakeup-0=12
cc1-8139 0 1 0 0 0 wakeup-0=1
cc1-4390 0 1 0 0 0 wakeup-0=1
cc1-4648 0 7 0 0 0 wakeup-0=7
cc1-4552 0 3 0 0 0 wakeup-0=3
dd-4550 0 31 0 0 0 wakeup-0=31
date-4898 0 1 0 0 0 wakeup-0=1
cc1-6549 0 7 0 0 0 wakeup-0=7
as-22202 0 17 0 0 0 wakeup-0=17
cc1-6495 0 9 0 0 0 wakeup-0=9
cc1-8299 0 1 0 0 0 wakeup-0=1
cc1-6009 0 1 0 0 0 wakeup-0=1
cc1-2574 0 2 0 0 0 wakeup-0=2
cc1-30568 0 1 0 0 0 wakeup-0=1
cc1-2679 0 6 0 0 0 wakeup-0=6
sh-13747 0 12 0 0 0 wakeup-0=12
cc1-22193 0 18 0 0 0 wakeup-0=18
cc1-30725 0 2 0 0 0 wakeup-0=2
as-4392 0 2 0 0 0 wakeup-0=2
cc1-28180 0 14 0 0 0 wakeup-0=14
cc1-13697 0 2 0 0 0 wakeup-0=2
cc1-22207 0 8 0 0 0 wakeup-0=8
cc1-15270 0 179 0 0 0 wakeup-0=179
cc1-22011 0 82 0 0 0 wakeup-0=82
cp-14682 0 1 0 0 0 wakeup-0=1
as-11926 0 2 0 0 0 wakeup-0=2
cc1-6016 0 5 0 0 0 wakeup-0=5
make-18554 0 13 0 0 0 wakeup-0=13
cc1-8292 0 12 0 0 0 wakeup-0=12
make-24381 0 1 0 0 0 wakeup-1=1
date-18681 0 33 0 0 0 wakeup-0=33
cc1-32276 0 1 0 0 0 wakeup-0=1
timestamp-outpu-2809 0 253 0 0 0 wakeup-0=240 wakeup-1=13
date-18624 0 7 0 0 0 wakeup-0=7
cc1-30960 0 9 0 0 0 wakeup-0=9
cc1-4014 0 1 0 0 0 wakeup-0=1
cc1-30706 0 22 0 0 0 wakeup-0=22
uname-3942 4 1 306 0 17 direct-9=4 wakeup-9=1
cc1-28207 0 1 0 0 0 wakeup-0=1
cc1-30563 0 9 0 0 0 wakeup-0=9
cc1-22214 0 10 0 0 0 wakeup-0=10
cc1-28221 0 11 0 0 0 wakeup-0=11
cc1-28123 0 6 0 0 0 wakeup-0=6
kswapd0-311 0 7 357302 0 34233 wakeup-0=7
cc1-5988 0 7 0 0 0 wakeup-0=7
as-30734 0 161 0 0 0 wakeup-0=161
cc1-22004 0 45 0 0 0 wakeup-0=45
date-4590 0 4 0 0 0 wakeup-0=4
cc1-15279 0 213 0 0 0 wakeup-0=213
date-30735 0 1 0 0 0 wakeup-0=1
cc1-30583 0 4 0 0 0 wakeup-0=4
cc1-32324 0 2 0 0 0 wakeup-0=2
cc1-23933 0 3 0 0 0 wakeup-0=3
cc1-22001 0 36 0 0 0 wakeup-0=36
bench-stresshig-3942 287 287 80186 6295 12196 direct-9=287 wakeup-9=287
cc1-28170 0 7 0 0 0 wakeup-0=7
date-7932 0 92 0 0 0 wakeup-0=92
cc1-22222 0 6 0 0 0 wakeup-0=6
cc1-32334 0 16 0 0 0 wakeup-0=16
cc1-2690 0 6 0 0 0 wakeup-0=6
cc1-30733 0 9 0 0 0 wakeup-0=9
cc1-32298 0 2 0 0 0 wakeup-0=2
cc1-13743 0 18 0 0 0 wakeup-0=18
cc1-22186 0 4 0 0 0 wakeup-0=4
cc1-28214 0 11 0 0 0 wakeup-0=11
cc1-13735 0 1 0 0 0 wakeup-0=1
updatedb-8173 0 18 0 0 0 wakeup-0=18
cc1-13750 0 3 0 0 0 wakeup-0=3
cat-2808 0 2 0 0 0 wakeup-0=2
cc1-15277 0 169 0 0 0 wakeup-0=169
date-18317 0 1 0 0 0 wakeup-0=1
cc1-15274 0 197 0 0 0 wakeup-0=197
cc1-30732 0 1 0 0 0 wakeup-0=1

Kswapd Kswapd Order Pages Pages Pages
Instance Wakeups Re-wakeup Scanned Sync-IO ASync-IO
kswapd0-311 91 24 357302 0 34233 wake-0=31 wake-1=1 wake-9=59 rewake-0=10 rewake-1=1 rewake-9=13

Summary
Direct reclaims: 291
Direct reclaim pages scanned: 437794
Direct reclaim write sync I/O: 6295
Direct reclaim write async I/O: 46446
Wake kswapd requests: 2152
Time stalled direct reclaim: 519.163009000002 seconds

Kswapd wakeups: 91
Kswapd pages scanned: 357302
Kswapd reclaim write sync I/O: 0
Kswapd reclaim write async I/O: 34233
Time kswapd awake: 5282.749757 seconds

Signed-off-by: Mel Gorman <[email protected]>
---
.../trace/postprocess/trace-vmscan-postprocess.pl | 623 ++++++++++++++++++++
1 files changed, 623 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..d415764
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,623 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+# --read-procstat If the trace lacks process info, get it from /proc
+# --ignore-pid Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <[email protected]>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2;
+use constant MM_VMSCAN_KSWAPD_WAKE => 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP => 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6;
+use constant MM_VMSCAN_LRU_ISOLATE => 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC => 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC => 9;
+use constant EVENT_UNKNOWN => 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN => 15;
+use constant STATE_DIRECT_ORDER => 16;
+use constant STATE_KSWAPD_BEGIN => 17;
+use constant STATE_KSWAPD_ORDER => 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY => 19;
+use constant HIGH_KSWAPD_LATENCY => 20;
+use constant HIGH_KSWAPD_REWAKEUP => 21;
+use constant HIGH_NR_SCANNED => 22;
+use constant HIGH_NR_TAKEN => 23;
+use constant HIGH_NR_RECLAIM => 24;
+use constant HIGH_NR_CONTIG_DIRTY => 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+ my $current_time = time;
+ if ($current_time - 2 > $sigint_received) {
+ print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+ $sigint_report = 1;
+ } else {
+ if (!$sigint_exit) {
+ print "Second SIGINT received quickly, exiting\n";
+ }
+ $sigint_exit++;
+ }
+
+ if ($sigint_exit > 3) {
+ print "Many SIGINTs received, exiting now without report\n";
+ exit;
+ }
+
+ $sigint_received = $current_time;
+ $sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+ 'ignore-pid' => \$opt_ignorepid,
+ 'read-procstat' => \$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+# (process_pid) (cpus ) ( time ) (tpoint ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+ my $event = shift;
+ my $default = shift;
+ my $regex;
+
+ # Read the event format or use the default
+ if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+ print("WARNING: Event $event format string not found\n");
+ return $default;
+ } else {
+ my $line;
+ while (!eof(FORMAT)) {
+ $line = <FORMAT>;
+ $line =~ s/, REC->.*//;
+ if ($line =~ /^print fmt:\s"(.*)".*/) {
+ $regex = $1;
+ $regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+ $regex =~ s/%p/\([0-9a-f]*\)/g;
+ $regex =~ s/%d/\([-0-9]*\)/g;
+ $regex =~ s/%ld/\([-0-9]*\)/g;
+ $regex =~ s/%lu/\([0-9]*\)/g;
+ }
+ }
+ }
+
+ # Can't handle the print_flags stuff but in the context of this
+ # script, it really doesn't matter
+ $regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+ # Verify fields are in the right order
+ my $tuple;
+ foreach $tuple (split /\s/, $regex) {
+ my ($key, $value) = split(/=/, $tuple);
+ my $expected = shift;
+ if ($key ne $expected) {
+ print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+ $regex =~ s/$key=\((.*)\)/$key=$1/;
+ }
+ }
+
+ if (defined shift) {
+ die("Fewer fields than expected in format");
+ }
+
+ return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+ "vmscan/mm_vmscan_direct_reclaim_begin",
+ $regex_direct_begin_default,
+ "order", "may_writepage",
+ "gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+ "vmscan/mm_vmscan_direct_reclaim_end",
+ $regex_direct_end_default,
+ "nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+ "vmscan/mm_vmscan_kswapd_wake",
+ $regex_kswapd_wake_default,
+ "nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+ "vmscan/mm_vmscan_kswapd_sleep",
+ $regex_kswapd_sleep_default,
+ "nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+ "vmscan/mm_vmscan_wakeup_kswapd",
+ $regex_wakeup_kswapd_default,
+ "nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+ "vmscan/mm_vmscan_lru_isolate",
+ $regex_lru_isolate_default,
+ "isolate_mode", "order",
+ "nr_requested", "nr_scanned", "nr_taken",
+ "contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+ "vmscan/mm_vmscan_lru_shrink_inactive",
+ $regex_lru_shrink_inactive_default,
+ "nid", "zid",
+ "lru",
+ "nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+ "vmscan/mm_vmscan_lru_shrink_active",
+ $regex_lru_shrink_active_default,
+ "nid", "zid",
+ "lru",
+ "nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+ "vmscan/mm_vmscan_writepage",
+ $regex_writepage_default,
+ "page", "pfn", "sync_io");
+
+sub read_statline($) {
+ my $pid = $_[0];
+ my $statline;
+
+ if (open(STAT, "/proc/$pid/stat")) {
+ $statline = <STAT>;
+ close(STAT);
+ }
+
+ if ($statline eq '') {
+ $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+ }
+
+ return $statline;
+}
+
+sub guess_process_pid($$) {
+ my $pid = $_[0];
+ my $statline = $_[1];
+
+ if ($pid == 0) {
+ return "swapper-0";
+ }
+
+ if ($statline !~ /$regex_statname/o) {
+ die("Failed to math stat line for process name :: $statline");
+ }
+ return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+ my $timestamp = $_[0];
+
+ my ($sec, $usec) = split (/\./, $timestamp);
+ return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+ my $traceevent;
+ my $process_pid;
+ my $cpus;
+ my $timestamp;
+ my $tracepoint;
+ my $details;
+ my $statline;
+
+ # Read each line of the event log
+EVENT_PROCESS:
+ while ($traceevent = <STDIN>) {
+ if ($traceevent =~ /$regex_traceevent/o) {
+ $process_pid = $1;
+ $timestamp = $3;
+ $tracepoint = $4;
+
+ $process_pid =~ /(.*)-([0-9]*)$/;
+ my $process = $1;
+ my $pid = $2;
+
+ if ($process eq "") {
+ $process = $last_procmap{$pid};
+ $process_pid = "$process-$pid";
+ }
+ $last_procmap{$pid} = $process;
+
+ if ($opt_read_procstat) {
+ $statline = read_statline($pid);
+ if ($opt_read_procstat && $process eq '') {
+ $process_pid = guess_process_pid($pid, $statline);
+ }
+ }
+ } else {
+ next;
+ }
+
+ # Perl Switch() sucks majorly
+ if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+ $timestamp = timestamp_to_ms($timestamp);
+ $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+ $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+ $details = $5;
+ if ($details !~ /$regex_direct_begin/o) {
+ print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+ print " $details\n";
+ print " $regex_direct_begin\n";
+ next;
+ }
+ my $order = $1;
+ $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+ $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+ } elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+ # Count the event itself
+ my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+ $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+ # Record how long direct reclaim took this time
+ $timestamp = timestamp_to_ms($timestamp);
+ my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+ my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+ $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+ } elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+ $details = $5;
+ if ($details !~ /$regex_kswapd_wake/o) {
+ print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+ print " $details\n";
+ print " $regex_kswapd_wake\n";
+ next;
+ }
+
+ my $order = $2;
+ $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+ if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+ $timestamp = timestamp_to_ms($timestamp);
+ $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+ $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+ $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+ } else {
+ $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+ $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+ }
+ } elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+ # Count the event itself
+ my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+ $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+ # Record how long kswapd was awake
+ $timestamp = timestamp_to_ms($timestamp);
+ my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+ my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+ $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+ $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+ } elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+ $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+ $details = $5;
+ if ($details !~ /$regex_wakeup_kswapd/o) {
+ print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+ print " $details\n";
+ print " $regex_wakeup_kswapd\n";
+ next;
+ }
+ my $order = $3;
+ $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+ } elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+ $details = $5;
+ if ($details !~ /$regex_lru_isolate/o) {
+ print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+ print " $details\n";
+ print " $regex_lru_isolate/o\n";
+ next;
+ }
+ my $nr_scanned = $4;
+ my $nr_contig_dirty = $7;
+ $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+ $perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+ } elsif ($tracepoint eq "mm_vmscan_writepage") {
+ $details = $5;
+ if ($details !~ /$regex_writepage/o) {
+ print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+ print " $details\n";
+ print " $regex_writepage\n";
+ next;
+ }
+
+ my $sync_io = $3;
+ if ($sync_io) {
+ $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+ } else {
+ $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+ }
+ } else {
+ $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+ }
+
+ if ($sigint_pending) {
+ last EVENT_PROCESS;
+ }
+ }
+}
+
+sub dump_stats {
+ my $hashref = shift;
+ my %stats = %$hashref;
+
+ # Dump per-process stats
+ my $process_pid;
+ my $max_strlen = 0;
+
+ # Get the maximum process name
+ foreach $process_pid (keys %perprocesspid) {
+ my $len = length($process_pid);
+ if ($len > $max_strlen) {
+ $max_strlen = $len;
+ }
+ }
+ $max_strlen += 2;
+
+ # Print out latencies
+ if (!$opt_ignorepid) {
+ printf("\n");
+ printf("Reclaim latencies expressed as order-latency_in_ms\n");
+ foreach $process_pid (keys %stats) {
+
+ if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+ !$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+ next;
+ }
+
+ printf "%-" . $max_strlen . "s ", $process_pid;
+ my $index = 0;
+ while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+ defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+ if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+ printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+ my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+ $total_direct_latency += $latency;
+ } else {
+ printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+ my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+ $total_kswapd_latency += $latency;
+ }
+ $index++;
+ }
+ print "\n";
+ }
+ }
+
+ # Print out process activity
+ printf("\n");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Sync-IO", "ASync-IO");
+ foreach $process_pid (keys %stats) {
+
+ if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+ next;
+ }
+
+ $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+ $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+ $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+ $total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+ $total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+ printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u",
+ $process_pid,
+ $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+ $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+ $stats{$process_pid}->{HIGH_NR_SCANNED},
+ $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+ $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+ if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+ print " ";
+ for (my $order = 0; $order < 20; $order++) {
+ my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+ if ($count != 0) {
+ print "direct-$order=$count ";
+ }
+ }
+ }
+ if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+ print " ";
+ for (my $order = 0; $order < 20; $order++) {
+ my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+ if ($count != 0) {
+ print "wakeup-$order=$count ";
+ }
+ }
+ }
+ if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+ print " ";
+ my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+ if ($count != 0) {
+ print "contig-dirty=$count ";
+ }
+ }
+
+ print "\n";
+ }
+
+ # Print out kswapd activity
+ printf("\n");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+ foreach $process_pid (keys %stats) {
+
+ if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+ next;
+ }
+
+ $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+ $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+ $total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+ $total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+ printf("%-" . $max_strlen . "s %8d %10d %8u %8i %8u",
+ $process_pid,
+ $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+ $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+ $stats{$process_pid}->{HIGH_NR_SCANNED},
+ $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+ $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+ if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+ print " ";
+ for (my $order = 0; $order < 20; $order++) {
+ my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+ if ($count != 0) {
+ print "wake-$order=$count ";
+ }
+ }
+ }
+ if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+ print " ";
+ for (my $order = 0; $order < 20; $order++) {
+ my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+ if ($count != 0) {
+ print "rewake-$order=$count ";
+ }
+ }
+ }
+ printf("\n");
+ }
+
+ # Print out summaries
+ $total_direct_latency /= 1000;
+ $total_kswapd_latency /= 1000;
+ print "\nSummary\n";
+ print "Direct reclaims: $total_direct_reclaim\n";
+ print "Direct reclaim pages scanned: $total_direct_nr_scanned\n";
+ print "Direct reclaim write sync I/O: $total_direct_writepage_sync\n";
+ print "Direct reclaim write async I/O: $total_direct_writepage_async\n";
+ print "Wake kswapd requests: $total_wakeup_kswapd\n";
+ print "Time stalled direct reclaim: $total_direct_latency seconds\n";
+ print "\n";
+ print "Kswapd wakeups: $total_kswapd_wake\n";
+ print "Kswapd pages scanned: $total_kswapd_nr_scanned\n";
+ print "Kswapd reclaim write sync I/O: $total_kswapd_writepage_sync\n";
+ print "Kswapd reclaim write async I/O: $total_kswapd_writepage_async\n";
+ print "Time kswapd awake: $total_kswapd_latency seconds\n";
+}
+
+sub aggregate_perprocesspid() {
+ my $process_pid;
+ my $process;
+ undef %perprocess;
+
+ foreach $process_pid (keys %perprocesspid) {
+ $process = $process_pid;
+ $process =~ s/-([0-9])*$//;
+ if ($process eq '') {
+ $process = "NO_PROCESS_NAME";
+ }
+
+ $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+ $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+ $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+ $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+ $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+
+ for (my $order = 0; $order < 20; $order++) {
+ $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+ $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+ $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+ }
+ }
+}
+
+sub report() {
+ if (!$opt_ignorepid) {
+ dump_stats(\%perprocesspid);
+ } else {
+ aggregate_perprocesspid();
+ dump_stats(\%perprocess);
+ }
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+ my $sigint_processed;
+ do {
+ $sigint_processed = 0;
+ process_events();
+
+ # Handle pending signals if any
+ if ($sigint_pending) {
+ my $current_time = time;
+
+ if ($sigint_exit) {
+ print "Received exit signal\n";
+ $sigint_pending = 0;
+ }
+ if ($sigint_report) {
+ if ($current_time >= $sigint_received + 2) {
+ report();
+ $sigint_report = 0;
+ $sigint_pending = 0;
+ $sigint_processed = 1;
+ }
+ }
+ }
+ } while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
--
1.7.1

2010-06-08 09:03:38

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/6] tracing, vmscan: Add trace event when a page is written

This patch adds a trace event for when page reclaim queues a page for IO and
records whether it is synchronous or asynchronous. Excessive synchronous
IO for a process can result in noticeable stalls during direct reclaim.
Excessive IO from page reclaim may indicate that the system is seriously
under provisioned for the amount of dirty pages that exist.

Signed-off-by: Mel Gorman <[email protected]>
---
include/trace/events/vmscan.h | 23 +++++++++++++++++++++++
mm/vmscan.c | 2 ++
2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a331454..b26daa9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -154,6 +154,29 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
__entry->nr_lumpy_dirty,
__entry->nr_lumpy_failed)
);
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+ TP_PROTO(struct page *page,
+ int sync_io),
+
+ TP_ARGS(page, sync_io),
+
+ TP_STRUCT__entry(
+ __field(struct page *, page)
+ __field(int, sync_io)
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->sync_io = sync_io;
+ ),
+
+ TP_printk("page=%p pfn=%lu sync_io=%d",
+ __entry->page,
+ page_to_pfn(__entry->page),
+ __entry->sync_io)
+);

#endif /* _TRACE_VMSCAN_H */

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25bf05a..58527c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -399,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
/* synchronous write or broken a_ops? */
ClearPageReclaim(page);
}
+ trace_mm_vmscan_writepage(page,
+ sync_writeback == PAGEOUT_IO_SYNC);
inc_zone_page_state(page, NR_VMSCAN_WRITE);
return PAGE_SUCCESS;
}
--
1.7.1

2010-06-08 09:03:40

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/6] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim

This patch adds two trace events for kswapd waking up and going asleep for
the purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process. High frequency triggering of these events could point to memory
pressure problems.

Signed-off-by: Mel Gorman <[email protected]>
---
include/trace/events/gfpflags.h | 37 +++++++++++++
include/trace/events/kmem.h | 38 +-------------
include/trace/events/vmscan.h | 115 +++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 24 +++++++--
4 files changed, 173 insertions(+), 41 deletions(-)
create mode 100644 include/trace/events/gfpflags.h
create mode 100644 include/trace/events/vmscan.h

diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ * GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags) \
+ (flags) ? __print_flags(flags, "|", \
+ {(unsigned long)GFP_HIGHUSER_MOVABLE, "GFP_HIGHUSER_MOVABLE"}, \
+ {(unsigned long)GFP_HIGHUSER, "GFP_HIGHUSER"}, \
+ {(unsigned long)GFP_USER, "GFP_USER"}, \
+ {(unsigned long)GFP_TEMPORARY, "GFP_TEMPORARY"}, \
+ {(unsigned long)GFP_KERNEL, "GFP_KERNEL"}, \
+ {(unsigned long)GFP_NOFS, "GFP_NOFS"}, \
+ {(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \
+ {(unsigned long)GFP_NOIO, "GFP_NOIO"}, \
+ {(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \
+ {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \
+ {(unsigned long)__GFP_IO, "GFP_IO"}, \
+ {(unsigned long)__GFP_COLD, "GFP_COLD"}, \
+ {(unsigned long)__GFP_NOWARN, "GFP_NOWARN"}, \
+ {(unsigned long)__GFP_REPEAT, "GFP_REPEAT"}, \
+ {(unsigned long)__GFP_NOFAIL, "GFP_NOFAIL"}, \
+ {(unsigned long)__GFP_NORETRY, "GFP_NORETRY"}, \
+ {(unsigned long)__GFP_COMP, "GFP_COMP"}, \
+ {(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \
+ {(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \
+ {(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \
+ {(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \
+ {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
+ {(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"} \
+ ) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@

#include <linux/types.h>
#include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- * GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags) \
- (flags) ? __print_flags(flags, "|", \
- {(unsigned long)GFP_HIGHUSER_MOVABLE, "GFP_HIGHUSER_MOVABLE"}, \
- {(unsigned long)GFP_HIGHUSER, "GFP_HIGHUSER"}, \
- {(unsigned long)GFP_USER, "GFP_USER"}, \
- {(unsigned long)GFP_TEMPORARY, "GFP_TEMPORARY"}, \
- {(unsigned long)GFP_KERNEL, "GFP_KERNEL"}, \
- {(unsigned long)GFP_NOFS, "GFP_NOFS"}, \
- {(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \
- {(unsigned long)GFP_NOIO, "GFP_NOIO"}, \
- {(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \
- {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \
- {(unsigned long)__GFP_IO, "GFP_IO"}, \
- {(unsigned long)__GFP_COLD, "GFP_COLD"}, \
- {(unsigned long)__GFP_NOWARN, "GFP_NOWARN"}, \
- {(unsigned long)__GFP_REPEAT, "GFP_REPEAT"}, \
- {(unsigned long)__GFP_NOFAIL, "GFP_NOFAIL"}, \
- {(unsigned long)__GFP_NORETRY, "GFP_NORETRY"}, \
- {(unsigned long)__GFP_COMP, "GFP_COMP"}, \
- {(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \
- {(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \
- {(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \
- {(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \
- {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
- {(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"} \
- ) : "GFP_NOWAIT"
+#include "gfpflags.h"

DECLARE_EVENT_CLASS(kmem_alloc,

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f76521f
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,115 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+ TP_PROTO(int nid),
+
+ TP_ARGS(nid),
+
+ TP_STRUCT__entry(
+ __field( int, nid )
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ ),
+
+ TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+ TP_PROTO(int nid, int order),
+
+ TP_ARGS(nid, order),
+
+ TP_STRUCT__entry(
+ __field( int, nid )
+ __field( int, order )
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->order = order;
+ ),
+
+ TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+ TP_PROTO(int nid, int zid, int order),
+
+ TP_ARGS(nid, zid, order),
+
+ TP_STRUCT__entry(
+ __field( int, nid )
+ __field( int, zid )
+ __field( int, order )
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->zid = zid;
+ __entry->order = order;
+ ),
+
+ TP_printk("nid=%d zid=%d order=%d",
+ __entry->nid,
+ __entry->zid,
+ __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+ TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+ TP_ARGS(order, may_writepage, gfp_flags),
+
+ TP_STRUCT__entry(
+ __field( int, order )
+ __field( int, may_writepage )
+ __field( gfp_t, gfp_flags )
+ ),
+
+ TP_fast_assign(
+ __entry->order = order;
+ __entry->may_writepage = may_writepage;
+ __entry->gfp_flags = gfp_flags;
+ ),
+
+ TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+ __entry->order,
+ __entry->may_writepage,
+ show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+ TP_PROTO(unsigned long nr_reclaimed),
+
+ TP_ARGS(nr_reclaimed),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, nr_reclaimed )
+ ),
+
+ TP_fast_assign(
+ __entry->nr_reclaimed = nr_reclaimed;
+ ),
+
+ TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..6bfb579 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@

#include "internal.h"

+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -1886,6 +1889,7 @@ out:
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
+ unsigned long nr_reclaimed;
struct scan_control sc = {
.gfp_mask = gfp_mask,
.may_writepage = !laptop_mode,
@@ -1898,7 +1902,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.nodemask = nodemask,
};

- return do_try_to_free_pages(zonelist, &sc);
+ trace_mm_vmscan_direct_reclaim_begin(order,
+ sc.may_writepage,
+ gfp_mask);
+
+ nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+ trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+ return nr_reclaimed;
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2297,9 +2309,10 @@ static int kswapd(void *p)
* premature sleep. If not, then go fully
* to sleep until explicitly woken up
*/
- if (!sleeping_prematurely(pgdat, order, remaining))
+ if (!sleeping_prematurely(pgdat, order, remaining)) {
+ trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
schedule();
- else {
+ } else {
if (remaining)
count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
else
@@ -2319,8 +2332,10 @@ static int kswapd(void *p)
* We can speed up thawing tasks if we don't call balance_pgdat
* after returning from the refrigerator
*/
- if (!ret)
+ if (!ret) {
+ trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
balance_pgdat(pgdat, order);
+ }
}
return 0;
}
@@ -2340,6 +2355,7 @@ void wakeup_kswapd(struct zone *zone, int order)
return;
if (pgdat->kswapd_max_order < order)
pgdat->kswapd_max_order = order;
+ trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
return;
if (!waitqueue_active(&pgdat->kswapd_wait))
--
1.7.1

2010-06-08 09:08:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 08, 2010 at 10:02:19AM +0100, Mel Gorman wrote:
> seeky patterns. The second is that direct reclaim calling the filesystem
> splices two potentially deep call paths together and potentially overflows
> the stack on complex storage or filesystems. This series is an early draft
> at tackling both of these problems and is in three stages.

Btw, one more thing came up when I discussed the issue again with Dave
recently:

- we also need to care about ->releasepage. At least for XFS it
can end up in the same deep allocator chain as ->writepage because
it does all the extent state conversions, even if it doesn't
start I/O. I haven't managed yet to decode the ext4/btrfs codepaths
for ->releasepage yet to figure out how they release a page that
covers a delayed allocated or unwritten range.

2010-06-08 09:28:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 08, 2010 at 05:08:11AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 08, 2010 at 10:02:19AM +0100, Mel Gorman wrote:
> > seeky patterns. The second is that direct reclaim calling the filesystem
> > splices two potentially deep call paths together and potentially overflows
> > the stack on complex storage or filesystems. This series is an early draft
> > at tackling both of these problems and is in three stages.
>
> Btw, one more thing came up when I discussed the issue again with Dave
> recently:
>
> - we also need to care about ->releasepage. At least for XFS it
> can end up in the same deep allocator chain as ->writepage because
> it does all the extent state conversions, even if it doesn't
> start I/O.

Dang.

> I haven't managed yet to decode the ext4/btrfs codepaths
> for ->releasepage yet to figure out how they release a page that
> covers a delayed allocated or unwritten range.
>

If ext4/btrfs are also very deep call-chains and this series is going more
or less the right direction, then avoiding calling ->releasepage from direct
reclaim is one, somewhat unfortunate, option. The second is to avoid it on
a per-filesystem basis for direct reclaim using PF_MEMALLOC to detect
reclaimers and PF_KSWAPD to tell the difference between direct
reclaimers and kswapd.

Either way, these pages could be treated similar to dirty pages on the
dirty_pages list.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-09 02:56:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, 8 Jun 2010 10:02:19 +0100
Mel Gorman <[email protected]> wrote:

> I finally got a chance last week to visit the topic of direct reclaim
> avoiding the writing out pages. As it came up during discussions the last
> time, I also had a stab at making the VM writing ranges of pages instead
> of individual pages. I am not proposing for merging yet until I want to see
> what people think of this general direction and if we can agree on if this
> is the right one or not.
>
> To summarise, there are two big problems with page reclaim right now. The
> first is that page reclaim uses a_op->writepage to write a back back
> under the page lock which is inefficient from an IO perspective due to
> seeky patterns. The second is that direct reclaim calling the filesystem
> splices two potentially deep call paths together and potentially overflows
> the stack on complex storage or filesystems. This series is an early draft
> at tackling both of these problems and is in three stages.
>
> The first 4 patches are a forward-port of trace points that are partly
> based on trace points defined by Larry Woodman but never merged. They trace
> parts of kswapd, direct reclaim, LRU page isolation and page writeback. The
> tracepoints can be used to evaluate what is happening within reclaim and
> whether things are getting better or worse. They do not have to be part of
> the final series but might be useful during discussion.
>
> Patch 5 writes out contiguous ranges of pages where possible using
> a_ops->writepages. When writing a range, the inode is pinned and the page
> lock released before submitting to writepages(). This potentially generates
> a better IO pattern and it should avoid a lock inversion problem within the
> filesystem that wants the same page lock held by the VM. The downside with
> writing ranges is that the VM may not be generating more IO than necessary.
>
> Patch 6 prevents direct reclaim writing out pages at all and instead dirty
> pages are put back on the LRU. For lumpy reclaim, the caller will briefly
> wait on dirty pages to be written out before trying to reclaim the dirty
> pages a second time.
>
> The last patch increases the responsibility of kswapd somewhat because
> it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
> a better fit than background flushers to clean pages as it knows where the
> pages needing cleaning are. As it's async IO, it should not cause kswapd to
> stall (at least until the queue is congested) but the order that pages are
> reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
> by direct reclaimers are getting another lap on the LRU. The dirty pages
> could have been put on a dedicated list but this increased counter overhead
> and the number of lists and it is unclear if it is necessary.
>
> The series has survived performance and stress testing, particularly around
> high-order allocations on X86, X86-64 and PPC64. The results of the tests
> showed that while lumpy reclaim has a slightly lower success rate when
> allocating huge pages but it was still very acceptable rates, reclaim was
> a lot less disruptive and allocation latency was lower.
>
> Comments?
>

My concern is how memcg should work. IOW, what changes will be necessary for
memcg to work with the new vmscan logic as no-direct-writeback.

Maybe an ideal solution will be
- support buffered I/O tracking in I/O cgroup.
- flusher threads should work with I/O cgroup.
- memcg itself should support dirty ratio. and add a trigger to kick flusher
threads for dirty pages in a memcg.
But I know it's a long way.

How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
memcg has to wait for a flusher thread make pages clean ?
Or memcg should have kswapd-for-memcg ?

Is it okay to call writeback directly when !scanning_global_lru() ?
memcg's reclaim routine is only called from specific positions, so, I guess
no stack problem. But we just have I/O pattern problem.

Thanks,
-Kame






2010-06-09 09:52:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Wed, Jun 09, 2010 at 11:52:11AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 8 Jun 2010 10:02:19 +0100
> > <SNIP>
> >
> > Patch 5 writes out contiguous ranges of pages where possible using
> > a_ops->writepages. When writing a range, the inode is pinned and the page
> > lock released before submitting to writepages(). This potentially generates
> > a better IO pattern and it should avoid a lock inversion problem within the
> > filesystem that wants the same page lock held by the VM. The downside with
> > writing ranges is that the VM may not be generating more IO than necessary.
> >
> > Patch 6 prevents direct reclaim writing out pages at all and instead dirty
> > pages are put back on the LRU. For lumpy reclaim, the caller will briefly
> > wait on dirty pages to be written out before trying to reclaim the dirty
> > pages a second time.
> >
> > The last patch increases the responsibility of kswapd somewhat because
> > it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
> > a better fit than background flushers to clean pages as it knows where the
> > pages needing cleaning are. As it's async IO, it should not cause kswapd to
> > stall (at least until the queue is congested) but the order that pages are
> > reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
> > by direct reclaimers are getting another lap on the LRU. The dirty pages
> > could have been put on a dedicated list but this increased counter overhead
> > and the number of lists and it is unclear if it is necessary.
> >
> > <SNIP>
>
> My concern is how memcg should work. IOW, what changes will be necessary for
> memcg to work with the new vmscan logic as no-direct-writeback.
>

At worst, memcg waits on background flushers to clean their pages but
obviously this could lead to stalls in containers if it happened to be full
of dirty pages.

Do you have test scenarios already setup for functional and performance
regression testing of containers? If so, can you run tests with this series
and see what sort of impact you find? I haven't done performance testing
with containers to date so I don't know what the expected values are.

> Maybe an ideal solution will be
> - support buffered I/O tracking in I/O cgroup.
> - flusher threads should work with I/O cgroup.
> - memcg itself should support dirty ratio. and add a trigger to kick flusher
> threads for dirty pages in a memcg.
> But I know it's a long way.
>

I'm not very familiar with memcg I'm afraid or its requirements so I am
having trouble guessing which of these would behave the best. You could take
a gamble on having memcg doing writeback in direct reclaim but you may run
into the same problem of overflowing stacks.

I'm not sure how a flusher thread would work just within a cgroup. It
would have to do a lot of searching to find the pages it needs
considering that it's looking at inodes rather than pages.

One possibility I guess would be to create a flusher-like thread if a direct
reclaimer finds that the dirty pages in the container are above the dirty
ratio. It would scan and clean all dirty pages in the container LRU on behalf
of dirty reclaimers.

Another possibility would be to have kswapd work in containers.
Specifically, if wakeup_kswapd() is called with a cgroup that it's added
to a list. kswapd gives priority to global reclaim but would
occasionally check if there is a container that needs kswapd on a
pending list and if so, work within the container. Is there a good
reason why kswapd does not work within container groups?

Finally, you could just allow reclaim within a memcg do writeback. Right
now, the check is based on current_is_kswapd() but I could create a helper
function that also checked for sc->mem_cgroup. Direct reclaim from the
page allocator never appears to work within a container group (which
raises questions in itself such as why a process in a container would
reclaim pages outside the container?) so it would remain safe.

> How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
> memcg has to wait for a flusher thread make pages clean ?

Right now, memcg has to wait for a flusher thread to make pages clean.

> Or memcg should have kswapd-for-memcg ?
>
> Is it okay to call writeback directly when !scanning_global_lru() ?
> memcg's reclaim routine is only called from specific positions, so, I guess
> no stack problem.

It's a judgement call from you really. I see that direct reclaimers do
not set mem_cgroup so it's down to - are you reasonably sure that all
the paths that reclaim based on a container are not deep? I looked
around for a while and the bulk appeared to be in the fault path so I
would guess "yes" but as I'm not familiar with the memcg implementation
I'll have missed a lot.

> But we just have I/O pattern problem.

True.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-10 00:43:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Wed, 9 Jun 2010 10:52:00 +0100
Mel Gorman <[email protected]> wrote:

> On Wed, Jun 09, 2010 at 11:52:11AM +0900, KAMEZAWA Hiroyuki wrote:

> > > <SNIP>
> >
> > My concern is how memcg should work. IOW, what changes will be necessary for
> > memcg to work with the new vmscan logic as no-direct-writeback.
> >
>
> At worst, memcg waits on background flushers to clean their pages but
> obviously this could lead to stalls in containers if it happened to be full
> of dirty pages.
>
yes.

> Do you have test scenarios already setup for functional and performance
> regression testing of containers? If so, can you run tests with this series
> and see what sort of impact you find? I haven't done performance testing
> with containers to date so I don't know what the expected values are.
>
Maybe kernbench is enough. I think it does enough write and malloc.
'Limit' size for test depends on your host. I sometimes does this on
8cpu SMP box.

# mount -t cgroup none /cgroups -o memory
# mkdir /cgroups/A
# echo $$ > /cgroups/A
# echo 300M > /cgroups/memory.limit_in_bytes
# make -j 8 or make -j 16

Comparing size of swap and speed will be interesting.
(Above 300M is enough small because my test machine has 24G memory.)

Or
# mount -t cgroup none /cgroups -o memory
# mkdir /cgroups/A
# echo $$ > /cgroups/A
# echo 50M > /cgroups/memory.limit_in_bytes
# dd if=/dev/zero of=./tmpfile bs=65536 count=100000

or some. When I tested the original patch for "avoiding writeback" by
Dave Chinner, I saw 2 ooms in 10 tests.
If not patched, I never see OOM.



> > Maybe an ideal solution will be
> > - support buffered I/O tracking in I/O cgroup.
> > - flusher threads should work with I/O cgroup.
> > - memcg itself should support dirty ratio. and add a trigger to kick flusher
> > threads for dirty pages in a memcg.
> > But I know it's a long way.
> >
>
> I'm not very familiar with memcg I'm afraid or its requirements so I am
> having trouble guessing which of these would behave the best. You could take
> a gamble on having memcg doing writeback in direct reclaim but you may run
> into the same problem of overflowing stacks.
>
maybe.

> I'm not sure how a flusher thread would work just within a cgroup. It
> would have to do a lot of searching to find the pages it needs
> considering that it's looking at inodes rather than pages.
>
yes. So, I(we) need some way for coloring inode for selectable writeback.
But people in this area are very nervous about performance (me too ;), I've
not found the answer yet.


> One possibility I guess would be to create a flusher-like thread if a direct
> reclaimer finds that the dirty pages in the container are above the dirty
> ratio. It would scan and clean all dirty pages in the container LRU on behalf
> of dirty reclaimers.
>
Yes, that's possible. But Andrew recommends not to do add more threads. So,
I'll use workqueue if necessary.

> Another possibility would be to have kswapd work in containers.
> Specifically, if wakeup_kswapd() is called with a cgroup that it's added
> to a list. kswapd gives priority to global reclaim but would
> occasionally check if there is a container that needs kswapd on a
> pending list and if so, work within the container. Is there a good
> reason why kswapd does not work within container groups?
>
One reason is node v.s. memcg.
Because memcg doesn't limit memory placement, a container can contain pages
from the all nodes. So,it's a bit problem which node's kswapd we should run .
(but yes, maybe small problem.)
Another is memory-reclaim-prioirty between memcg.
(I don't want to add such a knob...)

Maybe it's time to consider about that.
Now, we're using kswapd for softlimit. I think similar hints for kswapd
should work. yes.

> Finally, you could just allow reclaim within a memcg do writeback. Right
> now, the check is based on current_is_kswapd() but I could create a helper
> function that also checked for sc->mem_cgroup. Direct reclaim from the
> page allocator never appears to work within a container group (which
> raises questions in itself such as why a process in a container would
> reclaim pages outside the container?) so it would remain safe.
>
isolate_lru_pages() for memcg finds only pages in a memcg ;)


> > How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
> > memcg has to wait for a flusher thread make pages clean ?
>
> Right now, memcg has to wait for a flusher thread to make pages clean.
>
ok.


> > Or memcg should have kswapd-for-memcg ?
> >
> > Is it okay to call writeback directly when !scanning_global_lru() ?
> > memcg's reclaim routine is only called from specific positions, so, I guess
> > no stack problem.
>
> It's a judgement call from you really. I see that direct reclaimers do
> not set mem_cgroup so it's down to - are you reasonably sure that all
> the paths that reclaim based on a container are not deep?

One concerns is add_to_page_cache(). If it's called in deep stack, my assumption
is wrong.

> I looked
> around for a while and the bulk appeared to be in the fault path so I
> would guess "yes" but as I'm not familiar with the memcg implementation
> I'll have missed a lot.
>
> > But we just have I/O pattern problem.
>
> True.
>

Okay, I'll consider about how to kick kswapd via memcg or flusher-for-memcg.
Please go ahead as you want. I love good I/O pattern, too.

Thanks,
-Kame

2010-06-10 01:10:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Thu, Jun 10, 2010 at 09:38:42AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 9 Jun 2010 10:52:00 +0100
> Mel Gorman <[email protected]> wrote:
>
> > On Wed, Jun 09, 2010 at 11:52:11AM +0900, KAMEZAWA Hiroyuki wrote:
>
> > > > <SNIP>
> > >
> > > My concern is how memcg should work. IOW, what changes will be necessary for
> > > memcg to work with the new vmscan logic as no-direct-writeback.
> > >
> >
> > At worst, memcg waits on background flushers to clean their pages but
> > obviously this could lead to stalls in containers if it happened to be full
> > of dirty pages.
> >
>
> yes.
>

I'd like to have a better intuitive idea of how bad this really is. My current
understanding is that we don't know how bad it really is but it "feels vaguely
bad". I am focusing on the global scenario right now but only because I know
it's important and that any complex IO or filesystem can overflow the stack.

> > Do you have test scenarios already setup for functional and performance
> > regression testing of containers? If so, can you run tests with this series
> > and see what sort of impact you find? I haven't done performance testing
> > with containers to date so I don't know what the expected values are.
> >
> Maybe kernbench is enough.

I think kernbench only stresses reclaim in very specific scenarios. I wouldn't
think it is reliable for this sort of evaluation. I've been using sysbench
and the high-order stress allocation to evaulate IO and lumpy reclaim. I'm
missing a test on stack-usage because I lack a test situation with complex
IO (I lack the resources to setup such a thing) or a complex FS (because I
haven't set up one).

> I think it does enough write and malloc.
> 'Limit' size for test depends on your host. I sometimes does this on
> 8cpu SMP box.
>
> # mount -t cgroup none /cgroups -o memory
> # mkdir /cgroups/A
> # echo $$ > /cgroups/A
> # echo 300M > /cgroups/memory.limit_in_bytes
> # make -j 8 or make -j 16
>

That sort of scenario would be barely pushed by kernbench. For a single
kernel build, it's about 250-400M depending on the .config but it's still
a bit unreliable. Critically, it's not the sort of workload that would have
lots of long-lived mappings that would hurt a workload a lot if it was being
paged out.

> Comparing size of swap and speed will be interesting.
> (Above 300M is enough small because my test machine has 24G memory.)
>
> Or
> # mount -t cgroup none /cgroups -o memory
> # mkdir /cgroups/A
> # echo $$ > /cgroups/A
> # echo 50M > /cgroups/memory.limit_in_bytes
> # dd if=/dev/zero of=./tmpfile bs=65536 count=100000
>

That would push it more, but you're still talking about short-lived
processes with a lowish footprint that might not hurt in an obvious
manner in a writeback situation.

A more reliable measure would be sysbench sized to the size of the container
I imagined.

> or some. When I tested the original patch for "avoiding writeback" by
> Dave Chinner, I saw 2 ooms in 10 tests.
> If not patched, I never see OOM.
>

My patches build on Dave's approach somewhat by waiting in lumpy reclaim
for the IO to happen and in the general case by batching the dirty pages
together for kswapd.

> > > Maybe an ideal solution will be
> > > - support buffered I/O tracking in I/O cgroup.
> > > - flusher threads should work with I/O cgroup.
> > > - memcg itself should support dirty ratio. and add a trigger to kick flusher
> > > threads for dirty pages in a memcg.
> > > But I know it's a long way.
> > >
> >
> > I'm not very familiar with memcg I'm afraid or its requirements so I am
> > having trouble guessing which of these would behave the best. You could take
> > a gamble on having memcg doing writeback in direct reclaim but you may run
> > into the same problem of overflowing stacks.
> >
>
> maybe.
>

Maybe it would be reasonable as a starting point but we'd have to be
very careful of the stack usage figures? I'm leaning towards this
approach to start with.

I'm preparing another release that takes my two most important patches
about reclaim but also reduces usage in page relcaim (a combination of
two previously released series). In combination, it might be ok for the
memcg paths to reclaim pages from a stack perspective although the IO
pattern might still blow.

> > I'm not sure how a flusher thread would work just within a cgroup. It
> > would have to do a lot of searching to find the pages it needs
> > considering that it's looking at inodes rather than pages.
> >
>
> yes. So, I(we) need some way for coloring inode for selectable writeback.
> But people in this area are very nervous about performance (me too ;), I've
> not found the answer yet.
>

I worry that too much targetting of writing back a specific inode would
have other consequences.

>
> > One possibility I guess would be to create a flusher-like thread if a direct
> > reclaimer finds that the dirty pages in the container are above the dirty
> > ratio. It would scan and clean all dirty pages in the container LRU on behalf
> > of dirty reclaimers.
> >
>
> Yes, that's possible. But Andrew recommends not to do add more threads. So,
> I'll use workqueue if necessary.
>

I also was not happy with adding more threads unless we had to. In a
sense, I preferred adding logic to kswapd that switched between
reclaiming for global and containers but too much of how it behaves
depends on "how many containers are there"

In this sense, I would lean more torwards letting containers write back
pages in reclaim and see what the stack usage looks like.

> > Another possibility would be to have kswapd work in containers.
> > Specifically, if wakeup_kswapd() is called with a cgroup that it's added
> > to a list. kswapd gives priority to global reclaim but would
> > occasionally check if there is a container that needs kswapd on a
> > pending list and if so, work within the container. Is there a good
> > reason why kswapd does not work within container groups?
> >
>
> One reason is node v.s. memcg.
> Because memcg doesn't limit memory placement, a container can contain pages
> from the all nodes. So,it's a bit problem which node's kswapd we should run .
> (but yes, maybe small problem.)

I would hope it's small. I would expect a correlation between containers
and the nodes they have access to.

> Another is memory-reclaim-prioirty between memcg.
> (I don't want to add such a knob...)
>
> Maybe it's time to consider about that.
> Now, we're using kswapd for softlimit. I think similar hints for kswapd
> should work. yes.
>
> > Finally, you could just allow reclaim within a memcg do writeback. Right
> > now, the check is based on current_is_kswapd() but I could create a helper
> > function that also checked for sc->mem_cgroup. Direct reclaim from the
> > page allocator never appears to work within a container group (which
> > raises questions in itself such as why a process in a container would
> > reclaim pages outside the container?) so it would remain safe.
> >
>
> isolate_lru_pages() for memcg finds only pages in a memcg ;)
>

Ok.

>
> > > How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
> > > memcg has to wait for a flusher thread make pages clean ?
> >
> > Right now, memcg has to wait for a flusher thread to make pages clean.
> >
>
> ok.
>
>
> > > Or memcg should have kswapd-for-memcg ?
> > >
> > > Is it okay to call writeback directly when !scanning_global_lru() ?
> > > memcg's reclaim routine is only called from specific positions, so, I guess
> > > no stack problem.
> >
> > It's a judgement call from you really. I see that direct reclaimers do
> > not set mem_cgroup so it's down to - are you reasonably sure that all
> > the paths that reclaim based on a container are not deep?
>
> One concerns is add_to_page_cache(). If it's called in deep stack, my assumption
> is wrong.
>

But I wouldn't expect that in general to be very deep. Maybe I'm wrong.

> > I looked
> > around for a while and the bulk appeared to be in the fault path so I
> > would guess "yes" but as I'm not familiar with the memcg implementation
> > I'll have missed a lot.
> >
> > > But we just have I/O pattern problem.
> >
> > True.
> >
>
> Okay, I'll consider about how to kick kswapd via memcg or flusher-for-memcg.
> Please go ahead as you want. I love good I/O pattern, too.
>

For the moment, I'm strongly leaning towards allowing memcg to write
back pages. The IO pattern might not be great, but it would be in line
with current behaviour. The critical question is really "is it possible
to overflow the stack?".

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-10 01:34:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Thu, 10 Jun 2010 02:10:35 +0100
Mel Gorman <[email protected]> wrote:
> > # mount -t cgroup none /cgroups -o memory
> > # mkdir /cgroups/A
> > # echo $$ > /cgroups/A
> > # echo 300M > /cgroups/memory.limit_in_bytes
> > # make -j 8 or make -j 16
> >
>
> That sort of scenario would be barely pushed by kernbench. For a single
> kernel build, it's about 250-400M depending on the .config but it's still
> a bit unreliable. Critically, it's not the sort of workload that would have
> lots of long-lived mappings that would hurt a workload a lot if it was being
> paged out.

You're right. An excuse for me is that my concern is usually the amount of
swap-out and OOM at rapid/heavy pressure comes because it's visible to
users easily. So, I use short-lived process test.

> Maybe it would be reasonable as a starting point but we'd have to be
> very careful of the stack usage figures? I'm leaning towards this
> approach to start with.
>
> I'm preparing another release that takes my two most important patches
> about reclaim but also reduces usage in page relcaim (a combination of
> two previously released series). In combination, it might be ok for the
> memcg paths to reclaim pages from a stack perspective although the IO
> pattern might still blow.

sounds nice.

> > > I'm not sure how a flusher thread would work just within a cgroup. It
> > > would have to do a lot of searching to find the pages it needs
> > > considering that it's looking at inodes rather than pages.
> > >
> >
> > yes. So, I(we) need some way for coloring inode for selectable writeback.
> > But people in this area are very nervous about performance (me too ;), I've
> > not found the answer yet.
> >
>
> I worry that too much targetting of writing back a specific inode would
> have other consequences.

I personally think this(writeback scheduling) is a job for I/O cgroup.
So, I guess what memcg can do is dirty-ratio-limiting, at most. The user has to
set well-balanced combination of memory+I/O cgroup.
Sorry for wrong mixture of story.


> > Okay, I'll consider about how to kick kswapd via memcg or flusher-for-memcg.
> > Please go ahead as you want. I love good I/O pattern, too.
> >
>
> For the moment, I'm strongly leaning towards allowing memcg to write
> back pages. The IO pattern might not be great, but it would be in line
> with current behaviour. The critical question is really "is it possible
> to overflow the stack?".
>

Because I don't use XFS, I don't have relaiable answer, now. But, at least,
memcg's memory reclaim will never be called as top of do_select(), which
uses 1000 bytes.

We have to consider long-term fix for I/O patterns under memmcg but
please global-reclaim-update-first. We did in that way when splitting LRU
to ANON and FILE. I don't want to make memcg as a burden for updating
vmscan.c better.

Thanks,
-Kame

2010-06-11 05:58:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, 8 Jun 2010 10:02:19 +0100 Mel Gorman <[email protected]> wrote:

> To summarise, there are two big problems with page reclaim right now. The
> first is that page reclaim uses a_op->writepage to write a back back
> under the page lock which is inefficient from an IO perspective due to
> seeky patterns.

No it isn't. If we have a pile of file-contiguous, disk-contiguous
dirty pages on the tail of the LRU then the single writepage()s will
work just fine due to request merging.



Look. This is getting very frustrating. I keep saying the same thing
and keep getting ignored. Once more:

WE BROKE IT!

PLEASE STOP WRITING CODE!

FIND OUT HOW WE BROKE IT!

Loud enough yet?

It used to be the case that only very small amounts of IO occurred in
page reclaim - the vast majority of writeback happened within
write()->balance_dirty_pages(). Then (and I think it was around 2.6.12)
we broke it, and page reclaim started doing lots of writeout.

So the thing to do is to either find out how we broke it and see if it
can be repaired, or change the VM so that it doesn't do so much
LRU-based writeout. Rather than fiddling around trying to make the
we-broke-it code run its brokenness faster.

2010-06-11 06:11:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible

On Tue, 8 Jun 2010 10:02:24 +0100 Mel Gorman <[email protected]> wrote:

> Page reclaim cleans individual pages using a_ops->writepage() because from
> the VM perspective, it is known that pages in a particular zone must be freed
> soon, it considers the target page to be the oldest and it does not want
> to wait while background flushers cleans other pages. From a filesystem
> perspective this is extremely inefficient as it generates a very seeky
> IO pattern leading to the perverse situation where it can take longer to
> clean all dirty pages than it would have otherwise.
>
> This patch recognises that there are cases where a number of pages
> belonging to the same inode are being written out. When this happens and
> writepages() is implemented, the range of pages will be written out with
> a_ops->writepages. The inode is pinned and the page lock released before
> submitting the range to the filesystem. While this potentially means that
> more pages are cleaned than strictly necessary, the expectation is that the
> filesystem will be able to writeout the pages more efficiently and improve
> overall performance.
>
> ...
>
> + /* Write single page */
> + switch (write_reclaim_page(cursor, mapping, PAGEOUT_IO_ASYNC)) {
> + case PAGE_KEEP:
> + case PAGE_ACTIVATE:
> + case PAGE_CLEAN:
> + unlock_page(cursor);
> + break;
> + case PAGE_SUCCESS:
> + break;
> + }
> + } else {
> + /* Grab inode under page lock before writing range */
> + struct inode *inode = igrab(mapping->host);
> + unlock_page(cursor);
> + if (inode) {
> + do_writepages(mapping, &wbc);
> + iput(inode);

Buggy.


I did this, umm ~8 years ago and ended up reverting it because it was
complex and didn't seem to buy us anything. Of course, that was before
we broke the VM and started writing out lots of LRU pages. That code
was better than your code - it grabbed the address_space and did
writearound around the target page.

The reason this code is buggy is that under extreme memory pressure
(<oldfart>the sort of testing nobody does any more</oldfart>) it can be
the case that this iput() is the final iput() on this inode.

Now go take a look at iput_final(), which I bet has never been executed
on this path in your testing. It takes a large number of high-level
VFS locks. Locks which cannot be taken from deep within page reclaim
without causing various deadlocks.

I did solve that problem before reverting it all but I forget how. By
holding a page lock to pin the address_space rather than igrab(),
perhaps. Go take a look - it was somewhere between 2.5.1 and 2.5.10 if
I vaguely recall correctly.

Or don't take a look - we shouldn't need to do any of this anyway.

2010-06-11 06:17:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 6/6] vmscan: Do not writeback pages in direct reclaim

On Tue, 8 Jun 2010 10:02:25 +0100 Mel Gorman <[email protected]> wrote:

> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back pages by not setting
> may_writepage in scan_control. Instead, dirty pages are placed back on the
> LRU lists for either background writing by the BDI threads or kswapd. If
> in direct lumpy reclaim and dirty pages are encountered, the process will
> kick the background flushter threads before trying again.
>

This wouldn't have worked at all well back in the days when you could
dirty all memory with MAP_SHARED. The balance_dirty_pages() calls on
the fault path will now save us but if for some reason we were ever to
revert those, we'd need to revert this change too, I suspect.


As it stands, it would be wildly incautious to make a change like
this without first working out why we're pulling so many dirty pages
off the LRU tail, and fixing that.

2010-06-11 12:33:41

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Thu, Jun 10, 2010 at 10:57:49PM -0700, Andrew Morton wrote:
> On Tue, 8 Jun 2010 10:02:19 +0100 Mel Gorman <[email protected]> wrote:
>
> > To summarise, there are two big problems with page reclaim right now. The
> > first is that page reclaim uses a_op->writepage to write a back back
> > under the page lock which is inefficient from an IO perspective due to
> > seeky patterns.
>
> No it isn't. If we have a pile of file-contiguous, disk-contiguous
> dirty pages on the tail of the LRU then the single writepage()s will
> work just fine due to request merging.
>

Ok, I was under the mistaken impression that filesystems wanted to be
given ranges of pages where possible. Considering that there has been no
reaction to the patch in question from the filesystem people cc'd, I'll
drop the problem for now.

>
>
> Look. This is getting very frustrating. I keep saying the same thing
> and keep getting ignored. Once more:
>
> WE BROKE IT!
>
> PLEASE STOP WRITING CODE!
>
> FIND OUT HOW WE BROKE IT!
>
> Loud enough yet?
>

Yep. I've started a new series of tests that capture the trace points
during each test to get some data on how many dirty pages are really
being written back. They takes a long time to complete unfortunately.

> It used to be the case that only very small amounts of IO occurred in
> page reclaim - the vast majority of writeback happened within
> write()->balance_dirty_pages(). Then (and I think it was around 2.6.12)
> we broke it, and page reclaim started doing lots of writeout.
>

Ok, I'll work out exactly how many dirty pages are being written back
then. The data I have at the moment covers the whole test, so I cannot
be certain if all the writeback happened during one stress test or
whether it's a comment event.

> So the thing to do is to either find out how we broke it and see if it
> can be repaired, or change the VM so that it doesn't do so much
> LRU-based writeout. Rather than fiddling around trying to make the
> we-broke-it code run its brokenness faster.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-11 12:49:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible

On Thu, Jun 10, 2010 at 11:10:45PM -0700, Andrew Morton wrote:
> On Tue, 8 Jun 2010 10:02:24 +0100 Mel Gorman <[email protected]> wrote:
>
> > Page reclaim cleans individual pages using a_ops->writepage() because from
> > the VM perspective, it is known that pages in a particular zone must be freed
> > soon, it considers the target page to be the oldest and it does not want
> > to wait while background flushers cleans other pages. From a filesystem
> > perspective this is extremely inefficient as it generates a very seeky
> > IO pattern leading to the perverse situation where it can take longer to
> > clean all dirty pages than it would have otherwise.
> >
> > This patch recognises that there are cases where a number of pages
> > belonging to the same inode are being written out. When this happens and
> > writepages() is implemented, the range of pages will be written out with
> > a_ops->writepages. The inode is pinned and the page lock released before
> > submitting the range to the filesystem. While this potentially means that
> > more pages are cleaned than strictly necessary, the expectation is that the
> > filesystem will be able to writeout the pages more efficiently and improve
> > overall performance.
> >
> > ...
> >
> > + /* Write single page */
> > + switch (write_reclaim_page(cursor, mapping, PAGEOUT_IO_ASYNC)) {
> > + case PAGE_KEEP:
> > + case PAGE_ACTIVATE:
> > + case PAGE_CLEAN:
> > + unlock_page(cursor);
> > + break;
> > + case PAGE_SUCCESS:
> > + break;
> > + }
> > + } else {
> > + /* Grab inode under page lock before writing range */
> > + struct inode *inode = igrab(mapping->host);
> > + unlock_page(cursor);
> > + if (inode) {
> > + do_writepages(mapping, &wbc);
> > + iput(inode);
>
> Buggy.

It's buggy all right. Under heavy stress on one machine using XFS, it locks
up. I setup the XFS-based tests after I posted the series which is why I
missed it.

>
> I did this, umm ~8 years ago and ended up reverting it because it was
> complex and didn't seem to buy us anything. Of course, that was before
> we broke the VM and started writing out lots of LRU pages. That code
> was better than your code - it grabbed the address_space and did
> writearound around the target page.
>

I considered duplicating the writing around the target page but decided
that the VM had no idea if they needed to be cleaned or not. That's why this
patch only considered ranges of pages the VM wanted to clean now.

> The reason this code is buggy is that under extreme memory pressure
> (<oldfart>the sort of testing nobody does any more</oldfart>) it can be
> the case that this iput() is the final iput() on this inode.
>
> Now go take a look at iput_final(), which I bet has never been executed
> on this path in your testing.

I didn't check if iput_final was being hit or not. Certainly the lockup I
experienced was under heavy load when a lot of files were being created and
deleleted but I hadn't pinned down where it went wrong before this mail. I
think it was because I wasn't unlocking all the pages in the list properly.

> It takes a large number of high-level
> VFS locks. Locks which cannot be taken from deep within page reclaim
> without causing various deadlocks.
>

Can you explain this a bit more please? I can see the inode_lock is very
important in this path for example but am not seeing how page reclaim taking
it would cause a deadlock.

> I did solve that problem before reverting it all but I forget how. By
> holding a page lock to pin the address_space rather than igrab(),
> perhaps.

But this is what I did. That function has a list of locked pages. When I
call igrab(), the page is locked so the address_space should be pinned. I
unlock the page after I call igrab.

> Go take a look - it was somewhere between 2.5.1 and 2.5.10 if
> I vaguely recall correctly.
>
> Or don't take a look - we shouldn't need to do any of this anyway.
>

I'll take a closer look if there is real interest in having the VM use
writepages() but it sounds like it's a waste of time. I'll focus on

a) identifying how many dirty pages the VM is really writing back with
tracepoints
b) not using writepage from direct reclaim because it overflows the
stack

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-11 12:55:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] vmscan: Do not writeback pages in direct reclaim

On Thu, Jun 10, 2010 at 11:17:06PM -0700, Andrew Morton wrote:
> On Tue, 8 Jun 2010 10:02:25 +0100 Mel Gorman <[email protected]> wrote:
>
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> >
> > This patch prevents direct reclaim writing back pages by not setting
> > may_writepage in scan_control. Instead, dirty pages are placed back on the
> > LRU lists for either background writing by the BDI threads or kswapd. If
> > in direct lumpy reclaim and dirty pages are encountered, the process will
> > kick the background flushter threads before trying again.
> >
>
> This wouldn't have worked at all well back in the days when you could
> dirty all memory with MAP_SHARED.

Yes, it would have been a bucket of fail.

> The balance_dirty_pages() calls on
> the fault path will now save us but if for some reason we were ever to
> revert those, we'd need to revert this change too, I suspect.
>

Quite likely.

> As it stands, it would be wildly incautious to make a change like
> this without first working out why we're pulling so many dirty pages
> off the LRU tail, and fixing that.
>

Ok, I have a series prepared for testing that is in three parts.

Patches 1-4: tracepoints to gather how many dirty pages there really are
being written out on the LRU
Patches 5-10: reduce the stack usage in page reclaim
Patches 9-10: Avoid writing out pages from direct reclaim and instead
kicking background flushers to do the writing

Patches 1-4 on its own should an accurate view of how many dirty pages are
really being written back and if it's a real problem or not.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-11 16:25:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] vmscan: Do not writeback pages in direct reclaim

On Thu, Jun 10, 2010 at 11:17:06PM -0700, Andrew Morton wrote:
> As it stands, it would be wildly incautious to make a change like
> this without first working out why we're pulling so many dirty pages
> off the LRU tail, and fixing that.

Note that unlike the writepage vs writepages from kswapd which can
be fixed by the right tuning this is a black or white issue. Writeback
from direct reclaim will kill your stack if the caller happens to be
the wrong one, and just making it happen less often is not a fix - it
must not happen at all.

2010-06-11 16:27:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible

On Thu, Jun 10, 2010 at 11:10:45PM -0700, Andrew Morton wrote:
> I did this, umm ~8 years ago and ended up reverting it because it was
> complex and didn't seem to buy us anything. Of course, that was before
> we broke the VM and started writing out lots of LRU pages. That code
> was better than your code - it grabbed the address_space and did
> writearound around the target page.

> Or don't take a look - we shouldn't need to do any of this anyway.

Doing nearly 100% of the writepage from the flusher threads would
also be preferable from the filesystem point of view - getting I/O
from one thread helps to make it more local and work around all the
stupid I/O controller logic that tries to make our life difficult.

Of course getting rid of ->writepage from the AOPs API one day would
also be nice to simplify the filesystems code, but it's not that
important.

2010-06-11 16:29:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 08, 2010 at 10:28:14AM +0100, Mel Gorman wrote:
> > - we also need to care about ->releasepage. At least for XFS it
> > can end up in the same deep allocator chain as ->writepage because
> > it does all the extent state conversions, even if it doesn't
> > start I/O.
>
> Dang.
>
> > I haven't managed yet to decode the ext4/btrfs codepaths
> > for ->releasepage yet to figure out how they release a page that
> > covers a delayed allocated or unwritten range.
> >
>
> If ext4/btrfs are also very deep call-chains and this series is going more
> or less the right direction, then avoiding calling ->releasepage from direct
> reclaim is one, somewhat unfortunate, option. The second is to avoid it on
> a per-filesystem basis for direct reclaim using PF_MEMALLOC to detect
> reclaimers and PF_KSWAPD to tell the difference between direct
> reclaimers and kswapd.

I went throught this a bit more and I can't actually hit that code in
XFS ->releasepage anymore. I've also audited the caller and can't see
how we could theoretically hit it anymore. Do the VM gurus know a case
where we would call ->releasepage on a page that's actually dirty and
hasn't been through block_invalidatepage before?

2010-06-11 16:30:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Fri, Jun 11, 2010 at 01:33:20PM +0100, Mel Gorman wrote:
> Ok, I was under the mistaken impression that filesystems wanted to be
> given ranges of pages where possible. Considering that there has been no
> reaction to the patch in question from the filesystem people cc'd, I'll
> drop the problem for now.

Yes, we'd prefer them if possible. Then again we'd really prefer to
get as much I/O as possible from the flusher threads, and not kswapd.

2010-06-11 17:44:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 6/6] vmscan: Do not writeback pages in direct reclaim

On Fri, 11 Jun 2010 12:25:23 -0400 Christoph Hellwig <[email protected]> wrote:

> On Thu, Jun 10, 2010 at 11:17:06PM -0700, Andrew Morton wrote:
> > As it stands, it would be wildly incautious to make a change like
> > this without first working out why we're pulling so many dirty pages
> > off the LRU tail, and fixing that.
>
> Note that unlike the writepage vs writepages from kswapd which can
> be fixed by the right tuning this is a black or white issue. Writeback
> from direct reclaim will kill your stack if the caller happens to be
> the wrong one, and just making it happen less often is not a fix - it
> must not happen at all.

Of course, but making a change like that in the current VM will cause a
large number of dirty pages to get refiled, so the impact of this
change on some workloads could be quite bad.

If, however, we can get things back to the state where few dirty pages
ever reach the tail of the LRU then the adverse impact of this change
will be much less.

2010-06-11 17:49:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] vmscan: Do not writeback pages in direct reclaim

On Fri, Jun 11, 2010 at 10:43:31AM -0700, Andrew Morton wrote:
> Of course, but making a change like that in the current VM will cause a
> large number of dirty pages to get refiled, so the impact of this
> change on some workloads could be quite bad.

Note that ext4, btrfs and xfs all error out on ->writepage from reclaim
context. That is both kswapd and direct reclaim because there is no way
to distinguish between the two. Things seem to work fine with these
filesystems, so the issue can't be _that_ bad. Of course reducing this
to just error out from direct reclaim, and fixing them VM to better
cope with it is even better.

2010-06-11 18:13:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] vmscan: Do not writeback pages in direct reclaim

On Fri, Jun 11, 2010 at 01:49:00PM -0400, Christoph Hellwig wrote:
> On Fri, Jun 11, 2010 at 10:43:31AM -0700, Andrew Morton wrote:
> > Of course, but making a change like that in the current VM will cause a
> > large number of dirty pages to get refiled, so the impact of this
> > change on some workloads could be quite bad.
>
> Note that ext4, btrfs and xfs all error out on ->writepage from reclaim
> context. That is both kswapd and direct reclaim because there is no way
> to distinguish between the two.

What's wrong with PF_KSWAPD?

> Things seem to work fine with these
> filesystems, so the issue can't be _that_ bad. Of course reducing this
> to just error out from direct reclaim, and fixing them VM to better
> cope with it is even better.
>

I have some preliminary figures but tests are still ongoing but right
now, it doesn't seem as bad as was expected. Only my ppc64 machine has
finished tests so here is what I found. The tests I used were kernbench,
iozone, simple-writeback and stress-highalloc

This data is based on the tracepoints. Three kernels are tested on a new
patch stack (not posted yet but bits and pieces of it have)

traceonly - Just the tracepoints
stackreduce - Reduces the stack usage of page reclaim in general.
This is more a thread originally posted over a month
ago and picked up again in the interest of allowing
kswapd to do writeback at some point in the future.
nodirect - Avoid writing any pages direct reclaim. This is the
last patch from this series juggled slightly.

kernbench FTrace Reclaim Statistics

traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 0 0 0
Direct reclaim pages scanned 0 0 0
Direct reclaim write sync I/O 0 0 0
Direct reclaim write async I/O 0 0 0
Wake kswapd requests 0 0 0
Kswapd wakeups 0 0 0
Kswapd pages scanned 0 0 0
Kswapd reclaim write sync I/O 0 0 0
Kswapd reclaim write async I/O 0 0 0
Time stalled direct reclaim 0.00 0.00 0.00
Time kswapd awake 0.00 0.00 0.00

No surprises, kernbench is not memory intensive so reclaim didn't happen


iozone FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 0 0 0
Direct reclaim pages scanned 0 0 0
Direct reclaim write sync I/O 0 0 0
Direct reclaim write async I/O 0 0 0
Wake kswapd requests 0 0 0
Kswapd wakeups 0 0 0
Kswapd pages scanned 0 0 0
Kswapd reclaim write sync I/O 0 0 0
Kswapd reclaim write async I/O 0 0 0
Time stalled direct reclaim 0.00 0.00 0.00
Time kswapd awake 0.00 0.00 0.00

Again, not very surprising. Memory pressure was not a factor for iozone.

simple-writeback FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 4098 2436 5670
Direct reclaim pages scanned 393664 215821 505483
Direct reclaim write sync I/O 0 0 0
Direct reclaim write async I/O 0 0 0
Wake kswapd requests 865097 728976 1036147
Kswapd wakeups 639 561 585
Kswapd pages scanned 11123648 10383929 10561818
Kswapd reclaim write sync I/O 0 0 0
Kswapd reclaim write async I/O 3595 0 19068
Time stalled direct reclaim 2843.74 2771.71 32.76
Time kswapd awake 347.58 8865.65 433.27

This is a dd-orientated benchmark that was intended to just generate IO.
On a 4-core machine it starts with 4 jobs. Each iteration of the test
increases the number of jobs until a total of 64 are running. The total
amount of data written is 4*PhysicalMemory. dd was run with conv=fsync so
the timing figures would be a bit more stable but unfortunately, the figures
from the VM with respect to reclaim are not very stable. The intention was
to create a lot of dirty data and see what fell out.

Interestingly, direct reclaim didn't write pages in any of the kernels and
kswapd was not crazy on the amount it wrote out implying that in this test
at least, there were not many dirty pages on the LRU. Disabling writeback in
direct reclaim did mean that processes were stalled less but that is hardly
a surprise.

stress-highorder FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 2143 2184 847
Direct reclaim pages scanned 181293 191127 136593
Direct reclaim write sync I/O 13709 15617 0
Direct reclaim write async I/O 26686 28058 0
Wake kswapd requests 234 217 17271
Kswapd wakeups 200 192 145
Kswapd pages scanned 10810122 9822064 3104526
Kswapd reclaim write sync I/O 0 0 0
Kswapd reclaim write async I/O 790109 762967 236092
Time stalled direct reclaim 1956.76 1810.06 1395.76
Time kswapd awake 1171.50 1174.24 484.00

This test starts a number of simulatenous kernel compiles whose total size
exceeeds physical memory and then tries to allocate as many huge pages as there
is physical memory. This stresses page reclaim, particularly lumpy reclaim.

As expected, with direct reclaim not able to writeback, the IO counts
for it are 0 and it stalled less as you might expect. What was very
unexpected is that kswapd wrote fewer pages with direct relcaim disabled
than with.

I think with this test, much of the direct reclaim IO was due to lumpy
reclaim. So far, I'm not seeing as many dirty pages on the LRU as I was
expecting - even more lumpy reclaim. Does anyone have a test in mind
that is known to cause serious problems with dirty pages on the LRU?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-11 18:15:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Fri, Jun 11, 2010 at 12:29:12PM -0400, Christoph Hellwig wrote:
> On Tue, Jun 08, 2010 at 10:28:14AM +0100, Mel Gorman wrote:
> > > - we also need to care about ->releasepage. At least for XFS it
> > > can end up in the same deep allocator chain as ->writepage because
> > > it does all the extent state conversions, even if it doesn't
> > > start I/O.
> >
> > Dang.
> >
> > > I haven't managed yet to decode the ext4/btrfs codepaths
> > > for ->releasepage yet to figure out how they release a page that
> > > covers a delayed allocated or unwritten range.
> > >
> >
> > If ext4/btrfs are also very deep call-chains and this series is going more
> > or less the right direction, then avoiding calling ->releasepage from direct
> > reclaim is one, somewhat unfortunate, option. The second is to avoid it on
> > a per-filesystem basis for direct reclaim using PF_MEMALLOC to detect
> > reclaimers and PF_KSWAPD to tell the difference between direct
> > reclaimers and kswapd.
>
> I went throught this a bit more and I can't actually hit that code in
> XFS ->releasepage anymore. I've also audited the caller and can't see
> how we could theoretically hit it anymore. Do the VM gurus know a case
> where we would call ->releasepage on a page that's actually dirty and
> hasn't been through block_invalidatepage before?
>

Not a clue I'm afraid as I haven't dealt much with the interactions
between VM and FS in the past. Nick?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-11 18:17:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Fri, Jun 11, 2010 at 12:30:26PM -0400, Christoph Hellwig wrote:
> On Fri, Jun 11, 2010 at 01:33:20PM +0100, Mel Gorman wrote:
> > Ok, I was under the mistaken impression that filesystems wanted to be
> > given ranges of pages where possible. Considering that there has been no
> > reaction to the patch in question from the filesystem people cc'd, I'll
> > drop the problem for now.
>
> Yes, we'd prefer them if possible. Then again we'd really prefer to
> get as much I/O as possible from the flusher threads, and not kswapd.
>

Ok, for the moment I'll put it on the maybe pile and drop it from the
series. We can revisit if a use is found for it and we're happy that
there wasn't some other bug leaving dirty pages on the LRU for too long.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-11 19:08:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible

On Fri, 11 Jun 2010 13:49:36 +0100
Mel Gorman <[email protected]> wrote:

> > It takes a large number of high-level
> > VFS locks. Locks which cannot be taken from deep within page reclaim
> > without causing various deadlocks.
> >
>
> Can you explain this a bit more please? I can see the inode_lock is very
> important in this path for example but am not seeing how page reclaim taking
> it would cause a deadlock.

iput_final() takes a lot more locks than inode_lock. It can get down
into truncate_inode_pages() and can run journal commits and does
lock_page() and presumably takes i_mutex somewhere. We'd need to check
all the fs-specific ->clear_inode, ->delete_inode, maybe others. It
can do a synchronous write_inode_now() in generic_detach_inode(). We
seem to run about half the kernel code under iput_final() :(

I don't recall specifically what deadlock was hitting, and being eight
years ago it's not necessarily still there.

> > I did solve that problem before reverting it all but I forget how. By
> > holding a page lock to pin the address_space rather than igrab(),
> > perhaps.
>
> But this is what I did. That function has a list of locked pages. When I
> call igrab(), the page is locked so the address_space should be pinned. I
> unlock the page after I call igrab.

Right, so you end up with an inode/address_space which has no locked
pages and on which you hold a refcount. When that refcount gets
dropped with iput(), the code can run iput_final().

<grovels around for a while>

OK, 2.5.48's mm/page-writeback.c has:

/*
* A library function, which implements the vm_writeback a_op. It's fairly
* lame at this time. The idea is: the VM wants to liberate this page,
* so we pass the page to the address_space and give the fs the opportunity
* to write out lots of pages around this one. It allows extent-based
* filesytems to do intelligent things. It lets delayed-allocate filesystems
* perform better file layout. It lets the address_space opportunistically
* write back disk-contiguous pages which are in other zones.
*
* FIXME: the VM wants to start I/O against *this* page. Because its zone
* is under pressure. But this function may start writeout against a
* totally different set of pages. Unlikely to be a huge problem, but if it
* is, we could just writepage the page if it is still (PageDirty &&
* !PageWriteback) (See below).
*
* Another option is to just reposition page->mapping->dirty_pages so we
* *know* that the page will be written. That will work fine, but seems
* unpleasant. (If the page is not for-sure on ->dirty_pages we're dead).
* Plus it assumes that the address_space is performing writeback in
* ->dirty_pages order.
*
* So. The proper fix is to leave the page locked-and-dirty and to pass
* it all the way down.
*/
int generic_vm_writeback(struct page *page, struct writeback_control *wbc)
{
struct inode *inode = page->mapping->host;

/*
* We don't own this inode, and we don't want the address_space
* vanishing while writeback is walking its pages.
*/
inode = igrab(inode);
unlock_page(page);

if (inode) {
do_writepages(inode->i_mapping, wbc);

/*
* This iput() will internally call ext2_discard_prealloc(),
* which is rather bogus. But there is no other way of
* dropping our ref to the inode. However, there's no harm
* in dropping the prealloc, because there probably isn't any.
* Just a waste of cycles.
*/
iput(inode);
#if 0
if (!PageWriteback(page) && PageDirty(page)) {
lock_page(page);
if (!PageWriteback(page)&&test_clear_page_dirty(page)) {
int ret;

ret = page->mapping->a_ops->writepage(page);
if (ret == -EAGAIN)
__set_page_dirty_nobuffers(page);
} else {
unlock_page(page);
}
}
#endif
}
return 0;
}

and that still uses igrab :(

I'm pretty sure I did fix this at some stage in some tree, don't recall
where or how, but I think the fix involved not using igrab/iput, but
instead ensuring that the code retained at least one locked page until
it had finished touching the address_space.

> > Go take a look - it was somewhere between 2.5.1 and 2.5.10 if
> > I vaguely recall correctly.
> >
> > Or don't take a look - we shouldn't need to do any of this anyway.
> >
>
> I'll take a closer look if there is real interest in having the VM use
> writepages() but it sounds like it's a waste of time.

Well. The main problem is that we're doing too much IO off the LRU of
course.

But a secondary problem is that the pages which are coming off the LRU
may not be well-ordered wrt their on-disk layout. Seeky writes to a
database will do this, as may seeky writes from /usr/bin/ld, etc. And
seeky metadata writes to /dev/sda1! So writing in LRU-based ordering
can generate crappy IO patterns.

Doing a pgoff_t-based writearound around the target page was an attempt
to straighten all that out. And in some circumstances it really should
provide large reductions in seek traffic, and would still be a good
area of investigation. But if we continue to submit IO in the order in
which pages fall off the tail of the LRU, I don't think there's much to
be gained in the area of improved IO patterns. There might be CPU
consumption benefits, doing less merging work in the block layer.

> I'll focus on
>
> a) identifying how many dirty pages the VM is really writing back with
> tracepoints
> b) not using writepage from direct reclaim because it overflows the
> stack

OK.

This stuff takes a lot of time. You see a blob of 1000 dirty pages
fall off the tail of the LRU and then need to work out how the heck
they got there and what could be done to prevent that, and to improve
the clean-to-dirty ratio of those pages.

Obviously another appropach would be just to bisect the thing - write a
little patch to backport /proc/vmstat:nr_vmscan_write into old kernels,
pick a simple workload which causes "excessive" increments in
nr_vmscan_write then go for it. Bit of a PITA.

2010-06-11 19:14:29

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Fri, Jun 11, 2010 at 12:29:12PM -0400, Christoph Hellwig wrote:
> On Tue, Jun 08, 2010 at 10:28:14AM +0100, Mel Gorman wrote:
> > > - we also need to care about ->releasepage. At least for XFS it
> > > can end up in the same deep allocator chain as ->writepage because
> > > it does all the extent state conversions, even if it doesn't
> > > start I/O.
> >
> > Dang.
> >
> > > I haven't managed yet to decode the ext4/btrfs codepaths
> > > for ->releasepage yet to figure out how they release a page that
> > > covers a delayed allocated or unwritten range.
> > >
> >
> > If ext4/btrfs are also very deep call-chains and this series is going more
> > or less the right direction, then avoiding calling ->releasepage from direct
> > reclaim is one, somewhat unfortunate, option. The second is to avoid it on
> > a per-filesystem basis for direct reclaim using PF_MEMALLOC to detect
> > reclaimers and PF_KSWAPD to tell the difference between direct
> > reclaimers and kswapd.
>
> I went throught this a bit more and I can't actually hit that code in
> XFS ->releasepage anymore. I've also audited the caller and can't see
> how we could theoretically hit it anymore. Do the VM gurus know a case
> where we would call ->releasepage on a page that's actually dirty and
> hasn't been through block_invalidatepage before?

Which part of xfs releasepage are you trying to avoid?

dirty = xfs_page_state_convert(inode, page, &wbc, 0, 0);
if (dirty == 0 && !unwritten)
goto free_buffers;

I'd expect the above was fixed by page_mkwrite, which should be dealing
with all the funny corners that we used to have to mess with in
releasepage.

btrfs_release_page does no allocations, it only checks to see if the
page is busy somehow (dirty/writeback etc).

-chris

2010-06-11 20:44:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible

On Fri, Jun 11, 2010 at 12:07:30PM -0700, Andrew Morton wrote:
> On Fri, 11 Jun 2010 13:49:36 +0100
> Mel Gorman <[email protected]> wrote:
>
> > > It takes a large number of high-level
> > > VFS locks. Locks which cannot be taken from deep within page reclaim
> > > without causing various deadlocks.
> > >
> >
> > Can you explain this a bit more please? I can see the inode_lock is very
> > important in this path for example but am not seeing how page reclaim taking
> > it would cause a deadlock.
>
> iput_final() takes a lot more locks than inode_lock. It can get down
> into truncate_inode_pages() and can run journal commits and does
> lock_page() and presumably takes i_mutex somewhere.

i_mutex was something I failed to consider. I can see how that could
conceivably get deadlocked on if it was held when page reclaim was
entered and then direct reclaim later. I don't know if this actually
happens, but it's possible I guess.

> We'd need to check
> all the fs-specific ->clear_inode, ->delete_inode, maybe others. It
> can do a synchronous write_inode_now() in generic_detach_inode(). We
> seem to run about half the kernel code under iput_final() :(
>
> I don't recall specifically what deadlock was hitting, and being eight
> years ago it's not necessarily still there.
>

I know now what to keep an eye out for at least. Thanks.

> > > I did solve that problem before reverting it all but I forget how. By
> > > holding a page lock to pin the address_space rather than igrab(),
> > > perhaps.
> >
> > But this is what I did. That function has a list of locked pages. When I
> > call igrab(), the page is locked so the address_space should be pinned. I
> > unlock the page after I call igrab.
>
> Right, so you end up with an inode/address_space which has no locked
> pages and on which you hold a refcount. When that refcount gets
> dropped with iput(), the code can run iput_final().
>
> <grovels around for a while>
>
> OK, 2.5.48's mm/page-writeback.c has:
>
> /*
> * A library function, which implements the vm_writeback a_op. It's fairly
> * lame at this time. The idea is: the VM wants to liberate this page,
> * so we pass the page to the address_space and give the fs the opportunity
> * to write out lots of pages around this one. It allows extent-based
> * filesytems to do intelligent things. It lets delayed-allocate filesystems
> * perform better file layout. It lets the address_space opportunistically
> * write back disk-contiguous pages which are in other zones.
> *
> * FIXME: the VM wants to start I/O against *this* page. Because its zone
> * is under pressure. But this function may start writeout against a
> * totally different set of pages. Unlikely to be a huge problem, but if it
> * is, we could just writepage the page if it is still (PageDirty &&
> * !PageWriteback) (See below).
> *
> * Another option is to just reposition page->mapping->dirty_pages so we
> * *know* that the page will be written. That will work fine, but seems
> * unpleasant. (If the page is not for-sure on ->dirty_pages we're dead).
> * Plus it assumes that the address_space is performing writeback in
> * ->dirty_pages order.
> *
> * So. The proper fix is to leave the page locked-and-dirty and to pass
> * it all the way down.
> */
> int generic_vm_writeback(struct page *page, struct writeback_control *wbc)
> {
> struct inode *inode = page->mapping->host;
>
> /*
> * We don't own this inode, and we don't want the address_space
> * vanishing while writeback is walking its pages.
> */
> inode = igrab(inode);
> unlock_page(page);
>
> if (inode) {
> do_writepages(inode->i_mapping, wbc);
>
> /*
> * This iput() will internally call ext2_discard_prealloc(),
> * which is rather bogus. But there is no other way of
> * dropping our ref to the inode. However, there's no harm
> * in dropping the prealloc, because there probably isn't any.
> * Just a waste of cycles.
> */
> iput(inode);
> #if 0
> if (!PageWriteback(page) && PageDirty(page)) {
> lock_page(page);
> if (!PageWriteback(page)&&test_clear_page_dirty(page)) {
> int ret;
>
> ret = page->mapping->a_ops->writepage(page);
> if (ret == -EAGAIN)
> __set_page_dirty_nobuffers(page);
> } else {
> unlock_page(page);
> }
> }
> #endif
> }
> return 0;
> }
>
> and that still uses igrab :(
>
> I'm pretty sure I did fix this at some stage in some tree, don't recall
> where or how, but I think the fix involved not using igrab/iput, but
> instead ensuring that the code retained at least one locked page until
> it had finished touching the address_space.
>

I'll do a further investigation again later divided into two parts.
First, if we still can hit this problem in theory and second, if
whatever way you fixed it in the past still works.

> > > Go take a look - it was somewhere between 2.5.1 and 2.5.10 if
> > > I vaguely recall correctly.
> > >
> > > Or don't take a look - we shouldn't need to do any of this anyway.
> > >
> >
> > I'll take a closer look if there is real interest in having the VM use
> > writepages() but it sounds like it's a waste of time.
>
> Well. The main problem is that we're doing too much IO off the LRU of
> course.
>

What would be considered "too much IO"? In the tests I was running, I know
I can sometimes get a chunk of dirty pages at the end of the LRU but it's
rare and under load. To trigger it with dd, 64 jobs had to be running which
in combination were writing files 4 times the tsize of physical memory. Even
then, it was kswapd that did much of the work as can be seen here

traceonly stackreduce nodirect
Direct reclaims 4098 2436 5670
Direct reclaim pages scanned 393664 215821 505483
Direct reclaim write sync I/O 0 0 0
Direct reclaim write async I/O 0 0 0
Wake kswapd requests 865097 728976 1036147
Kswapd wakeups 639 561 585
Kswapd pages scanned 11123648 10383929 10561818
Kswapd reclaim write sync I/O 0 0 0
Kswapd reclaim write async I/O 3595 0 19068
Time stalled direct reclaim 2843.74 2771.71 32.76
Time kswapd awake 347.58 8865.65 433.27

What workload is considered most problematic? Next time, I'll also run a
read/write sysbench tests on postgres but each of these tests take a
long time to complete so it'd be nice to narrow it down.

The worst I saw was with large amounts of writeouts were during stress tests
for high-order allocations when lumpy reclaim is a big factor. Otherwise,
it didn't seem too bad.

> But a secondary problem is that the pages which are coming off the LRU
> may not be well-ordered wrt their on-disk layout. Seeky writes to a
> database will do this, as may seeky writes from /usr/bin/ld, etc. And
> seeky metadata writes to /dev/sda1! So writing in LRU-based ordering
> can generate crappy IO patterns.
>

Based on the tests I've seen so far, databases are the most plausible
way of having dirty pages at the end of the LRU. Granted, if you load up
the machine with enough compile jobs and dd, you'll see dirty pages at
the end of the LRU too but that is hardly a surprise.

But by and large, what I've seen suggests that lumpy reclaim when it
happens is a source of writeback from page reclaim but otherwise it's
not a major problem.

> Doing a pgoff_t-based writearound around the target page was an attempt
> to straighten all that out. And in some circumstances it really should
> provide large reductions in seek traffic, and would still be a good
> area of investigation. But if we continue to submit IO in the order in
> which pages fall off the tail of the LRU, I don't think there's much to
> be gained in the area of improved IO patterns. There might be CPU
> consumption benefits, doing less merging work in the block layer.
>

Ok.

> > I'll focus on
> >
> > a) identifying how many dirty pages the VM is really writing back with
> > tracepoints
> > b) not using writepage from direct reclaim because it overflows the
> > stack
>
> OK.
>
> This stuff takes a lot of time. You see a blob of 1000 dirty pages
> fall off the tail of the LRU and then need to work out how the heck
> they got there and what could be done to prevent that, and to improve
> the clean-to-dirty ratio of those pages.
>

Again, I'm not really seeing this pattern for the workloads I've tried
unless lumpy reclaim was a major factor. I'll see what sysbench shows
up.

> Obviously another appropach would be just to bisect the thing - write a
> little patch to backport /proc/vmstat:nr_vmscan_write into old kernels,
> pick a simple workload which causes "excessive" increments in
> nr_vmscan_write then go for it. Bit of a PITA.
>

I don't think it's feasible really. Even a basic test of this takes 4
hours to complete. Testing everything from 2.6.12 or doing a bisection
would be a woeful PITA.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-11 21:35:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible

On Fri, 11 Jun 2010 21:44:11 +0100
Mel Gorman <[email protected]> wrote:

> > Well. The main problem is that we're doing too much IO off the LRU of
> > course.
> >
>
> What would be considered "too much IO"?

Enough to slow things down ;)

This problem used to hurt a lot. Since those times we've decreased the
default value of /proc/sys/vm/dirty*ratio by a lot, which surely
papered over this problem a lot. We shouldn't forget that those ratios
_are_ tunable, after all. If we make a change which explodes the
kernel when someone's tuned to 40% then that's a problem and we'll need
to scratch our heads over the magnitude of that problem.

As for a workload which triggers the problem on a large machine which
is tuned to 20%/10%: dunno. If we're reliably activating pages when
dirtying them then perhaps it's no longer a problem with the default
tuning. I'd do some testing with mem=256M though - that has a habit of
triggering weirdnesses.

btw, I'm trying to work out if zap_pte_range() really needs to run
set_page_dirty(). Didn't (pte_dirty() && !PageDirty()) pages get
themselves stamped out?

2010-06-12 00:17:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/6] vmscan: Write out ranges of pages contiguous to the inode where possible

On Fri, Jun 11, 2010 at 02:33:37PM -0700, Andrew Morton wrote:
> On Fri, 11 Jun 2010 21:44:11 +0100
> Mel Gorman <[email protected]> wrote:
>
> > > Well. The main problem is that we're doing too much IO off the LRU of
> > > course.
> > >
> >
> > What would be considered "too much IO"?
>
> Enough to slow things down ;)
>

I like it. We don't know what it is, but we'll know when we see it :)

> This problem used to hurt a lot. Since those times we've decreased the
> default value of /proc/sys/vm/dirty*ratio by a lot, which surely
> papered over this problem a lot. We shouldn't forget that those ratios
> _are_ tunable, after all. If we make a change which explodes the
> kernel when someone's tuned to 40% then that's a problem and we'll need
> to scratch our heads over the magnitude of that problem.
>

Ok. What could be done is finalise the tracepoints (they are counting some
stuff they shouldn't) and merge them. The can measure the amount of time
kswapd was awake but critically how long direct reclaim was going on. A test
could be monitor the tracepoints and vmstat, start whatever the workload is,
generate a report and see what percentage of time was spent in direct reclaim
in comparison to the total. For the IO, it would be a comparison of the IO
generated by page reclaim in comparison to total IO. We'd need to decide on
"goodness" values for these ratios but at least it would be measurable
and broadly speaking - the lower the better and preferably 0 for both.

> As for a workload which triggers the problem on a large machine which
> is tuned to 20%/10%: dunno. If we're reliably activating pages when
> dirtying them then perhaps it's no longer a problem with the default
> tuning. I'd do some testing with mem=256M though - that has a habit of
> triggering weirdnesses.
>

Will do. I was testing with 2G which is probably too much.

> btw, I'm trying to work out if zap_pte_range() really needs to run
> set_page_dirty(). Didn't (pte_dirty() && !PageDirty()) pages get
> themselves stamped out?
>

I don't remember anything specific in that area. Will check it out if
someone doesn't have the quick answer.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 14:00:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

Hi Mel,

I know lots of people doesn't like direct reclaim, but I personally do
and I think if memory pressure is hard enough we should eventually
enter direct reclaim full force including ->writepage to avoid false
positive OOM failures. Transparent hugepage allocation in fact won't
even wakeup kswapd that would be insist to create hugepages and shrink
an excessive amount of memory (especially before memory compaction was
merged, it shall be tried again but if memory compaction fails in
kswapd context, definitely kswapd should immediately stop and not go
ahead trying the create hugepages the blind way, kswapd
order-awareness the blind way is surely detrimental and pointless).

When memory pressure is low, not going into ->writepage may be
beneficial from latency prospective too. (but again it depends how
much it matters to go in LRU and how beneficial is the cache, to know
if it's worth taking clean cache away even if hotter than dirty cache)

About the stack overflow did you ever got any stack-debug error? We've
plenty of instrumentation and ->writepage definitely runs with irq
enable, so if there's any issue, it can't possibly be unnoticed. The
worry about stack overflow shall be backed by numbers.

You posted lots of latency numbers (surely latency will improve but
it's only safe approach on light memory pressure, on heavy pressure
it'll early-oom not to call ->writepage, and if cache is very
important and system has little ram, not going in lru order may also
screw fs-cache performance), but I didn't see any max-stack usage hard
numbers, to back the claim that we're going to overflow.

In any case I'd prefer to be able to still call ->writepage if memory
pressure is high (at some point when priority going down and
collecting clean cache doesn't still satisfy the allocation), during
allocations in direct reclaim and increase the THREAD_SIZE than doing
this purely for stack reasons as the VM will lose reliability if we
forbid ->writepage at all in direct reclaim. Throttling on kswapd is
possible but it's probably less efficient and on the stack we know
exactly which kind of memory we should allocate, kswapd doesn't and it
works global.

2010-06-15 14:11:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> collecting clean cache doesn't still satisfy the allocation), during
> allocations in direct reclaim and increase the THREAD_SIZE than doing
> this purely for stack reasons as the VM will lose reliability if we

This basically means doubling the stack size, as you can splice together
two extremtly stack hungry codepathes in the worst case. Do you really
want order 2 stack allocations?

2010-06-15 14:22:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 10:11:22AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> > collecting clean cache doesn't still satisfy the allocation), during
> > allocations in direct reclaim and increase the THREAD_SIZE than doing
> > this purely for stack reasons as the VM will lose reliability if we
>
> This basically means doubling the stack size, as you can splice together
> two extremtly stack hungry codepathes in the worst case. Do you really
> want order 2 stack allocations?

If we were forbidden to call ->writepage just because of stack
overflow yes as I don't think it's big deal with memory compaction and
I see this as a too limiting design to allow ->writepage only in
kernel thread. ->writepage is also called by the pagecache layer,
msync etc.. not just by kswapd.

But let's defer this after we have any resemblance of hard numbers of
worst-case stack usage measured during the aforementioned workload, I
didn't read all the details as I'm quite against this design, but I
didn't see any stack usage number or any sign of stack-overflow debug
triggering. I'd suggest to measure the max stack usage first and worry
later.

And if ->writepage is a stack hog in some fs, I'd rather see
->writepage made less stack hungry (with proper warning at runtime
with debug option enabled) than vetoed. The VM itself shouldn't be a
stack hog already. I don't see a particular reason why writepage
should be so stuck hungry compared to the rest of the kernel, it just
have to do I/O, if it requires complex data structure it should
kmalloc those and stay light on stack as everybody else.

And if something I'm worried more about slab shrink than ->writepage
as that enters the vfs layer and then the lowlevel fs to collect the
dentry, inode etc...

2010-06-15 14:44:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 04:22:19PM +0200, Andrea Arcangeli wrote:
> If we were forbidden to call ->writepage just because of stack
> overflow yes as I don't think it's big deal with memory compaction and
> I see this as a too limiting design to allow ->writepage only in
> kernel thread. ->writepage is also called by the pagecache layer,
> msync etc.. not just by kswapd.

Other callers of ->writepage are fine because they come from a
controlled environment with relatively little stack usage. The problem
with direct reclaim is that we splice multiple stack hogs ontop of each
other.

Direct reclaim can come from any point that does memory allocations,
including those that absolutely have to because their stack "quota"
is almost used up. Let's look at a worst case scenario:

We're in a deep stack codepath, say

(1) core_sys_select, which has to kmalloc the array if it doesn't
fit on the huge stack variable. All fine by now, it stays in it's
stack quota.
(2) That code now calls into the slab allocator, which doesn't find free
space in the large slab, and then calls into kmem_getpages, adding
more stack usage.
(3) That calls into alloc_pages_exact_node which adds stack usage of
the page allocator.
(4) no free pages in the zone anymore, and direct reclaim is invoked,
adding the stack usage of the reclaim code, which currently is
quite heavy.
(5) direct reclaim calls into foofs ->writepage. foofs_writepage
notices the page is delayed allocated and needs to conver it.
It now has to start a transaction, then call the extent management
code to convert the extent, which calls into the space managment
code, which calls into the buffercache for the metadata buffers,
which needs to submit a bio to read/write the metadata.
(6) The metadata buffer goes through submit_bio and the block layer
code. Because we're doing a synchronous writeout it gets directly
dispatched to the block layer.
(7) for extra fun add a few remapping layers for raid or similar to
add to the stack usage.
(8) The lowlevel block driver is iscsi or something similar, so after
going through the scsi layer adding more stack it now goes through
the networking layer with tcp and ipv4 (if you're unlucky ipv6)
code
(9) we finally end up in the lowlevel networking driver (except that we
would have long overflown the stack)

And for extrea fun:

(10) Just when we're way down that stack an IRQ comes in on the CPU that
we're executing on. Because we don't enable irqstacks for the only
sensible stack configuration (yeah, still bitter about the patch
for that getting ignored) it goes on the same stack above.


And note that the above does not only happen with ext4/btrfs/xfs that
have delayed allocations. With every other filesystem it can also
happen, just a lot less likely - when writing to a file through shared
mmaps we still have to call the allocator from ->writepage in
ext2/ext3/reiserfs/etc.

And seriously, if the VM isn't stopped from calling ->writepage from
reclaim context we FS people will simply ignore any ->writepage from
reclaim context. Been there, done that and never again.

Just wondering, what filesystems do your hugepage testing systems use?
If it's any of the ext4/btrfs/xfs above you're already seeing the
filesystem refuse ->writepage from both kswapd and direct reclaim,
so Mel's series will allow us to reclaim pages from more contexts
than before.

2010-06-15 14:51:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
>
> I know lots of people doesn't like direct reclaim,

It's not direct reclaim that is the problem per-se, it's direct reclaim
calling writepage and splicing two potentially deep call chains
together.

> but I personally do
> and I think if memory pressure is hard enough we should eventually
> enter direct reclaim full force including ->writepage to avoid false
> positive OOM failures.

Be that as it may, filesystems that have deep call paths for their
->writepage are ignoring both kswapd and direct reclaim so on XFS and
btrfs for example, this "full force" effect is not being reached.

> Transparent hugepage allocation in fact won't
> even wakeup kswapd that would be insist to create hugepages and shrink
> an excessive amount of memory (especially before memory compaction was
> merged, it shall be tried again but if memory compaction fails in
> kswapd context, definitely kswapd should immediately stop and not go
> ahead trying the create hugepages the blind way, kswapd
> order-awareness the blind way is surely detrimental and pointless).
>

kswapd does end up freeing a lot of memory in response to lumpy reclaim
because it also tries to restore watermarks for a high-order page. This
is disruptive to the system and something I'm going to revisit but it's
a separate topic for another discussion. I can see why transparent
hugepage support would not want this disruptive effect to occur where as
it might make sense when resizing the hugepage pool.

> When memory pressure is low, not going into ->writepage may be
> beneficial from latency prospective too. (but again it depends how
> much it matters to go in LRU and how beneficial is the cache, to know
> if it's worth taking clean cache away even if hotter than dirty cache)
>
> About the stack overflow did you ever got any stack-debug error?

Not an error. Got a report from Dave Chinner though and it's what kicked
off this whole routine in the first place. I've been recording stack
usage figures but not reporting them. In reclaim I'm getting to about 5K
deep but this was on simple storage and XFS was ignoring attempts for
reclaim to writeback.

http://lkml.org/lkml/2010/4/13/121

Here is one my my own stack traces though

Depth Size Location (49 entries)
----- ---- --------
0) 5064 304 get_page_from_freelist+0x2e4/0x722
1) 4760 240 __alloc_pages_nodemask+0x15f/0x6a7
2) 4520 48 kmem_getpages+0x61/0x12c
3) 4472 96 cache_grow+0xca/0x272
4) 4376 80 cache_alloc_refill+0x1d4/0x226
5) 4296 64 kmem_cache_alloc+0x129/0x1bc
6) 4232 16 mempool_alloc_slab+0x16/0x18
7) 4216 144 mempool_alloc+0x56/0x104
8) 4072 16 scsi_sg_alloc+0x48/0x4a [scsi_mod]
9) 4056 96 __sg_alloc_table+0x58/0xf8
10) 3960 32 scsi_init_sgtable+0x37/0x8f [scsi_mod]
11) 3928 32 scsi_init_io+0x24/0xce [scsi_mod]
12) 3896 48 scsi_setup_fs_cmnd+0xbc/0xc4 [scsi_mod]
13) 3848 144 sd_prep_fn+0x1d3/0xc13 [sd_mod]
14) 3704 64 blk_peek_request+0xe2/0x1a6
15) 3640 96 scsi_request_fn+0x87/0x522 [scsi_mod]
16) 3544 32 __blk_run_queue+0x88/0x14b
17) 3512 48 elv_insert+0xb7/0x254
18) 3464 48 __elv_add_request+0x9f/0xa7
19) 3416 128 __make_request+0x3f4/0x476
20) 3288 192 generic_make_request+0x332/0x3a4
21) 3096 64 submit_bio+0xc4/0xcd
22) 3032 80 _xfs_buf_ioapply+0x222/0x252 [xfs]
23) 2952 48 xfs_buf_iorequest+0x84/0xa1 [xfs]
24) 2904 32 xlog_bdstrat+0x47/0x4d [xfs]
25) 2872 64 xlog_sync+0x21a/0x329 [xfs]
26) 2808 48 xlog_state_release_iclog+0x9b/0xa8 [xfs]
27) 2760 176 xlog_write+0x356/0x506 [xfs]
28) 2584 96 xfs_log_write+0x5a/0x86 [xfs]
29) 2488 368 xfs_trans_commit_iclog+0x165/0x2c3 [xfs]
30) 2120 80 _xfs_trans_commit+0xd8/0x20d [xfs]
31) 2040 240 xfs_iomap_write_allocate+0x247/0x336 [xfs]
32) 1800 144 xfs_iomap+0x31a/0x345 [xfs]
33) 1656 48 xfs_map_blocks+0x3c/0x40 [xfs]
34) 1608 256 xfs_page_state_convert+0x2c4/0x597 [xfs]
35) 1352 64 xfs_vm_writepage+0xf5/0x12f [xfs]
36) 1288 32 __writepage+0x17/0x34
37) 1256 288 write_cache_pages+0x1f3/0x2f8
38) 968 16 generic_writepages+0x24/0x2a
39) 952 64 xfs_vm_writepages+0x4f/0x5c [xfs]
40) 888 16 do_writepages+0x21/0x2a
41) 872 48 writeback_single_inode+0xd8/0x2f4
42) 824 112 writeback_inodes_wb+0x41a/0x51e
43) 712 176 wb_writeback+0x13d/0x1b7
44) 536 128 wb_do_writeback+0x150/0x167
45) 408 80 bdi_writeback_task+0x43/0x117
46) 328 48 bdi_start_fn+0x76/0xd5
47) 280 96 kthread+0x82/0x8a
48) 184 184 kernel_thread_helper+0x4/0x10

XFS as you can see is quite deep there. Now consider if
get_page_from_freelist() there had entered direct reclaim and then tried
to writeback a page. That's the problem that is being worried about.


> We've
> plenty of instrumentation and ->writepage definitely runs with irq
> enable, so if there's any issue, it can't possibly be unnoticed. The
> worry about stack overflow shall be backed by numbers.
>
> You posted lots of latency numbers (surely latency will improve but
> it's only safe approach on light memory pressure, on heavy pressure
> it'll early-oom not to call ->writepage, and if cache is very
> important and system has little ram, not going in lru order may also
> screw fs-cache performance),

I also haven't been able to trigger a new OOM as a result of the patch
but maybe I'm missing something. To trigger an OOM, the bulk of the LRU
would have to be dirty and the direct reclaimer making no further
progress but if the bulk of the LRU has been dirtied like this, are we
not already in trouble?

We could have it that direct reclaimers kick the flusher threads when it
counters dirty pages and goes to sleep but this will increase latency
and considering the number of dirty pages direct reclaimers should be
seeing, I'm not sure it's necessary.

> but I didn't see any max-stack usage hard
> numbers, to back the claim that we're going to overflow.
>

I hadn't posted them because they had been posted previously and I
didn't think they were that interesting as such because it wasn't being
disputed.

> In any case I'd prefer to be able to still call ->writepage if memory
> pressure is high (at some point when priority going down and
> collecting clean cache doesn't still satisfy the allocation),

Well, kswapd is still writing pages if the pressure is high enough that
the flusher threads are not doing it and a direct reclaimer will wait on
congestion_wait() if the pressure gets high enough (PRIORITY < 2).

> during
> allocations in direct reclaim and increase the THREAD_SIZE than doing
> this purely for stack reasons as the VM will lose reliability if we
> forbid ->writepage at all in direct reclaim.

Well, we've lost that particular reliability already on btrfs and xfs
because they are ignoring the VM and increasing THREAD_SIZE would
increase the order used for stack allocations which causes problems of
its own.

The VM would lose a lot of reliability if we weren't throttling on pages
being dirtied in the fault path but because we are doing that, I don't
currently believe we are losing reliability by not writing back pages in
direct reclaim.

> Throttling on kswapd is
> possible but it's probably less efficient and on the stack we know
> exactly which kind of memory we should allocate, kswapd doesn't and it
> works global.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 14:56:06

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On 06/15/2010 10:51 AM, Mel Gorman wrote:
> On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
>> Hi Mel,
>>
>> I know lots of people doesn't like direct reclaim,
>
> It's not direct reclaim that is the problem per-se, it's direct reclaim
> calling writepage and splicing two potentially deep call chains
> together.

I have talked to Mel on IRC, and the above means:

"calling alloc_pages from an already deep stack frame,
and then going into direct reclaim"

That explanation would have been helpful in email :)

--
All rights reversed

2010-06-15 15:08:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 03:51:34PM +0100, Mel Gorman wrote:
> On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> > When memory pressure is low, not going into ->writepage may be
> > beneficial from latency prospective too. (but again it depends how
> > much it matters to go in LRU and how beneficial is the cache, to know
> > if it's worth taking clean cache away even if hotter than dirty cache)
> >
> > About the stack overflow did you ever got any stack-debug error?
>
> Not an error. Got a report from Dave Chinner though and it's what kicked
> off this whole routine in the first place. I've been recording stack
> usage figures but not reporting them. In reclaim I'm getting to about 5K
> deep but this was on simple storage and XFS was ignoring attempts for
> reclaim to writeback.
>
> http://lkml.org/lkml/2010/4/13/121
>
> Here is one my my own stack traces though
>
> Depth Size Location (49 entries)
> ----- ---- --------
> 0) 5064 304 get_page_from_freelist+0x2e4/0x722
> 1) 4760 240 __alloc_pages_nodemask+0x15f/0x6a7
> 2) 4520 48 kmem_getpages+0x61/0x12c
> 3) 4472 96 cache_grow+0xca/0x272
> 4) 4376 80 cache_alloc_refill+0x1d4/0x226
> 5) 4296 64 kmem_cache_alloc+0x129/0x1bc
> 6) 4232 16 mempool_alloc_slab+0x16/0x18
> 7) 4216 144 mempool_alloc+0x56/0x104
> 8) 4072 16 scsi_sg_alloc+0x48/0x4a [scsi_mod]
> 9) 4056 96 __sg_alloc_table+0x58/0xf8
> 10) 3960 32 scsi_init_sgtable+0x37/0x8f [scsi_mod]
> 11) 3928 32 scsi_init_io+0x24/0xce [scsi_mod]
> 12) 3896 48 scsi_setup_fs_cmnd+0xbc/0xc4 [scsi_mod]
> 13) 3848 144 sd_prep_fn+0x1d3/0xc13 [sd_mod]
> 14) 3704 64 blk_peek_request+0xe2/0x1a6
> 15) 3640 96 scsi_request_fn+0x87/0x522 [scsi_mod]
> 16) 3544 32 __blk_run_queue+0x88/0x14b
> 17) 3512 48 elv_insert+0xb7/0x254
> 18) 3464 48 __elv_add_request+0x9f/0xa7
> 19) 3416 128 __make_request+0x3f4/0x476
> 20) 3288 192 generic_make_request+0x332/0x3a4
> 21) 3096 64 submit_bio+0xc4/0xcd
> 22) 3032 80 _xfs_buf_ioapply+0x222/0x252 [xfs]
> 23) 2952 48 xfs_buf_iorequest+0x84/0xa1 [xfs]
> 24) 2904 32 xlog_bdstrat+0x47/0x4d [xfs]
> 25) 2872 64 xlog_sync+0x21a/0x329 [xfs]
> 26) 2808 48 xlog_state_release_iclog+0x9b/0xa8 [xfs]
> 27) 2760 176 xlog_write+0x356/0x506 [xfs]
> 28) 2584 96 xfs_log_write+0x5a/0x86 [xfs]
> 29) 2488 368 xfs_trans_commit_iclog+0x165/0x2c3 [xfs]
> 30) 2120 80 _xfs_trans_commit+0xd8/0x20d [xfs]
> 31) 2040 240 xfs_iomap_write_allocate+0x247/0x336 [xfs]
> 32) 1800 144 xfs_iomap+0x31a/0x345 [xfs]
> 33) 1656 48 xfs_map_blocks+0x3c/0x40 [xfs]
> 34) 1608 256 xfs_page_state_convert+0x2c4/0x597 [xfs]
> 35) 1352 64 xfs_vm_writepage+0xf5/0x12f [xfs]
> 36) 1288 32 __writepage+0x17/0x34
> 37) 1256 288 write_cache_pages+0x1f3/0x2f8
> 38) 968 16 generic_writepages+0x24/0x2a
> 39) 952 64 xfs_vm_writepages+0x4f/0x5c [xfs]
> 40) 888 16 do_writepages+0x21/0x2a
> 41) 872 48 writeback_single_inode+0xd8/0x2f4
> 42) 824 112 writeback_inodes_wb+0x41a/0x51e
> 43) 712 176 wb_writeback+0x13d/0x1b7
> 44) 536 128 wb_do_writeback+0x150/0x167
> 45) 408 80 bdi_writeback_task+0x43/0x117
> 46) 328 48 bdi_start_fn+0x76/0xd5
> 47) 280 96 kthread+0x82/0x8a
> 48) 184 184 kernel_thread_helper+0x4/0x10
>
> XFS as you can see is quite deep there. Now consider if
> get_page_from_freelist() there had entered direct reclaim and then tried
> to writeback a page. That's the problem that is being worried about.

It would be a problem because it should be !__GFP_IO at that point so
something would be seriously broken if it called ->writepage again.

2010-06-15 15:09:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 10:43:42AM -0400, Christoph Hellwig wrote:
> Other callers of ->writepage are fine because they come from a
> controlled environment with relatively little stack usage. The problem
> with direct reclaim is that we splice multiple stack hogs ontop of each
> other.

It's not like we're doing a stack recursive algorithm in kernel. These
have to be "controlled hogs", so we must have space to run 4/5 of them
on top of each other, that's the whole point.

I'm aware the ->writepage can run on any alloc_pages, but frankly I
don't see a whole lot of difference between regular kernel code paths
or msync. Sure they can be at higher stack usage, but not like with
only 1000bytes left.

> And seriously, if the VM isn't stopped from calling ->writepage from
> reclaim context we FS people will simply ignore any ->writepage from
> reclaim context. Been there, done that and never again.
>
> Just wondering, what filesystems do your hugepage testing systems use?
> If it's any of the ext4/btrfs/xfs above you're already seeing the
> filesystem refuse ->writepage from both kswapd and direct reclaim,
> so Mel's series will allow us to reclaim pages from more contexts
> than before.

fs ignoring ->writepage during memory pressure (even from kswapd) is
broken, this is not up to the fs to decide. I'm using ext4 on most of
my testing, it works ok, but it doesn't make it right (if fact if
performance declines without that hack, it may prove VM needs fixing,
it doesn't justify the hack).

If you don't throttle against kswapd, or if even kswapd can't turn a
dirty page into a clean one, you can get oom false positives. Anything
is better than that. (provided you've proper stack instrumentation to
notice when there is risk of a stack overflow, it's ages I never seen
a stack overflow debug detector report)

The irq stack must be enabled and this isn't about direct reclaim but
about irqs in general and their potential nesting with softirq calls
too.

Also note, there's nothing that prevents us from switching the stack
to something else the moment we enter direct reclaim. It doesn't need
to be physically contiguous. Just allocate a couple of 4k pages and
switch to them every time a new hog starts in VM context. The only
real complexity is in the stack unwind but if irqstack can cope with
it sure stack unwind can cope with more "special" stacks too.

Ignoring ->writepage on VM invocations at best can only hide VM
inefficiencies with the downside of breaking the VM in corner cases
with heavy VM pressure.

Crippling down the kernel by vetoing ->writepage to me looks very
wrong, but I'd be totally supportive of a "special" writepage stack or
special iscsi stack etc...

2010-06-15 15:11:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Wed, Jun 16, 2010 at 01:08:00AM +1000, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 03:51:34PM +0100, Mel Gorman wrote:
> > On Tue, Jun 15, 2010 at 04:00:11PM +0200, Andrea Arcangeli wrote:
> > > When memory pressure is low, not going into ->writepage may be
> > > beneficial from latency prospective too. (but again it depends how
> > > much it matters to go in LRU and how beneficial is the cache, to know
> > > if it's worth taking clean cache away even if hotter than dirty cache)
> > >
> > > About the stack overflow did you ever got any stack-debug error?
> >
> > Not an error. Got a report from Dave Chinner though and it's what kicked
> > off this whole routine in the first place. I've been recording stack
> > usage figures but not reporting them. In reclaim I'm getting to about 5K
> > deep but this was on simple storage and XFS was ignoring attempts for
> > reclaim to writeback.
> >
> > http://lkml.org/lkml/2010/4/13/121
> >
> > Here is one my my own stack traces though
> >
> > Depth Size Location (49 entries)
> > ----- ---- --------
> > 0) 5064 304 get_page_from_freelist+0x2e4/0x722
> > 1) 4760 240 __alloc_pages_nodemask+0x15f/0x6a7
> > 2) 4520 48 kmem_getpages+0x61/0x12c
> > 3) 4472 96 cache_grow+0xca/0x272
> > 4) 4376 80 cache_alloc_refill+0x1d4/0x226
> > 5) 4296 64 kmem_cache_alloc+0x129/0x1bc
> > 6) 4232 16 mempool_alloc_slab+0x16/0x18
> > 7) 4216 144 mempool_alloc+0x56/0x104
> > 8) 4072 16 scsi_sg_alloc+0x48/0x4a [scsi_mod]
> > 9) 4056 96 __sg_alloc_table+0x58/0xf8
> > 10) 3960 32 scsi_init_sgtable+0x37/0x8f [scsi_mod]
> > 11) 3928 32 scsi_init_io+0x24/0xce [scsi_mod]
> > 12) 3896 48 scsi_setup_fs_cmnd+0xbc/0xc4 [scsi_mod]
> > 13) 3848 144 sd_prep_fn+0x1d3/0xc13 [sd_mod]
> > 14) 3704 64 blk_peek_request+0xe2/0x1a6
> > 15) 3640 96 scsi_request_fn+0x87/0x522 [scsi_mod]
> > 16) 3544 32 __blk_run_queue+0x88/0x14b
> > 17) 3512 48 elv_insert+0xb7/0x254
> > 18) 3464 48 __elv_add_request+0x9f/0xa7
> > 19) 3416 128 __make_request+0x3f4/0x476
> > 20) 3288 192 generic_make_request+0x332/0x3a4
> > 21) 3096 64 submit_bio+0xc4/0xcd
> > 22) 3032 80 _xfs_buf_ioapply+0x222/0x252 [xfs]
> > 23) 2952 48 xfs_buf_iorequest+0x84/0xa1 [xfs]
> > 24) 2904 32 xlog_bdstrat+0x47/0x4d [xfs]
> > 25) 2872 64 xlog_sync+0x21a/0x329 [xfs]
> > 26) 2808 48 xlog_state_release_iclog+0x9b/0xa8 [xfs]
> > 27) 2760 176 xlog_write+0x356/0x506 [xfs]
> > 28) 2584 96 xfs_log_write+0x5a/0x86 [xfs]
> > 29) 2488 368 xfs_trans_commit_iclog+0x165/0x2c3 [xfs]
> > 30) 2120 80 _xfs_trans_commit+0xd8/0x20d [xfs]
> > 31) 2040 240 xfs_iomap_write_allocate+0x247/0x336 [xfs]
> > 32) 1800 144 xfs_iomap+0x31a/0x345 [xfs]
> > 33) 1656 48 xfs_map_blocks+0x3c/0x40 [xfs]
> > 34) 1608 256 xfs_page_state_convert+0x2c4/0x597 [xfs]
> > 35) 1352 64 xfs_vm_writepage+0xf5/0x12f [xfs]
> > 36) 1288 32 __writepage+0x17/0x34
> > 37) 1256 288 write_cache_pages+0x1f3/0x2f8
> > 38) 968 16 generic_writepages+0x24/0x2a
> > 39) 952 64 xfs_vm_writepages+0x4f/0x5c [xfs]
> > 40) 888 16 do_writepages+0x21/0x2a
> > 41) 872 48 writeback_single_inode+0xd8/0x2f4
> > 42) 824 112 writeback_inodes_wb+0x41a/0x51e
> > 43) 712 176 wb_writeback+0x13d/0x1b7
> > 44) 536 128 wb_do_writeback+0x150/0x167
> > 45) 408 80 bdi_writeback_task+0x43/0x117
> > 46) 328 48 bdi_start_fn+0x76/0xd5
> > 47) 280 96 kthread+0x82/0x8a
> > 48) 184 184 kernel_thread_helper+0x4/0x10
> >
> > XFS as you can see is quite deep there. Now consider if
> > get_page_from_freelist() there had entered direct reclaim and then tried
> > to writeback a page. That's the problem that is being worried about.
>
> It would be a problem because it should be !__GFP_IO at that point so
> something would be seriously broken if it called ->writepage again.
>

True, ignore this as Christoph's example makes more sense.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 15:25:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 05:08:50PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 10:43:42AM -0400, Christoph Hellwig wrote:
> > Other callers of ->writepage are fine because they come from a
> > controlled environment with relatively little stack usage. The problem
> > with direct reclaim is that we splice multiple stack hogs ontop of each
> > other.
>
> It's not like we're doing a stack recursive algorithm in kernel. These
> have to be "controlled hogs", so we must have space to run 4/5 of them
> on top of each other, that's the whole point.

We're not doing a full recursion. We're splicing a codepath that
normally could use the full stack (fs writeback / block I/O) into
a random other code path that could use the full stack, and add
some quite stack heavy allocator / reclaim code inbetween.

>
> I'm aware the ->writepage can run on any alloc_pages, but frankly I
> don't see a whole lot of difference between regular kernel code paths
> or msync. Sure they can be at higher stack usage, but not like with
> only 1000bytes left.

msync does not use any significant amount of stack:

0xc01f53b3 sys_msync [vmlinux]: 40
0xc022b165 vfs_fsync [vmlinux]: 12
0xc022b053 vfs_fsync_range [vmlinux]: 24
0xc01d7e63 filemap_write_and_wait_range [vmlinux]: 28
0xc01d7df3 __filemap_fdatawrite_range [vmlinux]: 56

and then we alredy enter ->writepages. Direct reclaim on the other
hand can happen from context that already is say 4 or 6 kilobytes
into stack usage. And the callchain from kmalloc() into ->writepage
alone adds another 0.7k of stack usage. There's not much left for
the filesystem after this.

> If you don't throttle against kswapd, or if even kswapd can't turn a
> dirty page into a clean one, you can get oom false positives. Anything
> is better than that. (provided you've proper stack instrumentation to
> notice when there is risk of a stack overflow, it's ages I never seen
> a stack overflow debug detector report)

I've never seen the stack overflow detector trigger on this, but I've
seen lots of real life stack overflows on the mailing lists. End
users don't run with it enabled normally, and most testing workloads
don't seem to hit direct reclaim enough to actually trigger this
reproducibly.

> Also note, there's nothing that prevents us from switching the stack
> to something else the moment we enter direct reclaim. It doesn't need
> to be physically contiguous. Just allocate a couple of 4k pages and
> switch to them every time a new hog starts in VM context. The only
> real complexity is in the stack unwind but if irqstack can cope with
> it sure stack unwind can cope with more "special" stacks too.

Which is a lot more complicated than loading off the page cleaning
from direct reclaim to dedicated threads - be that the flusher threads
or kswapd.

> Ignoring ->writepage on VM invocations at best can only hide VM
> inefficiencies with the downside of breaking the VM in corner cases
> with heavy VM pressure.

It allows the system to survive in case direct reclaim is called instead
of crashing with a stack overflow. And at least in my testing the
VM seems to cope rather well with not beeing able to write out
filesystem pages from direct reclaim. That doesn't mean that this
behaviour can't be further improved on.

2010-06-15 15:38:58

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 05:08:50PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 10:43:42AM -0400, Christoph Hellwig wrote:
> > Other callers of ->writepage are fine because they come from a
> > controlled environment with relatively little stack usage. The problem
> > with direct reclaim is that we splice multiple stack hogs ontop of each
> > other.
>
> It's not like we're doing a stack recursive algorithm in kernel. These
> have to be "controlled hogs", so we must have space to run 4/5 of them
> on top of each other, that's the whole point.
>
> I'm aware the ->writepage can run on any alloc_pages, but frankly I
> don't see a whole lot of difference between regular kernel code paths
> or msync. Sure they can be at higher stack usage, but not like with
> only 1000bytes left.
>

That is pretty much what Dave is claiming here at
http://lkml.org/lkml/2010/4/13/121 where if mempool_alloc_slab() needed
to allocate a page and writepage was entered, there would have been a
a problem.

I disagreed with his fix which is what led to this series as an alternative.

> > And seriously, if the VM isn't stopped from calling ->writepage from
> > reclaim context we FS people will simply ignore any ->writepage from
> > reclaim context. Been there, done that and never again.
> >
> > Just wondering, what filesystems do your hugepage testing systems use?
> > If it's any of the ext4/btrfs/xfs above you're already seeing the
> > filesystem refuse ->writepage from both kswapd and direct reclaim,
> > so Mel's series will allow us to reclaim pages from more contexts
> > than before.
>
> fs ignoring ->writepage during memory pressure (even from kswapd) is
> broken, this is not up to the fs to decide. I'm using ext4 on most of
> my testing, it works ok, but it doesn't make it right (if fact if
> performance declines without that hack, it may prove VM needs fixing,
> it doesn't justify the hack).
>

Broken or not, it's what some of them are doing to avoid stack
overflows. Worst, they are ignoring both kswapd and direct reclaim when they
only really needed to ignore kswapd. With this series at least, the
check for PF_MEMALLOC in ->writepage can be removed

> If you don't throttle against kswapd, or if even kswapd can't turn a
> dirty page into a clean one, you can get oom false positives. Anything
> is better than that.

This series would at least allow kswapd to turn dirty pages into clean
ones so it's an improvement.

> (provided you've proper stack instrumentation to
> notice when there is risk of a stack overflow, it's ages I never seen
> a stack overflow debug detector report)
>
> The irq stack must be enabled and this isn't about direct reclaim but
> about irqs in general and their potential nesting with softirq calls
> too.
>
> Also note, there's nothing that prevents us from switching the stack
> to something else the moment we enter direct reclaim.

Other than a lack of code to do it :/

If you really feel strongly about this, you could follow on the series
by extending clean_page_list() to switch stack if !kswapd.

> It doesn't need
> to be physically contiguous. Just allocate a couple of 4k pages and
> switch to them every time a new hog starts in VM context. The only
> real complexity is in the stack unwind but if irqstack can cope with
> it sure stack unwind can cope with more "special" stacks too.
>
> Ignoring ->writepage on VM invocations at best can only hide VM
> inefficiencies with the downside of breaking the VM in corner cases
> with heavy VM pressure.
>

This has actually been the case for a while. I vaguely recall FS people
complaining about writepage from direct reclaim at some conference or
the other two years ago.

> Crippling down the kernel by vetoing ->writepage to me looks very
> wrong, but I'd be totally supportive of a "special" writepage stack or
> special iscsi stack etc...
>

I'm not sure the complexityy is justified based on the data I've seen so
far.

if (reclaim_can_writeback(sc)) {
cleaned = MAX_SWAP_CLEAN_WAIT;
clean_page_list(page_list, sc);
goto restart_dirty;
} else {
cleaned++;
/*
* If lumpy reclaiming, kick the background
* flusher and wait
* for the pages to be cleaned
*
* XXX: kswapd won't find these isolated pages but the
* background flusher does not prioritise pages. It'd
* be nice to prioritise a list of pages somehow
*/
if (sync_writeback == PAGEOUT_IO_SYNC) {
wakeup_flusher_threads(nr_dirty);
congestion_wait(BLK_RW_ASYNC, HZ/10);
goto restart_dirty;
}
}

to

if (reclaim_can_writeback(sc)) {
cleaned = MAX_SWAP_CLEAN_WAIT;
clean_page_list(page_list, sc);
goto restart_dirty;
} else {
cleaned++;
wakeup_flusher_threads(nr_dirty);
congestion_wait(BLK_RW_ASYNC, HZ/10);

/* If not in lumpy reclaim, just try these
* pages one more time before isolating more
* pages from the LRU
*/
if (sync_writeback != PAGEOUT_IO_SYNC)
clean = MAX_SWAP_CLEAN_WAIT;
goto restart_dirty;
}

i.e. when direct reclaim encounters N dirty pages, unconditionally ask the
flusher threads to clean that number of pages, throttle by waiting for them
to be cleaned, reclaim them if they get cleaned or otherwise scan more pages
on the LRU.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 15:45:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 11:25:26AM -0400, Christoph Hellwig wrote:
> hand can happen from context that already is say 4 or 6 kilobytes
> into stack usage. And the callchain from kmalloc() into ->writepage

Mel's stack trace of 5k was still not realistic as it doesn't call
writepage there. I was just asking the 6k example vs msync.

Plus shrink dcache/inodes may also invoke I/O and end up with all
those hogs.

> I've never seen the stack overflow detector trigger on this, but I've
> seen lots of real life stack overflows on the mailing lists. End
> users don't run with it enabled normally, and most testing workloads
> don't seem to hit direct reclaim enough to actually trigger this
> reproducibly.

How do you know it's a stack overflow if it's not the stack overflow
detector firing before the fact, could be bad ram too, usually?

> Which is a lot more complicated than loading off the page cleaning
> from direct reclaim to dedicated threads - be that the flusher threads
> or kswapd.

More complicated for sure. But surely I like that more than vetoing
->writepage from VM context, especially if it's a fs decision. fs
shouldn't decide that.

> It allows the system to survive in case direct reclaim is called instead
> of crashing with a stack overflow. And at least in my testing the
> VM seems to cope rather well with not beeing able to write out
> filesystem pages from direct reclaim. That doesn't mean that this
> behaviour can't be further improved on.

Agreed. Surely it seems to work ok for me too, but it may hide VM
issues, it makes the VM less reliable against potential false positive
OOM, and it's better if we just teach the VM to switch stack before
invoking the freeing methods, so it automatically solves dcache/icache
collection ending up writing data etc...

Then if we don't want to call ->writepage we won't do it for other
reasons, but we can solve this in a generic and reliable way that
covers not just ->writepage but all source I/O, including swapout over
iscsi, vfs etc...

2010-06-15 16:14:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 04:38:38PM +0100, Mel Gorman wrote:
> That is pretty much what Dave is claiming here at
> http://lkml.org/lkml/2010/4/13/121 where if mempool_alloc_slab() needed

This stack trace shows writepage called by shrink_page_list... that
contradict Christoph's claim that xfs already won't writepage if
invoked by direct reclaim.

> to allocate a page and writepage was entered, there would have been a
> a problem.

There can't be a problem if a page wasn't available in mempool because
we can't nest two writepage on top of the other or it'd deadlock on fs
locks and this is the reason of GFP_NOFS, like noticed in the email.

Surely this shows the writepage going very close to the stack
size... probably not enough to trigger the stack detector but close
enough to worry! Agreed.

I think we just need to switch stack on do_try_to_free_pages to solve
it, and not just writepage or the filesystems.

> Broken or not, it's what some of them are doing to avoid stack
> overflows. Worst, they are ignoring both kswapd and direct reclaim when they
> only really needed to ignore kswapd. With this series at least, the
> check for PF_MEMALLOC in ->writepage can be removed

I don't get how we end up in xfs_buf_ioapply above though if xfs
writepage is a noop on PF_MEMALLOC. Definitely PF_MEMALLOC is set
before try_to_free_pages but in the above trace writepage still runs
and submit the I/O.

> This series would at least allow kswapd to turn dirty pages into clean
> ones so it's an improvement.

Not saying it's not an improvement, but still it's not necessarily the
right direction.

> Other than a lack of code to do it :/

;)

> If you really feel strongly about this, you could follow on the series
> by extending clean_page_list() to switch stack if !kswapd.
>
> This has actually been the case for a while. I vaguely recall FS people

Again not what looks like from the stack trace. Also grepping for
PF_MEMALLOC in fs/xfs shows nothing. In fact it's ext4_write_inode
that skips the write if PF_MEMALLOC is set, not writepage apparently
(only did a quick grep so I might be wrong). I suspect
ext4_write_inode is the case I just mentioned about slab shrink, not
->writepage ;).

inodes are small, it's no big deal to keep an inode pinned and not
slab-reclaimable because dirty, while skipping real writepage in
memory pressure could really open a regression in oom false positives!
One pagecache much bigger than one inode and there can be plenty more
dirty pagecache than inodes.

> i.e. when direct reclaim encounters N dirty pages, unconditionally ask the
> flusher threads to clean that number of pages, throttle by waiting for them
> to be cleaned, reclaim them if they get cleaned or otherwise scan more pages
> on the LRU.

Not bad at all... throttling is what makes it safe too. Problem is all
the rest that isn't solved by this and could be solved with a stack
switch, that's my main reason for considering this a ->writepage only
hack not complete enough to provide a generic solution for reclaim
issues ending up in fs->dm->iscsi/bio. I also suspect xfs is more hog
than others (might not be a coicidence the 7k happens with xfs
writepage) and could be lightened up a bit by looking into it.

2010-06-15 16:22:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 06:14:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 04:38:38PM +0100, Mel Gorman wrote:
> > That is pretty much what Dave is claiming here at
> > http://lkml.org/lkml/2010/4/13/121 where if mempool_alloc_slab() needed
>
> This stack trace shows writepage called by shrink_page_list... that
> contradict Christoph's claim that xfs already won't writepage if
> invoked by direct reclaim.

We only recently did that - before that we tried to get the VM fixed
multiple times but finally had to bite the bullet and follow ext4 and
btrfs in that regard.

> Again not what looks like from the stack trace. Also grepping for
> PF_MEMALLOC in fs/xfs shows nothing. In fact it's ext4_write_inode
> that skips the write if PF_MEMALLOC is set, not writepage apparently
> (only did a quick grep so I might be wrong). I suspect
> ext4_write_inode is the case I just mentioned about slab shrink, not
> ->writepage ;).

ext4 in fact does not check PF_MEMALLOC but simply refuses to write
out anything in ->writepage in most cases. There is a corner case
when the page doesn't have any buffers attached where it wouldn't
have write out data, without actually calling the allocator. I
suspect this code actually is a leftover as we don't normally strip
buffers from a page that had them before.

> inodes are small, it's no big deal to keep an inode pinned and not
> slab-reclaimable because dirty, while skipping real writepage in
> memory pressure could really open a regression in oom false positives!
> One pagecache much bigger than one inode and there can be plenty more
> dirty pagecache than inodes.

At least for XFS ->write_inode is really simple these days. If it's
a synchronous writeout, which won't happen from these path it logs the
inode, which is far less harmless than the whole allocator code, and
for write = 0 it only adds it to the delayed write queue, which doesn't
call into the I/O stack at all.

2010-06-15 16:26:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 05:45:16PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 11:25:26AM -0400, Christoph Hellwig wrote:
> > hand can happen from context that already is say 4 or 6 kilobytes
> > into stack usage. And the callchain from kmalloc() into ->writepage
>
> Mel's stack trace of 5k was still not realistic as it doesn't call
> writepage there. I was just asking the 6k example vs msync.

FYI here is the most recent one that Michael Monnerie reported after he
hit it on a production machine. It's what finally prompted us to add
the check in ->writepage:

[21877.948005] BUG: scheduling while atomic: rsync/2345/0xffff8800
[21877.948005] Modules linked in: af_packet nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 ramzswap xvmalloc lzo_decompress lzo_compress loop dm_mod reiserfs xfs exportfs xennet xenblk cdrom
[21877.948005] Pid: 2345, comm: rsync Not tainted 2.6.31.12-0.2-xen #1
[21877.948005] Call Trace:
[21877.949649] [<ffffffff800119b9>] try_stack_unwind+0x189/0x1b0
[21877.949659] [<ffffffff8000f466>] dump_trace+0xa6/0x1e0
[21877.949666] [<ffffffff800114c4>] show_trace_log_lvl+0x64/0x90
[21877.949676] [<ffffffff80011513>] show_trace+0x23/0x40
[21877.949684] [<ffffffff8046b92c>] dump_stack+0x81/0x9e
[21877.949695] [<ffffffff8003f398>] __schedule_bug+0x78/0x90
[21877.949702] [<ffffffff8046c97c>] thread_return+0x1d7/0x3fb
[21877.949709] [<ffffffff8046cf85>] schedule_timeout+0x195/0x200
[21877.949717] [<ffffffff8046be2b>] wait_for_common+0x10b/0x230
[21877.949726] [<ffffffff8046c09b>] wait_for_completion+0x2b/0x50
[21877.949768] [<ffffffffa009e741>] xfs_buf_iowait+0x31/0x80 [xfs]
[21877.949894] [<ffffffffa009ea30>] _xfs_buf_read+0x70/0x80 [xfs]
[21877.949992] [<ffffffffa009ef8b>] xfs_buf_read_flags+0x8b/0xd0 [xfs]
[21877.950089] [<ffffffffa0091ab9>] xfs_trans_read_buf+0x1e9/0x320 [xfs]
[21877.950174] [<ffffffffa005b278>] xfs_btree_read_buf_block+0x68/0xe0 [xfs]
[21877.950232] [<ffffffffa005b99e>] xfs_btree_lookup_get_block+0x8e/0x110 [xfs]
[21877.950281] [<ffffffffa005c0af>] xfs_btree_lookup+0xdf/0x4d0 [xfs]
[21877.950329] [<ffffffffa0042b77>] xfs_alloc_lookup_eq+0x27/0x50 [xfs]
[21877.950361] [<ffffffffa0042f09>] xfs_alloc_fixup_trees+0x249/0x370 [xfs]
[21877.950397] [<ffffffffa0044c30>] xfs_alloc_ag_vextent_near+0x4e0/0x9a0 [xfs]
[21877.950432] [<ffffffffa00451f5>] xfs_alloc_ag_vextent+0x105/0x160 [xfs]
[21877.950471] [<ffffffffa0045bb4>] xfs_alloc_vextent+0x3b4/0x4b0 [xfs]
[21877.950504] [<ffffffffa0058da8>] xfs_bmbt_alloc_block+0xf8/0x210 [xfs]
[21877.950550] [<ffffffffa005e3b7>] xfs_btree_split+0xc7/0x720 [xfs]
[21877.950597] [<ffffffffa005ef8c>] xfs_btree_make_block_unfull+0x15c/0x1c0 [xfs]
[21877.950643] [<ffffffffa005f3ff>] xfs_btree_insrec+0x40f/0x5c0 [xfs]
[21877.950689] [<ffffffffa005f651>] xfs_btree_insert+0xa1/0x1b0 [xfs]
[21877.950748] [<ffffffffa005325e>] xfs_bmap_add_extent_delay_real+0x82e/0x12a0 [xfs]
[21877.950787] [<ffffffffa00540f4>] xfs_bmap_add_extent+0x424/0x450 [xfs]
[21877.950833] [<ffffffffa00573f3>] xfs_bmapi+0xda3/0x1320 [xfs]
[21877.950879] [<ffffffffa007c248>] xfs_iomap_write_allocate+0x1d8/0x3f0 [xfs]
[21877.950953] [<ffffffffa007d089>] xfs_iomap+0x2c9/0x300 [xfs]
[21877.951021] [<ffffffffa009a1b8>] xfs_map_blocks+0x38/0x60 [xfs]
[21877.951108] [<ffffffffa009b93a>] xfs_page_state_convert+0x3fa/0x720 [xfs]
[21877.951204] [<ffffffffa009bde4>] xfs_vm_writepage+0x84/0x160 [xfs]
[21877.951301] [<ffffffff800e3603>] pageout+0x143/0x2b0
[21877.951308] [<ffffffff800e514e>] shrink_page_list+0x26e/0x650
[21877.951314] [<ffffffff800e5803>] shrink_inactive_list+0x2d3/0x7c0
[21877.951320] [<ffffffff800e5d4b>] shrink_list+0x5b/0x110
[21877.951325] [<ffffffff800e5f71>] shrink_zone+0x171/0x250
[21877.951330] [<ffffffff800e60d3>] shrink_zones+0x83/0x120
[21877.951336] [<ffffffff800e620e>] do_try_to_free_pages+0x9e/0x380
[21877.951342] [<ffffffff800e6607>] try_to_free_pages+0x77/0xa0
[21877.951349] [<ffffffff800dbfa3>] __alloc_pages_slowpath+0x2d3/0x5c0
[21877.951355] [<ffffffff800dc3e1>] __alloc_pages_nodemask+0x151/0x160
[21877.951362] [<ffffffff800d44b7>] __page_cache_alloc+0x27/0x50
[21877.951368] [<ffffffff800d68ca>] grab_cache_page_write_begin+0x9a/0xe0
[21877.951376] [<ffffffff8014bdfe>] block_write_begin+0xae/0x120
[21877.951396] [<ffffffffa009ac24>] xfs_vm_write_begin+0x34/0x50 [xfs]
[21877.951482] [<ffffffff800d4b31>] generic_perform_write+0xc1/0x1f0
[21877.951489] [<ffffffff800d5d00>] generic_file_buffered_write+0x90/0x160
[21877.951512] [<ffffffffa00a4711>] xfs_write+0x521/0xb60 [xfs]
[21877.951624] [<ffffffffa009fb80>] xfs_file_aio_write+0x70/0xa0 [xfs]
[21877.951711] [<ffffffff80118c42>] do_sync_write+0x102/0x160
[21877.951718] [<ffffffff80118fc8>] vfs_write+0xd8/0x1c0
[21877.951723] [<ffffffff8011995b>] sys_write+0x5b/0xa0
[21877.951729] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[21877.951736] [<00007fc41b0fab10>] 0x7fc41b0fab10
[21877.951750] BUG: unable to handle kernel paging request at 0000000108743280
[21877.951755] IP: [<ffffffff80034832>] dequeue_task+0x72/0x110
[21877.951766] PGD 31c6f067 PUD 0
[21877.951770] Thread overran stack, or stack corrupted

2010-06-15 16:31:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 06:14:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 15, 2010 at 04:38:38PM +0100, Mel Gorman wrote:
> > That is pretty much what Dave is claiming here at
> > http://lkml.org/lkml/2010/4/13/121 where if mempool_alloc_slab() needed
>
> This stack trace shows writepage called by shrink_page_list... that
> contradict Christoph's claim that xfs already won't writepage if
> invoked by direct reclaim.
>

See this

STATIC int
xfs_vm_writepage(
struct page *page,
struct writeback_control *wbc)
{
int error;
int need_trans;
int delalloc, unmapped, unwritten;
struct inode *inode = page->mapping->host;

trace_xfs_writepage(inode, page, 0);

/*
* Refuse to write the page out if we are called from reclaim
* context.
*
* This is primarily to avoid stack overflows when called from deep
* used stacks in random callers for direct reclaim, but disabling
* reclaim for kswap is a nice side-effect as kswapd causes rather
* suboptimal I/O patters, too.
*
* This should really be done by the core VM, but until that happens
* filesystems like XFS, btrfs and ext4 have to take care of this
* by themselves.
*/
if (current->flags & PF_MEMALLOC)
goto out_fail;


> > to allocate a page and writepage was entered, there would have been a
> > a problem.
>
> There can't be a problem if a page wasn't available in mempool because
> we can't nest two writepage on top of the other or it'd deadlock on fs
> locks and this is the reason of GFP_NOFS, like noticed in the email.
>

Indeed, this is another case where we wouldn't have bust, just are
dangerously close. As Dave pointed out, we might have been in trouble if
the storage was also complicated but there isn't specific proof - just a
lot of strong evidence.

My 5K example is poor I'll admit but the storage is also a bit simple.
Just one disk, no md, networking or the anything else. This is why the data
I showed focused on how many dirty pages were being encountered during LRU
scanning, stalls and the like rather than the stack usage itself.

> Surely this shows the writepage going very close to the stack
> size... probably not enough to trigger the stack detector but close
> enough to worry! Agreed.
>
> I think we just need to switch stack on do_try_to_free_pages to solve
> it, and not just writepage or the filesystems.
>

Again, missing the code to do it and am missing data showing that not
writing pages in direct reclaim is really a bad idea.

> > Broken or not, it's what some of them are doing to avoid stack
> > overflows. Worst, they are ignoring both kswapd and direct reclaim when they
> > only really needed to ignore kswapd. With this series at least, the
> > check for PF_MEMALLOC in ->writepage can be removed
>
> I don't get how we end up in xfs_buf_ioapply above though if xfs
> writepage is a noop on PF_MEMALLOC. Definitely PF_MEMALLOC is set
> before try_to_free_pages but in the above trace writepage still runs
> and submit the I/O.
>
> > This series would at least allow kswapd to turn dirty pages into clean
> > ones so it's an improvement.
>
> Not saying it's not an improvement, but still it's not necessarily the
> right direction.
>
> > Other than a lack of code to do it :/
>
> ;)
>
> > If you really feel strongly about this, you could follow on the series
> > by extending clean_page_list() to switch stack if !kswapd.
> >
> > This has actually been the case for a while. I vaguely recall FS people
>
> Again not what looks like from the stack trace. Also grepping for
> PF_MEMALLOC in fs/xfs shows nothing.

fs/xfs/linux-2.6/xfs_aops.c

> In fact it's ext4_write_inode
> that skips the write if PF_MEMALLOC is set, not writepage apparently
> (only did a quick grep so I might be wrong). I suspect
> ext4_write_inode is the case I just mentioned about slab shrink, not
> ->writepage ;).
>

After grepping through fs/, it was only xfs and btrfs that I saw were
specfically disabling writepage from reclaim context.

> inodes are small, it's no big deal to keep an inode pinned and not
> slab-reclaimable because dirty, while skipping real writepage in
> memory pressure could really open a regression in oom false positives!
> One pagecache much bigger than one inode and there can be plenty more
> dirty pagecache than inodes.
>
> > i.e. when direct reclaim encounters N dirty pages, unconditionally ask the
> > flusher threads to clean that number of pages, throttle by waiting for them
> > to be cleaned, reclaim them if they get cleaned or otherwise scan more pages
> > on the LRU.
>
> Not bad at all... throttling is what makes it safe too. Problem is all
> the rest that isn't solved by this and could be solved with a stack
> switch, that's my main reason for considering this a ->writepage only
> hack not complete enough to provide a generic solution for reclaim
> issues ending up in fs->dm->iscsi/bio. I also suspect xfs is more hog
> than others (might not be a coicidence the 7k happens with xfs
> writepage) and could be lightened up a bit by looking into it.
>

Other than the whole "lacking the code" thing and it's still not clear that
writing from direct reclaim is absolutly necessary for VM stability considering
it's been ignored today by at least two filesystems. I can add the throttling
logic if it'd make you happied but I know it'd be at least two weeks
before I could start from scratch on a
stack-switch-based-solution and a PITA considering that I'm not convinced
it's necessary :)

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 16:34:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

> > > That is pretty much what Dave is claiming here at
> > > http://lkml.org/lkml/2010/4/13/121 where if mempool_alloc_slab() needed
> >
> > This stack trace shows writepage called by shrink_page_list... that
> > contradict Christoph's claim that xfs already won't writepage if
> > invoked by direct reclaim.
> >
>
> See this
>
> STATIC int
> xfs_vm_writepage(
> struct page *page,
> struct writeback_control *wbc)
> {
> int error;
> int need_trans;
> int delalloc, unmapped, unwritten;
> struct inode *inode = page->mapping->host;
>
> trace_xfs_writepage(inode, page, 0);
>
> /*
> * Refuse to write the page out if we are called from reclaim
> * context.
> *
> * This is primarily to avoid stack overflows when called from deep
> * used stacks in random callers for direct reclaim, but disabling
> * reclaim for kswap is a nice side-effect as kswapd causes rather
> * suboptimal I/O patters, too.
> *
> * This should really be done by the core VM, but until that happens
> * filesystems like XFS, btrfs and ext4 have to take care of this
> * by themselves.
> */
> if (current->flags & PF_MEMALLOC)
> goto out_fail;
>

My apologies. I didn't realise this was added so recently. I thought for
a while already so....

> > Not bad at all... throttling is what makes it safe too. Problem is all
> > the rest that isn't solved by this and could be solved with a stack
> > switch, that's my main reason for considering this a ->writepage only
> > hack not complete enough to provide a generic solution for reclaim
> > issues ending up in fs->dm->iscsi/bio. I also suspect xfs is more hog
> > than others (might not be a coicidence the 7k happens with xfs
> > writepage) and could be lightened up a bit by looking into it.
> >
>
> Other than the whole "lacking the code" thing and it's still not clear that
> writing from direct reclaim is absolutly necessary for VM stability considering
> it's been ignored today by at least two filesystems.

I retract this point as well because in reality, we have little data on
the full consequences of not writing pages from direct reclaim. Early
data based on the tests I've run indicate that the number of pages
direct reclaim writes is so small that it's not a problem but there is a
strong case for adding throttling at least.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 16:35:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 05:30:44PM +0100, Mel Gorman wrote:
> After grepping through fs/, it was only xfs and btrfs that I saw were
> specfically disabling writepage from reclaim context.

ext4 doesn't specificly disable writeback from reclaim context, but
in a rather convoluted way disabled basically all writeback through
->writepage. The only thing allowed is overwrites of already allocated
blocks.

In addition to that reiserfs also frefuses to write back pages
from reclaim context if they require a transaction, which is the case
if the file was written to through mmap.

2010-06-15 16:45:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 06:14:19PM +0200, Andrea Arcangeli wrote:
> Again not what looks like from the stack trace. Also grepping for
> PF_MEMALLOC in fs/xfs shows nothing. In fact it's ext4_write_inode
> that skips the write if PF_MEMALLOC is set, not writepage apparently
> (only did a quick grep so I might be wrong). I suspect
> ext4_write_inode is the case I just mentioned about slab shrink, not
> ->writepage ;).
>
> inodes are small, it's no big deal to keep an inode pinned and not
> slab-reclaimable because dirty, while skipping real writepage in
> memory pressure could really open a regression in oom false positives!
> One pagecache much bigger than one inode and there can be plenty more
> dirty pagecache than inodes.

Btw, those comments in ext3/ext4 don't make much sense. The only
time iput_final ever calls into ->write_inode is when the filesystem
is beeing unmounted, which never happens with PF_MEMALLOC set.

2010-06-15 16:50:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On 06/15/2010 12:26 PM, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 05:45:16PM +0200, Andrea Arcangeli wrote:
>> On Tue, Jun 15, 2010 at 11:25:26AM -0400, Christoph Hellwig wrote:
>>> hand can happen from context that already is say 4 or 6 kilobytes
>>> into stack usage. And the callchain from kmalloc() into ->writepage
>>
>> Mel's stack trace of 5k was still not realistic as it doesn't call
>> writepage there. I was just asking the 6k example vs msync.
>
> FYI here is the most recent one that Michael Monnerie reported after he
> hit it on a production machine. It's what finally prompted us to add
> the check in ->writepage:
>
> [21877.948005] BUG: scheduling while atomic: rsync/2345/0xffff8800
> [21877.948005] Modules linked in: af_packet nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 ramzswap xvmalloc lzo_decompress lzo_compress loop dm_mod reiserfs xfs exportfs xennet xenblk cdrom
> [21877.948005] Pid: 2345, comm: rsync Not tainted 2.6.31.12-0.2-xen #1
> [21877.948005] Call Trace:
> [21877.949649] [<ffffffff800119b9>] try_stack_unwind+0x189/0x1b0
> [21877.949659] [<ffffffff8000f466>] dump_trace+0xa6/0x1e0
> [21877.949666] [<ffffffff800114c4>] show_trace_log_lvl+0x64/0x90
> [21877.949676] [<ffffffff80011513>] show_trace+0x23/0x40
> [21877.949684] [<ffffffff8046b92c>] dump_stack+0x81/0x9e
> [21877.949695] [<ffffffff8003f398>] __schedule_bug+0x78/0x90
> [21877.949702] [<ffffffff8046c97c>] thread_return+0x1d7/0x3fb
> [21877.949709] [<ffffffff8046cf85>] schedule_timeout+0x195/0x200
> [21877.949717] [<ffffffff8046be2b>] wait_for_common+0x10b/0x230
> [21877.949726] [<ffffffff8046c09b>] wait_for_completion+0x2b/0x50
> [21877.949768] [<ffffffffa009e741>] xfs_buf_iowait+0x31/0x80 [xfs]
> [21877.949894] [<ffffffffa009ea30>] _xfs_buf_read+0x70/0x80 [xfs]
> [21877.949992] [<ffffffffa009ef8b>] xfs_buf_read_flags+0x8b/0xd0 [xfs]
> [21877.950089] [<ffffffffa0091ab9>] xfs_trans_read_buf+0x1e9/0x320 [xfs]
> [21877.950174] [<ffffffffa005b278>] xfs_btree_read_buf_block+0x68/0xe0 [xfs]
> [21877.950232] [<ffffffffa005b99e>] xfs_btree_lookup_get_block+0x8e/0x110 [xfs]
> [21877.950281] [<ffffffffa005c0af>] xfs_btree_lookup+0xdf/0x4d0 [xfs]
> [21877.950329] [<ffffffffa0042b77>] xfs_alloc_lookup_eq+0x27/0x50 [xfs]
> [21877.950361] [<ffffffffa0042f09>] xfs_alloc_fixup_trees+0x249/0x370 [xfs]
> [21877.950397] [<ffffffffa0044c30>] xfs_alloc_ag_vextent_near+0x4e0/0x9a0 [xfs]
> [21877.950432] [<ffffffffa00451f5>] xfs_alloc_ag_vextent+0x105/0x160 [xfs]
> [21877.950471] [<ffffffffa0045bb4>] xfs_alloc_vextent+0x3b4/0x4b0 [xfs]
> [21877.950504] [<ffffffffa0058da8>] xfs_bmbt_alloc_block+0xf8/0x210 [xfs]
> [21877.950550] [<ffffffffa005e3b7>] xfs_btree_split+0xc7/0x720 [xfs]
> [21877.950597] [<ffffffffa005ef8c>] xfs_btree_make_block_unfull+0x15c/0x1c0 [xfs]
> [21877.950643] [<ffffffffa005f3ff>] xfs_btree_insrec+0x40f/0x5c0 [xfs]
> [21877.950689] [<ffffffffa005f651>] xfs_btree_insert+0xa1/0x1b0 [xfs]
> [21877.950748] [<ffffffffa005325e>] xfs_bmap_add_extent_delay_real+0x82e/0x12a0 [xfs]
> [21877.950787] [<ffffffffa00540f4>] xfs_bmap_add_extent+0x424/0x450 [xfs]
> [21877.950833] [<ffffffffa00573f3>] xfs_bmapi+0xda3/0x1320 [xfs]
> [21877.950879] [<ffffffffa007c248>] xfs_iomap_write_allocate+0x1d8/0x3f0 [xfs]
> [21877.950953] [<ffffffffa007d089>] xfs_iomap+0x2c9/0x300 [xfs]
> [21877.951021] [<ffffffffa009a1b8>] xfs_map_blocks+0x38/0x60 [xfs]
> [21877.951108] [<ffffffffa009b93a>] xfs_page_state_convert+0x3fa/0x720 [xfs]
> [21877.951204] [<ffffffffa009bde4>] xfs_vm_writepage+0x84/0x160 [xfs]
> [21877.951301] [<ffffffff800e3603>] pageout+0x143/0x2b0
> [21877.951308] [<ffffffff800e514e>] shrink_page_list+0x26e/0x650
> [21877.951314] [<ffffffff800e5803>] shrink_inactive_list+0x2d3/0x7c0
> [21877.951320] [<ffffffff800e5d4b>] shrink_list+0x5b/0x110
> [21877.951325] [<ffffffff800e5f71>] shrink_zone+0x171/0x250
> [21877.951330] [<ffffffff800e60d3>] shrink_zones+0x83/0x120
> [21877.951336] [<ffffffff800e620e>] do_try_to_free_pages+0x9e/0x380
> [21877.951342] [<ffffffff800e6607>] try_to_free_pages+0x77/0xa0
> [21877.951349] [<ffffffff800dbfa3>] __alloc_pages_slowpath+0x2d3/0x5c0
> [21877.951355] [<ffffffff800dc3e1>] __alloc_pages_nodemask+0x151/0x160
> [21877.951362] [<ffffffff800d44b7>] __page_cache_alloc+0x27/0x50
> [21877.951368] [<ffffffff800d68ca>] grab_cache_page_write_begin+0x9a/0xe0
> [21877.951376] [<ffffffff8014bdfe>] block_write_begin+0xae/0x120
> [21877.951396] [<ffffffffa009ac24>] xfs_vm_write_begin+0x34/0x50 [xfs]

This is already in a filesystem. Why does ->writepage get
called a second time? Shouldn't this have a gfp_mask
without __GFP_FS set?

> [21877.951482] [<ffffffff800d4b31>] generic_perform_write+0xc1/0x1f0
> [21877.951489] [<ffffffff800d5d00>] generic_file_buffered_write+0x90/0x160
> [21877.951512] [<ffffffffa00a4711>] xfs_write+0x521/0xb60 [xfs]
> [21877.951624] [<ffffffffa009fb80>] xfs_file_aio_write+0x70/0xa0 [xfs]
> [21877.951711] [<ffffffff80118c42>] do_sync_write+0x102/0x160
> [21877.951718] [<ffffffff80118fc8>] vfs_write+0xd8/0x1c0
> [21877.951723] [<ffffffff8011995b>] sys_write+0x5b/0xa0
> [21877.951729] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
> [21877.951736] [<00007fc41b0fab10>] 0x7fc41b0fab10
> [21877.951750] BUG: unable to handle kernel paging request at 0000000108743280
> [21877.951755] IP: [<ffffffff80034832>] dequeue_task+0x72/0x110
> [21877.951766] PGD 31c6f067 PUD 0
> [21877.951770] Thread overran stack, or stack corrupted
>


--
All rights reversed

2010-06-15 16:54:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
> This is already in a filesystem. Why does ->writepage get
> called a second time? Shouldn't this have a gfp_mask
> without __GFP_FS set?

Why would it? GFP_NOFS is not for all filesystem code, but only for
code where we can't re-enter the filesystem due to deadlock potential.

Except for a few filesystems that have transactions open inside
->aio_write no one uses GFP_NOFS from that path.

2010-06-15 16:55:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 05:34:07PM +0100, Mel Gorman wrote:
> My apologies. I didn't realise this was added so recently. I thought for
> a while already so....

It was also my fault I didn't grep with -r (as most fs layouts don't
have the writepage implementation under an inner linux-2.6/ dir ;),
but it's still recent it was added on Jun 03...

I wonder if anybody tested swapon ./swapfile_on_xfs after after such
change during heavy memory pressure leading to OOM (but not reaching
it).

Christoph says ext4 also does the same thing but lack of PF_MEMALLOC
check there rings a bell, can't judje without understanding ext4
better. Surely ext4 had more testing than this xfs of last week, so
taking ext4 as example is better idea if it does the same
thing. Taking the xfs change as example is not ok anymore considering
when it was added...

> I retract this point as well because in reality, we have little data on
> the full consequences of not writing pages from direct reclaim. Early
> data based on the tests I've run indicate that the number of pages
> direct reclaim writes is so small that it's not a problem but there is a
> strong case for adding throttling at least.

A "cp /dev/zero ." on xfs filesystem, during a gcc build on same xfs,
plus some swapping with swapfile over same xfs, sounds good test for
that. I doubt anybody run that considering how young that is.

2010-06-15 16:56:12

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
> On 06/15/2010 12:26 PM, Christoph Hellwig wrote:
> >On Tue, Jun 15, 2010 at 05:45:16PM +0200, Andrea Arcangeli wrote:
> >[21877.951204] [<ffffffffa009bde4>] xfs_vm_writepage+0x84/0x160 [xfs]
> >[21877.951301] [<ffffffff800e3603>] pageout+0x143/0x2b0
> >[21877.951308] [<ffffffff800e514e>] shrink_page_list+0x26e/0x650
> >[21877.951314] [<ffffffff800e5803>] shrink_inactive_list+0x2d3/0x7c0
> >[21877.951320] [<ffffffff800e5d4b>] shrink_list+0x5b/0x110
> >[21877.951325] [<ffffffff800e5f71>] shrink_zone+0x171/0x250
> >[21877.951330] [<ffffffff800e60d3>] shrink_zones+0x83/0x120
> >[21877.951336] [<ffffffff800e620e>] do_try_to_free_pages+0x9e/0x380
> >[21877.951342] [<ffffffff800e6607>] try_to_free_pages+0x77/0xa0
> >[21877.951349] [<ffffffff800dbfa3>] __alloc_pages_slowpath+0x2d3/0x5c0
> >[21877.951355] [<ffffffff800dc3e1>] __alloc_pages_nodemask+0x151/0x160
> >[21877.951362] [<ffffffff800d44b7>] __page_cache_alloc+0x27/0x50
> >[21877.951368] [<ffffffff800d68ca>] grab_cache_page_write_begin+0x9a/0xe0
> >[21877.951376] [<ffffffff8014bdfe>] block_write_begin+0xae/0x120
> >[21877.951396] [<ffffffffa009ac24>] xfs_vm_write_begin+0x34/0x50 [xfs]
>
> This is already in a filesystem. Why does ->writepage get
> called a second time? Shouldn't this have a gfp_mask
> without __GFP_FS set?

No, we're allowed to use __GFP_FS with i_mutex held.

>
> >[21877.951482] [<ffffffff800d4b31>] generic_perform_write+0xc1/0x1f0
> >[21877.951489] [<ffffffff800d5d00>] generic_file_buffered_write+0x90/0x160
> >[21877.951512] [<ffffffffa00a4711>] xfs_write+0x521/0xb60 [xfs]
> >[21877.951624] [<ffffffffa009fb80>] xfs_file_aio_write+0x70/0xa0 [xfs]
> >[21877.951711] [<ffffffff80118c42>] do_sync_write+0x102/0x160
> >[21877.951718] [<ffffffff80118fc8>] vfs_write+0xd8/0x1c0
> >[21877.951723] [<ffffffff8011995b>] sys_write+0x5b/0xa0
> >[21877.951729] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
> >[21877.951736] [<00007fc41b0fab10>] 0x7fc41b0fab10
> >[21877.951750] BUG: unable to handle kernel paging request at 0000000108743280
> >[21877.951755] IP: [<ffffffff80034832>] dequeue_task+0x72/0x110
> >[21877.951766] PGD 31c6f067 PUD 0
> >[21877.951770] Thread overran stack, or stack corrupted
> >
>
>
> --
> All rights reversed

2010-06-15 17:37:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 03:51:34PM +0100, Mel Gorman wrote:
> kswapd does end up freeing a lot of memory in response to lumpy reclaim
> because it also tries to restore watermarks for a high-order page. This
> is disruptive to the system and something I'm going to revisit but it's
> a separate topic for another discussion. I can see why transparent
> hugepage support would not want this disruptive effect to occur where as
> it might make sense when resizing the hugepage pool.

on a related topic, I also had to nuke lumpy reclaim, it's pointless
with mem compaction and it halts the system and makes it unusable
under all normal loads unless allocations are run like hugetlbfs does
(just all at once at app startup and never again, so the hang is
limited to the first minute when app starts). With a dynamic approach
like THP systems becomes unusable. Nothing should fail when large
order allocation fails (I mean the large order that activates lumpy
reclaims) so there's no point to grind the system to unusable state in
order to generate those large order pages, considering lumpy reclaim
effectives is next to irrelevant compared to compaction, and in turn
not worth it.

> Depth Size Location (49 entries)
> ----- ---- --------
> 0) 5064 304 get_page_from_freelist+0x2e4/0x722
> 1) 4760 240 __alloc_pages_nodemask+0x15f/0x6a7
> 2) 4520 48 kmem_getpages+0x61/0x12c
> 3) 4472 96 cache_grow+0xca/0x272
> 4) 4376 80 cache_alloc_refill+0x1d4/0x226
> 5) 4296 64 kmem_cache_alloc+0x129/0x1bc
> 6) 4232 16 mempool_alloc_slab+0x16/0x18
> 7) 4216 144 mempool_alloc+0x56/0x104
> 8) 4072 16 scsi_sg_alloc+0x48/0x4a [scsi_mod]
> 9) 4056 96 __sg_alloc_table+0x58/0xf8
> 10) 3960 32 scsi_init_sgtable+0x37/0x8f [scsi_mod]
> 11) 3928 32 scsi_init_io+0x24/0xce [scsi_mod]
> 12) 3896 48 scsi_setup_fs_cmnd+0xbc/0xc4 [scsi_mod]
> 13) 3848 144 sd_prep_fn+0x1d3/0xc13 [sd_mod]
> 14) 3704 64 blk_peek_request+0xe2/0x1a6
> 15) 3640 96 scsi_request_fn+0x87/0x522 [scsi_mod]
> 16) 3544 32 __blk_run_queue+0x88/0x14b
> 17) 3512 48 elv_insert+0xb7/0x254
> 18) 3464 48 __elv_add_request+0x9f/0xa7
> 19) 3416 128 __make_request+0x3f4/0x476
> 20) 3288 192 generic_make_request+0x332/0x3a4
> 21) 3096 64 submit_bio+0xc4/0xcd
> 22) 3032 80 _xfs_buf_ioapply+0x222/0x252 [xfs]
> 23) 2952 48 xfs_buf_iorequest+0x84/0xa1 [xfs]
> 24) 2904 32 xlog_bdstrat+0x47/0x4d [xfs]
> 25) 2872 64 xlog_sync+0x21a/0x329 [xfs]
> 26) 2808 48 xlog_state_release_iclog+0x9b/0xa8 [xfs]
> 27) 2760 176 xlog_write+0x356/0x506 [xfs]
> 28) 2584 96 xfs_log_write+0x5a/0x86 [xfs]
> 29) 2488 368 xfs_trans_commit_iclog+0x165/0x2c3 [xfs]
> 30) 2120 80 _xfs_trans_commit+0xd8/0x20d [xfs]
> 31) 2040 240 xfs_iomap_write_allocate+0x247/0x336 [xfs]
> 32) 1800 144 xfs_iomap+0x31a/0x345 [xfs]
> 33) 1656 48 xfs_map_blocks+0x3c/0x40 [xfs]
> 34) 1608 256 xfs_page_state_convert+0x2c4/0x597 [xfs]
> 35) 1352 64 xfs_vm_writepage+0xf5/0x12f [xfs]
> 36) 1288 32 __writepage+0x17/0x34
> 37) 1256 288 write_cache_pages+0x1f3/0x2f8
> 38) 968 16 generic_writepages+0x24/0x2a
> 39) 952 64 xfs_vm_writepages+0x4f/0x5c [xfs]
> 40) 888 16 do_writepages+0x21/0x2a
> 41) 872 48 writeback_single_inode+0xd8/0x2f4
> 42) 824 112 writeback_inodes_wb+0x41a/0x51e
> 43) 712 176 wb_writeback+0x13d/0x1b7
> 44) 536 128 wb_do_writeback+0x150/0x167
> 45) 408 80 bdi_writeback_task+0x43/0x117
> 46) 328 48 bdi_start_fn+0x76/0xd5
> 47) 280 96 kthread+0x82/0x8a
> 48) 184 184 kernel_thread_helper+0x4/0x10
>
> XFS as you can see is quite deep there. Now consider if
> get_page_from_freelist() there had entered direct reclaim and then tried
> to writeback a page. That's the problem that is being worried about.

As said in other email this can't be a problem, 5k is very ok there
and there's zero risk as writepage can't reenter itself or fs would
lockup.

This even the above trace, already shows that 5k is used just for xfs
writepage itself, so that means generic kernel code can't exceed 3k, I
agree it's too risky (at least with xfs, dunno if ext4 also eats ~5k
just for writepage + bio).

> I also haven't been able to trigger a new OOM as a result of the patch
> but maybe I'm missing something. To trigger an OOM, the bulk of the LRU

Well you're throttling and waiting I/O from the kernel thread, so it
should be fully safe and zero risk for OOM regressions, agreed!

But if we make changes to tackle this "risk", I prefer if we allow to
remove the PF_MEMALLOC in ext4_write_inode too.. and we instead allow
it to run when __GFP_FS|__GFP_IO is set.

> I hadn't posted them because they had been posted previously and I
> didn't think they were that interesting as such because it wasn't being
> disputed.

No problem, I didn't notice those prev reports, the links you posted
have been handy to find them more quickly ;), that's surely more than
enough, thanks!

2010-06-15 17:38:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 12:26:00PM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 05:45:16PM +0200, Andrea Arcangeli wrote:
> > On Tue, Jun 15, 2010 at 11:25:26AM -0400, Christoph Hellwig wrote:
> > > hand can happen from context that already is say 4 or 6 kilobytes
> > > into stack usage. And the callchain from kmalloc() into ->writepage
> >
> > Mel's stack trace of 5k was still not realistic as it doesn't call
> > writepage there. I was just asking the 6k example vs msync.
>
> FYI here is the most recent one that Michael Monnerie reported after he
> hit it on a production machine. It's what finally prompted us to add
> the check in ->writepage:
>
> [21877.948005] BUG: scheduling while atomic: rsync/2345/0xffff8800
> [21877.948005] Modules linked in: af_packet nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 ramzswap xvmalloc lzo_decompress lzo_compress loop dm_mod reiserfs xfs exportfs xennet xenblk cdrom
> [21877.948005] Pid: 2345, comm: rsync Not tainted 2.6.31.12-0.2-xen #1
> [21877.948005] Call Trace:
> [21877.949649] [<ffffffff800119b9>] try_stack_unwind+0x189/0x1b0
> [21877.949659] [<ffffffff8000f466>] dump_trace+0xa6/0x1e0
> [21877.949666] [<ffffffff800114c4>] show_trace_log_lvl+0x64/0x90
> [21877.949676] [<ffffffff80011513>] show_trace+0x23/0x40
> [21877.949684] [<ffffffff8046b92c>] dump_stack+0x81/0x9e
> [21877.949695] [<ffffffff8003f398>] __schedule_bug+0x78/0x90
> [21877.949702] [<ffffffff8046c97c>] thread_return+0x1d7/0x3fb
> [21877.949709] [<ffffffff8046cf85>] schedule_timeout+0x195/0x200
> [21877.949717] [<ffffffff8046be2b>] wait_for_common+0x10b/0x230
> [21877.949726] [<ffffffff8046c09b>] wait_for_completion+0x2b/0x50
> [21877.949768] [<ffffffffa009e741>] xfs_buf_iowait+0x31/0x80 [xfs]
> [21877.949894] [<ffffffffa009ea30>] _xfs_buf_read+0x70/0x80 [xfs]
> [21877.949992] [<ffffffffa009ef8b>] xfs_buf_read_flags+0x8b/0xd0 [xfs]
> [21877.950089] [<ffffffffa0091ab9>] xfs_trans_read_buf+0x1e9/0x320 [xfs]
> [21877.950174] [<ffffffffa005b278>] xfs_btree_read_buf_block+0x68/0xe0 [xfs]
> [21877.950232] [<ffffffffa005b99e>] xfs_btree_lookup_get_block+0x8e/0x110 [xfs]
> [21877.950281] [<ffffffffa005c0af>] xfs_btree_lookup+0xdf/0x4d0 [xfs]
> [21877.950329] [<ffffffffa0042b77>] xfs_alloc_lookup_eq+0x27/0x50 [xfs]
> [21877.950361] [<ffffffffa0042f09>] xfs_alloc_fixup_trees+0x249/0x370 [xfs]
> [21877.950397] [<ffffffffa0044c30>] xfs_alloc_ag_vextent_near+0x4e0/0x9a0 [xfs]
> [21877.950432] [<ffffffffa00451f5>] xfs_alloc_ag_vextent+0x105/0x160 [xfs]
> [21877.950471] [<ffffffffa0045bb4>] xfs_alloc_vextent+0x3b4/0x4b0 [xfs]
> [21877.950504] [<ffffffffa0058da8>] xfs_bmbt_alloc_block+0xf8/0x210 [xfs]
> [21877.950550] [<ffffffffa005e3b7>] xfs_btree_split+0xc7/0x720 [xfs]
> [21877.950597] [<ffffffffa005ef8c>] xfs_btree_make_block_unfull+0x15c/0x1c0 [xfs]
> [21877.950643] [<ffffffffa005f3ff>] xfs_btree_insrec+0x40f/0x5c0 [xfs]
> [21877.950689] [<ffffffffa005f651>] xfs_btree_insert+0xa1/0x1b0 [xfs]
> [21877.950748] [<ffffffffa005325e>] xfs_bmap_add_extent_delay_real+0x82e/0x12a0 [xfs]
> [21877.950787] [<ffffffffa00540f4>] xfs_bmap_add_extent+0x424/0x450 [xfs]
> [21877.950833] [<ffffffffa00573f3>] xfs_bmapi+0xda3/0x1320 [xfs]
> [21877.950879] [<ffffffffa007c248>] xfs_iomap_write_allocate+0x1d8/0x3f0 [xfs]
> [21877.950953] [<ffffffffa007d089>] xfs_iomap+0x2c9/0x300 [xfs]
> [21877.951021] [<ffffffffa009a1b8>] xfs_map_blocks+0x38/0x60 [xfs]
> [21877.951108] [<ffffffffa009b93a>] xfs_page_state_convert+0x3fa/0x720 [xfs]
> [21877.951204] [<ffffffffa009bde4>] xfs_vm_writepage+0x84/0x160 [xfs]
> [21877.951301] [<ffffffff800e3603>] pageout+0x143/0x2b0
> [21877.951308] [<ffffffff800e514e>] shrink_page_list+0x26e/0x650
> [21877.951314] [<ffffffff800e5803>] shrink_inactive_list+0x2d3/0x7c0
> [21877.951320] [<ffffffff800e5d4b>] shrink_list+0x5b/0x110
> [21877.951325] [<ffffffff800e5f71>] shrink_zone+0x171/0x250
> [21877.951330] [<ffffffff800e60d3>] shrink_zones+0x83/0x120
> [21877.951336] [<ffffffff800e620e>] do_try_to_free_pages+0x9e/0x380
> [21877.951342] [<ffffffff800e6607>] try_to_free_pages+0x77/0xa0

If we switch stack here, we're done...

I surely agree Mel's series is much safer than the recent change that
adds the PF_MEMALLOC. Also note I grepped current mainline, so this
xfs change is not recent but _very_ recent and probably hasn't been
tested with heavy VM pressure to verify it doesn't introduce early
OOM.

Definitely go with Mel's code rather than a blind PF_MEMALLOC check in
writepage. But I'd prefer if we switch stack and solve
ext4_write_inode too etc..

> [21877.951349] [<ffffffff800dbfa3>] __alloc_pages_slowpath+0x2d3/0x5c0
> [21877.951355] [<ffffffff800dc3e1>] __alloc_pages_nodemask+0x151/0x160
> [21877.951362] [<ffffffff800d44b7>] __page_cache_alloc+0x27/0x50
> [21877.951368] [<ffffffff800d68ca>] grab_cache_page_write_begin+0x9a/0xe0
> [21877.951376] [<ffffffff8014bdfe>] block_write_begin+0xae/0x120
> [21877.951396] [<ffffffffa009ac24>] xfs_vm_write_begin+0x34/0x50 [xfs]
> [21877.951482] [<ffffffff800d4b31>] generic_perform_write+0xc1/0x1f0
> [21877.951489] [<ffffffff800d5d00>] generic_file_buffered_write+0x90/0x160
> [21877.951512] [<ffffffffa00a4711>] xfs_write+0x521/0xb60 [xfs]
> [21877.951624] [<ffffffffa009fb80>] xfs_file_aio_write+0x70/0xa0 [xfs]
> [21877.951711] [<ffffffff80118c42>] do_sync_write+0x102/0x160
> [21877.951718] [<ffffffff80118fc8>] vfs_write+0xd8/0x1c0
> [21877.951723] [<ffffffff8011995b>] sys_write+0x5b/0xa0
> [21877.951729] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
> [21877.951736] [<00007fc41b0fab10>] 0x7fc41b0fab10
> [21877.951750] BUG: unable to handle kernel paging request at 0000000108743280
> [21877.951755] IP: [<ffffffff80034832>] dequeue_task+0x72/0x110
> [21877.951766] PGD 31c6f067 PUD 0
> [21877.951770] Thread overran stack, or stack corrupted
>

2010-06-15 17:38:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 05:30:44PM +0100, Mel Gorman wrote:
> See this
>
> STATIC int
> xfs_vm_writepage(
> struct page *page,
> struct writeback_control *wbc)
> {
> int error;
> int need_trans;
> int delalloc, unmapped, unwritten;
> struct inode *inode = page->mapping->host;
>
> trace_xfs_writepage(inode, page, 0);
>
> /*
> * Refuse to write the page out if we are called from reclaim
> * context.
> *
> * This is primarily to avoid stack overflows when called from deep
> * used stacks in random callers for direct reclaim, but disabling
> * reclaim for kswap is a nice side-effect as kswapd causes rather
> * suboptimal I/O patters, too.
> *
> * This should really be done by the core VM, but until that happens
> * filesystems like XFS, btrfs and ext4 have to take care of this
> * by themselves.
> */
> if (current->flags & PF_MEMALLOC)
> goto out_fail;

so it's under xfs/linux-2.6... ;) I guess this dates back from the
xfs/irix xfs/freebsd days, no prob.

> Again, missing the code to do it and am missing data showing that not
> writing pages in direct reclaim is really a bad idea.

Your code is functionally fine, my point is it's not just writepage as
shown by the PF_MEMALLOC check in ext4.

> Other than the whole "lacking the code" thing and it's still not clear that
> writing from direct reclaim is absolutly necessary for VM stability considering
> it's been ignored today by at least two filesystems. I can add the throttling
> logic if it'd make you happied but I know it'd be at least two weeks
> before I could start from scratch on a
> stack-switch-based-solution and a PITA considering that I'm not convinced
> it's necessary :)

The reason things are working on I think is because of
wait_on_page_writeback. By the time lots of ram is full with dirty
pdflush and stuff will submit I/O, then VM will still wait on I/O to
complete. Waiting is eating no stack, submitting I/O does instead. So
that explains why everything works fine.

It'd be interesting to verify that things don't fall apart with
current xfs if you swapon ./file_on_xfs instead of /dev/something.

2010-06-15 17:44:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 06:37:47PM +0200, Andrea Arcangeli wrote:
> It'd be interesting to verify that things don't fall apart with
> current xfs if you swapon ./file_on_xfs instead of /dev/something.

I can give it a try, but I don't see why it would make any difference.
Swap files bypass the filesystem completely during the I/O phase as the
swap code builts an extent map during swapon and then submits bios
by itself. That also means no allocator calls or other forms of
metadata updates.

2010-06-15 19:14:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On 06/15/2010 12:54 PM, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
>> This is already in a filesystem. Why does ->writepage get
>> called a second time? Shouldn't this have a gfp_mask
>> without __GFP_FS set?
>
> Why would it? GFP_NOFS is not for all filesystem code, but only for
> code where we can't re-enter the filesystem due to deadlock potential.

Why? How about because you know the stack is not big enough
to have the XFS call path on it twice? :)

Isn't the whole purpose of this patch series to prevent writepage
from being called by the VM, when invoked from a deep callstack
like xfs writepage?

That sounds a lot like simply wanting to not have GFP_FS...

--
All rights reversed

2010-06-15 19:17:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 03:13:09PM -0400, Rik van Riel wrote:
> Why? How about because you know the stack is not big enough
> to have the XFS call path on it twice? :)
>
> Isn't the whole purpose of this patch series to prevent writepage
> from being called by the VM, when invoked from a deep callstack
> like xfs writepage?

It's not invoked from xfs writepage, but from xfs_file_aio_write via
generic_file_buffered_write. Which isn't actually an all that deep
callstack, just en example of one that's alread bad enough to overflow
the stack.

> That sounds a lot like simply wanting to not have GFP_FS...

There's no point in sprinkling random GFP_NOFS flags. It's not just
the filesystem code that uses a lot of stack.

2010-06-15 19:46:17

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 03:17:16PM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 03:13:09PM -0400, Rik van Riel wrote:
> > Why? How about because you know the stack is not big enough
> > to have the XFS call path on it twice? :)
> >
> > Isn't the whole purpose of this patch series to prevent writepage
> > from being called by the VM, when invoked from a deep callstack
> > like xfs writepage?
>
> It's not invoked from xfs writepage, but from xfs_file_aio_write via
> generic_file_buffered_write. Which isn't actually an all that deep
> callstack, just en example of one that's alread bad enough to overflow
> the stack.

Keep in mind that both ext4 and btrfs have similar checks in their
writepage path. I think Dave Chinner's stack analysis we very clear
here, there's no room in the stack for any filesystem and direct reclaim
to live happily together.

Circling back to an older thread:

> 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 33) 3120 384 shrink_page_list+0x65e/0x840
> 34) 2736 528 shrink_zone+0x63f/0xe10
> 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> 36) 2096 128 try_to_free_pages+0x77/0x80
> 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> 36) 2096 128 try_to_free_pages+0x77/0x80
> 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> 38) 1728 48 alloc_pages_current+0x8c/0xe0
> 39) 1680 16 __get_free_pages+0xe/0x50
> 40) 1664 48 __pollwait+0xca/0x110
> 41) 1616 32 unix_poll+0x28/0xc0
> 42) 1584 16 sock_poll+0x1d/0x20
> 43) 1568 912 do_select+0x3d6/0x700
> 44) 656 416 core_sys_select+0x18c/0x2c0
> 45) 240 112 sys_select+0x4f/0x110
> 46) 128 128 system_call_fastpath+0x16/0x1b

So, before xfs can hand this work off to one of its 16 btrees, push
it through the hand tuned irix simulator or even think about spreading
the work across 512 cpus (whoops, I guess that's just btrfs), we've used
up quite a lot of the stack.

I'm not against direct reclaim, but I think we have to admit that it has
to be done directly with another stack context. Handoff to a different
thread, whatever.

When the reclaim does happen, it would be really nice if ios were done
in large-ish clusters. Small ios reclaim less memory in more time and
slow everything down.

-chris

2010-06-16 07:57:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Tue, Jun 15, 2010 at 03:13:09PM -0400, Rik van Riel wrote:
> On 06/15/2010 12:54 PM, Christoph Hellwig wrote:
> >On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
> >>This is already in a filesystem. Why does ->writepage get
> >>called a second time? Shouldn't this have a gfp_mask
> >>without __GFP_FS set?
> >
> >Why would it? GFP_NOFS is not for all filesystem code, but only for
> >code where we can't re-enter the filesystem due to deadlock potential.
>
> Why? How about because you know the stack is not big enough
> to have the XFS call path on it twice? :)
>
> Isn't the whole purpose of this patch series to prevent writepage
> from being called by the VM, when invoked from a deep callstack
> like xfs writepage?
>
> That sounds a lot like simply wanting to not have GFP_FS...

buffered write path uses __GFP_FS by design because huge amounts
of (dirty) memory can be allocated in doing pagecache writes. If
would be nasty if that was not allowed to wait for filesystem
activity.

2010-06-16 17:00:45

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On 06/16/2010 03:57 AM, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 03:13:09PM -0400, Rik van Riel wrote:
>> On 06/15/2010 12:54 PM, Christoph Hellwig wrote:
>>> On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
>>>> This is already in a filesystem. Why does ->writepage get
>>>> called a second time? Shouldn't this have a gfp_mask
>>>> without __GFP_FS set?
>>>
>>> Why would it? GFP_NOFS is not for all filesystem code, but only for
>>> code where we can't re-enter the filesystem due to deadlock potential.
>>
>> Why? How about because you know the stack is not big enough
>> to have the XFS call path on it twice? :)
>>
>> Isn't the whole purpose of this patch series to prevent writepage
>> from being called by the VM, when invoked from a deep callstack
>> like xfs writepage?
>>
>> That sounds a lot like simply wanting to not have GFP_FS...
>
> buffered write path uses __GFP_FS by design because huge amounts
> of (dirty) memory can be allocated in doing pagecache writes. If
> would be nasty if that was not allowed to wait for filesystem
> activity.

__GFP_IO can wait for filesystem activity

__GFP_FS can kick off new filesystem activity

At least, that's how I remember it from when I last looked
at that code in detail. Things may have changed subtly.

--
All rights reversed

2010-06-16 17:05:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

On Wed, Jun 16, 2010 at 12:59:54PM -0400, Rik van Riel wrote:
> __GFP_IO can wait for filesystem activity

Hmm I think it's for submitting I/O, not about waiting. At some point
you may not enter the FS because of the FS locks you already hold
(like within writepage itself), but you can still submit I/O through
blkdev layer.

> __GFP_FS can kick off new filesystem activity

Yes that's for dcache/icache/writepage or anything that can reenter
the fs locks and deadlock IIRC.