2011-02-20 14:44:02

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v6 0/4] fadvise(DONTNEED) support

Recently, there was a reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.
So app developer want to support POSIX_FADV_NOREUSE but other OSes include linux
don't support it. (http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is going on writing
during invalidate_mapping_pages, it can't work.
It makes application programmer to use it hard since they always
consider sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages couldn't be discardable. At last, they can't use
deferred write of kernel so see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So the idea in this series is that
let's move invalidated pages but not-freed page until into inactive list.
It can help relcaim efficiency very much so that it can prevent
eviction working set.

My exeperiment is folowing as.

Test Environment :
DRAM : 2G, CPU : Intel(R) Core(TM)2 CPU
Rsync backup directory size : 16G

rsync version is 3.0.7.
rsync patch is Ben's fadivse.
The stress scenario do following jobs with parallel.

1. git clone linux-2.6
1. make all -j4 linux-mmotm
3. rsync src dst

nrns : no-patched rsync + no stress
prns : patched rsync + no stress
nrs : no-patched rsync + stress
prs : patched rsync + stress

For profiling, I added some vmstat.
pginvalidate : the total number of pages which are moved by invalidate_mapping_pages
pgreclaim : the number of pages which are moved at inactive's tail by PG_reclaim of pginvalidate

NRNS PRNS NRS PRS
Elapsed time 36:01.49 37:13.58 01:23:24 01:21:45
nr_vmscan_write 184 1 296 509
pgactivate 76559 84714 445214 463143
pgdeactivate 19360 40184 74302 91423
pginvalidate 0 2240333 0 1769147
pgreclaim 0 1849651 0 1650796
pgfault 406208 421860 72485217 70334416
pgmajfault 212 334 5149 3688
pgsteal_dma 0 0 0 0
pgsteal_normal 2645174 1545116 2521098 1578651
pgsteal_high 5162080 2562269 6074720 3137294
pgsteal_movable 0 0 0 0
pgscan_kswapd_dma 0 0 0 0
pgscan_kswapd_normal 2641732 1545374 2499894 1557882
pgscan_kswapd_high 5143794 2567981 5999585 3051150
pgscan_kswapd_movable 0 0 0 0
pgscan_direct_dma 0 0 0 0
pgscan_direct_normal 3643 0 21613 21238
pgscan_direct_high 20174 1783 76980 87848
pgscan_direct_movable 0 0 0 0
pginodesteal 130 1029 3510 24100
slabs_scanned 1421824 1648128 1870720 1880320
kswapd_steal 7785153 4105620 8498332 4608372
kswapd_inodesteal 189432 474052 342835 472503
pageoutrun 100687 52282 145712 70946
allocstall 22 1 149 163
pgrotated 0 2231408 2932 1765393
unevictable_pgs_scanned 0 0 0 0

In stress test(NRS vs PRS), pgsteal_[normal|high] are reduced by 37% and 48%.
pgscan_kswapd_[normal|high] are reduced by 37% and 49%.
It means although the VM scan small window, it can reclaim enough pages to work well and
prevent eviction unnecessary page.
rsync program's elapsed time is reduced by 1.5 minutes but I think rsync's fadvise
isn't good because [NRNS vs NRS] it takes one minutes longer time.
I think it's because calling unnecessary fadivse system calls so that
rsync's fadvise should be smart then effect would be much better than now.
The pgmajor fault is reduced by 28%. It's good.
What I can't understand is that why inode steal is increased.
If anyone know it, please explain to me.
Anyway, this patch improves reclaim efficiency very much.

Recently, Steven Barrentt already applied this series to his project kernel
"Liquorix kernel" and said followin as with one problem.
(The problem is solved by [3/4]. See the description)

" I've been having really good results with your new patch set that
mitigates the problem where a backup utility or something like that
reads each file once and eventually evicting the original working set
out of the page cache.
...
...
These patches solved some problems on a friend's desktop.
He said that his wife wanted to send me kisses and hugs because their
computer was so responsive after the patches were applied.
"
So I think this patch series solves real problem.

- [1/3] is to move invalidated page which is dirty/writeback on active list
into inactive list's head.
- [2/3] is to move memcg reclaimable page on inactive's tail.
- [3/3] is for moving invalidated page into inactive list's tail when the
page's writeout is completed for reclaim asap.

This patches are based on mmotm-02-04

Changelog since v5:
- Remove vmstat patch as profiling for final merge

Changelog since v4:
- Remove patches related to madvise and clean up patch of swap.c
(I will separate madvise issue from this series and repost after merging this series)

Minchan Kim (3):
deactivate invalidated pages
memcg: move memcg reclaimable page into tail of inactive list
Reclaim invalidated page ASAP

include/linux/memcontrol.h | 6 ++
include/linux/swap.h | 1 +
mm/memcontrol.c | 27 ++++++++++
mm/page-writeback.c | 12 ++++-
mm/swap.c | 116 +++++++++++++++++++++++++++++++++++++++++++-
mm/truncate.c | 17 +++++--
6 files changed, 172 insertions(+), 7 deletions(-)


2011-02-20 14:44:10

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v6 1/3] deactivate invalidated pages

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It makes for application programmer to use it since they always
have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head if we can't truncate
it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page
would be flushed sooner or later. It can prevent writeout of pageout
which is less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added
his Signed-off-by.

Reported-by: Ben Gamari <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
Changelog since v5:
- call deactivate_page out of lock_page - suggested by Balbir

Changelog since v4:
- Change function comments - suggested by Johannes
- Change function name - suggested by Johannes
- Drop only dirty/writeback pages to deactive pagevec - suggested by Johannes
- Add acked-by

Changelog since v3:
- Change function comments - suggested by Johannes
- Change function name - suggested by Johannes
- add only dirty/writeback pages to deactive pagevec

Changelog since v2:
- mapped page leaves alone - suggested by Mel
- pass part related PG_reclaim in next patch.

Changelog since v1:
- modify description
- correct typo
- add some comment


include/linux/swap.h | 1 +
mm/swap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
mm/truncate.c | 17 ++++++++---
3 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..c335055 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index c02f936..4aea806 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;

static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);

/*
* This path almost never happens for VM activity - pages are normally
@@ -347,6 +348,60 @@ void add_page_to_unevictable_list(struct page *page)
}

/*
+ * If the page can not be invalidated, it is moved to the
+ * inactive list to speed up its reclaim. It is moved to the
+ * head of the list, rather than the tail, to give the flusher
+ * threads some time to write it out, as this is much more
+ * effective than the single-page writeout from reclaim.
+ */
+static void lru_deactivate(struct page *page, struct zone *zone)
+{
+ int lru, file;
+
+ if (!PageLRU(page) || !PageActive(page))
+ return;
+
+ /* Some processes are using the page */
+ if (page_mapped(page))
+ return;
+
+ file = page_is_file_cache(page);
+ lru = page_lru_base_type(page);
+ del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(zone, page, lru);
+ __count_vm_event(PGDEACTIVATE);
+
+ update_page_reclaim_stat(zone, page, file, 0);
+}
+
+static void ____pagevec_lru_deactivate(struct pagevec *pvec)
+{
+ int i;
+ struct zone *zone = NULL;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct zone *pagezone = page_zone(page);
+
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ lru_deactivate(page, zone);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+
+ release_pages(pvec->pages, pvec->nr, pvec->cold);
+ pagevec_reinit(pvec);
+}
+
+
+/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
* disabled; or "cpu" is being hot-unplugged, and is already dead.
@@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
pagevec_move_tail(pvec);
local_irq_restore(flags);
}
+
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ ____pagevec_lru_deactivate(pvec);
+}
+
+/**
+ * deactivate_page - forcefully deactivate a page
+ * @page: page to deactivate
+ *
+ * This function hints the VM that @page is a good reclaim candidate,
+ * for example if its invalidation fails due to the page being dirty
+ * or under writeback.
+ */
+void deactivate_page(struct page *page)
+{
+ if (likely(get_page_unless_zero(page))) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ if (!pagevec_add(pvec, page))
+ ____pagevec_lru_deactivate(pvec);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
}

void lru_add_drain(void)
diff --git a/mm/truncate.c b/mm/truncate.c
index 4d415b3..35a379b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -328,11 +328,12 @@ EXPORT_SYMBOL(truncate_inode_pages);
* pagetables.
*/
unsigned long invalidate_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t end)
+ pgoff_t start, pgoff_t end)
{
struct pagevec pvec;
pgoff_t next = start;
- unsigned long ret = 0;
+ unsigned long ret;
+ unsigned long count = 0;
int i;

pagevec_init(&pvec, 0);
@@ -359,9 +360,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (lock_failed)
continue;

- ret += invalidate_inode_page(page);
-
+ ret = invalidate_inode_page(page);
unlock_page(page);
+ /*
+ * Invalidation is a hint that the page is no longer
+ * of interest and try to speed up its reclaim.
+ */
+ if (!ret)
+ deactivate_page(page);
+ count += ret;
if (next > end)
break;
}
@@ -369,7 +376,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
mem_cgroup_uncharge_end();
cond_resched();
}
- return ret;
+ return count;
}
EXPORT_SYMBOL(invalidate_mapping_pages);

--
1.7.1

2011-02-20 14:44:13

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list

The rotate_reclaimable_page function moves just written out
pages, which the VM wanted to reclaim, to the end of the
inactive list. That way the VM will find those pages first
next time it needs to free memory.
This patch apply the rule in memcg.
It can help to prevent unnecessary working page eviction of memcg.

Acked-by: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
Changelog since v5:
- change page_lru_base_type to page_lru - suggested by Kame

Changelog since v4:
- add acked-by and reviewed-by
- change description - suggested by Rik

include/linux/memcontrol.h | 6 ++++++
mm/memcontrol.c | 27 +++++++++++++++++++++++++++
mm/swap.c | 3 ++-
3 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3da48ae..5a5ce70 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru(struct page *page);
extern void mem_cgroup_move_lists(struct page *page,
@@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
return ;
}

+static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+ return ;
+}
+
static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
{
return ;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 686f1ce..8a97571 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
mem_cgroup_del_lru_list(page, page_lru(page));
}

+/*
+ * Writeback is about to end against a page which has been marked for immediate
+ * reclaim. If it still appears to be reclaimable, move it to the tail of the
+ * inactive list.
+ */
+void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;
+ enum lru_list lru = page_lru(page);
+
+ if (mem_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(page);
+ /*
+ * Used bit is set without atomic ops but after smp_wmb().
+ * For making pc->mem_cgroup visible, insert smp_rmb() here.
+ */
+ smp_rmb();
+ /* unused or root page is not rotated. */
+ if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+ return;
+ mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
+ list_move_tail(&pc->lru, &mz->lists[lru]);
+}
+
void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz;
diff --git a/mm/swap.c b/mm/swap.c
index 4aea806..1b9e4eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,8 +200,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
spin_lock(&zone->lru_lock);
}
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
- int lru = page_lru_base_type(page);
+ enum lru_list lru = page_lru_base_type(page);
list_move_tail(&page->lru, &zone->lru[lru].list);
+ mem_cgroup_rotate_reclaimable_page(page);
pgmoved++;
}
}
--
1.7.1

2011-02-20 14:44:23

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v6 3/3] Reclaim invalidated page ASAP

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.
In v4, ClearPageReadahead in set_page_dirty has a problem which is reported
by Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it breaks
PageTail flag.

I think it doesn't affect THP and pass my test with THP enabling
but Cced Andrea for double check.

Reported-by: Steven Barrett <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/page-writeback.c | 12 +++++++++++-
mm/swap.c | 41 ++++++++++++++++++++++++++++++++++++++---
2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..b437fe6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1211,6 +1211,17 @@ int set_page_dirty(struct page *page)

if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+ /*
+ * readahead/lru_deactivate_page could remain
+ * PG_readahead/PG_reclaim due to race with end_page_writeback
+ * About readahead, if the page is written, the flags would be
+ * reset. So no problem.
+ * About lru_deactivate_page, if the page is redirty, the flag
+ * will be reset. So no problem. but if the page is used by readahead
+ * it will confuse readahead and make it restart the size rampup
+ * process. But it's a trivial problem.
+ */
+ ClearPageReclaim(page);
#ifdef CONFIG_BLOCK
if (!spd)
spd = __set_page_dirty_buffers;
@@ -1266,7 +1277,6 @@ int clear_page_dirty_for_io(struct page *page)

BUG_ON(!PageLocked(page));

- ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
diff --git a/mm/swap.c b/mm/swap.c
index 1b9e4eb..0a33714 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -354,26 +354,61 @@ void add_page_to_unevictable_list(struct page *page)
* head of the list, rather than the tail, to give the flusher
* threads some time to write it out, as this is much more
* effective than the single-page writeout from reclaim.
+ *
+ * If the page isn't page_mapped and dirty/writeback, the page
+ * could reclaim asap using PG_reclaim.
+ *
+ * 1. active, mapped page -> none
+ * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
+ * 3. inactive, mapped page -> none
+ * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
+ * 5. inactive, clean -> inactive, tail
+ * 6. Others -> none
+ *
+ * In 4, why it moves inactive's head, the VM expects the page would
+ * be write it out by flusher threads as this is much more effective
+ * than the single-page writeout from reclaim.
*/
static void lru_deactivate(struct page *page, struct zone *zone)
{
int lru, file;
+ bool active;

- if (!PageLRU(page) || !PageActive(page))
+ if (!PageLRU(page))
return;

/* Some processes are using the page */
if (page_mapped(page))
return;

+ active = PageActive(page);
+
file = page_is_file_cache(page);
lru = page_lru_base_type(page);
- del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+ del_page_from_lru_list(zone, page, lru + active);
ClearPageActive(page);
ClearPageReferenced(page);
add_page_to_lru_list(zone, page, lru);
- __count_vm_event(PGDEACTIVATE);

+ if (PageWriteback(page) || PageDirty(page)) {
+ /*
+ * PG_reclaim could be raced with end_page_writeback
+ * It can make readahead confusing. But race window
+ * is _really_ small and it's non-critical problem.
+ */
+ SetPageReclaim(page);
+ } else {
+ /*
+ * The page's writeback ends up during pagevec
+ * We moves tha page into tail of inactive.
+ */
+ list_move_tail(&page->lru, &zone->lru[lru].list);
+ mem_cgroup_rotate_reclaimable_page(page);
+ __count_vm_event(PGROTATED);
+ }
+
+ if (active)
+ __count_vm_event(PGDEACTIVATE);
update_page_reclaim_stat(zone, page, file, 0);
}

--
1.7.1

2011-02-21 08:30:05

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] deactivate invalidated pages

On Sun, Feb 20, 2011 at 11:43:36PM +0900, Minchan Kim wrote:
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It makes for application programmer to use it since they always
> have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
> make sure the pages could be discardable. At last, they can't use
> deferred write of kernel so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head if we can't truncate
> it right now.
>
> Why I move page to head of lru on this patch, Dirty/Writeback page
> would be flushed sooner or later. It can prevent writeout of pageout
> which is less effective than flusher's writeout.
>
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

Reviewed-by: Johannes Weiner <[email protected]>

2011-02-21 08:40:25

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list

On Sun, Feb 20, 2011 at 11:43:37PM +0900, Minchan Kim wrote:
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
> mem_cgroup_del_lru_list(page, page_lru(page));
> }
>
> +/*
> + * Writeback is about to end against a page which has been marked for immediate
> + * reclaim. If it still appears to be reclaimable, move it to the tail of the
> + * inactive list.
> + */
> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> + struct mem_cgroup_per_zone *mz;
> + struct page_cgroup *pc;
> + enum lru_list lru = page_lru(page);
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + /*
> + * Used bit is set without atomic ops but after smp_wmb().
> + * For making pc->mem_cgroup visible, insert smp_rmb() here.
> + */
> + smp_rmb();
> + /* unused or root page is not rotated. */
> + if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
> + return;

The placement of this barrier is confused and has been fixed up in the
meantime in other places. It has to be between PageCgroupUsed() and
accessing pc->mem_cgroup. You can look at the other memcg lru
functions for reference.

2011-02-21 14:07:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list

On Mon, Feb 21, 2011 at 5:40 PM, Johannes Weiner <[email protected]> wrote:
> On Sun, Feb 20, 2011 at 11:43:37PM +0900, Minchan Kim wrote:
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
>>       mem_cgroup_del_lru_list(page, page_lru(page));
>>  }
>>
>> +/*
>> + * Writeback is about to end against a page which has been marked for immediate
>> + * reclaim.  If it still appears to be reclaimable, move it to the tail of the
>> + * inactive list.
>> + */
>> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
>> +{
>> +     struct mem_cgroup_per_zone *mz;
>> +     struct page_cgroup *pc;
>> +     enum lru_list lru = page_lru(page);
>> +
>> +     if (mem_cgroup_disabled())
>> +             return;
>> +
>> +     pc = lookup_page_cgroup(page);
>> +     /*
>> +      * Used bit is set without atomic ops but after smp_wmb().
>> +      * For making pc->mem_cgroup visible, insert smp_rmb() here.
>> +      */
>> +     smp_rmb();
>> +     /* unused or root page is not rotated. */
>> +     if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>> +             return;
>
> The placement of this barrier is confused and has been fixed up in the
> meantime in other places.  It has to be between PageCgroupUsed() and
> accessing pc->mem_cgroup.  You can look at the other memcg lru
> functions for reference.

Yes. I saw your patch at that time but forgot it.
I will resend fixed version.
Thanks.

--
Kind regards,
Minchan Kim

2011-02-21 15:59:38

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list

Fixed version.

>From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Tue, 22 Feb 2011 00:53:05 +0900
Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list

The rotate_reclaimable_page function moves just written out
pages, which the VM wanted to reclaim, to the end of the
inactive list. That way the VM will find those pages first
next time it needs to free memory.
This patch apply the rule in memcg.
It can help to prevent unnecessary working page eviction of memcg.

Acked-by: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---

Changelog since v6:
- adjust smp_rmb - suggested by Johannes

include/linux/memcontrol.h | 6 ++++++
mm/memcontrol.c | 26 ++++++++++++++++++++++++++
mm/swap.c | 3 ++-
3 files changed, 34 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3da48ae..5a5ce70 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru(struct page *page);
extern void mem_cgroup_move_lists(struct page *page,
@@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
return ;
}

+static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+ return ;
+}
+
static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
{
return ;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 686f1ce..2fc97fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -813,6 +813,32 @@ void mem_cgroup_del_lru(struct page *page)
mem_cgroup_del_lru_list(page, page_lru(page));
}

+/*
+ * Writeback is about to end against a page which has been marked for immediate
+ * reclaim. If it still appears to be reclaimable, move it to the tail of the
+ * inactive list.
+ */
+void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;
+ enum lru_list lru = page_lru(page);
+
+ if (mem_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(page);
+ /* unused or root page is not rotated. */
+ if (!PageCgroupUsed(pc))
+ return;
+ /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+ smp_rmb();
+ if (mem_cgroup_is_root(pc->mem_cgroup))
+ return;
+ mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
+ list_move_tail(&pc->lru, &mz->lists[lru]);
+}
+
void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz;
diff --git a/mm/swap.c b/mm/swap.c
index 4aea806..1b9e4eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,8 +200,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
spin_lock(&zone->lru_lock);
}
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
- int lru = page_lru_base_type(page);
+ enum lru_list lru = page_lru_base_type(page);
list_move_tail(&page->lru, &zone->lru[lru].list);
+ mem_cgroup_rotate_reclaimable_page(page);
pgmoved++;
}
}
--
1.7.1

2011-02-21 16:07:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list

On Tue, Feb 22, 2011 at 12:59:25AM +0900, Minchan Kim wrote:
> Fixed version.
>
> >From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Tue, 22 Feb 2011 00:53:05 +0900
> Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
>
> The rotate_reclaimable_page function moves just written out
> pages, which the VM wanted to reclaim, to the end of the
> inactive list. That way the VM will find those pages first
> next time it needs to free memory.
> This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
>
> Acked-by: Balbir Singh <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

Reviewed-by: Johannes Weiner <[email protected]>

2011-02-21 19:09:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] fadvise(DONTNEED) support

Hello,

On Sun, Feb 20, 2011 at 11:43:35PM +0900, Minchan Kim wrote:
> Recently, there was a reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into

"recently" and "thrashing horribly" seem to signal a regression. Ok
that trying to have backup not messing up the VM working set, but by
any means running rsync in a loop shouldn't lead a server into
"trashing horribly" (other than for the additional disk I/O, just like
if rsync would be using O_DIRECT).

This effort in teaching rsync to tell the VM it's likely an used-once
type of access to the cache is good (tar will need it too), but if
this is a regression like it appears from the words above ("recently"
and "trashing horribly"), I suspect it's much higher priority to fix a
VM regression than to add fadvise support in rsync/tar. Likely if the
system didn't start "trashing horribly", they wouldn't need rsync.

Then fadvise becomes an improvement on top of that.

It'd be nice if at least it was tested if older kernel wouldn't trash
horribly after being left inactive overnight. If it still trashes
horribly with 2.6.18 ok... ignore this, otherwise we need a real fix.

I'm quite comfortable that older kernels would do perfectly ok with a
loop of rsync overnight while the system was idle. I also got people
asking me privately what to do to avoid the backup to swapout, that
further make me believe something regressed recently as older VM code
would never swapout on such a workload, even if you do used twice or 3
times in a row. If it swapout that's the real bug.

I had questions about limiting the pagecache size to a certain amount,
that works too, but that's again a band aid like fadvise, and it's
real minor issue compared to fixing the VM so that at least you can
tell the kernel "nuke all clean cache first", being able to tell the
kernel just that (even if some VM clever algorithm thinks swapping is
better and we want to swap by default) will fix it. We still need a
way to make the kernel behave perfect with zero swapping without
fadvise and without limiting the cache. Maybe setting swappiness to 0
just does that, I suggested that and I heard nothing back.

If you can reproduce I suggest making sure that at least it doesn't
swap anything during the overnight workload as that would signal a
definitive problem.

2011-02-21 22:59:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] fadvise(DONTNEED) support

Hi Andrea,

On Tue, Feb 22, 2011 at 4:07 AM, Andrea Arcangeli <[email protected]> wrote:
> Hello,
>
> On Sun, Feb 20, 2011 at 11:43:35PM +0900, Minchan Kim wrote:
>> Recently, there was a reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>
> "recently" and "thrashing horribly" seem to signal a regression. Ok
> that trying to have backup not messing up the VM working set, but by
> any means running rsync in a loop shouldn't lead a server into
> "trashing horribly" (other than for the additional disk I/O, just like
> if rsync would be using O_DIRECT).
>
> This effort in teaching rsync to tell the VM it's likely an used-once
> type of access to the cache is good (tar will need it too), but if
> this is a regression like it appears from the words above ("recently"
> and "trashing horribly"), I suspect it's much higher priority to fix a
> VM regression than to add fadvise support in rsync/tar. Likely if the
> system didn't start "trashing horribly", they wouldn't need rsync.
>
> Then fadvise becomes an improvement on top of that.
>
> It'd be nice if at least it was tested if older kernel wouldn't trash
> horribly after being left inactive overnight. If it still trashes
> horribly with 2.6.18 ok... ignore this, otherwise we need a real fix.

I don't have a reproducible experiment.
I started the series with Ben's complain.
http://marc.info/?l=rsync&m=128885034930933&w=2
I don't know Ben could test it with older kernel since recently he got silent.

I am not sure older kernel worked well such as workload.
That's because Rik had been pointed out rsync's two touch
problem(http://linux-mm.org/PageReplacementRequirements) and solved
part of the problem with remaining half of the active file pages(look
at inactive_file_is_low) on 2.6.28's big change reclaim.
So I think the problem about such workload have been in there.

>
> I'm quite comfortable that older kernels would do perfectly ok with a
> loop of rsync overnight while the system was idle. I also got people
> asking me privately what to do to avoid the backup to swapout, that
> further make me believe something regressed recently as older VM code
> would never swapout on such a workload, even if you do used twice or 3
> times in a row. If it swapout that's the real bug.

Hmm,,,

>
> I had questions about limiting the pagecache size to a certain amount,
> that works too, but that's again a band aid like fadvise, and it's
> real minor issue compared to fixing the VM so that at least you can
> tell the kernel "nuke all clean cache first", being able to tell the
> kernel just that (even if some VM clever algorithm thinks swapping is
> better and we want to swap by default) will fix it. We still need a
> way to make the kernel behave perfect with zero swapping without
> fadvise and without limiting the cache. Maybe setting swappiness to 0
> just does that, I suggested that and I heard nothing back.

I don't think it's desirable to set swappiness to 0 since it changes
system global reclaim policy for preventing just a backing program's
wrong behavior.
And this patch's benefit is not only for preventing working set.
Before this patch, application should sync the data before fadvise to
take a effect but it's very inefficient by slow sync operation. If the
invalidation meet dirty page, it can skip the page. But after this
patch, application could fdavise without sync because we could move
the page of head/or tail of inactive list.

So I think the patch is valuable enough to merge?

>
> If you can reproduce I suggest making sure that at least it doesn't
> swap anything during the overnight workload as that would signal a
> definitive problem.
>
with swappiness = 0?
As I said, I don't like the solution.

And as I said, I need Ben's help but I am not sure he can.
Thanks for the review, Andrea.


--
Kind regards,
Minchan Kim

2011-02-22 13:29:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] fadvise(DONTNEED) support

Hi Minchan,

On Tue, Feb 22, 2011 at 07:59:51AM +0900, Minchan Kim wrote:
> I don't have a reproducible experiment.
> I started the series with Ben's complain.
> http://marc.info/?l=rsync&m=128885034930933&w=2

Yes I've noticed.

> I don't know Ben could test it with older kernel since recently he got silent.

That's my point, we should check if the "thrashing horribly" is really
a "recently" or if it has always happened before with 2.6.18 and previous.

> I am not sure older kernel worked well such as workload.
> That's because Rik had been pointed out rsync's two touch
> problem(http://linux-mm.org/PageReplacementRequirements) and solved
> part of the problem with remaining half of the active file pages(look
> at inactive_file_is_low) on 2.6.28's big change reclaim.
> So I think the problem about such workload have been in there.

It's possible it's an old problem, but frankly I doubt that any
swapping would have ever happened before, no matter how much rsync you
would run in a loop. As said I also got PM from users asking what they
can do to limit the pagecache because their systems are swapping
overnight because of backup loads, that's definitely not ok. Or at
least there must be a tweak to tell the VM "stop doing the swapping
thing with backup". I didn't yet try to reproduce or debug this as I'm
busy with other compaction/THP related bits.

> And this patch's benefit is not only for preventing working set.
> Before this patch, application should sync the data before fadvise to
> take a effect but it's very inefficient by slow sync operation. If the
> invalidation meet dirty page, it can skip the page. But after this
> patch, application could fdavise without sync because we could move
> the page of head/or tail of inactive list.

I agree, the objective of the patch definitely looks good. Before the
fadvise would be ignored without a fdatasync before it as the pages
were dirty and couldn't be discarded.

> So I think the patch is valuable enough to merge?

The objective looks good, I didn't have time to review all details of
the patches but I'll try to review it later.

> And as I said, I need Ben's help but I am not sure he can.
> Thanks for the review, Andrea.

Thanks for the effort in improving fadvise.

I'd like to try to find the time to check if we've a regression
without fadvise too. It's perfectly ok to use fadvise in rsync but my
point is that the kernel should not lead to trashing of running apps
regardless of fadvise or not, and I think that is higher priority to
fix than the fadvise improvement. But still the fadvise improvement is
very valuable to increase overall performance and hopefully to avoid
wasting most of filesystem cache because of backups (but wasting cache
shouldn't lead to trashing horribly, just more I/O from apps after the
backup run, like after starting the app the first time, trashing
usually means we swapped out or discarded mapped pages too and that's
wrong).

Thanks,
Andrea

2011-02-22 14:26:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] fadvise(DONTNEED) support

On Tue, Feb 22, 2011 at 02:28:04PM +0100, Andrea Arcangeli wrote:
> Hi Minchan,
>
> On Tue, Feb 22, 2011 at 07:59:51AM +0900, Minchan Kim wrote:
> > I don't have a reproducible experiment.
> > I started the series with Ben's complain.
> > http://marc.info/?l=rsync&m=128885034930933&w=2
>
> Yes I've noticed.
>
> > I don't know Ben could test it with older kernel since recently he got silent.
>
> That's my point, we should check if the "thrashing horribly" is really
> a "recently" or if it has always happened before with 2.6.18 and previous.
>
> > I am not sure older kernel worked well such as workload.
> > That's because Rik had been pointed out rsync's two touch
> > problem(http://linux-mm.org/PageReplacementRequirements) and solved
> > part of the problem with remaining half of the active file pages(look
> > at inactive_file_is_low) on 2.6.28's big change reclaim.
> > So I think the problem about such workload have been in there.
>
> It's possible it's an old problem, but frankly I doubt that any
> swapping would have ever happened before, no matter how much rsync you
> would run in a loop. As said I also got PM from users asking what they
> can do to limit the pagecache because their systems are swapping
> overnight because of backup loads, that's definitely not ok. Or at
> least there must be a tweak to tell the VM "stop doing the swapping
> thing with backup". I didn't yet try to reproduce or debug this as I'm
> busy with other compaction/THP related bits.
>
> > And this patch's benefit is not only for preventing working set.
> > Before this patch, application should sync the data before fadvise to
> > take a effect but it's very inefficient by slow sync operation. If the
> > invalidation meet dirty page, it can skip the page. But after this
> > patch, application could fdavise without sync because we could move
> > the page of head/or tail of inactive list.
>
> I agree, the objective of the patch definitely looks good. Before the
> fadvise would be ignored without a fdatasync before it as the pages
> were dirty and couldn't be discarded.
>
> > So I think the patch is valuable enough to merge?
>
> The objective looks good, I didn't have time to review all details of
> the patches but I'll try to review it later.
>
> > And as I said, I need Ben's help but I am not sure he can.
> > Thanks for the review, Andrea.
>
> Thanks for the effort in improving fadvise.
>
> I'd like to try to find the time to check if we've a regression
> without fadvise too. It's perfectly ok to use fadvise in rsync but my
> point is that the kernel should not lead to trashing of running apps
> regardless of fadvise or not, and I think that is higher priority to
> fix than the fadvise improvement. But still the fadvise improvement is
> very valuable to increase overall performance and hopefully to avoid
> wasting most of filesystem cache because of backups (but wasting cache
> shouldn't lead to trashing horribly, just more I/O from apps after the
> backup run, like after starting the app the first time, trashing
> usually means we swapped out or discarded mapped pages too and that's
> wrong).

I agree your opinion but I hope the patch is going on 2.6.38 or 39.
That's because if we find regression's root cause, how could this series be changed?
I think it's no difference before and after.
Of course, if rsync like applicatoin start to use fadvise agressively, the problem
could be buried on toe but we still have a older kernel and older rsync so we can
reproduce it then we can find the root cause.
What's the problem if the series is merged?
If it is reasonable, it's no problem to pend the series.

I _totally_ agree your opinion and I want to find root cause of the regression, too.
But unfortunatly, I don't have any time and enviroment to reproduce it. ;(
I hope clever people like you would have a time to find it and report it to linux-mm
in future.

Ben. Could you test your workload on older 2.6.18 kernel if you see the thread?
It could help us very much.
Thanks.

>
> Thanks,
> Andrea

--
Kind regards,
Minchan Kim

2011-02-22 14:49:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] fadvise(DONTNEED) support

On Tue, Feb 22, 2011 at 11:26:10PM +0900, Minchan Kim wrote:
> I agree your opinion but I hope the patch is going on 2.6.38 or 39.
> That's because if we find regression's root cause, how could this series be changed?
> I think it's no difference before and after.
> Of course, if rsync like applicatoin start to use fadvise agressively, the problem
> could be buried on toe but we still have a older kernel and older rsync so we can
> reproduce it then we can find the root cause.

No risk to hide it (I do backups with tar ;). I'm also assuming your
modification to rsync isn't going to be on by default (for small
working set, it's ok if rsync holds the cache and doesn't discard it).

> What's the problem if the series is merged?
> If it is reasonable, it's no problem to pend the series.

I've absolutely no problem with the objective of this series. The
objective looks very good and it can increase performance (like showed
by your benchmark saving 2min from your workload under stress and only
running a few seconds slower without stress).

> I _totally_ agree your opinion and I want to find root cause of the regression, too.
> But unfortunatly, I don't have any time and enviroment to reproduce it. ;(
> I hope clever people like you would have a time to find it and report it to linux-mm
> in future.
>
> Ben. Could you test your workload on older 2.6.18 kernel if you see the thread?
> It could help us very much.

Exactly, I only wanted to suggest to check what happens with 2.6.18 to
at least know if it's a regression or not. Because if we have a
regression, we need more work on this.

2011-02-22 17:14:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] fadvise(DONTNEED) support

On Tue, Feb 22, 2011 at 11:03:51AM -0600, Jeffrey Hundstad wrote:
> On 02/22/2011 07:28 AM, Andrea Arcangeli wrote:
> > Hi Minchan,
> > That's my point, we should check if the "thrashing horribly" is really
> > a "recently" or if it has always happened before with 2.6.18 and previous.
>
> I would bet that "thrashing" isn't what he meant. He almost certainly
> meant "needless I/O" and nothing to do with swapping. I've seen this
> similar report before. This report is almost always gets answered with
> a "set your swappiness to an appropriate value" type answer. Or a "your

Hmm but swappiness can't help if "needless I/O in the morning after
backup" is the problem. So it wouldn't be the right suggestion if
"needless I/O is the problem".

We need a "vmstat 1" during the "trashing horribly"/"needless I/O" in
the morning to be sure.

> userspace app needs to get smarter" type answer. I think they are
> trying to do the latter, but need some help from the kernel, and that's
> what they're trying to do here.

That's fine thing. But I got reports indipendent of this thread from
friends, asking me how to prevent swapping overnight because of
backup, so I connected that report to the above "trashing
horribly"/recently. I had no problems personally but may backup runs
once (not twice to activate/reference pages). I had no time to
investigate this further yet to see if things works ok for me even
with a rsync loop activating and referencing everything.

2011-02-22 17:29:36

by Jeffrey Hundstad

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] fadvise(DONTNEED) support

On 02/22/2011 07:28 AM, Andrea Arcangeli wrote:
> Hi Minchan,
> That's my point, we should check if the "thrashing horribly" is really
> a "recently" or if it has always happened before with 2.6.18 and previous.

I would bet that "thrashing" isn't what he meant. He almost certainly
meant "needless I/O" and nothing to do with swapping. I've seen this
similar report before. This report is almost always gets answered with
a "set your swappiness to an appropriate value" type answer. Or a "your
userspace app needs to get smarter" type answer. I think they are
trying to do the latter, but need some help from the kernel, and that's
what they're trying to do here.

--
Jeffrey Hundstad

2011-03-28 12:51:52

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page

Le mardi 22 février 2011 à 00:59 +0900, Minchan Kim a écrit :
> Fixed version.
>
> From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Tue, 22 Feb 2011 00:53:05 +0900
> Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
>
> The rotate_reclaimable_page function moves just written out
> pages, which the VM wanted to reclaim, to the end of the
> inactive list. That way the VM will find those pages first
> next time it needs to free memory.
> This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
>
> Acked-by: Balbir Singh <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---

Hmm... "inline inline" is an error on some gcc versions

CC arch/x86/kernel/asm-offsets.s
In file included from include/linux/swap.h:8,
from include/linux/suspend.h:4,
from arch/x86/kernel/asm-offsets.c:12:
include/linux/memcontrol.h:220: error: duplicate `inline'
make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1


> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> + return ;
> +}
> +

[PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page proto

commit 3f58a8294333 (move memcg reclaimable page into tail of inactive
list) added inline keyword twice in its prototype.

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..5e9840f5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -216,7 +216,7 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
return ;
}

-static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
{
return ;
}

2011-03-28 13:46:05

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page

On Mon, Mar 28, 2011 at 02:51:46PM +0200, Eric Dumazet wrote:
> Le mardi 22 f?vrier 2011 ? 00:59 +0900, Minchan Kim a ?crit :
> > Fixed version.
> >
> > From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <[email protected]>
> > Date: Tue, 22 Feb 2011 00:53:05 +0900
> > Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
> >
> > The rotate_reclaimable_page function moves just written out
> > pages, which the VM wanted to reclaim, to the end of the
> > inactive list. That way the VM will find those pages first
> > next time it needs to free memory.
> > This patch apply the rule in memcg.
> > It can help to prevent unnecessary working page eviction of memcg.
> >
> > Acked-by: Balbir Singh <[email protected]>
> > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> > Reviewed-by: Rik van Riel <[email protected]>
> > Cc: KOSAKI Motohiro <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
>
> Hmm... "inline inline" is an error on some gcc versions
>
> CC arch/x86/kernel/asm-offsets.s
> In file included from include/linux/swap.h:8,
> from include/linux/suspend.h:4,
> from arch/x86/kernel/asm-offsets.c:12:
> include/linux/memcontrol.h:220: error: duplicate `inline'
> make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1
>
>
> > +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> > +{
> > + return ;
> > +}
> > +
>
> [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page proto
>
> commit 3f58a8294333 (move memcg reclaimable page into tail of inactive
> list) added inline keyword twice in its prototype.
>
> Signed-off-by: Eric Dumazet <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

It a was totally my fault.
Thanks very much.