2011-02-17 15:08:49

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v5 0/4] fadvise(DONTNEED) support

Sorry for my laziness. It's time to repost with some test result.

Recently, there was a reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.
So app developer want to support POSIX_FADV_NOREUSE but other OSes include linux
don't support it. (http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is going on writing
during invalidate_mapping_pages, it can't work.
It makes application programmer to use it hard since they always
consider sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages couldn't be discardable. At last, they can't use
deferred write of kernel so see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So the idea in this series is that
let's move invalidated pages but not-freed page until into inactive list.
It can help relcaim efficiency very much so that it can prevent
eviction working set.

My exeperiment is folowing as.

Test Environment :
DRAM : 2G, CPU : Intel(R) Core(TM)2 CPU
Rsync backup directory size : 16G

rsync version is 3.0.7.
rsync patch is Ben's fadivse.
The stress scenario do following jobs with parallel.

1. git clone linux-2.6
1. make all -j4 linux-mmotm
3. rsync src dst

nrns : no-patched rsync + no stress
prns : patched rsync + no stress
nrs : no-patched rsync + stress
prs : patched rsync + stress

For profiling, I add some vmstat.
pginvalidate : the total number of pages which are moved by this patch.
pgreclaim : the number of pages which are moved at inactive's tail by PG_reclaim of pginvalidate

NRNS PRNS NRS PRS
Elapsed time 36:01.49 37:13.58 01:23:24 01:21:45
nr_vmscan_write 184 1 296 509
pgactivate 76559 84714 445214 463143
pgdeactivate 19360 40184 74302 91423
pginvalidate 0 2240333 0 1769147
pgreclaim 0 1849651 0 1650796
pgfault 406208 421860 72485217 70334416
pgmajfault 212 334 5149 3688
pgsteal_dma 0 0 0 0
pgsteal_normal 2645174 1545116 2521098 1578651
pgsteal_high 5162080 2562269 6074720 3137294
pgsteal_movable 0 0 0 0
pgscan_kswapd_dma 0 0 0 0
pgscan_kswapd_normal 2641732 1545374 2499894 1557882
pgscan_kswapd_high 5143794 2567981 5999585 3051150
pgscan_kswapd_movable 0 0 0 0
pgscan_direct_dma 0 0 0 0
pgscan_direct_normal 3643 0 21613 21238
pgscan_direct_high 20174 1783 76980 87848
pgscan_direct_movable 0 0 0 0
pginodesteal 130 1029 3510 24100
slabs_scanned 1421824 1648128 1870720 1880320
kswapd_steal 7785153 4105620 8498332 4608372
kswapd_inodesteal 189432 474052 342835 472503
pageoutrun 100687 52282 145712 70946
allocstall 22 1 149 163
pgrotated 0 2231408 2932 1765393
unevictable_pgs_scanned 0 0 0 0

In stress test(NRS vs PRS), pgsteal_[normal|high] are reduced by 37% and 48%.
pgscan_kswapd_[normal|high] are reduced by 37% and 49%.
It means although the VM scan small window, it can reclaim enough pages to work well and
prevent eviction unnecessary page.
rsync program's elapsed time is reduced by 1.5 minutes but I think rsync's fadvise
isn't good because [NRNS vs NRS] it takes one minutes longer time.
I think it's because calling unnecessary fadivse system calls so that
rsync's fadvise should be smart then effect would be much better than now.
The pgmajor fault is reduced by 28%. It's good.
What I can't understand is that why inode steal is increased.
If anyone know it, please explain to me.
Anyway, this patch improves reclaim efficiency very much.

Recently, Steven Barrentt already applied this series to his project kernel
"Liquorix kernel" and said followin as with one problem.
(The problem is solved by [3/4]. See the description)

" I've been having really good results with your new patch set that
mitigates the problem where a backup utility or something like that
reads each file once and eventually evicting the original working set
out of the page cache.
...
...
These patches solved some problems on a friend's desktop.
He said that his wife wanted to send me kisses and hugs because their
computer was so responsive after the patches were applied.
"
So I think this patch series solves real problem.

- [1/4] is to move invalidated page which is dirty/writeback on active list
into inactive list's head.
- [2/4] is to move memcg reclaimable page on inactive's tail.
- [3/4] is for moving invalidated page into inactive list's tail when the
page's writeout is completed for reclaim asap.
- [4/4] is to add profing information for evaluation.

This patches are based on mmotm-02-04

Changelog since v4:
- Remove patches related to madvise and clean up patch of swap.c
(I will separate madvise issue from this series and repost after merging this series)

Minchan Kim (4):
[1/4] deactivate invalidated pages
[2/4] memcg: move memcg reclaimable page into tail of inactive list
[3/4] Reclaim invalidated page ASAP
[4/4] add profile information for invalidated page

include/linux/memcontrol.h | 6 ++
include/linux/swap.h | 1 +
include/linux/vmstat.h | 4 +-
mm/memcontrol.c | 27 ++++++++++
mm/page-writeback.c | 12 ++++-
mm/swap.c | 119 +++++++++++++++++++++++++++++++++++++++++++-
mm/truncate.c | 17 +++++--
mm/vmstat.c | 3 +
8 files changed, 180 insertions(+), 9 deletions(-)


2011-02-17 15:08:56

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v5 1/4] deactivate invalidated pages

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It makes for application programmer to use it since they always
have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head if we can't truncate
it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page
would be flushed sooner or later. It can prevent writeout of pageout
which is less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added
his Signed-off-by.

Reported-by: Ben Gamari <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
Changelog since v4:
- Change function comments - suggested by Johannes
- Change function name - suggested by Johannes
- Drop only dirty/writeback pages to deactive pagevec - suggested by Johannes
- Add acked-by

Changelog since v3:
- Change function comments - suggested by Johannes
- Change function name - suggested by Johannes
- add only dirty/writeback pages to deactive pagevec

Changelog since v2:
- mapped page leaves alone - suggested by Mel
- pass part related PG_reclaim in next patch.

Changelog since v1:
- modify description
- correct typo
- add some comment

include/linux/swap.h | 1 +
mm/swap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
mm/truncate.c | 17 ++++++++---
3 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..c335055 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index c02f936..4aea806 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;

static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);

/*
* This path almost never happens for VM activity - pages are normally
@@ -347,6 +348,60 @@ void add_page_to_unevictable_list(struct page *page)
}

/*
+ * If the page can not be invalidated, it is moved to the
+ * inactive list to speed up its reclaim. It is moved to the
+ * head of the list, rather than the tail, to give the flusher
+ * threads some time to write it out, as this is much more
+ * effective than the single-page writeout from reclaim.
+ */
+static void lru_deactivate(struct page *page, struct zone *zone)
+{
+ int lru, file;
+
+ if (!PageLRU(page) || !PageActive(page))
+ return;
+
+ /* Some processes are using the page */
+ if (page_mapped(page))
+ return;
+
+ file = page_is_file_cache(page);
+ lru = page_lru_base_type(page);
+ del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(zone, page, lru);
+ __count_vm_event(PGDEACTIVATE);
+
+ update_page_reclaim_stat(zone, page, file, 0);
+}
+
+static void ____pagevec_lru_deactivate(struct pagevec *pvec)
+{
+ int i;
+ struct zone *zone = NULL;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct zone *pagezone = page_zone(page);
+
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ lru_deactivate(page, zone);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+
+ release_pages(pvec->pages, pvec->nr, pvec->cold);
+ pagevec_reinit(pvec);
+}
+
+
+/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
* disabled; or "cpu" is being hot-unplugged, and is already dead.
@@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
pagevec_move_tail(pvec);
local_irq_restore(flags);
}
+
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ ____pagevec_lru_deactivate(pvec);
+}
+
+/**
+ * deactivate_page - forcefully deactivate a page
+ * @page: page to deactivate
+ *
+ * This function hints the VM that @page is a good reclaim candidate,
+ * for example if its invalidation fails due to the page being dirty
+ * or under writeback.
+ */
+void deactivate_page(struct page *page)
+{
+ if (likely(get_page_unless_zero(page))) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ if (!pagevec_add(pvec, page))
+ ____pagevec_lru_deactivate(pvec);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
}

void lru_add_drain(void)
diff --git a/mm/truncate.c b/mm/truncate.c
index 4d415b3..9ec7bc5 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -328,11 +328,12 @@ EXPORT_SYMBOL(truncate_inode_pages);
* pagetables.
*/
unsigned long invalidate_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t end)
+ pgoff_t start, pgoff_t end)
{
struct pagevec pvec;
pgoff_t next = start;
- unsigned long ret = 0;
+ unsigned long ret;
+ unsigned long count = 0;
int i;

pagevec_init(&pvec, 0);
@@ -359,8 +360,14 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (lock_failed)
continue;

- ret += invalidate_inode_page(page);
-
+ ret = invalidate_inode_page(page);
+ /*
+ * Invalidation is a hint that the page is no longer
+ * of interest and try to speed up its reclaim.
+ */
+ if (!ret)
+ deactivate_page(page);
+ count += ret;
unlock_page(page);
if (next > end)
break;
@@ -369,7 +376,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
mem_cgroup_uncharge_end();
cond_resched();
}
- return ret;
+ return count;
}
EXPORT_SYMBOL(invalidate_mapping_pages);

--
1.7.1

2011-02-17 15:09:01

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v5 2/4] memcg: move memcg reclaimable page into tail of inactive list

The rotate_reclaimable_page function moves just written out
pages, which the VM wanted to reclaim, to the end of the
inactive list. That way the VM will find those pages first
next time it needs to free memory.
This patch apply the rule in memcg.
It can help to prevent unnecessary working page eviction of memcg.

Acked-by: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
Changelog since v4:
- add acked-by and reviewed-by
- change description - suggested by Rik

include/linux/memcontrol.h | 6 ++++++
mm/memcontrol.c | 27 +++++++++++++++++++++++++++
mm/swap.c | 3 ++-
3 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3da48ae..5a5ce70 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
extern void mem_cgroup_del_lru(struct page *page);
extern void mem_cgroup_move_lists(struct page *page,
@@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
return ;
}

+static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+ return ;
+}
+
static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
{
return ;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 686f1ce..ab8bdff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
mem_cgroup_del_lru_list(page, page_lru(page));
}

+/*
+ * Writeback is about to end against a page which has been marked for immediate
+ * reclaim. If it still appears to be reclaimable, move it to the tail of the
+ * inactive list.
+ */
+void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct page_cgroup *pc;
+ enum lru_list lru = page_lru_base_type(page);
+
+ if (mem_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(page);
+ /*
+ * Used bit is set without atomic ops but after smp_wmb().
+ * For making pc->mem_cgroup visible, insert smp_rmb() here.
+ */
+ smp_rmb();
+ /* unused or root page is not rotated. */
+ if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+ return;
+ mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
+ list_move_tail(&pc->lru, &mz->lists[lru]);
+}
+
void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz;
diff --git a/mm/swap.c b/mm/swap.c
index 4aea806..1b9e4eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,8 +200,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
spin_lock(&zone->lru_lock);
}
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
- int lru = page_lru_base_type(page);
+ enum lru_list lru = page_lru_base_type(page);
list_move_tail(&page->lru, &zone->lru[lru].list);
+ mem_cgroup_rotate_reclaimable_page(page);
pgmoved++;
}
}
--
1.7.1

2011-02-17 15:09:11

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v5 3/4] Reclaim invalidated page ASAP

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.
In v4, ClearPageReadahead in set_page_dirty has a problem which is reported
by Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it breaks
PageTail flag.

I think it doesn't affect THP and pass my test with THP enabling
but Cced Andrea for double check.

Reported-by: Steven Barrett <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
Changelog since v4:
- move ClearPageReclaim into mapping condition to fix compound page issue
- change comment - suggested by Johannes
- add pgrotated to lru_deactivate - suggested by Johannes

Changelog since v3:
- move page which ends up writeback in pagevec on inactive's tail
- suggested by Johannes

Changelog since v2:
- put ClearPageReclaim in set_page_dirty - suggested by Wu.

Changelog since v1:
- make the invalidated page reclaim asap - suggested by Andrew.

mm/page-writeback.c | 12 +++++++++++-
mm/swap.c | 41 ++++++++++++++++++++++++++++++++++++++---
2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..b437fe6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1211,6 +1211,17 @@ int set_page_dirty(struct page *page)

if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+ /*
+ * readahead/lru_deactivate_page could remain
+ * PG_readahead/PG_reclaim due to race with end_page_writeback
+ * About readahead, if the page is written, the flags would be
+ * reset. So no problem.
+ * About lru_deactivate_page, if the page is redirty, the flag
+ * will be reset. So no problem. but if the page is used by readahead
+ * it will confuse readahead and make it restart the size rampup
+ * process. But it's a trivial problem.
+ */
+ ClearPageReclaim(page);
#ifdef CONFIG_BLOCK
if (!spd)
spd = __set_page_dirty_buffers;
@@ -1266,7 +1277,6 @@ int clear_page_dirty_for_io(struct page *page)

BUG_ON(!PageLocked(page));

- ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
diff --git a/mm/swap.c b/mm/swap.c
index 1b9e4eb..0a33714 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -354,26 +354,61 @@ void add_page_to_unevictable_list(struct page *page)
* head of the list, rather than the tail, to give the flusher
* threads some time to write it out, as this is much more
* effective than the single-page writeout from reclaim.
+ *
+ * If the page isn't page_mapped and dirty/writeback, the page
+ * could reclaim asap using PG_reclaim.
+ *
+ * 1. active, mapped page -> none
+ * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
+ * 3. inactive, mapped page -> none
+ * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
+ * 5. inactive, clean -> inactive, tail
+ * 6. Others -> none
+ *
+ * In 4, why it moves inactive's head, the VM expects the page would
+ * be write it out by flusher threads as this is much more effective
+ * than the single-page writeout from reclaim.
*/
static void lru_deactivate(struct page *page, struct zone *zone)
{
int lru, file;
+ bool active;

- if (!PageLRU(page) || !PageActive(page))
+ if (!PageLRU(page))
return;

/* Some processes are using the page */
if (page_mapped(page))
return;

+ active = PageActive(page);
+
file = page_is_file_cache(page);
lru = page_lru_base_type(page);
- del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+ del_page_from_lru_list(zone, page, lru + active);
ClearPageActive(page);
ClearPageReferenced(page);
add_page_to_lru_list(zone, page, lru);
- __count_vm_event(PGDEACTIVATE);

+ if (PageWriteback(page) || PageDirty(page)) {
+ /*
+ * PG_reclaim could be raced with end_page_writeback
+ * It can make readahead confusing. But race window
+ * is _really_ small and it's non-critical problem.
+ */
+ SetPageReclaim(page);
+ } else {
+ /*
+ * The page's writeback ends up during pagevec
+ * We moves tha page into tail of inactive.
+ */
+ list_move_tail(&page->lru, &zone->lru[lru].list);
+ mem_cgroup_rotate_reclaimable_page(page);
+ __count_vm_event(PGROTATED);
+ }
+
+ if (active)
+ __count_vm_event(PGDEACTIVATE);
update_page_reclaim_stat(zone, page, file, 0);
}

--
1.7.1

2011-02-17 15:09:16

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v5 4/4] add profile information for invalidated page

This patch adds profile information about invalidated page reclaim.
It's just for profiling for test so it is never for merging.

Acked-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/vmstat.h | 4 ++--
mm/swap.c | 3 +++
mm/vmstat.c | 3 +++
3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 833e676..c38ad95 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -30,8 +30,8 @@

enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
- PGFREE, PGACTIVATE, PGDEACTIVATE,
- PGFAULT, PGMAJFAULT,
+ PGFREE, PGACTIVATE, PGDEACTIVATE, PGINVALIDATE,
+ PGRECLAIM, PGFAULT, PGMAJFAULT,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL),
FOR_ALL_ZONES(PGSCAN_KSWAPD),
diff --git a/mm/swap.c b/mm/swap.c
index 0a33714..980c17b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -397,6 +397,7 @@ static void lru_deactivate(struct page *page, struct zone *zone)
* is _really_ small and it's non-critical problem.
*/
SetPageReclaim(page);
+ __count_vm_event(PGRECLAIM);
} else {
/*
* The page's writeback ends up during pagevec
@@ -409,6 +410,8 @@ static void lru_deactivate(struct page *page, struct zone *zone)

if (active)
__count_vm_event(PGDEACTIVATE);
+
+ __count_vm_event(PGINVALIDATE);
update_page_reclaim_stat(zone, page, file, 0);
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0c3b504..cbe032b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -896,6 +896,9 @@ static const char * const vmstat_text[] = {
"pgactivate",
"pgdeactivate",

+ "pginvalidate",
+ "pgreclaim",
+
"pgfault",
"pgmajfault",

--
1.7.1

2011-02-17 15:56:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] deactivate invalidated pages

On Fri, 18 Feb 2011 00:08:19 +0900
Minchan Kim <[email protected]> wrote:

> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It makes for application programmer to use it since they always
> have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
> make sure the pages could be discardable. At last, they can't use
> deferred write of kernel so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head if we can't truncate
> it right now.
>
> Why I move page to head of lru on this patch, Dirty/Writeback page
> would be flushed sooner or later. It can prevent writeout of pageout
> which is less effective than flusher's writeout.
>
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>


Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

One question is ....it seems there is no flush() code for percpu pagevec
in this patch. Is it safe against cpu hot plug ?

And from memory hot unplug point of view, I'm grad if pagevec for this
is flushed at the same time as when we clear other per-cpu lru pagevecs.
(And compaction will be affected by the page_count() magic by pagevec
which is flushed only when FADVISE is called.)

Could you add add-on patches for flushing and hooks ?

Thanks,
-Kame



> ---
> Changelog since v4:
> - Change function comments - suggested by Johannes
> - Change function name - suggested by Johannes
> - Drop only dirty/writeback pages to deactive pagevec - suggested by Johannes
> - Add acked-by
>
> Changelog since v3:
> - Change function comments - suggested by Johannes
> - Change function name - suggested by Johannes
> - add only dirty/writeback pages to deactive pagevec
>
> Changelog since v2:
> - mapped page leaves alone - suggested by Mel
> - pass part related PG_reclaim in next patch.
>
> Changelog since v1:
> - modify description
> - correct typo
> - add some comment
>
> include/linux/swap.h | 1 +
> mm/swap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
> mm/truncate.c | 17 ++++++++---
> 3 files changed, 91 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4d55932..c335055 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -215,6 +215,7 @@ extern void mark_page_accessed(struct page *);
> extern void lru_add_drain(void);
> extern int lru_add_drain_all(void);
> extern void rotate_reclaimable_page(struct page *page);
> +extern void deactivate_page(struct page *page);
> extern void swap_setup(void);
>
> extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/mm/swap.c b/mm/swap.c
> index c02f936..4aea806 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,7 @@ int page_cluster;
>
> static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>
> /*
> * This path almost never happens for VM activity - pages are normally
> @@ -347,6 +348,60 @@ void add_page_to_unevictable_list(struct page *page)
> }
>
> /*
> + * If the page can not be invalidated, it is moved to the
> + * inactive list to speed up its reclaim. It is moved to the
> + * head of the list, rather than the tail, to give the flusher
> + * threads some time to write it out, as this is much more
> + * effective than the single-page writeout from reclaim.
> + */
> +static void lru_deactivate(struct page *page, struct zone *zone)
> +{
> + int lru, file;
> +
> + if (!PageLRU(page) || !PageActive(page))
> + return;
> +
> + /* Some processes are using the page */
> + if (page_mapped(page))
> + return;
> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + __count_vm_event(PGDEACTIVATE);
> +
> + update_page_reclaim_stat(zone, page, file, 0);
> +}
> +
> +static void ____pagevec_lru_deactivate(struct pagevec *pvec)
> +{
> + int i;
> + struct zone *zone = NULL;
> +
> + for (i = 0; i < pagevec_count(pvec); i++) {
> + struct page *page = pvec->pages[i];
> + struct zone *pagezone = page_zone(page);
> +
> + if (pagezone != zone) {
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> + zone = pagezone;
> + spin_lock_irq(&zone->lru_lock);
> + }
> + lru_deactivate(page, zone);
> + }
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> +
> + release_pages(pvec->pages, pvec->nr, pvec->cold);
> + pagevec_reinit(pvec);
> +}
> +
> +
> +/*
> * Drain pages out of the cpu's pagevecs.
> * Either "cpu" is the current CPU, and preemption has already been
> * disabled; or "cpu" is being hot-unplugged, and is already dead.
> @@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
> pagevec_move_tail(pvec);
> local_irq_restore(flags);
> }
> +
> + pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> + if (pagevec_count(pvec))
> + ____pagevec_lru_deactivate(pvec);
> +}
> +
> +/**
> + * deactivate_page - forcefully deactivate a page
> + * @page: page to deactivate
> + *
> + * This function hints the VM that @page is a good reclaim candidate,
> + * for example if its invalidation fails due to the page being dirty
> + * or under writeback.
> + */
> +void deactivate_page(struct page *page)
> +{
> + if (likely(get_page_unless_zero(page))) {
> + struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +
> + if (!pagevec_add(pvec, page))
> + ____pagevec_lru_deactivate(pvec);
> + put_cpu_var(lru_deactivate_pvecs);
> + }
> }
>
> void lru_add_drain(void)
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 4d415b3..9ec7bc5 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -328,11 +328,12 @@ EXPORT_SYMBOL(truncate_inode_pages);
> * pagetables.
> */
> unsigned long invalidate_mapping_pages(struct address_space *mapping,
> - pgoff_t start, pgoff_t end)
> + pgoff_t start, pgoff_t end)
> {
> struct pagevec pvec;
> pgoff_t next = start;
> - unsigned long ret = 0;
> + unsigned long ret;
> + unsigned long count = 0;
> int i;
>
> pagevec_init(&pvec, 0);
> @@ -359,8 +360,14 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (lock_failed)
> continue;
>
> - ret += invalidate_inode_page(page);
> -
> + ret = invalidate_inode_page(page);
> + /*
> + * Invalidation is a hint that the page is no longer
> + * of interest and try to speed up its reclaim.
> + */
> + if (!ret)
> + deactivate_page(page);
> + count += ret;
> unlock_page(page);
> if (next > end)
> break;
> @@ -369,7 +376,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> mem_cgroup_uncharge_end();
> cond_resched();
> }
> - return ret;
> + return count;
> }
> EXPORT_SYMBOL(invalidate_mapping_pages);
>
> --
> 1.7.1
>
>

2011-02-17 16:10:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] memcg: move memcg reclaimable page into tail of inactive list

On Fri, 18 Feb 2011 00:08:20 +0900
Minchan Kim <[email protected]> wrote:

> The rotate_reclaimable_page function moves just written out
> pages, which the VM wanted to reclaim, to the end of the
> inactive list. That way the VM will find those pages first
> next time it needs to free memory.
> This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
>
> Acked-by: Balbir Singh <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> Changelog since v4:
> - add acked-by and reviewed-by
> - change description - suggested by Rik
>
> include/linux/memcontrol.h | 6 ++++++
> mm/memcontrol.c | 27 +++++++++++++++++++++++++++
> mm/swap.c | 3 ++-
> 3 files changed, 35 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3da48ae..5a5ce70 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
> extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
> +extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
> extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
> extern void mem_cgroup_del_lru(struct page *page);
> extern void mem_cgroup_move_lists(struct page *page,
> @@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
> return ;
> }
>
> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> + return ;
> +}
> +
> static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
> {
> return ;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 686f1ce..ab8bdff 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
> mem_cgroup_del_lru_list(page, page_lru(page));
> }
>
> +/*
> + * Writeback is about to end against a page which has been marked for immediate
> + * reclaim. If it still appears to be reclaimable, move it to the tail of the
> + * inactive list.
> + */
> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> + struct mem_cgroup_per_zone *mz;
> + struct page_cgroup *pc;
> + enum lru_list lru = page_lru_base_type(page);
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + /*
> + * Used bit is set without atomic ops but after smp_wmb().
> + * For making pc->mem_cgroup visible, insert smp_rmb() here.
> + */
> + smp_rmb();
> + /* unused or root page is not rotated. */
> + if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
> + return;
> + mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
> + list_move_tail(&pc->lru, &mz->lists[lru]);
> +}
> +

Hmm, I'm sorry I misunderstand this. IIUC, page_lru_base_type() always returns
LRU_INACTIVE_XXX and this function may move page from active LRU to inactive LRU.

Then, LRU counters for memcg should be updated.

Could you replace after lookup like this ?

VM_BUG_ON(!PageCgroupAcctLRU(pc)) /* Implies this pages must be on some LRU */
if (!PageCgroupUsed(pc))
return;
/* Used bit check is not necessary, because there is a case Unused page
is lazily on LRU. We trust AcctLRU bit. */
mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
MEM_CGROUP_ZSTAT(mz, page_lru(page)) -= 1 << compound_order(page);
MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page)
if (mem_cgroup_is_root(pc->mem_cgroup))
return;
list_move_tail(&pc->lru, &mz->lists[lru])


Thanks,
-Kame
> void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
> {
> struct mem_cgroup_per_zone *mz;
> diff --git a/mm/swap.c b/mm/swap.c
> index 4aea806..1b9e4eb 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -200,8 +200,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
> spin_lock(&zone->lru_lock);
> }
> if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> - int lru = page_lru_base_type(page);
> + enum lru_list lru = page_lru_base_type(page);
> list_move_tail(&page->lru, &zone->lru[lru].list);
> + mem_cgroup_rotate_reclaimable_page(page);
> pgmoved++;
> }
> }
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-02-17 16:12:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] Reclaim invalidated page ASAP

On Fri, 18 Feb 2011 00:08:21 +0900
Minchan Kim <[email protected]> wrote:

> invalidate_mapping_pages is very big hint to reclaimer.
> It means user doesn't want to use the page any more.
> So in order to prevent working set page eviction, this patch
> move the page into tail of inactive list by PG_reclaim.
>
> Please, remember that pages in inactive list are working set
> as well as active list. If we don't move pages into inactive list's
> tail, pages near by tail of inactive list can be evicted although
> we have a big clue about useless pages. It's totally bad.
>
> Now PG_readahead/PG_reclaim is shared.
> fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
> preventing fast reclaiming readahead marker page.
>
> In this series, PG_reclaim is used by invalidated page, too.
> If VM find the page is invalidated and it's dirty, it sets PG_reclaim
> to reclaim asap. Then, when the dirty page will be writeback,
> clear_page_dirty_for_io will clear PG_reclaim unconditionally.
> It disturbs this serie's goal.
>
> I think it's okay to clear PG_readahead when the page is dirty, not
> writeback time. So this patch moves ClearPageReadahead.
> In v4, ClearPageReadahead in set_page_dirty has a problem which is reported
> by Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
> set_page_dirty with compound page which isn't on LRU. but my patch does
> ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it breaks
> PageTail flag.
>
> I think it doesn't affect THP and pass my test with THP enabling
> but Cced Andrea for double check.
>
> Reported-by: Steven Barrett <[email protected]>
> Reviewed-by: Johannes Weiner <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2011-02-17 16:14:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] add profile information for invalidated page

On Fri, 18 Feb 2011 00:08:22 +0900
Minchan Kim <[email protected]> wrote:

> This patch adds profile information about invalidated page reclaim.
> It's just for profiling for test so it is never for merging.
>
> Acked-by: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2011-02-18 00:14:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] memcg: move memcg reclaimable page into tail of inactive list

Hi Kame,

On Fri, Feb 18, 2011 at 1:04 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 18 Feb 2011 00:08:20 +0900
> Minchan Kim <[email protected]> wrote:
>
>> The rotate_reclaimable_page function moves just written out
>> pages, which the VM wanted to reclaim, to the end of the
>> inactive list.  That way the VM will find those pages first
>> next time it needs to free memory.
>> This patch apply the rule in memcg.
>> It can help to prevent unnecessary working page eviction of memcg.
>>
>> Acked-by: Balbir Singh <[email protected]>
>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>> Reviewed-by: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> ---
>> Changelog since v4:
>>  - add acked-by and reviewed-by
>>  - change description - suggested by Rik
>>
>>  include/linux/memcontrol.h |    6 ++++++
>>  mm/memcontrol.c            |   27 +++++++++++++++++++++++++++
>>  mm/swap.c                  |    3 ++-
>>  3 files changed, 35 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 3da48ae..5a5ce70 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>>                                       gfp_t gfp_mask);
>>  extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
>>  extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
>> +extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
>>  extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
>>  extern void mem_cgroup_del_lru(struct page *page);
>>  extern void mem_cgroup_move_lists(struct page *page,
>> @@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
>>       return ;
>>  }
>>
>> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
>> +{
>> +     return ;
>> +}
>> +
>>  static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
>>  {
>>       return ;
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 686f1ce..ab8bdff 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
>>       mem_cgroup_del_lru_list(page, page_lru(page));
>>  }
>>
>> +/*
>> + * Writeback is about to end against a page which has been marked for immediate
>> + * reclaim.  If it still appears to be reclaimable, move it to the tail of the
>> + * inactive list.
>> + */
>> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
>> +{
>> +     struct mem_cgroup_per_zone *mz;
>> +     struct page_cgroup *pc;
>> +     enum lru_list lru = page_lru_base_type(page);
>> +
>> +     if (mem_cgroup_disabled())
>> +             return;
>> +
>> +     pc = lookup_page_cgroup(page);
>> +     /*
>> +      * Used bit is set without atomic ops but after smp_wmb().
>> +      * For making pc->mem_cgroup visible, insert smp_rmb() here.
>> +      */
>> +     smp_rmb();
>> +     /* unused or root page is not rotated. */
>> +     if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>> +             return;
>> +     mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>> +     list_move_tail(&pc->lru, &mz->lists[lru]);
>> +}
>> +
>
> Hmm, I'm sorry I misunderstand this. IIUC, page_lru_base_type() always returns
> LRU_INACTIVE_XXX and this function may move page from active LRU to inactive LRU.
>
> Then, LRU counters for memcg should be updated.

Goal of mem_cgroup_rotate_reclaimable_page is same with rotate_reclaimable_page.
It means the page was already in inactive list.
Look at the check !PageActive(page).

But if you want to make the function generally(ie, support
active->inactive, too), I don't mind it. but if you want it, let's
make rotate_reclaimable_page to general function, too. but now any
user doesn't use it.

Thanks for the careful review.

>
> Could you replace after lookup like this ?
>
>     VM_BUG_ON(!PageCgroupAcctLRU(pc))  /* Implies this pages must be on some LRU */
>     if (!PageCgroupUsed(pc))
>           return;
>     /* Used bit check is not necessary, because there is a case Unused page
>        is lazily on LRU. We trust AcctLRU bit. */
>     mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>     MEM_CGROUP_ZSTAT(mz, page_lru(page)) -= 1 << compound_order(page);
>     MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page)
>     if (mem_cgroup_is_root(pc->mem_cgroup))
>           return;
>     list_move_tail(&pc->lru, &mz->lists[lru])
>
>
> Thanks,
> -Kame
>>  void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
>>  {
>>       struct mem_cgroup_per_zone *mz;
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 4aea806..1b9e4eb 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -200,8 +200,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
>>                       spin_lock(&zone->lru_lock);
>>               }
>>               if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
>> -                     int lru = page_lru_base_type(page);
>> +                     enum lru_list lru = page_lru_base_type(page);
>>                       list_move_tail(&page->lru, &zone->lru[lru].list);
>> +                     mem_cgroup_rotate_reclaimable_page(page);
>>                       pgmoved++;
>>               }
>>       }
>> --
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>
>



--
Kind regards,
Minchan Kim

2011-02-18 00:33:59

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] deactivate invalidated pages

On Fri, Feb 18, 2011 at 12:50 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 18 Feb 2011 00:08:19 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Recently, there are reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>> active list so that it results in working set page eviction.
>>
>> Some app developer want to support POSIX_FADV_NOREUSE.
>> But other OSes don't support it, either.
>> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>>
>> By other approach, app developers use POSIX_FADV_DONTNEED.
>> But it has a problem. If kernel meets page is writing
>> during invalidate_mapping_pages, it can't work.
>> It makes for application programmer to use it since they always
>> have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
>> make sure the pages could be discardable. At last, they can't use
>> deferred write of kernel so that they could see performance loss.
>> (http://insights.oetiker.ch/linux/fadvise.html)
>>
>> In fact, invalidation is very big hint to reclaimer.
>> It means we don't use the page any more. So let's move
>> the writing page into inactive list's head if we can't truncate
>> it right now.
>>
>> Why I move page to head of lru on this patch, Dirty/Writeback page
>> would be flushed sooner or later. It can prevent writeout of pageout
>> which is less effective than flusher's writeout.
>>
>> Originally, I reused lru_demote of Peter with some change so added
>> his Signed-off-by.
>>
>> Reported-by: Ben Gamari <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> Acked-by: Rik van Riel <[email protected]>
>> Acked-by: Mel Gorman <[email protected]>
>> Reviewed-by: KOSAKI Motohiro <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>
> One question is ....it seems there is no flush() code for percpu pagevec
> in this patch. Is it safe against cpu hot plug ?
>
> And from memory hot unplug point of view, I'm grad if pagevec for this
> is flushed at the same time as when we clear other per-cpu lru pagevecs.
> (And compaction will be affected by the page_count() magic by pagevec
>  which is flushed only when FADVISE is called.)
>
> Could you add add-on patches for flushing and hooks ?

Isn't it enough in my patch? If I miss your point, Could you elaborate please?

* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
* disabled; or "cpu" is being hot-unplugged, and is already dead.
@@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
pagevec_move_tail(pvec);
local_irq_restore(flags);
}
+
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ ____pagevec_lru_deactivate(pvec);
+}


>
> Thanks,
> -Kame
>
>
>
>> ---
>> Changelog since v4:
>>  - Change function comments - suggested by Johannes
>>  - Change function name - suggested by Johannes
>>  - Drop only dirty/writeback pages to deactive pagevec - suggested by Johannes
>>  - Add acked-by
>>
>> Changelog since v3:
>>  - Change function comments - suggested by Johannes
>>  - Change function name - suggested by Johannes
>>  - add only dirty/writeback pages to deactive pagevec
>>
>> Changelog since v2:
>>  - mapped page leaves alone - suggested by Mel
>>  - pass part related PG_reclaim in next patch.
>>
>> Changelog since v1:
>>  - modify description
>>  - correct typo
>>  - add some comment
>>
>>  include/linux/swap.h |    1 +
>>  mm/swap.c            |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  mm/truncate.c        |   17 ++++++++---
>>  3 files changed, 91 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 4d55932..c335055 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -215,6 +215,7 @@ extern void mark_page_accessed(struct page *);
>>  extern void lru_add_drain(void);
>>  extern int lru_add_drain_all(void);
>>  extern void rotate_reclaimable_page(struct page *page);
>> +extern void deactivate_page(struct page *page);
>>  extern void swap_setup(void);
>>
>>  extern void add_page_to_unevictable_list(struct page *page);
>> diff --git a/mm/swap.c b/mm/swap.c
>> index c02f936..4aea806 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -39,6 +39,7 @@ int page_cluster;
>>
>>  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>>
>>  /*
>>   * This path almost never happens for VM activity - pages are normally
>> @@ -347,6 +348,60 @@ void add_page_to_unevictable_list(struct page *page)
>>  }
>>
>>  /*
>> + * If the page can not be invalidated, it is moved to the
>> + * inactive list to speed up its reclaim.  It is moved to the
>> + * head of the list, rather than the tail, to give the flusher
>> + * threads some time to write it out, as this is much more
>> + * effective than the single-page writeout from reclaim.
>> + */
>> +static void lru_deactivate(struct page *page, struct zone *zone)
>> +{
>> +     int lru, file;
>> +
>> +     if (!PageLRU(page) || !PageActive(page))
>> +             return;
>> +
>> +     /* Some processes are using the page */
>> +     if (page_mapped(page))
>> +             return;
>> +
>> +     file = page_is_file_cache(page);
>> +     lru = page_lru_base_type(page);
>> +     del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
>> +     ClearPageActive(page);
>> +     ClearPageReferenced(page);
>> +     add_page_to_lru_list(zone, page, lru);
>> +     __count_vm_event(PGDEACTIVATE);
>> +
>> +     update_page_reclaim_stat(zone, page, file, 0);
>> +}
>> +
>> +static void ____pagevec_lru_deactivate(struct pagevec *pvec)
>> +{
>> +     int i;
>> +     struct zone *zone = NULL;
>> +
>> +     for (i = 0; i < pagevec_count(pvec); i++) {
>> +             struct page *page = pvec->pages[i];
>> +             struct zone *pagezone = page_zone(page);
>> +
>> +             if (pagezone != zone) {
>> +                     if (zone)
>> +                             spin_unlock_irq(&zone->lru_lock);
>> +                     zone = pagezone;
>> +                     spin_lock_irq(&zone->lru_lock);
>> +             }
>> +             lru_deactivate(page, zone);
>> +     }
>> +     if (zone)
>> +             spin_unlock_irq(&zone->lru_lock);
>> +
>> +     release_pages(pvec->pages, pvec->nr, pvec->cold);
>> +     pagevec_reinit(pvec);
>> +}
>> +
>> +
>> +/*
>>   * Drain pages out of the cpu's pagevecs.
>>   * Either "cpu" is the current CPU, and preemption has already been
>>   * disabled; or "cpu" is being hot-unplugged, and is already dead.
>> @@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
>>               pagevec_move_tail(pvec);
>>               local_irq_restore(flags);
>>       }
>> +
>> +     pvec = &per_cpu(lru_deactivate_pvecs, cpu);
>> +     if (pagevec_count(pvec))
>> +             ____pagevec_lru_deactivate(pvec);
>> +}
>> +
>> +/**
>> + * deactivate_page - forcefully deactivate a page
>> + * @page: page to deactivate
>> + *
>> + * This function hints the VM that @page is a good reclaim candidate,
>> + * for example if its invalidation fails due to the page being dirty
>> + * or under writeback.
>> + */
>> +void deactivate_page(struct page *page)
>> +{
>> +     if (likely(get_page_unless_zero(page))) {
>> +             struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
>> +
>> +             if (!pagevec_add(pvec, page))
>> +                     ____pagevec_lru_deactivate(pvec);
>> +             put_cpu_var(lru_deactivate_pvecs);
>> +     }
>>  }
>>
>>  void lru_add_drain(void)
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 4d415b3..9ec7bc5 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -328,11 +328,12 @@ EXPORT_SYMBOL(truncate_inode_pages);
>>   * pagetables.
>>   */
>>  unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> -                                    pgoff_t start, pgoff_t end)
>> +             pgoff_t start, pgoff_t end)
>>  {
>>       struct pagevec pvec;
>>       pgoff_t next = start;
>> -     unsigned long ret = 0;
>> +     unsigned long ret;
>> +     unsigned long count = 0;
>>       int i;
>>
>>       pagevec_init(&pvec, 0);
>> @@ -359,8 +360,14 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>>                       if (lock_failed)
>>                               continue;
>>
>> -                     ret += invalidate_inode_page(page);
>> -
>> +                     ret = invalidate_inode_page(page);
>> +                     /*
>> +                      * Invalidation is a hint that the page is no longer
>> +                      * of interest and try to speed up its reclaim.
>> +                      */
>> +                     if (!ret)
>> +                             deactivate_page(page);
>> +                     count += ret;
>>                       unlock_page(page);
>>                       if (next > end)
>>                               break;
>> @@ -369,7 +376,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>>               mem_cgroup_uncharge_end();
>>               cond_resched();
>>       }
>> -     return ret;
>> +     return count;
>>  }
>>  EXPORT_SYMBOL(invalidate_mapping_pages);
>>
>> --
>> 1.7.1
>>
>>
>
>



--
Kind regards,
Minchan Kim

2011-02-18 04:16:49

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] memcg: move memcg reclaimable page into tail of inactive list

2011/2/18 Minchan Kim <[email protected]>:
> Hi Kame,
>
> On Fri, Feb 18, 2011 at 1:04 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> On Fri, 18 Feb 2011 00:08:20 +0900
>> Minchan Kim <[email protected]> wrote:
>>
>>> The rotate_reclaimable_page function moves just written out
>>> pages, which the VM wanted to reclaim, to the end of the
>>> inactive list. ?That way the VM will find those pages first
>>> next time it needs to free memory.
>>> This patch apply the rule in memcg.
>>> It can help to prevent unnecessary working page eviction of memcg.
>>>
>>> Acked-by: Balbir Singh <[email protected]>
>>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>>> Reviewed-by: Rik van Riel <[email protected]>
>>> Cc: KOSAKI Motohiro <[email protected]>
>>> Cc: Johannes Weiner <[email protected]>
>>> Signed-off-by: Minchan Kim <[email protected]>
>>> ---
>>> Changelog since v4:
>>> ?- add acked-by and reviewed-by
>>> ?- change description - suggested by Rik
>>>
>>> ?include/linux/memcontrol.h | ? ?6 ++++++
>>> ?mm/memcontrol.c ? ? ? ? ? ?| ? 27 +++++++++++++++++++++++++++
>>> ?mm/swap.c ? ? ? ? ? ? ? ? ?| ? ?3 ++-
>>> ?3 files changed, 35 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 3da48ae..5a5ce70 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t gfp_mask);
>>> ?extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
>>> ?extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
>>> +extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
>>> ?extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
>>> ?extern void mem_cgroup_del_lru(struct page *page);
>>> ?extern void mem_cgroup_move_lists(struct page *page,
>>> @@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
>>> ? ? ? return ;
>>> ?}
>>>
>>> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
>>> +{
>>> + ? ? return ;
>>> +}
>>> +
>>> ?static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
>>> ?{
>>> ? ? ? return ;
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 686f1ce..ab8bdff 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
>>> ? ? ? mem_cgroup_del_lru_list(page, page_lru(page));
>>> ?}
>>>
>>> +/*
>>> + * Writeback is about to end against a page which has been marked for immediate
>>> + * reclaim. ?If it still appears to be reclaimable, move it to the tail of the
>>> + * inactive list.
>>> + */
>>> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
>>> +{
>>> + ? ? struct mem_cgroup_per_zone *mz;
>>> + ? ? struct page_cgroup *pc;
>>> + ? ? enum lru_list lru = page_lru_base_type(page);
>>> +
>>> + ? ? if (mem_cgroup_disabled())
>>> + ? ? ? ? ? ? return;
>>> +
>>> + ? ? pc = lookup_page_cgroup(page);
>>> + ? ? /*
>>> + ? ? ?* Used bit is set without atomic ops but after smp_wmb().
>>> + ? ? ?* For making pc->mem_cgroup visible, insert smp_rmb() here.
>>> + ? ? ?*/
>>> + ? ? smp_rmb();
>>> + ? ? /* unused or root page is not rotated. */
>>> + ? ? if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>>> + ? ? ? ? ? ? return;
>>> + ? ? mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>>> + ? ? list_move_tail(&pc->lru, &mz->lists[lru]);
>>> +}
>>> +
>>
>> Hmm, I'm sorry I misunderstand this. IIUC, page_lru_base_type() always returns
>> LRU_INACTIVE_XXX and this function may move page from active LRU to inactive LRU.
>>
>> Then, LRU counters for memcg should be updated.
>
> Goal of mem_cgroup_rotate_reclaimable_page is same with rotate_reclaimable_page.
> It means the page was already in inactive list.
> Look at the check !PageActive(page).

Hmm, ok. If so, could you change

page_lru_base_type() -> page_lru() ?

It's misleading.

Thanks,
-Kame

2011-02-18 04:18:40

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] deactivate invalidated pages

2011/2/18 Minchan Kim <[email protected]>:
> On Fri, Feb 18, 2011 at 12:50 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> On Fri, 18 Feb 2011 00:08:19 +0900
>> Minchan Kim <[email protected]> wrote:
>>
>>> Recently, there are reported problem about thrashing.
>>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>>> It happens by backup workloads(ex, nightly rsync).
>>> That's because the workload makes just use-once pages
>>> and touches pages twice. It promotes the page into
>>> active list so that it results in working set page eviction.
>>>
>>> Some app developer want to support POSIX_FADV_NOREUSE.
>>> But other OSes don't support it, either.
>>> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>>>
>>> By other approach, app developers use POSIX_FADV_DONTNEED.
>>> But it has a problem. If kernel meets page is writing
>>> during invalidate_mapping_pages, it can't work.
>>> It makes for application programmer to use it since they always
>>> have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
>>> make sure the pages could be discardable. At last, they can't use
>>> deferred write of kernel so that they could see performance loss.
>>> (http://insights.oetiker.ch/linux/fadvise.html)
>>>
>>> In fact, invalidation is very big hint to reclaimer.
>>> It means we don't use the page any more. So let's move
>>> the writing page into inactive list's head if we can't truncate
>>> it right now.
>>>
>>> Why I move page to head of lru on this patch, Dirty/Writeback page
>>> would be flushed sooner or later. It can prevent writeout of pageout
>>> which is less effective than flusher's writeout.
>>>
>>> Originally, I reused lru_demote of Peter with some change so added
>>> his Signed-off-by.
>>>
>>> Reported-by: Ben Gamari <[email protected]>
>>> Signed-off-by: Minchan Kim <[email protected]>
>>> Signed-off-by: Peter Zijlstra <[email protected]>
>>> Acked-by: Rik van Riel <[email protected]>
>>> Acked-by: Mel Gorman <[email protected]>
>>> Reviewed-by: KOSAKI Motohiro <[email protected]>
>>> Cc: Wu Fengguang <[email protected]>
>>> Cc: Johannes Weiner <[email protected]>
>>> Cc: Nick Piggin <[email protected]>
>>> Signed-off-by: Minchan Kim <[email protected]>
>>
>>
>> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>>
>> One question is ....it seems there is no flush() code for percpu pagevec
>> in this patch. Is it safe against cpu hot plug ?
>>
>> And from memory hot unplug point of view, I'm grad if pagevec for this
>> is flushed at the same time as when we clear other per-cpu lru pagevecs.
>> (And compaction will be affected by the page_count() magic by pagevec
>> ?which is flushed only when FADVISE is called.)
>>
>> Could you add add-on patches for flushing and hooks ?
>
> Isn't it enough in my patch? If I miss your point, Could you elaborate please?
>
> ?* Drain pages out of the cpu's pagevecs.
> ?* Either "cpu" is the current CPU, and preemption has already been
> ?* disabled; or "cpu" is being hot-unplugged, and is already dead.
> @@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
> ? ? ? ? ? ? ? pagevec_move_tail(pvec);
> ? ? ? ? ? ? ? local_irq_restore(flags);
> ? ? ? }
> +
> + ? ? ? pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> + ? ? ? if (pagevec_count(pvec))
> + ? ? ? ? ? ? ? ____pagevec_lru_deactivate(pvec);
> +}
>

I'm sorry that I missed this line. It seems I was wrong.

Regards,
-Kame

2011-02-18 06:18:09

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] fadvise(DONTNEED) support

* MinChan Kim <[email protected]> [2011-02-18 00:08:18]:

> Sorry for my laziness. It's time to repost with some test result.
>
> Recently, there was a reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
> So app developer want to support POSIX_FADV_NOREUSE but other OSes include linux
> don't support it. (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is going on writing
> during invalidate_mapping_pages, it can't work.
> It makes application programmer to use it hard since they always
> consider sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
> make sure the pages couldn't be discardable. At last, they can't use
> deferred write of kernel so see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So the idea in this series is that
> let's move invalidated pages but not-freed page until into inactive list.
> It can help relcaim efficiency very much so that it can prevent
> eviction working set.
>
> My exeperiment is folowing as.
>
> Test Environment :
> DRAM : 2G, CPU : Intel(R) Core(TM)2 CPU
> Rsync backup directory size : 16G
>
> rsync version is 3.0.7.
> rsync patch is Ben's fadivse.
> The stress scenario do following jobs with parallel.
>
> 1. git clone linux-2.6
> 1. make all -j4 linux-mmotm
> 3. rsync src dst
>
> nrns : no-patched rsync + no stress
> prns : patched rsync + no stress
> nrs : no-patched rsync + stress
> prs : patched rsync + stress
>
> For profiling, I add some vmstat.
> pginvalidate : the total number of pages which are moved by this patch.
> pgreclaim : the number of pages which are moved at inactive's tail by PG_reclaim of pginvalidate
>
> NRNS PRNS NRS PRS
> Elapsed time 36:01.49 37:13.58 01:23:24 01:21:45
> nr_vmscan_write 184 1 296 509
> pgactivate 76559 84714 445214 463143
> pgdeactivate 19360 40184 74302 91423
> pginvalidate 0 2240333 0 1769147
> pgreclaim 0 1849651 0 1650796
> pgfault 406208 421860 72485217 70334416
> pgmajfault 212 334 5149 3688
> pgsteal_dma 0 0 0 0
> pgsteal_normal 2645174 1545116 2521098 1578651
> pgsteal_high 5162080 2562269 6074720 3137294
> pgsteal_movable 0 0 0 0
> pgscan_kswapd_dma 0 0 0 0
> pgscan_kswapd_normal 2641732 1545374 2499894 1557882
> pgscan_kswapd_high 5143794 2567981 5999585 3051150
> pgscan_kswapd_movable 0 0 0 0
> pgscan_direct_dma 0 0 0 0
> pgscan_direct_normal 3643 0 21613 21238
> pgscan_direct_high 20174 1783 76980 87848
> pgscan_direct_movable 0 0 0 0
> pginodesteal 130 1029 3510 24100
> slabs_scanned 1421824 1648128 1870720 1880320
> kswapd_steal 7785153 4105620 8498332 4608372
> kswapd_inodesteal 189432 474052 342835 472503
> pageoutrun 100687 52282 145712 70946
> allocstall 22 1 149 163
> pgrotated 0 2231408 2932 1765393
> unevictable_pgs_scanned 0 0 0 0
>

The results do look impressive


--
Three Cheers,
Balbir

2011-02-18 06:29:11

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] deactivate invalidated pages

* MinChan Kim <[email protected]> [2011-02-18 00:08:19]:

> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
>
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It makes for application programmer to use it since they always
> have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
> make sure the pages could be discardable. At last, they can't use
> deferred write of kernel so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
>
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head if we can't truncate
> it right now.
>
> Why I move page to head of lru on this patch, Dirty/Writeback page
> would be flushed sooner or later. It can prevent writeout of pageout
> which is less effective than flusher's writeout.
>
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
>
> Reported-by: Ben Gamari <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> Changelog since v4:
> - Change function comments - suggested by Johannes
> - Change function name - suggested by Johannes
> - Drop only dirty/writeback pages to deactive pagevec - suggested by Johannes
> - Add acked-by
>
> Changelog since v3:
> - Change function comments - suggested by Johannes
> - Change function name - suggested by Johannes
> - add only dirty/writeback pages to deactive pagevec
>
> Changelog since v2:
> - mapped page leaves alone - suggested by Mel
> - pass part related PG_reclaim in next patch.
>
> Changelog since v1:
> - modify description
> - correct typo
> - add some comment
>
> include/linux/swap.h | 1 +
> mm/swap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
> mm/truncate.c | 17 ++++++++---
> 3 files changed, 91 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4d55932..c335055 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -215,6 +215,7 @@ extern void mark_page_accessed(struct page *);
> extern void lru_add_drain(void);
> extern int lru_add_drain_all(void);
> extern void rotate_reclaimable_page(struct page *page);
> +extern void deactivate_page(struct page *page);
> extern void swap_setup(void);
>
> extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/mm/swap.c b/mm/swap.c
> index c02f936..4aea806 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,7 @@ int page_cluster;
>
> static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>
> /*
> * This path almost never happens for VM activity - pages are normally
> @@ -347,6 +348,60 @@ void add_page_to_unevictable_list(struct page *page)
> }
>
> /*
> + * If the page can not be invalidated, it is moved to the
> + * inactive list to speed up its reclaim. It is moved to the
> + * head of the list, rather than the tail, to give the flusher
> + * threads some time to write it out, as this is much more
> + * effective than the single-page writeout from reclaim.
> + */
> +static void lru_deactivate(struct page *page, struct zone *zone)
> +{
> + int lru, file;
> +
> + if (!PageLRU(page) || !PageActive(page))
> + return;
> +
> + /* Some processes are using the page */
> + if (page_mapped(page))
> + return;
> +
> + file = page_is_file_cache(page);
> + lru = page_lru_base_type(page);
> + del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
> + ClearPageActive(page);
> + ClearPageReferenced(page);
> + add_page_to_lru_list(zone, page, lru);
> + __count_vm_event(PGDEACTIVATE);
> +
> + update_page_reclaim_stat(zone, page, file, 0);
> +}
> +
> +static void ____pagevec_lru_deactivate(struct pagevec *pvec)
> +{
> + int i;
> + struct zone *zone = NULL;
> +
> + for (i = 0; i < pagevec_count(pvec); i++) {
> + struct page *page = pvec->pages[i];
> + struct zone *pagezone = page_zone(page);
> +
> + if (pagezone != zone) {
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> + zone = pagezone;
> + spin_lock_irq(&zone->lru_lock);
> + }

The optimization to avoid taking locks if the zone does not change is
quite subtle

> + lru_deactivate(page, zone);
> + }
> + if (zone)
> + spin_unlock_irq(&zone->lru_lock);
> +
> + release_pages(pvec->pages, pvec->nr, pvec->cold);
> + pagevec_reinit(pvec);
> +}
> +
> +
> +/*
> * Drain pages out of the cpu's pagevecs.
> * Either "cpu" is the current CPU, and preemption has already been
> * disabled; or "cpu" is being hot-unplugged, and is already dead.
> @@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
> pagevec_move_tail(pvec);
> local_irq_restore(flags);
> }
> +
> + pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> + if (pagevec_count(pvec))
> + ____pagevec_lru_deactivate(pvec);
> +}
> +
> +/**
> + * deactivate_page - forcefully deactivate a page
> + * @page: page to deactivate
> + *
> + * This function hints the VM that @page is a good reclaim candidate,
> + * for example if its invalidation fails due to the page being dirty
> + * or under writeback.
> + */
> +void deactivate_page(struct page *page)
> +{
> + if (likely(get_page_unless_zero(page))) {
> + struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +
> + if (!pagevec_add(pvec, page))
> + ____pagevec_lru_deactivate(pvec);
> + put_cpu_var(lru_deactivate_pvecs);
> + }
> }
>
> void lru_add_drain(void)
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 4d415b3..9ec7bc5 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -328,11 +328,12 @@ EXPORT_SYMBOL(truncate_inode_pages);
> * pagetables.
> */
> unsigned long invalidate_mapping_pages(struct address_space *mapping,
> - pgoff_t start, pgoff_t end)
> + pgoff_t start, pgoff_t end)
> {
> struct pagevec pvec;
> pgoff_t next = start;
> - unsigned long ret = 0;
> + unsigned long ret;
> + unsigned long count = 0;
> int i;
>
> pagevec_init(&pvec, 0);
> @@ -359,8 +360,14 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (lock_failed)
> continue;
>
> - ret += invalidate_inode_page(page);
> -
> + ret = invalidate_inode_page(page);
> + /*
> + * Invalidation is a hint that the page is no longer
> + * of interest and try to speed up its reclaim.
> + */
> + if (!ret)
> + deactivate_page(page);

Do we need to do this under page_lock? Is there scope for us to reuse
rotate_reclaimable_page() logic?

> + count += ret;
> unlock_page(page);
> if (next > end)
> break;
> @@ -369,7 +376,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> mem_cgroup_uncharge_end();
> cond_resched();
> }
> - return ret;
> + return count;
> }
> EXPORT_SYMBOL(invalidate_mapping_pages);

--
Three Cheers,
Balbir

2011-02-18 10:39:34

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] deactivate invalidated pages

Hi Balbir,

On Fri, Feb 18, 2011 at 3:29 PM, Balbir Singh <[email protected]> wrote:
> * MinChan Kim <[email protected]> [2011-02-18 00:08:19]:
>
>> Recently, there are reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>> active list so that it results in working set page eviction.
>>
>> Some app developer want to support POSIX_FADV_NOREUSE.
>> But other OSes don't support it, either.
>> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
>>
>> By other approach, app developers use POSIX_FADV_DONTNEED.
>> But it has a problem. If kernel meets page is writing
>> during invalidate_mapping_pages, it can't work.
>> It makes for application programmer to use it since they always
>> have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
>> make sure the pages could be discardable. At last, they can't use
>> deferred write of kernel so that they could see performance loss.
>> (http://insights.oetiker.ch/linux/fadvise.html)
>>
>> In fact, invalidation is very big hint to reclaimer.
>> It means we don't use the page any more. So let's move
>> the writing page into inactive list's head if we can't truncate
>> it right now.
>>
>> Why I move page to head of lru on this patch, Dirty/Writeback page
>> would be flushed sooner or later. It can prevent writeout of pageout
>> which is less effective than flusher's writeout.
>>
>> Originally, I reused lru_demote of Peter with some change so added
>> his Signed-off-by.
>>
>> Reported-by: Ben Gamari <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> Acked-by: Rik van Riel <[email protected]>
>> Acked-by: Mel Gorman <[email protected]>
>> Reviewed-by: KOSAKI Motohiro <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> ---
>> Changelog since v4:
>>  - Change function comments - suggested by Johannes
>>  - Change function name - suggested by Johannes
>>  - Drop only dirty/writeback pages to deactive pagevec - suggested by Johannes
>>  - Add acked-by
>>
>> Changelog since v3:
>>  - Change function comments - suggested by Johannes
>>  - Change function name - suggested by Johannes
>>  - add only dirty/writeback pages to deactive pagevec
>>
>> Changelog since v2:
>>  - mapped page leaves alone - suggested by Mel
>>  - pass part related PG_reclaim in next patch.
>>
>> Changelog since v1:
>>  - modify description
>>  - correct typo
>>  - add some comment
>>
>>  include/linux/swap.h |    1 +
>>  mm/swap.c            |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  mm/truncate.c        |   17 ++++++++---
>>  3 files changed, 91 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 4d55932..c335055 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -215,6 +215,7 @@ extern void mark_page_accessed(struct page *);
>>  extern void lru_add_drain(void);
>>  extern int lru_add_drain_all(void);
>>  extern void rotate_reclaimable_page(struct page *page);
>> +extern void deactivate_page(struct page *page);
>>  extern void swap_setup(void);
>>
>>  extern void add_page_to_unevictable_list(struct page *page);
>> diff --git a/mm/swap.c b/mm/swap.c
>> index c02f936..4aea806 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -39,6 +39,7 @@ int page_cluster;
>>
>>  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>>
>>  /*
>>   * This path almost never happens for VM activity - pages are normally
>> @@ -347,6 +348,60 @@ void add_page_to_unevictable_list(struct page *page)
>>  }
>>
>>  /*
>> + * If the page can not be invalidated, it is moved to the
>> + * inactive list to speed up its reclaim.  It is moved to the
>> + * head of the list, rather than the tail, to give the flusher
>> + * threads some time to write it out, as this is much more
>> + * effective than the single-page writeout from reclaim.
>> + */
>> +static void lru_deactivate(struct page *page, struct zone *zone)
>> +{
>> +     int lru, file;
>> +
>> +     if (!PageLRU(page) || !PageActive(page))
>> +             return;
>> +
>> +     /* Some processes are using the page */
>> +     if (page_mapped(page))
>> +             return;
>> +
>> +     file = page_is_file_cache(page);
>> +     lru = page_lru_base_type(page);
>> +     del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
>> +     ClearPageActive(page);
>> +     ClearPageReferenced(page);
>> +     add_page_to_lru_list(zone, page, lru);
>> +     __count_vm_event(PGDEACTIVATE);
>> +
>> +     update_page_reclaim_stat(zone, page, file, 0);
>> +}
>> +
>> +static void ____pagevec_lru_deactivate(struct pagevec *pvec)
>> +{
>> +     int i;
>> +     struct zone *zone = NULL;
>> +
>> +     for (i = 0; i < pagevec_count(pvec); i++) {
>> +             struct page *page = pvec->pages[i];
>> +             struct zone *pagezone = page_zone(page);
>> +
>> +             if (pagezone != zone) {
>> +                     if (zone)
>> +                             spin_unlock_irq(&zone->lru_lock);
>> +                     zone = pagezone;
>> +                     spin_lock_irq(&zone->lru_lock);
>> +             }
>
> The optimization to avoid taking locks if the zone does not change is
> quite subtle

I just used it without big considering as it's a normal technique of
page array handling we have been used. So I want to keep it if it
doesn't make big overhead.

>
>> +             lru_deactivate(page, zone);
>> +     }
>> +     if (zone)
>> +             spin_unlock_irq(&zone->lru_lock);
>> +
>> +     release_pages(pvec->pages, pvec->nr, pvec->cold);
>> +     pagevec_reinit(pvec);
>> +}
>> +
>> +
>> +/*
>>   * Drain pages out of the cpu's pagevecs.
>>   * Either "cpu" is the current CPU, and preemption has already been
>>   * disabled; or "cpu" is being hot-unplugged, and is already dead.
>> @@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
>>               pagevec_move_tail(pvec);
>>               local_irq_restore(flags);
>>       }
>> +
>> +     pvec = &per_cpu(lru_deactivate_pvecs, cpu);
>> +     if (pagevec_count(pvec))
>> +             ____pagevec_lru_deactivate(pvec);
>> +}
>> +
>> +/**
>> + * deactivate_page - forcefully deactivate a page
>> + * @page: page to deactivate
>> + *
>> + * This function hints the VM that @page is a good reclaim candidate,
>> + * for example if its invalidation fails due to the page being dirty
>> + * or under writeback.
>> + */
>> +void deactivate_page(struct page *page)
>> +{
>> +     if (likely(get_page_unless_zero(page))) {
>> +             struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
>> +
>> +             if (!pagevec_add(pvec, page))
>> +                     ____pagevec_lru_deactivate(pvec);
>> +             put_cpu_var(lru_deactivate_pvecs);
>> +     }
>>  }
>>
>>  void lru_add_drain(void)
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 4d415b3..9ec7bc5 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -328,11 +328,12 @@ EXPORT_SYMBOL(truncate_inode_pages);
>>   * pagetables.
>>   */
>>  unsigned long invalidate_mapping_pages(struct address_space *mapping,
>> -                                    pgoff_t start, pgoff_t end)
>> +             pgoff_t start, pgoff_t end)
>>  {
>>       struct pagevec pvec;
>>       pgoff_t next = start;
>> -     unsigned long ret = 0;
>> +     unsigned long ret;
>> +     unsigned long count = 0;
>>       int i;
>>
>>       pagevec_init(&pvec, 0);
>> @@ -359,8 +360,14 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>>                       if (lock_failed)
>>                               continue;
>>
>> -                     ret += invalidate_inode_page(page);
>> -
>> +                     ret = invalidate_inode_page(page);
>> +                     /*
>> +                      * Invalidation is a hint that the page is no longer
>> +                      * of interest and try to speed up its reclaim.
>> +                      */
>> +                     if (!ret)
>> +                             deactivate_page(page);
>
> Do we need to do this under page_lock? Is there scope for us to reuse
> rotate_reclaimable_page() logic?

Good point.
I think we don't need page_lock. will fix.
About rotate_reclaimable_page, it has little bit similar logic but
several page flags test and irq disable are different so it would
result in ugly shape as far as I think.

I hope if you have a good idea, please, do refactoring after merging.

Thanks for the review, Balbir.

--
Kind regards,
Minchan Kim

2011-02-18 11:02:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] memcg: move memcg reclaimable page into tail of inactive list

On Fri, Feb 18, 2011 at 1:16 PM, Hiroyuki Kamezawa
<[email protected]> wrote:
> 2011/2/18 Minchan Kim <[email protected]>:
>> Hi Kame,
>>
>> On Fri, Feb 18, 2011 at 1:04 AM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>>> On Fri, 18 Feb 2011 00:08:20 +0900
>>> Minchan Kim <[email protected]> wrote:
>>>
>>>> The rotate_reclaimable_page function moves just written out
>>>> pages, which the VM wanted to reclaim, to the end of the
>>>> inactive list.  That way the VM will find those pages first
>>>> next time it needs to free memory.
>>>> This patch apply the rule in memcg.
>>>> It can help to prevent unnecessary working page eviction of memcg.
>>>>
>>>> Acked-by: Balbir Singh <[email protected]>
>>>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>>>> Reviewed-by: Rik van Riel <[email protected]>
>>>> Cc: KOSAKI Motohiro <[email protected]>
>>>> Cc: Johannes Weiner <[email protected]>
>>>> Signed-off-by: Minchan Kim <[email protected]>
>>>> ---
>>>> Changelog since v4:
>>>>  - add acked-by and reviewed-by
>>>>  - change description - suggested by Rik
>>>>
>>>>  include/linux/memcontrol.h |    6 ++++++
>>>>  mm/memcontrol.c            |   27 +++++++++++++++++++++++++++
>>>>  mm/swap.c                  |    3 ++-
>>>>  3 files changed, 35 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>>> index 3da48ae..5a5ce70 100644
>>>> --- a/include/linux/memcontrol.h
>>>> +++ b/include/linux/memcontrol.h
>>>> @@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>>>>                                       gfp_t gfp_mask);
>>>>  extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
>>>>  extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
>>>> +extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
>>>>  extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
>>>>  extern void mem_cgroup_del_lru(struct page *page);
>>>>  extern void mem_cgroup_move_lists(struct page *page,
>>>> @@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
>>>>       return ;
>>>>  }
>>>>
>>>> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
>>>> +{
>>>> +     return ;
>>>> +}
>>>> +
>>>>  static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
>>>>  {
>>>>       return ;
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 686f1ce..ab8bdff 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
>>>>       mem_cgroup_del_lru_list(page, page_lru(page));
>>>>  }
>>>>
>>>> +/*
>>>> + * Writeback is about to end against a page which has been marked for immediate
>>>> + * reclaim.  If it still appears to be reclaimable, move it to the tail of the
>>>> + * inactive list.
>>>> + */
>>>> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
>>>> +{
>>>> +     struct mem_cgroup_per_zone *mz;
>>>> +     struct page_cgroup *pc;
>>>> +     enum lru_list lru = page_lru_base_type(page);
>>>> +
>>>> +     if (mem_cgroup_disabled())
>>>> +             return;
>>>> +
>>>> +     pc = lookup_page_cgroup(page);
>>>> +     /*
>>>> +      * Used bit is set without atomic ops but after smp_wmb().
>>>> +      * For making pc->mem_cgroup visible, insert smp_rmb() here.
>>>> +      */
>>>> +     smp_rmb();
>>>> +     /* unused or root page is not rotated. */
>>>> +     if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>>>> +             return;
>>>> +     mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>>>> +     list_move_tail(&pc->lru, &mz->lists[lru]);
>>>> +}
>>>> +
>>>
>>> Hmm, I'm sorry I misunderstand this. IIUC, page_lru_base_type() always returns
>>> LRU_INACTIVE_XXX and this function may move page from active LRU to inactive LRU.
>>>
>>> Then, LRU counters for memcg should be updated.
>>
>> Goal of mem_cgroup_rotate_reclaimable_page is same with rotate_reclaimable_page.
>> It means the page was already in inactive list.
>> Look at the check !PageActive(page).
>
> Hmm, ok. If so, could you change
>
> page_lru_base_type() -> page_lru() ?
>
> It's misleading.

No problem. Will resend.

>
> Thanks,
> -Kame
>



--
Kind regards,
Minchan Kim

2011-02-18 16:58:58

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] add profile information for invalidated page

On Fri, Feb 18, 2011 at 12:08:22AM +0900, Minchan Kim wrote:
> This patch adds profile information about invalidated page reclaim.
> It's just for profiling for test so it is never for merging.
>
> Acked-by: Rik van Riel <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> include/linux/vmstat.h | 4 ++--
> mm/swap.c | 3 +++
> mm/vmstat.c | 3 +++
> 3 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 833e676..c38ad95 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -30,8 +30,8 @@
>
> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> FOR_ALL_ZONES(PGALLOC),
> - PGFREE, PGACTIVATE, PGDEACTIVATE,
> - PGFAULT, PGMAJFAULT,
> + PGFREE, PGACTIVATE, PGDEACTIVATE, PGINVALIDATE,
> + PGRECLAIM, PGFAULT, PGMAJFAULT,
> FOR_ALL_ZONES(PGREFILL),
> FOR_ALL_ZONES(PGSTEAL),
> FOR_ALL_ZONES(PGSCAN_KSWAPD),
> diff --git a/mm/swap.c b/mm/swap.c
> index 0a33714..980c17b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -397,6 +397,7 @@ static void lru_deactivate(struct page *page, struct zone *zone)
> * is _really_ small and it's non-critical problem.
> */
> SetPageReclaim(page);
> + __count_vm_event(PGRECLAIM);
> } else {
> /*
> * The page's writeback ends up during pagevec

Is this name potentially misleading?

Pages that are reclaimed are accounted for with _steal. It's not particularly
obvious but that's the name it was given. I'd worry that an administrator that
was not aware of *_steal would read pgreclaim as "pages that were reclaimed"
when this is not necessarily the case.

Is there a better name for this? pginvalidate_deferred
or pginvalidate_delayed maybe?

> @@ -409,6 +410,8 @@ static void lru_deactivate(struct page *page, struct zone *zone)
>
> if (active)
> __count_vm_event(PGDEACTIVATE);
> +
> + __count_vm_event(PGINVALIDATE);
> update_page_reclaim_stat(zone, page, file, 0);
> }
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 0c3b504..cbe032b 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -896,6 +896,9 @@ static const char * const vmstat_text[] = {
> "pgactivate",
> "pgdeactivate",
>
> + "pginvalidate",
> + "pgreclaim",
> +
> "pgfault",
> "pgmajfault",
>
> --
> 1.7.1
>

--
Mel Gorman
SUSE Labs

2011-02-18 22:07:03

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] add profile information for invalidated page

Hi Mel,

On Sat, Feb 19, 2011 at 1:58 AM, Mel Gorman <[email protected]> wrote:
> On Fri, Feb 18, 2011 at 12:08:22AM +0900, Minchan Kim wrote:
>> This patch adds profile information about invalidated page reclaim.
>> It's just for profiling for test so it is never for merging.
>>
>> Acked-by: Rik van Riel <[email protected]>
>> Cc: KOSAKI Motohiro <[email protected]>
>> Cc: Wu Fengguang <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> ---
>>  include/linux/vmstat.h |    4 ++--
>>  mm/swap.c              |    3 +++
>>  mm/vmstat.c            |    3 +++
>>  3 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 833e676..c38ad95 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -30,8 +30,8 @@
>>
>>  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>               FOR_ALL_ZONES(PGALLOC),
>> -             PGFREE, PGACTIVATE, PGDEACTIVATE,
>> -             PGFAULT, PGMAJFAULT,
>> +             PGFREE, PGACTIVATE, PGDEACTIVATE, PGINVALIDATE,
>> +             PGRECLAIM, PGFAULT, PGMAJFAULT,
>>               FOR_ALL_ZONES(PGREFILL),
>>               FOR_ALL_ZONES(PGSTEAL),
>>               FOR_ALL_ZONES(PGSCAN_KSWAPD),
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 0a33714..980c17b 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -397,6 +397,7 @@ static void lru_deactivate(struct page *page, struct zone *zone)
>>                * is _really_ small and  it's non-critical problem.
>>                */
>>               SetPageReclaim(page);
>> +             __count_vm_event(PGRECLAIM);
>>       } else {
>>               /*
>>                * The page's writeback ends up during pagevec
>
> Is this name potentially misleading?
>
> Pages that are reclaimed are accounted for with _steal. It's not particularly
> obvious but that's the name it was given. I'd worry that an administrator that
> was not aware of *_steal would read pgreclaim as "pages that were reclaimed"
> when this is not necessarily the case.
>
> Is there a better name for this? pginvalidate_deferred
> or pginvalidate_delayed maybe?
>

Yep. Your suggestion is fair enough. But as I said in description,
It's just for testing for my profiling, not merging so I didn't care
about it.
I don't think we need new vmstat of pginvalidate.

--
Kind regards,
Minchan Kim

2011-02-18 23:18:03

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] add profile information for invalidated page

On Sat, Feb 19, 2011 at 07:07:01AM +0900, Minchan Kim wrote:
> Hi Mel,
>
> On Sat, Feb 19, 2011 at 1:58 AM, Mel Gorman <[email protected]> wrote:
> > On Fri, Feb 18, 2011 at 12:08:22AM +0900, Minchan Kim wrote:
> >> This patch adds profile information about invalidated page reclaim.
> >> It's just for profiling for test so it is never for merging.
> >>
> >> Acked-by: Rik van Riel <[email protected]>
> >> Cc: KOSAKI Motohiro <[email protected]>
> >> Cc: Wu Fengguang <[email protected]>
> >> Cc: Johannes Weiner <[email protected]>
> >> Cc: Nick Piggin <[email protected]>
> >> Cc: Mel Gorman <[email protected]>
> >> Signed-off-by: Minchan Kim <[email protected]>
> >> ---
> >> ?include/linux/vmstat.h | ? ?4 ++--
> >> ?mm/swap.c ? ? ? ? ? ? ?| ? ?3 +++
> >> ?mm/vmstat.c ? ? ? ? ? ?| ? ?3 +++
> >> ?3 files changed, 8 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> >> index 833e676..c38ad95 100644
> >> --- a/include/linux/vmstat.h
> >> +++ b/include/linux/vmstat.h
> >> @@ -30,8 +30,8 @@
> >>
> >> ?enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >> ? ? ? ? ? ? ? FOR_ALL_ZONES(PGALLOC),
> >> - ? ? ? ? ? ? PGFREE, PGACTIVATE, PGDEACTIVATE,
> >> - ? ? ? ? ? ? PGFAULT, PGMAJFAULT,
> >> + ? ? ? ? ? ? PGFREE, PGACTIVATE, PGDEACTIVATE, PGINVALIDATE,
> >> + ? ? ? ? ? ? PGRECLAIM, PGFAULT, PGMAJFAULT,
> >> ? ? ? ? ? ? ? FOR_ALL_ZONES(PGREFILL),
> >> ? ? ? ? ? ? ? FOR_ALL_ZONES(PGSTEAL),
> >> ? ? ? ? ? ? ? FOR_ALL_ZONES(PGSCAN_KSWAPD),
> >> diff --git a/mm/swap.c b/mm/swap.c
> >> index 0a33714..980c17b 100644
> >> --- a/mm/swap.c
> >> +++ b/mm/swap.c
> >> @@ -397,6 +397,7 @@ static void lru_deactivate(struct page *page, struct zone *zone)
> >> ? ? ? ? ? ? ? ?* is _really_ small and ?it's non-critical problem.
> >> ? ? ? ? ? ? ? ?*/
> >> ? ? ? ? ? ? ? SetPageReclaim(page);
> >> + ? ? ? ? ? ? __count_vm_event(PGRECLAIM);
> >> ? ? ? } else {
> >> ? ? ? ? ? ? ? /*
> >> ? ? ? ? ? ? ? ?* The page's writeback ends up during pagevec
> >
> > Is this name potentially misleading?
> >
> > Pages that are reclaimed are accounted for with _steal. It's not particularly
> > obvious but that's the name it was given. I'd worry that an administrator that
> > was not aware of *_steal would read pgreclaim as "pages that were reclaimed"
> > when this is not necessarily the case.
> >
> > Is there a better name for this? pginvalidate_deferred
> > or pginvalidate_delayed maybe?
> >
>
> Yep. Your suggestion is fair enough. But as I said in description,
> It's just for testing for my profiling, not merging so I didn't care
> about it. I don't think we need new vmstat of pginvalidate.
>

My bad. I was treating this piece of information as something we'd keep
around and did not read the introduction clearly enough. If it's just for
evaluation, the name does not matter as long as the reviewers know what it
is. The figures look good and I have no problem with the series. I didn't
ack the memcg parts but only because memcg is not an area I'm familiar enough
for my ack to proper meaning. If there are no other objections, I'd suggest
resubmitting minus this patch.

Thanks.

--
Mel Gorman
SUSE Labs

2011-02-18 23:38:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] add profile information for invalidated page

On Sat, Feb 19, 2011 at 8:17 AM, Mel Gorman <[email protected]> wrote:
> On Sat, Feb 19, 2011 at 07:07:01AM +0900, Minchan Kim wrote:
>> Hi Mel,
>>
>> On Sat, Feb 19, 2011 at 1:58 AM, Mel Gorman <[email protected]> wrote:
>> > On Fri, Feb 18, 2011 at 12:08:22AM +0900, Minchan Kim wrote:
>> >> This patch adds profile information about invalidated page reclaim.
>> >> It's just for profiling for test so it is never for merging.
>> >>
>> >> Acked-by: Rik van Riel <[email protected]>
>> >> Cc: KOSAKI Motohiro <[email protected]>
>> >> Cc: Wu Fengguang <[email protected]>
>> >> Cc: Johannes Weiner <[email protected]>
>> >> Cc: Nick Piggin <[email protected]>
>> >> Cc: Mel Gorman <[email protected]>
>> >> Signed-off-by: Minchan Kim <[email protected]>
>> >> ---
>> >>  include/linux/vmstat.h |    4 ++--
>> >>  mm/swap.c              |    3 +++
>> >>  mm/vmstat.c            |    3 +++
>> >>  3 files changed, 8 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> >> index 833e676..c38ad95 100644
>> >> --- a/include/linux/vmstat.h
>> >> +++ b/include/linux/vmstat.h
>> >> @@ -30,8 +30,8 @@
>> >>
>> >>  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> >>               FOR_ALL_ZONES(PGALLOC),
>> >> -             PGFREE, PGACTIVATE, PGDEACTIVATE,
>> >> -             PGFAULT, PGMAJFAULT,
>> >> +             PGFREE, PGACTIVATE, PGDEACTIVATE, PGINVALIDATE,
>> >> +             PGRECLAIM, PGFAULT, PGMAJFAULT,
>> >>               FOR_ALL_ZONES(PGREFILL),
>> >>               FOR_ALL_ZONES(PGSTEAL),
>> >>               FOR_ALL_ZONES(PGSCAN_KSWAPD),
>> >> diff --git a/mm/swap.c b/mm/swap.c
>> >> index 0a33714..980c17b 100644
>> >> --- a/mm/swap.c
>> >> +++ b/mm/swap.c
>> >> @@ -397,6 +397,7 @@ static void lru_deactivate(struct page *page, struct zone *zone)
>> >>                * is _really_ small and  it's non-critical problem.
>> >>                */
>> >>               SetPageReclaim(page);
>> >> +             __count_vm_event(PGRECLAIM);
>> >>       } else {
>> >>               /*
>> >>                * The page's writeback ends up during pagevec
>> >
>> > Is this name potentially misleading?
>> >
>> > Pages that are reclaimed are accounted for with _steal. It's not particularly
>> > obvious but that's the name it was given. I'd worry that an administrator that
>> > was not aware of *_steal would read pgreclaim as "pages that were reclaimed"
>> > when this is not necessarily the case.
>> >
>> > Is there a better name for this? pginvalidate_deferred
>> > or pginvalidate_delayed maybe?
>> >
>>
>> Yep. Your suggestion is fair enough. But as I said in description,
>> It's just for testing for my profiling, not merging so I didn't care
>> about it. I don't think we need new vmstat of pginvalidate.
>>
>
> My bad. I was treating this piece of information as something we'd keep
> around and did not read the introduction clearly enough. If it's just for
> evaluation, the name does not matter as long as the reviewers know what it
> is. The figures look good and I have no problem with the series. I didn't
> ack the memcg parts but only because memcg is not an area I'm familiar enough
> for my ack to proper meaning. If there are no other objections, I'd suggest
> resubmitting minus this patch.

Okay. I will wait any comments from others by this week and resubmit
the series at next week as remove profiling patch.
Thanks, Mel.

>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected].  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>



--
Kind regards,
Minchan Kim